keypoint training nan (#resolved: you need to apply the linear scaling rule when adjusting the number of GPUs during training) #16

dragonfly90 · 2018-01-24T03:52:58Z

python2 tools/train_net.py --cfg configs/12_2017_baselines/e2e_keypoint_rcnn_R-101-FPN_1x.yaml OUTPUT_DIR /tmp/detectron-output NUM_GPUS 1

Got Negative or NaN loss while training keypoint like the following, could not figure out the reasons. The coco data is downloaded from the official website. Tried e2e_keypoint_rcnn_R-50-FPN_1x.yaml etc, also got the same errors. Printing out the losses.

print(np.array([self.losses_and_metrics[k] for k in self.model.losses]))

json_stats: {"accuracy_cls": 0.804688, "eta": "10:24:55", "iter": 120, "loss": 8.408110, "loss_bbox": 0.352813, "loss_cls": 0.407682, "loss_kps": 7.525052, "loss_rpn_bbox_fpn2": 0.000000, "loss_rpn_bbox_fpn3": 0.000000, "loss_rpn_bbox_fpn4": 0.000000, "loss_rpn_bbox_fpn5": 0.003086, "loss_rpn_bbox_fpn6": 0.000943, "loss_rpn_cls_fpn2": 0.032497, "loss_rpn_cls_fpn3": 0.021612, "loss_rpn_cls_fpn4": 0.010768, "loss_rpn_cls_fpn5": 0.007526, "loss_rpn_cls_fpn6": 0.002437, "lr": 0.009867, "mb_qsize": 64, "mem": 8873, "time": 0.417168}
[  3.47561250e-03   0.00000000e+00   3.90737087e-01   1.66270062e-02
   0.00000000e+00   0.00000000e+00   4.36257482e-01   1.92877371e-02
   6.92374632e-02   8.01100396e-03   3.68567742e-03   4.04049922e-03
   1.31183434e+01]
total:  14.069702921
E0123 22:41:16.943624  9917 pybind_state.h:422] Exception encountered running PythonOp function: AssertionError: Negative areas founds

[             nan              nan              nan   1.87069760e+26
   3.42327067e+24              nan              nan              nan
              nan   1.00860528e+28              nan   2.96251020e+27
              nan]
total:  nan
CRITICAL train_net.py: 239: Loss is NaN, exiting...
INFO loader.py: 126: Stopping enqueue thread
INFO loader.py: 113: Stopping mini-batch loading thread
INFO loader.py: 113: Stopping mini-batch loading thread
INFO loader.py: 113: Stopping mini-batch loading thread
INFO loader.py: 113: Stopping mini-batch loading thread

The text was updated successfully, but these errors were encountered:

rbgirshick · 2018-01-24T07:32:07Z

@dragonfly90: I encourage you to study the multi-gpu training section of the getting started guide here: https://github.com/facebookresearch/Detectron/blob/master/GETTING_STARTED.md#2-multi-gpu-training. In particular, pay attention to the note about 8 GPU training and the need to apply the linear scaling rule when adjusting the minibatch size (which happens implicitly when going from 8 to 1 GPUs). It is not surprising that the loss quickly goes to nan since your LR is 8x higher than it should be.

rbgirshick closed this as completed Jan 24, 2018

rbgirshick changed the title ~~keypoint training error~~ keypoint training nan (#resolved: you need to apply the linear scaling rule when adjusting the number of GPUs during training) Jan 24, 2018

ZZYuting mentioned this issue Mar 28, 2018

Keypoints training using 1GPU error #267

Closed

sidnav mentioned this issue Aug 9, 2018

Segmentation fault while running infer_simple.py #607

Closed

JeasonUESTC mentioned this issue Mar 18, 2019

RuntimeError: [enforce fail at context_gpu.cu:234] #842

Open

chenliqiong mentioned this issue Jun 12, 2019

RuntimeError: CUDA error: invalid device ordinal (exchangeDevice at /pytorch/c10/cuda/impl/CUDAGuardImpl.h:29) #897

Open

carryyu mentioned this issue Sep 17, 2019

RuntimeError: [enforce fail at operator.cc:75] blob != nullptr. op Conv: Encountered a non-existing input blob: gpu_0/old_res3_7_sum #941

Open

cpoptic mentioned this issue Nov 26, 2019

RuntimeError: CUDA error: no kernel image is available for execution on the device #965

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

keypoint training nan (#resolved: you need to apply the linear scaling rule when adjusting the number of GPUs during training) #16

keypoint training nan (#resolved: you need to apply the linear scaling rule when adjusting the number of GPUs during training) #16

dragonfly90 commented Jan 24, 2018 •

edited

Loading

rbgirshick commented Jan 24, 2018

keypoint training nan (#resolved: you need to apply the linear scaling rule when adjusting the number of GPUs during training) #16

keypoint training nan (#resolved: you need to apply the linear scaling rule when adjusting the number of GPUs during training) #16

Comments

dragonfly90 commented Jan 24, 2018 • edited Loading

rbgirshick commented Jan 24, 2018

dragonfly90 commented Jan 24, 2018 •

edited

Loading