Skip to content
This repository was archived by the owner on Nov 21, 2023. It is now read-only.

keypoint training nan (#resolved: you need to apply the linear scaling rule when adjusting the number of GPUs during training) #16

Closed
dragonfly90 opened this issue Jan 24, 2018 · 1 comment

Comments

@dragonfly90
Copy link

dragonfly90 commented Jan 24, 2018

python2 tools/train_net.py --cfg configs/12_2017_baselines/e2e_keypoint_rcnn_R-101-FPN_1x.yaml OUTPUT_DIR /tmp/detectron-output NUM_GPUS 1

Got Negative or NaN loss while training keypoint like the following, could not figure out the reasons. The coco data is downloaded from the official website. Tried e2e_keypoint_rcnn_R-50-FPN_1x.yaml etc, also got the same errors. Printing out the losses.

print(np.array([self.losses_and_metrics[k] for k in self.model.losses]))
json_stats: {"accuracy_cls": 0.804688, "eta": "10:24:55", "iter": 120, "loss": 8.408110, "loss_bbox": 0.352813, "loss_cls": 0.407682, "loss_kps": 7.525052, "loss_rpn_bbox_fpn2": 0.000000, "loss_rpn_bbox_fpn3": 0.000000, "loss_rpn_bbox_fpn4": 0.000000, "loss_rpn_bbox_fpn5": 0.003086, "loss_rpn_bbox_fpn6": 0.000943, "loss_rpn_cls_fpn2": 0.032497, "loss_rpn_cls_fpn3": 0.021612, "loss_rpn_cls_fpn4": 0.010768, "loss_rpn_cls_fpn5": 0.007526, "loss_rpn_cls_fpn6": 0.002437, "lr": 0.009867, "mb_qsize": 64, "mem": 8873, "time": 0.417168}
[  3.47561250e-03   0.00000000e+00   3.90737087e-01   1.66270062e-02
   0.00000000e+00   0.00000000e+00   4.36257482e-01   1.92877371e-02
   6.92374632e-02   8.01100396e-03   3.68567742e-03   4.04049922e-03
   1.31183434e+01]
total:  14.069702921
E0123 22:41:16.943624  9917 pybind_state.h:422] Exception encountered running PythonOp function: AssertionError: Negative areas founds
[             nan              nan              nan   1.87069760e+26
   3.42327067e+24              nan              nan              nan
              nan   1.00860528e+28              nan   2.96251020e+27
              nan]
total:  nan
CRITICAL train_net.py: 239: Loss is NaN, exiting...
INFO loader.py: 126: Stopping enqueue thread
INFO loader.py: 113: Stopping mini-batch loading thread
INFO loader.py: 113: Stopping mini-batch loading thread
INFO loader.py: 113: Stopping mini-batch loading thread
INFO loader.py: 113: Stopping mini-batch loading thread

@rbgirshick
Copy link
Contributor

@dragonfly90: I encourage you to study the multi-gpu training section of the getting started guide here: https://github.com/facebookresearch/Detectron/blob/master/GETTING_STARTED.md#2-multi-gpu-training. In particular, pay attention to the note about 8 GPU training and the need to apply the linear scaling rule when adjusting the minibatch size (which happens implicitly when going from 8 to 1 GPUs). It is not surprising that the loss quickly goes to nan since your LR is 8x higher than it should be.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants