-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Error running inference on CPU for network with BatchNorm #694
Comments
Some more information. This is the BN definition that worked well on 3.0.
|
Can you explain this in more details? What are you seeing when you do Classify One? Is the softmax the only output of your deploy network? |
The issue still happens after a few more epochs. The bug does not happen if I abort the training , and only then do classify one. |
If all your GPUs are used for training then Caffe will resort to using the CPU for inference. That might be why you are getting this error. |
This does not happen with The 3.0 version. Indicating that something has changed. |
Yes indeed in DIGITS 3.0, GPU resources were not explicitly reserved during inference. This would cause some other issues e.g. running out of GPU memory during training. This changed with #573. Now, there needs to be a free GPU (i.e. totally free, we don't have granularity for partially-free GPUs) to run inference on GPU. Sorry to hear that this might cause troubles but we felt this would fix a "bigger" issue. Now there is the question of why Caffe is crashing during CPU inference. Is this only happening in the presence of batch normalization? I am not seeing this issue when using the vanilla Alexnet. Is this because CuDNN is the default batch normalization engine? |
Maybe. The BatchNormalization prototxt I am using is the following:
What is the other option ? I thought I had to explict declare the CUDDN engine to use the cudnn one ? |
Actually if I add a BN to Alexnet I am getting this error when training from CPU:
I think nv-caffe should resort to using the Caffe (v.s. CuDNN) engine when running on CPU. See this for a list of options. |
I understand. Please correct if I am wrong... So For future ref : engine: CUDNN - CUDNN only ( GPU) For other that might be interested ( Issue #629 ): I haven't had convergence issues using the Default. |
So it seems was more a problem with documentation than a bug it self.. I do not think there is the need to change the classify one to fix this. But if anyone thinks differently, maybe a good solution would be to check if Classify One is in CPU mode. If so check if all the layers are available with the CPU mode . Whenever this checking fail maybe emit a warning ? @gheinrich please feel free to close the issue if you judge so. Thanks for the help. |
If all your GPUs are used for training then Caffe will resort to using the CPU for inference. That might be why you are getting this error. |
It doesn't look like BatchNorm is unavailable on the CPU. It looks to me like this is the issue:
I don't know why the CPU would encounter an overflow when the GPU doesn't though... |
The Caffe and CuDNN implementations of BN are known to be incompatible on nv-caffe 0.14. That might explain the overflow??? |
About the change introduced in #573, I perfectly see the reason (trainings were crashing in 3.0), but is it possible to implement an option marked "non-default", "unsafe", "deprecated", "use at your own risk" that restores the previous behavior? |
Btw, @igorbb do you know why there are two additional param fields with everything set to zero in your BN definition? Also, unrelated to the main issue, the error from your log "Couldn't import dot_parser, loading of dot files will not be possible." can be resolved (at least that worked for me) if you install older pyparsing version: pip uninstall pyparsing |
Hey guys!
I was using the Batch Normalization in my network with digits 3.0.
This was the end of my network (which was working fine):
Now, after updating to the master branch 3.3 I had To change the end of my network for the new
include { stage: "deploy" }
definitions.Thus, the end of the network starts looking like this :
The issue is that now the Classify One softmax, seems not to be using the terms of the softmax. I cannot classify an image that I could before.
I cannot point my finger where exactly. But It seems to be related to
include { stage: "deploy" }
that is not mandatory.It also seems to be a big issue when using the Batch Normalization ..
The text was updated successfully, but these errors were encountered: