Skip to content
This repository was archived by the owner on Jan 7, 2025. It is now read-only.

Error running inference on CPU for network with BatchNorm #694

Closed
igorbb opened this issue Apr 20, 2016 · 16 comments
Closed

Error running inference on CPU for network with BatchNorm #694

igorbb opened this issue Apr 20, 2016 · 16 comments
Labels

Comments

@igorbb
Copy link

igorbb commented Apr 20, 2016

Hey guys!

I was using the Batch Normalization in my network with digits 3.0.

This was the end of my network (which was working fine):

layer {
  bottom: "loss1/fc/bn"
  top: "loss1/classifier"
  name: "loss1/classifier"
  type: "InnerProduct"
  param {
    lr_mult: 1
    decay_mult: 1
  }
  param {
    lr_mult: 2
    decay_mult: 0
  }
  inner_product_param {
    #num_output: 1000
    weight_filler {
      type: "xavier"
    }
    bias_filler {
      type: "constant"
      value: 0
    }
  }
}
layer {
  name: "loss1/loss"
  type: "SoftmaxWithLoss"
  bottom: "loss1/classifier"
  bottom: "label"
  top: "loss"
  loss_weight: 1
}
layer {
  name: "loss1/top-1"
  type: "Accuracy"
  bottom: "loss1/classifier"
  bottom: "label"
  top: "accuracy"
  include { stage: "val" }
}
layer {
  name: "loss1/top-5"
  type: "Accuracy"
  bottom: "loss1/classifier"
  bottom: "label"
  top: "accuracy-top5"
  accuracy_param {
    top_k: 5
  }
}

Now, after updating to the master branch 3.3 I had To change the end of my network for the new include { stage: "deploy" } definitions.

Thus, the end of the network starts looking like this :

layer {
  bottom: "loss1/fc/bn"
  top: "loss1/classifier"
  name: "loss1/classifier"
  type: "InnerProduct"
  param {
    lr_mult: 1
    decay_mult: 1
  }
  param {
    lr_mult: 2
    decay_mult: 0
  }
  inner_product_param {
    #num_output: 1000
    weight_filler {
      type: "xavier"
    }
    bias_filler {
      type: "constant"
      value: 0
    }
  }
}
layer {
  name: "loss1/loss"
  type: "SoftmaxWithLoss"
  bottom: "loss1/classifier"
  bottom: "label"
  top: "loss"
  loss_weight: 1
  exclude { stage: "deploy" }
}
layer {
  name: "loss1/top-1"
  type: "Accuracy"
  bottom: "loss1/classifier"
  bottom: "label"
  top: "accuracy"
  include { stage: "val" }
}
layer {
  name: "loss1/top-5"
  type: "Accuracy"
  bottom: "loss1/classifier"
  bottom: "label"
  top: "accuracy-top5"
  include { stage: "val" }
  accuracy_param {
    top_k: 5
  }
}
layer {
  name: "softmax"
  type: "Softmax"
  bottom: "loss1/classifier"
  top: "softmax"
  include { stage: "deploy" }
}

The issue is that now the Classify One softmax, seems not to be using the terms of the softmax. I cannot classify an image that I could before.

I cannot point my finger where exactly. But It seems to be related to include { stage: "deploy" } that is not mandatory.

It also seems to be a big issue when using the Batch Normalization ..

@igorbb
Copy link
Author

igorbb commented Apr 20, 2016

Some more information. This is the BN definition that worked well on 3.0.

## BN
layer {
  bottom: "loss1/fc"
  name: "loss1/fc/bn"
  top: "loss1/fc/bn"
  type: "BatchNorm"
  param {
    lr_mult: 1.0
    decay_mult: 0.0
  }
  param {
    lr_mult: 1.0
    decay_mult: 0.0
  }
  param {
    lr_mult: 0.0
    decay_mult: 0.0
  }
  param {
    lr_mult: 0.0
    decay_mult: 0.0
  }
  batch_norm_param {
    moving_average_fraction: 0.980000019073
    eps: 9.99999974738e-05
    scale_filler {
      type: "constant"
      value: 1.0
    }
    bias_filler {
      type: "constant"
      value: 0.0
    }
  }
}

@gheinrich
Copy link
Contributor

The issue is that now the Classify One softmax, seems not to be using the terms of the softmax. I cannot classify an image that I could before.

Can you explain this in more details? What are you seeing when you do Classify One?

Is the softmax the only output of your deploy network?

@igorbb
Copy link
Author

igorbb commented Apr 20, 2016

Yes it was the only output.

Whenever using the Digits 3.0 ( From Nvidia docker) and A inception based Network ( GoogLenet)
Everything works well. If I add Batch Normalization before every relu. It still goes well and we have faster convergence as we should.

Now to reproduce the error:

Now If I do the same on to DIGITS 3.3. It does not goes well.
The only difference in the prototxt are the stage definition parameters ( i.e include { stage: "val" }).
This gist of text shows the difference
selection_047

After A few Epochs, I run my classification (while still training). And the follow messages appears on the terminal.

2016-04-20 14:09:26 [20160420-140924-09b0] [INFO ] Infer Model task started.
2016-04-20 14:09:26 [20160420-140924-09b0] [WARNING] Infer Model unrecognized output: libdc1394 error: Failed to initialize libdc1394
2016-04-20 14:09:29 [20160420-140924-09b0] [WARNING] Infer Model unrecognized output: /usr/lib/python2.7/dist-packages/numpy/core/_methods.py:102: RuntimeWarning: overflow encountered in multiply
2016-04-20 14:09:29 [20160420-140924-09b0] [WARNING] Infer Model unrecognized output: x = um.multiply(x, x, out=x)
2016-04-20 14:09:31 [20160420-140924-09b0] [WARNING] Infer Model unrecognized output: Couldn't import dot_parser, loading of dot files will not be possible.
2016-04-20 14:09:31 [20160420-140924-09b0] [INFO ] Infer Model task completed.
2016-04-20 14:09:31 [20160420-140924-09b0] [INFO ] Job complete.

I will wait a few more epoch to let you know if the error continue. And I will then stop the training process and see if the error happens then.

@igorbb
Copy link
Author

igorbb commented Apr 20, 2016

The issue still happens after a few more epochs.

The bug does not happen if I abort the training , and only then do classify one.
Maybe is a memory issue ? My batch was large enough to use > 90% of the GPUs memory while training.

@gheinrich
Copy link
Contributor

If all your GPUs are used for training then Caffe will resort to using the CPU for inference. That might be why you are getting this error.

@igorbb
Copy link
Author

igorbb commented Apr 20, 2016

This does not happen with The 3.0 version. Indicating that something has changed.
I also left 80% of the GPU available and the same error message appears.

@gheinrich
Copy link
Contributor

Yes indeed in DIGITS 3.0, GPU resources were not explicitly reserved during inference. This would cause some other issues e.g. running out of GPU memory during training. This changed with #573. Now, there needs to be a free GPU (i.e. totally free, we don't have granularity for partially-free GPUs) to run inference on GPU. Sorry to hear that this might cause troubles but we felt this would fix a "bigger" issue.

Now there is the question of why Caffe is crashing during CPU inference. Is this only happening in the presence of batch normalization? I am not seeing this issue when using the vanilla Alexnet. Is this because CuDNN is the default batch normalization engine?

@igorbb
Copy link
Author

igorbb commented Apr 20, 2016

Maybe.

The BatchNormalization prototxt I am using is the following:

## BN
layer {
  bottom: "loss1/fc"
  name: "loss1/fc/bn"
  top: "loss1/fc/bn"
  type: "BatchNorm"
  param {
    lr_mult: 1.0
    decay_mult: 0.0
  }
  param {
    lr_mult: 1.0
    decay_mult: 0.0
  }
  param {
    lr_mult: 0.0
    decay_mult: 0.0
  }
  param {
    lr_mult: 0.0
    decay_mult: 0.0
  }
  batch_norm_param {
    moving_average_fraction: 0.980000019073
    eps: 9.99999974738e-05
    scale_filler {
      type: "constant"
      value: 1.0
    }
    bias_filler {
      type: "constant"
      value: 0.0
    }
  }
}

What is the other option ? I thought I had to explict declare the CUDDN engine to use the cudnn one ?
Which one is the recommended one ?
I can try with another in case you can point me to the the a simple proto BN template.

@gheinrich
Copy link
Contributor

Actually if I add a BN to Alexnet I am getting this error when training from CPU:

*** Aborted at 1461179733 (unix time) try "date -d @1461179733" if you are using GNU date ***
PC: @     0x7fae8dc27855 (unknown)
*** SIGSEGV (@0x727084000) received by PID 13900 (TID 0x7fae99c4ba40) from PID 654852096; stack trace: ***
@     0x7fae97cf2d40 (unknown)
@     0x7fae8dc27855 (unknown)
@     0x7fae9969f23a caffe::CuDNNBatchNormLayer<>::Backward_gpu()
@     0x7fae9957c4e7 caffe::Net<>::BackwardFromTo()
@     0x7fae9957c651 caffe::Net<>::Backward()
@     0x7fae99520f31 caffe::Solver<>::Step()
@     0x7fae99521775 caffe::Solver<>::Solve()
@           0x40b993 train()
@           0x4092e8 main
@     0x7fae97cddec5 (unknown)
@           0x409acb (unknown)
@                0x0 (unknown)

I think nv-caffe should resort to using the Caffe (v.s. CuDNN) engine when running on CPU. See this for a list of options.

@igorbb
Copy link
Author

igorbb commented Apr 20, 2016

I understand. Please correct if I am wrong...

So For future ref :

engine: CUDNN - CUDNN only ( GPU)
engine: Deafult - CUDA BASED only (GPU)
engine: Caffe - CPU + GPU .

For other that might be interested ( Issue #629 ):

I haven't had convergence issues using the Default.

@igorbb
Copy link
Author

igorbb commented Apr 20, 2016

So it seems was more a problem with documentation than a bug it self..

I do not think there is the need to change the classify one to fix this. But if anyone thinks differently, maybe a good solution would be to check if Classify One is in CPU mode. If so check if all the layers are available with the CPU mode . Whenever this checking fail maybe emit a warning ?

@gheinrich please feel free to close the issue if you judge so. Thanks for the help.

@TimZaman
Copy link
Contributor

TimZaman commented Apr 20, 2016

If all your GPUs are used for training then Caffe will resort to using the CPU for inference. That might be why you are getting this error.
I have this quite often when running 4 GPU's (using torch) and i want to classify one image, it gives an error which is not very descriptive if I recall correctly; it does not want to run because i am using all gpu's.

@lukeyeager
Copy link
Member

It doesn't look like BatchNorm is unavailable on the CPU. It looks to me like this is the issue:

[WARNING] Infer Model unrecognized output: /usr/lib/python2.7/dist-packages/numpy/core/_methods.py:102: RuntimeWarning: overflow encountered in multiply
[WARNING] Infer Model unrecognized output: x = um.multiply(x, x, out=x)

I don't know why the CPU would encounter an overflow when the GPU doesn't though...

@gheinrich
Copy link
Contributor

The Caffe and CuDNN implementations of BN are known to be incompatible on nv-caffe 0.14. That might explain the overflow???

@pansk
Copy link

pansk commented Apr 23, 2016

About the change introduced in #573, I perfectly see the reason (trainings were crashing in 3.0), but is it possible to implement an option marked "non-default", "unsafe", "deprecated", "use at your own risk" that restores the previous behavior?
Sometimes it's really handy to make a quick test while training a network, to see how good (or, in my case, how bad) things are going, and one can decide to take the chance when running a training about half of the device memory or less.

@gheinrich gheinrich added the bug label Apr 25, 2016
@mfernezir
Copy link

mfernezir commented Apr 28, 2016

Btw, @igorbb do you know why there are two additional param fields with everything set to zero in your BN definition?

Also, unrelated to the main issue, the error from your log "Couldn't import dot_parser, loading of dot files will not be possible." can be resolved (at least that worked for me) if you install older pyparsing version:

pip uninstall pyparsing
pip install pyparsing==1.5.7

@lukeyeager lukeyeager changed the title New stage definition brakes Classify One and Batch Normalization Error running inference on CPU for network with BatchNorm Apr 28, 2016
@igorbb igorbb closed this as completed Apr 21, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

6 participants