Error running inference on CPU for network with BatchNorm #694

igorbb · 2016-04-20T00:32:30Z

Hey guys!

I was using the Batch Normalization in my network with digits 3.0.

This was the end of my network (which was working fine):

layer {
  bottom: "loss1/fc/bn"
  top: "loss1/classifier"
  name: "loss1/classifier"
  type: "InnerProduct"
  param {
    lr_mult: 1
    decay_mult: 1
  }
  param {
    lr_mult: 2
    decay_mult: 0
  }
  inner_product_param {
    #num_output: 1000
    weight_filler {
      type: "xavier"
    }
    bias_filler {
      type: "constant"
      value: 0
    }
  }
}
layer {
  name: "loss1/loss"
  type: "SoftmaxWithLoss"
  bottom: "loss1/classifier"
  bottom: "label"
  top: "loss"
  loss_weight: 1
}
layer {
  name: "loss1/top-1"
  type: "Accuracy"
  bottom: "loss1/classifier"
  bottom: "label"
  top: "accuracy"
  include { stage: "val" }
}
layer {
  name: "loss1/top-5"
  type: "Accuracy"
  bottom: "loss1/classifier"
  bottom: "label"
  top: "accuracy-top5"
  accuracy_param {
    top_k: 5
  }
}

Now, after updating to the master branch 3.3 I had To change the end of my network for the new include { stage: "deploy" } definitions.

Thus, the end of the network starts looking like this :

layer {
  bottom: "loss1/fc/bn"
  top: "loss1/classifier"
  name: "loss1/classifier"
  type: "InnerProduct"
  param {
    lr_mult: 1
    decay_mult: 1
  }
  param {
    lr_mult: 2
    decay_mult: 0
  }
  inner_product_param {
    #num_output: 1000
    weight_filler {
      type: "xavier"
    }
    bias_filler {
      type: "constant"
      value: 0
    }
  }
}
layer {
  name: "loss1/loss"
  type: "SoftmaxWithLoss"
  bottom: "loss1/classifier"
  bottom: "label"
  top: "loss"
  loss_weight: 1
  exclude { stage: "deploy" }
}
layer {
  name: "loss1/top-1"
  type: "Accuracy"
  bottom: "loss1/classifier"
  bottom: "label"
  top: "accuracy"
  include { stage: "val" }
}
layer {
  name: "loss1/top-5"
  type: "Accuracy"
  bottom: "loss1/classifier"
  bottom: "label"
  top: "accuracy-top5"
  include { stage: "val" }
  accuracy_param {
    top_k: 5
  }
}
layer {
  name: "softmax"
  type: "Softmax"
  bottom: "loss1/classifier"
  top: "softmax"
  include { stage: "deploy" }
}

The issue is that now the Classify One softmax, seems not to be using the terms of the softmax. I cannot classify an image that I could before.

I cannot point my finger where exactly. But It seems to be related to include { stage: "deploy" } that is not mandatory.

It also seems to be a big issue when using the Batch Normalization ..

The text was updated successfully, but these errors were encountered:

igorbb · 2016-04-20T00:53:07Z

Some more information. This is the BN definition that worked well on 3.0.

## BN
layer {
  bottom: "loss1/fc"
  name: "loss1/fc/bn"
  top: "loss1/fc/bn"
  type: "BatchNorm"
  param {
    lr_mult: 1.0
    decay_mult: 0.0
  }
  param {
    lr_mult: 1.0
    decay_mult: 0.0
  }
  param {
    lr_mult: 0.0
    decay_mult: 0.0
  }
  param {
    lr_mult: 0.0
    decay_mult: 0.0
  }
  batch_norm_param {
    moving_average_fraction: 0.980000019073
    eps: 9.99999974738e-05
    scale_filler {
      type: "constant"
      value: 1.0
    }
    bias_filler {
      type: "constant"
      value: 0.0
    }
  }
}

gheinrich · 2016-04-20T08:55:44Z

The issue is that now the Classify One softmax, seems not to be using the terms of the softmax. I cannot classify an image that I could before.

Can you explain this in more details? What are you seeing when you do Classify One?

Is the softmax the only output of your deploy network?

igorbb · 2016-04-20T14:17:39Z

Yes it was the only output.

Whenever using the Digits 3.0 ( From Nvidia docker) and A inception based Network ( GoogLenet)
Everything works well. If I add Batch Normalization before every relu. It still goes well and we have faster convergence as we should.

Now to reproduce the error:

Now If I do the same on to DIGITS 3.3. It does not goes well.
The only difference in the prototxt are the stage definition parameters ( i.e include { stage: "val" }).
This gist of text shows the difference

After A few Epochs, I run my classification (while still training). And the follow messages appears on the terminal.

2016-04-20 14:09:26 [20160420-140924-09b0] [INFO ] Infer Model task started.
2016-04-20 14:09:26 [20160420-140924-09b0] [WARNING] Infer Model unrecognized output: libdc1394 error: Failed to initialize libdc1394
2016-04-20 14:09:29 [20160420-140924-09b0] [WARNING] Infer Model unrecognized output: /usr/lib/python2.7/dist-packages/numpy/core/_methods.py:102: RuntimeWarning: overflow encountered in multiply
2016-04-20 14:09:29 [20160420-140924-09b0] [WARNING] Infer Model unrecognized output: x = um.multiply(x, x, out=x)
2016-04-20 14:09:31 [20160420-140924-09b0] [WARNING] Infer Model unrecognized output: Couldn't import dot_parser, loading of dot files will not be possible.
2016-04-20 14:09:31 [20160420-140924-09b0] [INFO ] Infer Model task completed.
2016-04-20 14:09:31 [20160420-140924-09b0] [INFO ] Job complete.

I will wait a few more epoch to let you know if the error continue. And I will then stop the training process and see if the error happens then.

igorbb · 2016-04-20T14:45:33Z

The issue still happens after a few more epochs.

The bug does not happen if I abort the training , and only then do classify one.
Maybe is a memory issue ? My batch was large enough to use > 90% of the GPUs memory while training.

gheinrich · 2016-04-20T17:42:16Z

If all your GPUs are used for training then Caffe will resort to using the CPU for inference. That might be why you are getting this error.

igorbb · 2016-04-20T18:17:18Z

This does not happen with The 3.0 version. Indicating that something has changed.
I also left 80% of the GPU available and the same error message appears.

gheinrich · 2016-04-20T19:03:31Z

Yes indeed in DIGITS 3.0, GPU resources were not explicitly reserved during inference. This would cause some other issues e.g. running out of GPU memory during training. This changed with #573. Now, there needs to be a free GPU (i.e. totally free, we don't have granularity for partially-free GPUs) to run inference on GPU. Sorry to hear that this might cause troubles but we felt this would fix a "bigger" issue.

Now there is the question of why Caffe is crashing during CPU inference. Is this only happening in the presence of batch normalization? I am not seeing this issue when using the vanilla Alexnet. Is this because CuDNN is the default batch normalization engine?

igorbb · 2016-04-20T19:10:36Z

Maybe.

The BatchNormalization prototxt I am using is the following:

## BN
layer {
  bottom: "loss1/fc"
  name: "loss1/fc/bn"
  top: "loss1/fc/bn"
  type: "BatchNorm"
  param {
    lr_mult: 1.0
    decay_mult: 0.0
  }
  param {
    lr_mult: 1.0
    decay_mult: 0.0
  }
  param {
    lr_mult: 0.0
    decay_mult: 0.0
  }
  param {
    lr_mult: 0.0
    decay_mult: 0.0
  }
  batch_norm_param {
    moving_average_fraction: 0.980000019073
    eps: 9.99999974738e-05
    scale_filler {
      type: "constant"
      value: 1.0
    }
    bias_filler {
      type: "constant"
      value: 0.0
    }
  }
}

What is the other option ? I thought I had to explict declare the CUDDN engine to use the cudnn one ?
Which one is the recommended one ?
I can try with another in case you can point me to the the a simple proto BN template.

gheinrich · 2016-04-20T19:27:52Z

Actually if I add a BN to Alexnet I am getting this error when training from CPU:

*** Aborted at 1461179733 (unix time) try "date -d @1461179733" if you are using GNU date ***
PC: @     0x7fae8dc27855 (unknown)
*** SIGSEGV (@0x727084000) received by PID 13900 (TID 0x7fae99c4ba40) from PID 654852096; stack trace: ***
@     0x7fae97cf2d40 (unknown)
@     0x7fae8dc27855 (unknown)
@     0x7fae9969f23a caffe::CuDNNBatchNormLayer<>::Backward_gpu()
@     0x7fae9957c4e7 caffe::Net<>::BackwardFromTo()
@     0x7fae9957c651 caffe::Net<>::Backward()
@     0x7fae99520f31 caffe::Solver<>::Step()
@     0x7fae99521775 caffe::Solver<>::Solve()
@           0x40b993 train()
@           0x4092e8 main
@     0x7fae97cddec5 (unknown)
@           0x409acb (unknown)
@                0x0 (unknown)

I think nv-caffe should resort to using the Caffe (v.s. CuDNN) engine when running on CPU. See this for a list of options.

igorbb · 2016-04-20T20:39:36Z

I understand. Please correct if I am wrong...

So For future ref :

engine: CUDNN - CUDNN only ( GPU)
engine: Deafult - CUDA BASED only (GPU)
engine: Caffe - CPU + GPU .

For other that might be interested ( Issue #629 ):

I haven't had convergence issues using the Default.

igorbb · 2016-04-20T20:51:24Z

So it seems was more a problem with documentation than a bug it self..

I do not think there is the need to change the classify one to fix this. But if anyone thinks differently, maybe a good solution would be to check if Classify One is in CPU mode. If so check if all the layers are available with the CPU mode . Whenever this checking fail maybe emit a warning ?

@gheinrich please feel free to close the issue if you judge so. Thanks for the help.

TimZaman · 2016-04-20T22:19:08Z

If all your GPUs are used for training then Caffe will resort to using the CPU for inference. That might be why you are getting this error.
I have this quite often when running 4 GPU's (using torch) and i want to classify one image, it gives an error which is not very descriptive if I recall correctly; it does not want to run because i am using all gpu's.

lukeyeager · 2016-04-20T22:27:52Z

It doesn't look like BatchNorm is unavailable on the CPU. It looks to me like this is the issue:

[WARNING] Infer Model unrecognized output: /usr/lib/python2.7/dist-packages/numpy/core/_methods.py:102: RuntimeWarning: overflow encountered in multiply
[WARNING] Infer Model unrecognized output: x = um.multiply(x, x, out=x)

I don't know why the CPU would encounter an overflow when the GPU doesn't though...

gheinrich · 2016-04-21T13:48:32Z

The Caffe and CuDNN implementations of BN are known to be incompatible on nv-caffe 0.14. That might explain the overflow???

pansk · 2016-04-23T12:07:36Z

About the change introduced in #573, I perfectly see the reason (trainings were crashing in 3.0), but is it possible to implement an option marked "non-default", "unsafe", "deprecated", "use at your own risk" that restores the previous behavior?
Sometimes it's really handy to make a quick test while training a network, to see how good (or, in my case, how bad) things are going, and one can decide to take the chance when running a training about half of the device memory or less.

mfernezir · 2016-04-28T04:45:08Z

Btw, @igorbb do you know why there are two additional param fields with everything set to zero in your BN definition?

Also, unrelated to the main issue, the error from your log "Couldn't import dot_parser, loading of dot files will not be possible." can be resolved (at least that worked for me) if you install older pyparsing version:

pip uninstall pyparsing
pip install pyparsing==1.5.7

gheinrich added the bug label Apr 25, 2016

lukeyeager changed the title ~~New stage definition brakes Classify One and Batch Normalization~~ Error running inference on CPU for network with BatchNorm Apr 28, 2016

igorbb closed this as completed Apr 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error running inference on CPU for network with BatchNorm #694

Error running inference on CPU for network with BatchNorm #694

igorbb commented Apr 20, 2016

igorbb commented Apr 20, 2016

gheinrich commented Apr 20, 2016

igorbb commented Apr 20, 2016

igorbb commented Apr 20, 2016 •

edited

Loading

gheinrich commented Apr 20, 2016

igorbb commented Apr 20, 2016 •

edited

Loading

gheinrich commented Apr 20, 2016

igorbb commented Apr 20, 2016

gheinrich commented Apr 20, 2016

igorbb commented Apr 20, 2016

igorbb commented Apr 20, 2016

TimZaman commented Apr 20, 2016 •

edited

Loading

lukeyeager commented Apr 20, 2016

gheinrich commented Apr 21, 2016

pansk commented Apr 23, 2016

mfernezir commented Apr 28, 2016 •

edited

Loading

Error running inference on CPU for network with BatchNorm #694

Error running inference on CPU for network with BatchNorm #694

Comments

igorbb commented Apr 20, 2016

igorbb commented Apr 20, 2016

gheinrich commented Apr 20, 2016

igorbb commented Apr 20, 2016

igorbb commented Apr 20, 2016 • edited Loading

gheinrich commented Apr 20, 2016

igorbb commented Apr 20, 2016 • edited Loading

gheinrich commented Apr 20, 2016

igorbb commented Apr 20, 2016

gheinrich commented Apr 20, 2016

igorbb commented Apr 20, 2016

igorbb commented Apr 20, 2016

TimZaman commented Apr 20, 2016 • edited Loading

lukeyeager commented Apr 20, 2016

gheinrich commented Apr 21, 2016

pansk commented Apr 23, 2016

mfernezir commented Apr 28, 2016 • edited Loading

igorbb commented Apr 20, 2016 •

edited

Loading

igorbb commented Apr 20, 2016 •

edited

Loading

TimZaman commented Apr 20, 2016 •

edited

Loading

mfernezir commented Apr 28, 2016 •

edited

Loading