New sanity check for "invalid bottom" blobs breaks valid networks #601

Shadfield · 2016-02-25T14:07:00Z

Revision d3cbdca introduced a new sanity check into digits/model/tasks/caffe_train.py at Line 1515 onwards.

I don't understand the code well enough to pinpoint the error further than this, but there seems to be some problem with the implementation of the sanity check.

Below is an MWE network definition. In previous revisions of digits this executes successfully, and plots out the performance on the train and validation (test) data. Obviously this MWE has nothing to learn, so the error would be constant for all epochs.

In the new revision of DIGITS the training process quits before it even starts, with the error message:

ERROR: Layer 'flatlabel' references bottom 'label' at the TEST stage however this blob is not included at that stage. Please consider using an include directive to limit the scope of this layer.

Traceback (most recent call last):
  File "/home/mofo/digits_new/digits/scheduler.py", line 482, in run_task
    task.run(resources)
  File "/home/mofo/digits_new/digits/task.py", line 179, in run
    self.before_run()
  File "/home/mofo/digits_new/digits/model/tasks/caffe_train.py", line 133, in before_run
    self.save_files_generic()
  File "/home/mofo/digits_new/digits/model/tasks/caffe_train.py", line 669, in save_files_generic
    CaffeTrainTask.net_sanity_check(deploy_network, caffe_pb2.TEST)
  File "/home/mofo/digits_new/digits/model/tasks/caffe_train.py", line 1451, in net_sanity_check
    layer.name, bottom, "TRAIN" if phase == caffe_pb2.TRAIN else "TEST"))
CaffeTrainSanityCheckError: Layer 'flatlabel' references bottom 'label' at the TEST stage however this blob is not included at that stage. Please consider using an include directive to limit the scope of this layer.

The sanity check is complaining that the flatten layer requires the label blob, even though the label blob is available in both train and test modes. I've also tried including 2 flatten layers (one specified as "phase: TEST" and the other as "phase: TRAIN") with no luck.

If we remove the flatten layer from the MWE (so label is connected straight to the loss layer) it passes the sanity check, and correctly computes the validation loss on the test data. So the label blob is definitely available at test time, even in the most recent DIGITS revision.

MWE network definition copied from the train_val.prototxt that DIGITS creates, with filepaths anonymized.

layer {
  name: "data"
  type: "Data"
  top: "data"
  include {
    phase: TRAIN
  }
  data_param {
    source: "xxx/train/images.lmdb"
    batch_size: 1000
    backend: LMDB
  }
}
layer {
  name: "label"
  type: "Data"
  top: "label"
  include {
    phase: TRAIN
  }
  data_param {
    source: "xxx/train/labels.lmdb"
    batch_size: 1000
    backend: LMDB
  }
}
layer {
  name: "data"
  type: "Data"
  top: "data"
  include {
    phase: TEST
  }
  data_param {
    source: "xxx/test/images.lmdb"
    batch_size: 1000
    backend: LMDB
  }
}
layer {
  name: "label"
  type: "Data"
  top: "label"
  include {
    phase: TEST
  }
  data_param {
    source: "xxx/test/labels.lmdb"
    batch_size: 1000
    backend: LMDB
  }
}
layer {
  name: "flatlabel"
  type: "Flatten"
  bottom: "label"
  top: "flatlabel"
}
layer {
  name: "loss"
  type: "EuclideanLoss"
  bottom: "data"
  bottom: "flatlabel"
  top: "l2_error"
}

The text was updated successfully, but these errors were encountered:

gheinrich · 2016-02-25T16:47:53Z

Thanks for the report. The check was introduced to catch cases when one would want to use the label blob during inference (i.e. in the deploy network). When you test a single image with your model (before the sanity check was introduced), aren't you getting an error like:

Unknown bottom blob 'label' (layer 'flatlabel', bottom index 0)

A way to solve your problem would be to add this to your flatlabel and loss layers:

include { phase: TRAIN }

However that would mean you wouldn't see the validation loss and this is probably not what you want.

There is already a test in DIGITS to prevent loss functions from being added to the deploy network however your case introduces an intermediate layer and I think this case wasn't really thought of in DIGITS... it would be nice if we could discriminate between the training, validation and test phases, as @lukeyeager suggested on BVLC/caffe#1245 (comment).

lukeyeager · 2016-02-25T17:22:26Z

We could add the train_ and deploy_ hack for classification networks as well as generic ones:
https://github.com/NVIDIA/DIGITS/blob/v3.2.0/digits/model/tasks/caffe_train.py#L579-L584

It's hacky, but it works.

gheinrich · 2016-02-25T17:31:24Z

So with this change @Shadfield would just need to rename her/his flatlabel layer to e.g. train_val_flatlabel?

lukeyeager · 2016-02-25T18:40:59Z

To use a layer for training and validation but not deploy, he'd set the name to train_flatlabel and leave the phase unset.

The way it works now is:

Desired phase	Layer name	included phase
All phases	`mylayer`	unset
Training + Validation	`train_mylayer`	unset
Validation + Deploy	`mylayer`	TEST
Training only	`train_mylayer`	TRAIN
Validation only	`train_mylayer`	TEST
Deploy only	`deploy_mylayer`	TEST (or unset)

Not exactly what I'd call obvious.

Shadfield · 2016-02-25T20:43:21Z

@lukeyeager One thing which I perhaps didn't make clear, I was testing with generic models not classification networks.

In answer to the earlier question by @gheinrich: yes when doing "test a single image" in the earlier DIGITS revision, I get the caffe error you pointed out (and it crashes the DIGITS server).

Regarding the proposed solution: I previously thought the train_ layer prefix was the same as the using include { phase: TRAIN } but the table above shows that's wrong. I previously tried using include phase train as suggested by @gheinrich, but as he also points out, I don't see the validation results (as the table says I shouldn't).

According to the table it seems like the train_ prefix should be the solution. But when I try this I get the error below.

new_queue_pairs_.size() == 0 (1 vs. 0)
Creating layer data
Creating Layer data
data -> data
Check failed: new_queue_pairs_.size() == 0 (1 vs. 0)
Waiting for data

I've attached the logfile, train_val and deploy files generated by DIGITS (the "train_" prefix has been automatically removed). Looking at the logfile, it seems like it built the train network successfully and died while building the test network (which still has the flatten layer in). Maybe someone else could try to replicate this and see if the prefix works for them?
MWE_flat_label.zip

lukeyeager · 2016-02-25T22:50:12Z

Check BVLC/caffe#3394

  // Check no additional readers have been created. This can happen if
  // more than one net is trained at a time per process, whether single
  // or multi solver. It might also happen if two data layers have same
  // name and same source.

lukeyeager · 2016-02-25T22:52:13Z

Looks like all four of your data layers are reading from the same source? That can't be what you want.

twistedmage · 2016-02-25T23:47:13Z

Oh sorry that's my mistake. After moving back and forth between different revisions of DIGITS, I tried to save time creating the dataset by re-using the same lmdb file for all 4 data sources, just for the purposes of this MWE.

When I create a proper dataset and use the suggested train_ prefix, everything works on the newest version of digits. It passes the sanity check and I'm also able to see the validation performance during training.

I suppose this "bug report" was a bit of a waste of everyones time, there was a working solution even if it wasn't obvious! Perhaps instead I could make a feature request... to stop other people running into the same difficulties, maybe it would be useful to unify the behaviour of include phases and prefixes? And also to add "validation" as a third option similar to @gheinrich suggestion of train_val_flatlabel above.

lukeyeager · 2016-02-26T00:04:50Z

Replacing the entire phase/stage framework with layer name prefixes would make me sad. But you're right - we could add even more hackery to make that option more usable.

I'm still crossing my fingers for BVLC/caffe#3211 (comment).

I suppose this "bug report" was a bit of a waste of everyones time

Not at all! If I don't know something is bothering you, I can't fix it!

In case someone hits a problem like that mentioned in NVIDIA#601 for a classification network. Once Caffe implements input layers and phase control from Python we should be able to remove those workarounds.

lukeyeager · 2016-03-21T17:02:53Z

Closing this issue. You should be able to get around any failing sanity checks with the new power of #628, and the new_queue_pairs_.size() error was caused by improper usage of Data layers.

khurram-amin · 2016-08-29T12:27:44Z

@lukeyeager Can you please put some tutorial regarding this SANITY CHECK feature? Whether one needs to append train_ (or deploy_) only in name field or the top field also needs to be changed?
I tried (almost) all combinations and none is working for me. Another option that may work is essentially duplicate the whole network (i.e. one with deploy_ and other one with train_). Is this the correct way?

The following snippet is not working for me.

layer {
  name: "train_data"
  type: "Data"
  top: "data"
  include {
    phase: TRAIN
  }
  data_param {
    source: "XXX/LMDB/TrainingImages"
    batch_size: 2
    crop_size: 100
    mirror: true
    backend: LMDB
  }
}
layer {
  name: "train_label"
  type: "Data"
  top: "label"
  include {
    phase: TRAIN
  }
  data_param {
    source: "XXX/LMDB/TrainingLabels"
    batch_size: 2
    backend: LMDB
  }
}
layer {
  name: "deploy_data"
  type: "Data"
  top: "data"
  include {
    phase: TEST
  }
  data_param {
    source: "XXX/LMDB/TestingImages"
    batch_size: 1
    crop_size: 100
    mirror: true
    backend: LMDB
  }
}
layer {
  name: "deploy_label"
  type: "Data"
  top: "label"
  include {
    phase: TEST
  }
  data_param {
    source: "XXX/LMDB/TestingLabels"
    batch_size: 1
    backend: LMDB
  }
}

layer {
  name: "train_conv1_1"
  type: "Convolution"
  bottom: "data"
  top: "conv1_1"
  include { phase: TRAIN }
  param {
    lr_mult: 1
    decay_mult: 1
  }
  param {
    lr_mult: 2
    decay_mult: 0
  }
  convolution_param {
    num_output: 64
    pad: 100
    kernel_size: 3
    stride: 1
  }
}

layer {
  name: "deploy_conv1_1"
  type: "Convolution"
  bottom: "data"
  top: "conv1_1"
  include { phase: TEST }
  param {
    lr_mult: 1
    decay_mult: 1
  }
  param {
    lr_mult: 2
    decay_mult: 0
  }
  convolution_param {
    num_output: 64
    pad: 100
    kernel_size: 3
    stride: 1
  }
}

...........................
...........................
...........................

layer {
  name: "train_loss"
  type: "SoftmaxWithLoss"
  bottom: "score"
  bottom: "label"
  top: "loss"
  include {
    phase: TRAIN
  }
  loss_param {
    ignore_label: 0
    normalize: false
  }
}
layer {
  name: "deploy_accuracy"
  type: "Accuracy"
  bottom: "score"
  bottom: "label"
  top: "accuracy"
  include {
    phase: TEST
  }
}

gheinrich · 2016-08-29T12:37:21Z

Hi @khurram-amin you should preferably use include and exclude statements. See here and there for examples.

sulthanashafi · 2017-04-08T17:50:27Z

Pls someone trigger out the problem.
My MAP=O always and iam trying to change the stride value th a low value ignorer to detect a small size object.pls find a way.

lukeyeager · 2017-04-10T16:52:25Z

Try reading through this popup and see if it helps: #628 (comment)
(You can also get to it by clicking on the little question mark next to "Visualize")

gheinrich mentioned this issue Feb 26, 2016

Layer exclusion naming convention applied to classification nets [DON'T MERGE YET] #605

Closed

lukeyeager closed this as completed Mar 21, 2016

igorbb mentioned this issue Mar 24, 2016

Cannot train GoogLenet without a validation dataset. #654

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New sanity check for "invalid bottom" blobs breaks valid networks #601

New sanity check for "invalid bottom" blobs breaks valid networks #601

Shadfield commented Feb 25, 2016

gheinrich commented Feb 25, 2016

lukeyeager commented Feb 25, 2016

gheinrich commented Feb 25, 2016

lukeyeager commented Feb 25, 2016

Shadfield commented Feb 25, 2016

lukeyeager commented Feb 25, 2016

lukeyeager commented Feb 25, 2016

twistedmage commented Feb 25, 2016

lukeyeager commented Feb 26, 2016

lukeyeager commented Mar 21, 2016

khurram-amin commented Aug 29, 2016 •

edited

Loading

gheinrich commented Aug 29, 2016

sulthanashafi commented Apr 8, 2017

lukeyeager commented Apr 10, 2017

New sanity check for "invalid bottom" blobs breaks valid networks #601

New sanity check for "invalid bottom" blobs breaks valid networks #601

Comments

Shadfield commented Feb 25, 2016

gheinrich commented Feb 25, 2016

lukeyeager commented Feb 25, 2016

gheinrich commented Feb 25, 2016

lukeyeager commented Feb 25, 2016

Shadfield commented Feb 25, 2016

lukeyeager commented Feb 25, 2016

lukeyeager commented Feb 25, 2016

twistedmage commented Feb 25, 2016

lukeyeager commented Feb 26, 2016

lukeyeager commented Mar 21, 2016

khurram-amin commented Aug 29, 2016 • edited Loading

gheinrich commented Aug 29, 2016

sulthanashafi commented Apr 8, 2017

lukeyeager commented Apr 10, 2017

khurram-amin commented Aug 29, 2016 •

edited

Loading