Skip to content
This repository was archived by the owner on Jan 7, 2025. It is now read-only.

New sanity check for "invalid bottom" blobs breaks valid networks #601

Closed
Shadfield opened this issue Feb 25, 2016 · 14 comments
Closed

New sanity check for "invalid bottom" blobs breaks valid networks #601

Shadfield opened this issue Feb 25, 2016 · 14 comments

Comments

@Shadfield
Copy link

Revision d3cbdca introduced a new sanity check into digits/model/tasks/caffe_train.py at Line 1515 onwards.

I don't understand the code well enough to pinpoint the error further than this, but there seems to be some problem with the implementation of the sanity check.

Below is an MWE network definition. In previous revisions of digits this executes successfully, and plots out the performance on the train and validation (test) data. Obviously this MWE has nothing to learn, so the error would be constant for all epochs.

In the new revision of DIGITS the training process quits before it even starts, with the error message:

ERROR: Layer 'flatlabel' references bottom 'label' at the TEST stage however this blob is not included at that stage. Please consider using an include directive to limit the scope of this layer.

Traceback (most recent call last):
  File "/home/mofo/digits_new/digits/scheduler.py", line 482, in run_task
    task.run(resources)
  File "/home/mofo/digits_new/digits/task.py", line 179, in run
    self.before_run()
  File "/home/mofo/digits_new/digits/model/tasks/caffe_train.py", line 133, in before_run
    self.save_files_generic()
  File "/home/mofo/digits_new/digits/model/tasks/caffe_train.py", line 669, in save_files_generic
    CaffeTrainTask.net_sanity_check(deploy_network, caffe_pb2.TEST)
  File "/home/mofo/digits_new/digits/model/tasks/caffe_train.py", line 1451, in net_sanity_check
    layer.name, bottom, "TRAIN" if phase == caffe_pb2.TRAIN else "TEST"))
CaffeTrainSanityCheckError: Layer 'flatlabel' references bottom 'label' at the TEST stage however this blob is not included at that stage. Please consider using an include directive to limit the scope of this layer.

The sanity check is complaining that the flatten layer requires the label blob, even though the label blob is available in both train and test modes. I've also tried including 2 flatten layers (one specified as "phase: TEST" and the other as "phase: TRAIN") with no luck.

If we remove the flatten layer from the MWE (so label is connected straight to the loss layer) it passes the sanity check, and correctly computes the validation loss on the test data. So the label blob is definitely available at test time, even in the most recent DIGITS revision.

MWE network definition copied from the train_val.prototxt that DIGITS creates, with filepaths anonymized.

layer {
  name: "data"
  type: "Data"
  top: "data"
  include {
    phase: TRAIN
  }
  data_param {
    source: "xxx/train/images.lmdb"
    batch_size: 1000
    backend: LMDB
  }
}
layer {
  name: "label"
  type: "Data"
  top: "label"
  include {
    phase: TRAIN
  }
  data_param {
    source: "xxx/train/labels.lmdb"
    batch_size: 1000
    backend: LMDB
  }
}
layer {
  name: "data"
  type: "Data"
  top: "data"
  include {
    phase: TEST
  }
  data_param {
    source: "xxx/test/images.lmdb"
    batch_size: 1000
    backend: LMDB
  }
}
layer {
  name: "label"
  type: "Data"
  top: "label"
  include {
    phase: TEST
  }
  data_param {
    source: "xxx/test/labels.lmdb"
    batch_size: 1000
    backend: LMDB
  }
}
layer {
  name: "flatlabel"
  type: "Flatten"
  bottom: "label"
  top: "flatlabel"
}
layer {
  name: "loss"
  type: "EuclideanLoss"
  bottom: "data"
  bottom: "flatlabel"
  top: "l2_error"
}
@gheinrich
Copy link
Contributor

Thanks for the report. The check was introduced to catch cases when one would want to use the label blob during inference (i.e. in the deploy network). When you test a single image with your model (before the sanity check was introduced), aren't you getting an error like:

Unknown bottom blob 'label' (layer 'flatlabel', bottom index 0)

A way to solve your problem would be to add this to your flatlabel and loss layers:

include { phase: TRAIN }

However that would mean you wouldn't see the validation loss and this is probably not what you want.

There is already a test in DIGITS to prevent loss functions from being added to the deploy network however your case introduces an intermediate layer and I think this case wasn't really thought of in DIGITS... it would be nice if we could discriminate between the training, validation and test phases, as @lukeyeager suggested on BVLC/caffe#1245 (comment).

@lukeyeager
Copy link
Member

We could add the train_ and deploy_ hack for classification networks as well as generic ones:
https://github.com/NVIDIA/DIGITS/blob/v3.2.0/digits/model/tasks/caffe_train.py#L579-L584

It's hacky, but it works.

@gheinrich
Copy link
Contributor

So with this change @Shadfield would just need to rename her/his flatlabel layer to e.g. train_val_flatlabel?

@lukeyeager
Copy link
Member

To use a layer for training and validation but not deploy, he'd set the name to train_flatlabel and leave the phase unset.

The way it works now is:

Desired phase Layer name included phase
All phases mylayer unset
Training + Validation train_mylayer unset
Validation + Deploy mylayer TEST
Training only train_mylayer TRAIN
Validation only train_mylayer TEST
Deploy only deploy_mylayer TEST (or unset)

Not exactly what I'd call obvious.

@Shadfield
Copy link
Author

@lukeyeager One thing which I perhaps didn't make clear, I was testing with generic models not classification networks.

In answer to the earlier question by @gheinrich: yes when doing "test a single image" in the earlier DIGITS revision, I get the caffe error you pointed out (and it crashes the DIGITS server).

Regarding the proposed solution: I previously thought the train_ layer prefix was the same as the using include { phase: TRAIN } but the table above shows that's wrong. I previously tried using include phase train as suggested by @gheinrich, but as he also points out, I don't see the validation results (as the table says I shouldn't).

According to the table it seems like the train_ prefix should be the solution. But when I try this I get the error below.

new_queue_pairs_.size() == 0 (1 vs. 0)
Creating layer data
Creating Layer data
data -> data
Check failed: new_queue_pairs_.size() == 0 (1 vs. 0)
Waiting for data

I've attached the logfile, train_val and deploy files generated by DIGITS (the "train_" prefix has been automatically removed). Looking at the logfile, it seems like it built the train network successfully and died while building the test network (which still has the flatten layer in). Maybe someone else could try to replicate this and see if the prefix works for them?
MWE_flat_label.zip

@lukeyeager
Copy link
Member

Check BVLC/caffe#3394

  // Check no additional readers have been created. This can happen if
  // more than one net is trained at a time per process, whether single
  // or multi solver. It might also happen if two data layers have same
  // name and same source.

@lukeyeager
Copy link
Member

Looks like all four of your data layers are reading from the same source? That can't be what you want.

@twistedmage
Copy link

Oh sorry that's my mistake. After moving back and forth between different revisions of DIGITS, I tried to save time creating the dataset by re-using the same lmdb file for all 4 data sources, just for the purposes of this MWE.

When I create a proper dataset and use the suggested train_ prefix, everything works on the newest version of digits. It passes the sanity check and I'm also able to see the validation performance during training.

I suppose this "bug report" was a bit of a waste of everyones time, there was a working solution even if it wasn't obvious! Perhaps instead I could make a feature request... to stop other people running into the same difficulties, maybe it would be useful to unify the behaviour of include phases and prefixes? And also to add "validation" as a third option similar to @gheinrich suggestion of train_val_flatlabel above.

@lukeyeager
Copy link
Member

Replacing the entire phase/stage framework with layer name prefixes would make me sad. But you're right - we could add even more hackery to make that option more usable.

I'm still crossing my fingers for BVLC/caffe#3211 (comment).

I suppose this "bug report" was a bit of a waste of everyones time

Not at all! If I don't know something is bothering you, I can't fix it!

gheinrich added a commit to gheinrich/DIGITS that referenced this issue Feb 26, 2016
In case someone hits a problem like that mentioned in NVIDIA#601 for a classification network.
Once Caffe implements input layers and phase control from Python we should be able to remove those workarounds.
gheinrich added a commit to gheinrich/DIGITS that referenced this issue Feb 29, 2016
In case someone hits a problem like that mentioned in NVIDIA#601 for a classification network.
Once Caffe implements input layers and phase control from Python we should be able to remove those workarounds.
lukeyeager pushed a commit to lukeyeager/DIGITS that referenced this issue Mar 9, 2016
In case someone hits a problem like that mentioned in NVIDIA#601 for a classification network.
Once Caffe implements input layers and phase control from Python we should be able to remove those workarounds.
lukeyeager pushed a commit to lukeyeager/DIGITS that referenced this issue Mar 10, 2016
In case someone hits a problem like that mentioned in NVIDIA#601 for a classification network.
Once Caffe implements input layers and phase control from Python we should be able to remove those workarounds.
@lukeyeager
Copy link
Member

Closing this issue. You should be able to get around any failing sanity checks with the new power of #628, and the new_queue_pairs_.size() error was caused by improper usage of Data layers.

@khurram-amin
Copy link

khurram-amin commented Aug 29, 2016

@lukeyeager Can you please put some tutorial regarding this SANITY CHECK feature? Whether one needs to append train_ (or deploy_) only in name field or the top field also needs to be changed?
I tried (almost) all combinations and none is working for me. Another option that may work is essentially duplicate the whole network (i.e. one with deploy_ and other one with train_). Is this the correct way?

The following snippet is not working for me.

layer {
  name: "train_data"
  type: "Data"
  top: "data"
  include {
    phase: TRAIN
  }
  data_param {
    source: "XXX/LMDB/TrainingImages"
    batch_size: 2
    crop_size: 100
    mirror: true
    backend: LMDB
  }
}
layer {
  name: "train_label"
  type: "Data"
  top: "label"
  include {
    phase: TRAIN
  }
  data_param {
    source: "XXX/LMDB/TrainingLabels"
    batch_size: 2
    backend: LMDB
  }
}
layer {
  name: "deploy_data"
  type: "Data"
  top: "data"
  include {
    phase: TEST
  }
  data_param {
    source: "XXX/LMDB/TestingImages"
    batch_size: 1
    crop_size: 100
    mirror: true
    backend: LMDB
  }
}
layer {
  name: "deploy_label"
  type: "Data"
  top: "label"
  include {
    phase: TEST
  }
  data_param {
    source: "XXX/LMDB/TestingLabels"
    batch_size: 1
    backend: LMDB
  }
}

layer {
  name: "train_conv1_1"
  type: "Convolution"
  bottom: "data"
  top: "conv1_1"
  include { phase: TRAIN }
  param {
    lr_mult: 1
    decay_mult: 1
  }
  param {
    lr_mult: 2
    decay_mult: 0
  }
  convolution_param {
    num_output: 64
    pad: 100
    kernel_size: 3
    stride: 1
  }
}

layer {
  name: "deploy_conv1_1"
  type: "Convolution"
  bottom: "data"
  top: "conv1_1"
  include { phase: TEST }
  param {
    lr_mult: 1
    decay_mult: 1
  }
  param {
    lr_mult: 2
    decay_mult: 0
  }
  convolution_param {
    num_output: 64
    pad: 100
    kernel_size: 3
    stride: 1
  }
}

...........................
...........................
...........................

layer {
  name: "train_loss"
  type: "SoftmaxWithLoss"
  bottom: "score"
  bottom: "label"
  top: "loss"
  include {
    phase: TRAIN
  }
  loss_param {
    ignore_label: 0
    normalize: false
  }
}
layer {
  name: "deploy_accuracy"
  type: "Accuracy"
  bottom: "score"
  bottom: "label"
  top: "accuracy"
  include {
    phase: TEST
  }
}

@gheinrich
Copy link
Contributor

Hi @khurram-amin you should preferably use include and exclude statements. See here and there for examples.

@sulthanashafi
Copy link

screen shot 2017-04-07 at 10 41 32 pm
Pls someone trigger out the problem.
My MAP=O always and iam trying to change the stride value th a low value ignorer to detect a small size object.pls find a way.

@lukeyeager
Copy link
Member

Try reading through this popup and see if it helps: #628 (comment)
(You can also get to it by clicking on the little question mark next to "Visualize")

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants