Inference jobs #573

gheinrich · 2016-02-10T16:28:10Z

Move inference to separate job

Motivation

Allow inference to run on different machine within cluster
Allow long inference jobs to report progress through socketio
Allow inference to reserve resources (GPUs)
Image resizing done outside of server context
Provide command-line inferface for inference using exactly the same path as DIGITS

Features

No new feature, will implement socketio updates in separate pull request

Summary of changes

New job and task created for inference
To limit changes in source code, re-use existing inference API from model job/task
No code duplication between Caffe/Torch, Classification/Generic
New inference.py tool performs image resizing and inference
Results are communicated back to DIGITS through a HDF5 file on filesystem

Progress

gheinrich · 2016-02-15T21:35:18Z

A couple of issues:

Moving inference to a dedicated job adds a significant amount of latency

Loading the model and classifying one image on LeNet takes less than 0.2s on my machine however the whole process of running the job and reporting results back to the user takes over three seconds.

First, there is a 1-second delay (when not in test mode) before starting the job, followed by a couple of random 0.3s-0.5s delays to move the scheduler state machine to a state where tasks can be started. Then the inference sub-process takes 1.1s - of which 0.5s is spent loading DIGITS config. Most of the remaining time is spent starting the Python process and importing packages.

Even in test mode, the overhead is significant as it takes over 20 minutes to run the nose test suite (Travis test time is now about 40 minutes).

Possible areas of improvement would be: reduce delays, rework code to not have to load DIGITS config from inference sub-process. Is this something we need to do before merging this Pull Request?

Code coverage down as sub-processes not taken into account

Two solutions are presented there. None sound particularly elegant.

lukeyeager · 2016-02-16T18:23:11Z

Neither of those issues seem like showstoppers to me.

Speed

That all sounds correct to me. Nice profiling work, what did you use to get the timings?

1-second delay - we could try to work around this with some more clever logic for dealing with SocketIO message race conditions. The delay is a hack we should eliminate at some point anyway.
0.3-0.5 second delays - It'd be cool to move to a more event-based scheduler that didn't require all those calls to sleep().
Subprocess loading delays - I don't think there's any way around that, unfortunately.
Docker startup time - we haven't even gotten to this yet, but I expect it may be substantial despite what @3XX0 promises me 😉.

Code coverage

Bummer, I hadn't thought of that. Have you tried any of those hacks they suggested? They would be nice to have for our other worker processes (like create_db or parse_folder), too.

gheinrich · 2016-02-16T20:35:12Z

Nice profiling work, what did you use to get the timings?

Oh, nothing particularly clever, sorry! Just a code review and a few print statements. I'm feeling dumb :-)

lukeyeager · 2016-02-16T20:44:24Z

Oh, nothing particularly clever, sorry! Just a code review and a few print statements.

Haha that works too 😏

lukeyeager · 2016-02-16T21:51:42Z

digits/inference/errors.py

+
+from digits.utils import subclass
+
+@subclass


FYI @subclass doesn't really do anything if you're not also using @override. But I'm fine with leaving it in if you think it makes the code more readable.

lukeyeager · 2016-02-17T01:49:21Z

You've got some UI issues with classify_many:

lukeyeager · 2016-02-17T01:56:00Z

Let's remove the Computing visualizations for print statement to avoid all these warnings:

2016-02-16 17:52:28 [20160216-175223-c0c4] [WARNING] Infer Model unrecognized output: Computing visualizations for "scale" ...
2016-02-16 17:52:28 [20160216-175223-c0c4] [WARNING] Infer Model unrecognized output: Computing visualizations for "slice_triplet" ...
2016-02-16 17:52:28 [20160216-175223-c0c4] [WARNING] Infer Model unrecognized output: Computing visualizations for "conv1_left" ...
2016-02-16 17:52:28 [20160216-175223-c0c4] [WARNING] Infer Model unrecognized output: Computing visualizations for "pool1_left" ...
2016-02-16 17:52:28 [20160216-175223-c0c4] [WARNING] Infer Model unrecognized output: Computing visualizations for "conv2_left" ...
2016-02-16 17:52:28 [20160216-175223-c0c4] [WARNING] Infer Model unrecognized output: Computing visualizations for "pool2_left" ...
2016-02-16 17:52:28 [20160216-175223-c0c4] [WARNING] Infer Model unrecognized output: Computing visualizations for "ip1_left" ...
2016-02-16 17:52:28 [20160216-175223-c0c4] [WARNING] Infer Model unrecognized output: Computing visualizations for "relu1" ...
2016-02-16 17:52:28 [20160216-175223-c0c4] [WARNING] Infer Model unrecognized output: Computing visualizations for "ip2_left" ...
2016-02-16 17:52:28 [20160216-175223-c0c4] [WARNING] Infer Model unrecognized output: Computing visualizations for "feat_left" ...
2016-02-16 17:52:28 [20160216-175223-c0c4] [WARNING] Infer Model unrecognized output: Computing visualizations for "conv1_right" ...
2016-02-16 17:52:28 [20160216-175223-c0c4] [WARNING] Infer Model unrecognized output: Computing visualizations for "pool1_right" ...
2016-02-16 17:52:28 [20160216-175223-c0c4] [WARNING] Infer Model unrecognized output: Computing visualizations for "conv2_right" ...
2016-02-16 17:52:28 [20160216-175223-c0c4] [WARNING] Infer Model unrecognized output: Computing visualizations for "pool2_right" ...
2016-02-16 17:52:28 [20160216-175223-c0c4] [WARNING] Infer Model unrecognized output: Computing visualizations for "ip1_right" ...
2016-02-16 17:52:28 [20160216-175223-c0c4] [WARNING] Infer Model unrecognized output: Computing visualizations for "relu1_right" ...
2016-02-16 17:52:28 [20160216-175223-c0c4] [WARNING] Infer Model unrecognized output: Computing visualizations for "ip2_right" ...
2016-02-16 17:52:28 [20160216-175223-c0c4] [WARNING] Infer Model unrecognized output: Computing visualizations for "feat_right" ...
2016-02-16 17:52:28 [20160216-175223-c0c4] [WARNING] Infer Model unrecognized output: Computing visualizations for "discard" ...
2016-02-16 17:52:28 [20160216-175223-c0c4] [WARNING] Infer Model unrecognized output: Computing visualizations for "loss" ...
2016-02-16 17:52:28 [20160216-175223-c0c4] [INFO ] Infer Model task completed.

lukeyeager · 2016-02-17T01:58:55Z

I've got two related requests, but we may decide to push them back to a later PR.

Can we show the inference page immediately, like we do for other jobs? That would require sending the resulting data to the page through SocketIO.
If we're already sending data over SocketIO, can we send it incrementally as we get it?
- Would solve Output the testing results of many images as text file #70 (comment) and gunicorn + nginx issues #479 (comment)

lukeyeager · 2016-02-17T02:14:46Z

digits/inference/tasks/inference.py

+        self.inference_layers = visualizations
+
+    @override
+    def offer_resources(self, resources):


Can we reserve a GPU for inference, since we are in fact using one? That seems logical, and would have saved me from some confusing out-of-memory errors. I think you can tell pycaffe which one you want to use.

It'd be cool if we could fall back to CPU-only if there are no GPUs available, so you don't have to wait for all the long-running training jobs to finish first. We could still use the inference_task_pool resource to limit the number of CPU inference tasks running at once.

Looks great now, thanks!

gheinrich · 2016-02-17T16:34:25Z

Hi Luke, I have pushed a few more commits to address your remarks. I will squash everything once that is OK.
If you mind I'd like to address the incremental inference notifications in a separate PR as this will require more work.

lukeyeager · 2016-02-17T17:55:25Z

digits/model/tasks/caffe_train.py

@@ -54,6 +54,14 @@ def upgrade_network(network):
        #TODO
        pass

+    @staticmethod
+    def set_mode(gpu):
+        if gpu>=0:


Can we use gpu=None and gpu is None instead of gpu=-1 and gpu>=0? Sorry for being nitpicky, but I feel like it's a little less cryptic and more pythonic.

lukeyeager · 2016-02-17T18:39:19Z

If you mind I'd like to address the incremental inference notifications in a separate PR as this will require more work.

Yeah that's fine. We'll have to decide what to do about the JSON interface, too.

lukeyeager · 2016-02-17T19:16:27Z

While we're fixing stuff, can you remove this line:

https://github.com/gheinrich/DIGITS/blob/8b6022162c/digits/model/tasks/caffe_train.py#L1242

As @anuragphadke discovered in #581, it's unnecessary.

lukeyeager · 2016-02-17T21:54:01Z

digits/model/tasks/caffe_train.py

+    @staticmethod
+    def set_mode(gpu):
+        if gpu is not None:
+            caffe.set_mode_gpu()


I think you need to put set_device() BEFORE set_mode_gpu(). Odd.

https://groups.google.com/d/msg/caffe-users/HqdziMqpu6o/2zgy7MK3-hAJ
BVLC/caffe#507

gheinrich · 2016-02-22T22:02:50Z

Rebased/squashed and I switched the order of set_device() and set_mode_gpu()

lukeyeager · 2016-02-22T22:54:03Z

The UI leaves something to be desired. The "Edit Name", "Edit Notes" and "Delete Job" buttons all return errors if you try to use them because the job is already deleted. However, I like the efficiency of re-using the job template, and the "Abort" button will make sense once we implement #70 (comment).

lukeyeager · 2016-02-23T17:57:27Z

Ok, with those last few fixes this looks good to me!

Please squash and merge.

Inference jobs

gheinrich force-pushed the dev/separate-inference branch from 7af5168 to b7d3875 Compare February 15, 2016 14:04

gheinrich changed the title ~~Inference jobs [DON'T MERGE]~~ Inference jobs Feb 15, 2016

gheinrich force-pushed the dev/separate-inference branch from b7d3875 to 719d583 Compare February 15, 2016 21:38

gheinrich force-pushed the dev/separate-inference branch from 719d583 to 41be2df Compare February 16, 2016 20:52

lukeyeager reviewed Feb 16, 2016
View reviewed changes

lukeyeager reviewed Feb 17, 2016
View reviewed changes

lukeyeager added the enhancement label Feb 17, 2016

lukeyeager mentioned this pull request Feb 17, 2016

np.array performance bottleneck while training #583

Closed

lukeyeager reviewed Feb 17, 2016
View reviewed changes

lukeyeager mentioned this pull request Feb 19, 2016

Using mean pixel instead of mean image in classification/example.py #588

Closed

gheinrich force-pushed the dev/separate-inference branch from 8c006a2 to 0802d96 Compare February 22, 2016 13:29

lukeyeager mentioned this pull request Feb 22, 2016

Use image limit value for "Classify/Test Many" #592

Merged

Move inference to separate job

9dba452

gheinrich force-pushed the dev/separate-inference branch from 35e0373 to 9dba452 Compare February 23, 2016 20:08

gheinrich added a commit that referenced this pull request Feb 23, 2016

Merge pull request #573 from gheinrich/dev/separate-inference

6e0427b

Inference jobs

gheinrich merged commit 6e0427b into NVIDIA:master Feb 23, 2016

lukeyeager mentioned this pull request Feb 25, 2016

Deep copy in infer_many #603

Merged

This was referenced Feb 26, 2016

Fix inference image paths in presence of invalid paths #606

Merged

Caffe output ordering #609

Closed

lukeyeager mentioned this pull request Feb 29, 2016

View partial results for inference #611

Open

lukeyeager mentioned this pull request Apr 18, 2016

GPU-memory not freed after "Classify Many". #679

Closed

This was referenced Apr 20, 2016

Error running inference on CPU for network with BatchNorm #694

Closed

[bug] Training stopped after "Classify one" #664

Closed

gheinrich deleted the dev/separate-inference branch May 24, 2016 11:21

gheinrich mentioned this pull request May 25, 2016

Training gets randomly stuck without error messages #779

Closed

lukeyeager mentioned this pull request Jun 14, 2016

cannot Classify image NVIDIA/nvidia-docker#108

Closed

lukeyeager mentioned this pull request Jul 26, 2016

Memory leak on GPU 0 #922

Closed

lukeyeager mentioned this pull request Aug 2, 2016

Avoid creation of CUDA context in master process operated by gunicorn #944

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inference jobs #573

Inference jobs #573

gheinrich commented Feb 10, 2016

gheinrich commented Feb 15, 2016

lukeyeager commented Feb 16, 2016

gheinrich commented Feb 16, 2016

lukeyeager commented Feb 16, 2016

lukeyeager Feb 16, 2016

lukeyeager commented Feb 17, 2016

lukeyeager commented Feb 17, 2016

lukeyeager commented Feb 17, 2016

lukeyeager Feb 17, 2016

lukeyeager Feb 17, 2016

gheinrich commented Feb 17, 2016

lukeyeager Feb 17, 2016

lukeyeager commented Feb 17, 2016

lukeyeager commented Feb 17, 2016

lukeyeager Feb 17, 2016

gheinrich commented Feb 22, 2016

lukeyeager commented Feb 22, 2016

lukeyeager commented Feb 23, 2016

Inference jobs #573

Inference jobs #573

Conversation

gheinrich commented Feb 10, 2016

Motivation

Features

Summary of changes

Progress

gheinrich commented Feb 15, 2016

Moving inference to a dedicated job adds a significant amount of latency

Code coverage down as sub-processes not taken into account

lukeyeager commented Feb 16, 2016

Speed

Code coverage

gheinrich commented Feb 16, 2016

lukeyeager commented Feb 16, 2016

lukeyeager Feb 16, 2016

Choose a reason for hiding this comment

lukeyeager commented Feb 17, 2016

lukeyeager commented Feb 17, 2016

lukeyeager commented Feb 17, 2016

lukeyeager Feb 17, 2016

Choose a reason for hiding this comment

lukeyeager Feb 17, 2016

Choose a reason for hiding this comment

gheinrich commented Feb 17, 2016

lukeyeager Feb 17, 2016

Choose a reason for hiding this comment

lukeyeager commented Feb 17, 2016

lukeyeager commented Feb 17, 2016

lukeyeager Feb 17, 2016

Choose a reason for hiding this comment

gheinrich commented Feb 22, 2016

lukeyeager commented Feb 22, 2016

lukeyeager commented Feb 23, 2016