Skip to content
This repository was archived by the owner on Jan 7, 2025. It is now read-only.

Inference jobs #573

Merged
merged 1 commit into from
Feb 23, 2016
Merged

Inference jobs #573

merged 1 commit into from
Feb 23, 2016

Conversation

gheinrich
Copy link
Contributor

Move inference to separate job

Motivation

  • Allow inference to run on different machine within cluster
  • Allow long inference jobs to report progress through socketio
  • Allow inference to reserve resources (GPUs)
  • Image resizing done outside of server context
  • Provide command-line inferface for inference using exactly the same path as DIGITS

Features

No new feature, will implement socketio updates in separate pull request

Summary of changes

  • New job and task created for inference
  • To limit changes in source code, re-use existing inference API from model job/task
  • No code duplication between Caffe/Torch, Classification/Generic
  • New inference.py tool performs image resizing and inference
  • Results are communicated back to DIGITS through a HDF5 file on filesystem

Progress

  • Caffe single-image classification
  • Torch single-image classification
  • Caffe single-image inference
  • Torch single-image inference
  • Caffe multiple-image classification
  • Torch multiple-image classification
  • Caffe multiple-image inference
  • Torch multiple-image inference
  • Caffe layer visualizations
  • Torch layer visualizations
  • Delete job when done
  • Code Clean-up
  • Pass nostests
  • Fix code coverage in presence of sub-processes
  • Fix nosetest slowliness

@gheinrich gheinrich force-pushed the dev/separate-inference branch from 7af5168 to b7d3875 Compare February 15, 2016 14:04
@gheinrich
Copy link
Contributor Author

A couple of issues:

Moving inference to a dedicated job adds a significant amount of latency

Loading the model and classifying one image on LeNet takes less than 0.2s on my machine however the whole process of running the job and reporting results back to the user takes over three seconds.

First, there is a 1-second delay (when not in test mode) before starting the job, followed by a couple of random 0.3s-0.5s delays to move the scheduler state machine to a state where tasks can be started. Then the inference sub-process takes 1.1s - of which 0.5s is spent loading DIGITS config. Most of the remaining time is spent starting the Python process and importing packages.

Even in test mode, the overhead is significant as it takes over 20 minutes to run the nose test suite (Travis test time is now about 40 minutes).

Possible areas of improvement would be: reduce delays, rework code to not have to load DIGITS config from inference sub-process. Is this something we need to do before merging this Pull Request?

Code coverage down as sub-processes not taken into account

Two solutions are presented there. None sound particularly elegant.

@gheinrich gheinrich changed the title Inference jobs [DON'T MERGE] Inference jobs Feb 15, 2016
@gheinrich gheinrich force-pushed the dev/separate-inference branch from b7d3875 to 719d583 Compare February 15, 2016 21:38
@lukeyeager
Copy link
Member

Neither of those issues seem like showstoppers to me.

Speed

That all sounds correct to me. Nice profiling work, what did you use to get the timings?

  • 1-second delay - we could try to work around this with some more clever logic for dealing with SocketIO message race conditions. The delay is a hack we should eliminate at some point anyway.
  • 0.3-0.5 second delays - It'd be cool to move to a more event-based scheduler that didn't require all those calls to sleep().
  • Subprocess loading delays - I don't think there's any way around that, unfortunately.
  • Docker startup time - we haven't even gotten to this yet, but I expect it may be substantial despite what @3XX0 promises me 😉.
Code coverage

Bummer, I hadn't thought of that. Have you tried any of those hacks they suggested? They would be nice to have for our other worker processes (like create_db or parse_folder), too.

@gheinrich
Copy link
Contributor Author

Nice profiling work, what did you use to get the timings?

Oh, nothing particularly clever, sorry! Just a code review and a few print statements. I'm feeling dumb :-)

@lukeyeager
Copy link
Member

Oh, nothing particularly clever, sorry! Just a code review and a few print statements.

Haha that works too 😏

@gheinrich gheinrich force-pushed the dev/separate-inference branch from 719d583 to 41be2df Compare February 16, 2016 20:52

from digits.utils import subclass

@subclass
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI @subclass doesn't really do anything if you're not also using @override. But I'm fine with leaving it in if you think it makes the code more readable.

@lukeyeager
Copy link
Member

You've got some UI issues with classify_many:
separate-inference-ui-bug

@lukeyeager
Copy link
Member

Let's remove the Computing visualizations for print statement to avoid all these warnings:

2016-02-16 17:52:28 [20160216-175223-c0c4] [WARNING] Infer Model unrecognized output: Computing visualizations for "scale" ...
2016-02-16 17:52:28 [20160216-175223-c0c4] [WARNING] Infer Model unrecognized output: Computing visualizations for "slice_triplet" ...
2016-02-16 17:52:28 [20160216-175223-c0c4] [WARNING] Infer Model unrecognized output: Computing visualizations for "conv1_left" ...
2016-02-16 17:52:28 [20160216-175223-c0c4] [WARNING] Infer Model unrecognized output: Computing visualizations for "pool1_left" ...
2016-02-16 17:52:28 [20160216-175223-c0c4] [WARNING] Infer Model unrecognized output: Computing visualizations for "conv2_left" ...
2016-02-16 17:52:28 [20160216-175223-c0c4] [WARNING] Infer Model unrecognized output: Computing visualizations for "pool2_left" ...
2016-02-16 17:52:28 [20160216-175223-c0c4] [WARNING] Infer Model unrecognized output: Computing visualizations for "ip1_left" ...
2016-02-16 17:52:28 [20160216-175223-c0c4] [WARNING] Infer Model unrecognized output: Computing visualizations for "relu1" ...
2016-02-16 17:52:28 [20160216-175223-c0c4] [WARNING] Infer Model unrecognized output: Computing visualizations for "ip2_left" ...
2016-02-16 17:52:28 [20160216-175223-c0c4] [WARNING] Infer Model unrecognized output: Computing visualizations for "feat_left" ...
2016-02-16 17:52:28 [20160216-175223-c0c4] [WARNING] Infer Model unrecognized output: Computing visualizations for "conv1_right" ...
2016-02-16 17:52:28 [20160216-175223-c0c4] [WARNING] Infer Model unrecognized output: Computing visualizations for "pool1_right" ...
2016-02-16 17:52:28 [20160216-175223-c0c4] [WARNING] Infer Model unrecognized output: Computing visualizations for "conv2_right" ...
2016-02-16 17:52:28 [20160216-175223-c0c4] [WARNING] Infer Model unrecognized output: Computing visualizations for "pool2_right" ...
2016-02-16 17:52:28 [20160216-175223-c0c4] [WARNING] Infer Model unrecognized output: Computing visualizations for "ip1_right" ...
2016-02-16 17:52:28 [20160216-175223-c0c4] [WARNING] Infer Model unrecognized output: Computing visualizations for "relu1_right" ...
2016-02-16 17:52:28 [20160216-175223-c0c4] [WARNING] Infer Model unrecognized output: Computing visualizations for "ip2_right" ...
2016-02-16 17:52:28 [20160216-175223-c0c4] [WARNING] Infer Model unrecognized output: Computing visualizations for "feat_right" ...
2016-02-16 17:52:28 [20160216-175223-c0c4] [WARNING] Infer Model unrecognized output: Computing visualizations for "discard" ...
2016-02-16 17:52:28 [20160216-175223-c0c4] [WARNING] Infer Model unrecognized output: Computing visualizations for "loss" ...
2016-02-16 17:52:28 [20160216-175223-c0c4] [INFO ] Infer Model task completed.

@lukeyeager
Copy link
Member

I've got two related requests, but we may decide to push them back to a later PR.

  1. Can we show the inference page immediately, like we do for other jobs? That would require sending the resulting data to the page through SocketIO.
  2. If we're already sending data over SocketIO, can we send it incrementally as we get it?

self.inference_layers = visualizations

@override
def offer_resources(self, resources):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we reserve a GPU for inference, since we are in fact using one? That seems logical, and would have saved me from some confusing out-of-memory errors. I think you can tell pycaffe which one you want to use.

It'd be cool if we could fall back to CPU-only if there are no GPUs available, so you don't have to wait for all the long-running training jobs to finish first. We could still use the inference_task_pool resource to limit the number of CPU inference tasks running at once.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great now, thanks!

@gheinrich
Copy link
Contributor Author

Hi Luke, I have pushed a few more commits to address your remarks. I will squash everything once that is OK.
If you mind I'd like to address the incremental inference notifications in a separate PR as this will require more work.

@@ -54,6 +54,14 @@ def upgrade_network(network):
#TODO
pass

@staticmethod
def set_mode(gpu):
if gpu>=0:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use gpu=None and gpu is None instead of gpu=-1 and gpu>=0? Sorry for being nitpicky, but I feel like it's a little less cryptic and more pythonic.

@lukeyeager
Copy link
Member

If you mind I'd like to address the incremental inference notifications in a separate PR as this will require more work.

Yeah that's fine. We'll have to decide what to do about the JSON interface, too.

@lukeyeager
Copy link
Member

While we're fixing stuff, can you remove this line:

https://github.com/gheinrich/DIGITS/blob/8b6022162c/digits/model/tasks/caffe_train.py#L1242

As @anuragphadke discovered in #581, it's unnecessary.

@staticmethod
def set_mode(gpu):
if gpu is not None:
caffe.set_mode_gpu()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need to put set_device() BEFORE set_mode_gpu(). Odd.

https://groups.google.com/d/msg/caffe-users/HqdziMqpu6o/2zgy7MK3-hAJ
BVLC/caffe#507

@gheinrich
Copy link
Contributor Author

Rebased/squashed and I switched the order of set_device() and set_mode_gpu()

@lukeyeager
Copy link
Member

The UI leaves something to be desired. The "Edit Name", "Edit Notes" and "Delete Job" buttons all return errors if you try to use them because the job is already deleted. However, I like the efficiency of re-using the job template, and the "Abort" button will make sense once we implement #70 (comment).

@lukeyeager
Copy link
Member

Ok, with those last few fixes this looks good to me!

Please squash and merge.

@gheinrich gheinrich force-pushed the dev/separate-inference branch from 35e0373 to 9dba452 Compare February 23, 2016 20:08
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants