-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Conversation
7af5168
to
b7d3875
Compare
A couple of issues: Moving inference to a dedicated job adds a significant amount of latencyLoading the model and classifying one image on LeNet takes less than 0.2s on my machine however the whole process of running the job and reporting results back to the user takes over three seconds. First, there is a 1-second delay (when not in test mode) before starting the job, followed by a couple of random 0.3s-0.5s delays to move the scheduler state machine to a state where tasks can be started. Then the inference sub-process takes 1.1s - of which 0.5s is spent loading DIGITS config. Most of the remaining time is spent starting the Python process and importing packages. Even in test mode, the overhead is significant as it takes over 20 minutes to run the nose test suite (Travis test time is now about 40 minutes). Possible areas of improvement would be: reduce delays, rework code to not have to load DIGITS config from inference sub-process. Is this something we need to do before merging this Pull Request? Code coverage down as sub-processes not taken into accountTwo solutions are presented there. None sound particularly elegant. |
b7d3875
to
719d583
Compare
Neither of those issues seem like showstoppers to me. SpeedThat all sounds correct to me. Nice profiling work, what did you use to get the timings?
Code coverageBummer, I hadn't thought of that. Have you tried any of those hacks they suggested? They would be nice to have for our other worker processes (like |
Oh, nothing particularly clever, sorry! Just a code review and a few |
Haha that works too 😏 |
719d583
to
41be2df
Compare
|
||
from digits.utils import subclass | ||
|
||
@subclass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI @subclass
doesn't really do anything if you're not also using @override
. But I'm fine with leaving it in if you think it makes the code more readable.
Let's remove the
|
I've got two related requests, but we may decide to push them back to a later PR.
|
self.inference_layers = visualizations | ||
|
||
@override | ||
def offer_resources(self, resources): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we reserve a GPU for inference, since we are in fact using one? That seems logical, and would have saved me from some confusing out-of-memory errors. I think you can tell pycaffe which one you want to use.
It'd be cool if we could fall back to CPU-only if there are no GPUs available, so you don't have to wait for all the long-running training jobs to finish first. We could still use the inference_task_pool
resource to limit the number of CPU inference tasks running at once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great now, thanks!
Hi Luke, I have pushed a few more commits to address your remarks. I will squash everything once that is OK. |
@@ -54,6 +54,14 @@ def upgrade_network(network): | |||
#TODO | |||
pass | |||
|
|||
@staticmethod | |||
def set_mode(gpu): | |||
if gpu>=0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use gpu=None
and gpu is None
instead of gpu=-1
and gpu>=0
? Sorry for being nitpicky, but I feel like it's a little less cryptic and more pythonic.
Yeah that's fine. We'll have to decide what to do about the JSON interface, too. |
While we're fixing stuff, can you remove this line: https://github.com/gheinrich/DIGITS/blob/8b6022162c/digits/model/tasks/caffe_train.py#L1242 As @anuragphadke discovered in #581, it's unnecessary. |
@staticmethod | ||
def set_mode(gpu): | ||
if gpu is not None: | ||
caffe.set_mode_gpu() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you need to put set_device()
BEFORE set_mode_gpu()
. Odd.
https://groups.google.com/d/msg/caffe-users/HqdziMqpu6o/2zgy7MK3-hAJ
BVLC/caffe#507
8c006a2
to
0802d96
Compare
Rebased/squashed and I switched the order of |
The UI leaves something to be desired. The "Edit Name", "Edit Notes" and "Delete Job" buttons all return errors if you try to use them because the job is already deleted. However, I like the efficiency of re-using the job template, and the "Abort" button will make sense once we implement #70 (comment). |
Ok, with those last few fixes this looks good to me! Please squash and merge. |
35e0373
to
9dba452
Compare
Move inference to separate job
Motivation
Features
No new feature, will implement socketio updates in separate pull request
Summary of changes
Progress