-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
clustermq with multiprocess on Windows drops to single thread after some number of tasks #1305
Comments
Do you have access to other systems to try this? Multiprocess is fairly new in cluster. |
Lemme see if I can get it running on my Linux box. Any pointers as to what to look for or ways to try to trigger an issue like this? (Mainly, any way to try to detect where a subtarget ran?) |
Sorry, this could be caused by any number of factors, and I don’t know how I would detect it. My only guesses are a weird Windows-specific security/firewall thing or that multiprocess in clustermq is new enough to be rough around the edges. Don’t know if either is true. |
FYI, I've tried to make a few different plans that could possibly be reprexes, and I've not yet succeeded. I've not yet had a chance to try to run it on Linux. |
As I've looked at this more, I think that it may not be an issue. It may be that data transfer is happening between the main and worker processes, so they do not appear busy by CPU. On top of that, I have a couple of targets that appear to take a similar amount of time for the transfer to workers as they take to perform the operation (it is still useful because it keeps me within the memory limits of the system). I'll keep an eye out for this one, and if it pops back up in a way that appears to be real, I'll either come back here or open a new issue with an actual reprex. |
If all the workers use the same file system, |
Ah, that could be better. (In my scenario, they are on the same filesystem.) If it is definitively faster, could it be automatically detected? |
The bottlenecks here depend on the system, so I don't think this can be detected or automated. |
That's fair. In my case, it does appear to speed things up and reduce memory usage significantly. |
The issue resurfaced again today. This time, I was patient and let it run to completion with a single thread, and I got the following info at the end: (Note "report_cts" is the name of the ultimate target.) > target report_cts
Warning in super$cleanup(quiet = quiet, timeout = timeout) :
2/2 workers did not shut down properly
Master: [12783.4s 0.1% CPU]; Worker: [avg 99.0% CPU, max 12076.6 Mb]
Warning in self$finalize(quiet = (quiet || self$workers_running == 0)) :
Unclean shutdown for PIDs: 22592, 18204 Does that give any hints? (If not, feel free to re-close this. I still don't have a reprex.) |
Maybe @mschubert can confirm, but that warning does not point to anything specific. Do you have the worker logs? |
I'm afraid I can only help if you provide the worker logs |
How do I get worker logs? My quick look suggests that I don't have them based on the fact that I don't immediately see a .log file in my working directory and what I see in the clustermq user guide. Is there a way to recreate the logs after the fact? I still have the session open, and it has not been touched since that error message. |
You'll need to call |
I just exposed |
@billdenney, any update on the worker logs? |
I haven't been able to replicate the problem with the workers being logged yet. I have some more time today to try again. Update on 2020-08-19: I'm still working on this. The project was in an intermediate state and not able to be run, so I'm working through that and then I'm going to try rerunning with an empty cache. |
I've not been able to replicate the behavior with several attempts. I will now leave this issue closed unless I can generate a reprex or at minimum provide logs that appear to be informative. |
Prework
drake
's code of conduct.drake
and most likely not a user error. (If you run into an error and do not know the cause, please submit a "Trouble" issue instead.)Description
When I'm running a plan that takes a relatively long time to complete (probably 12-24 hours), I have tried to break the plan up into more dynamic targets and then use clustermq to make it faster. Overall, I have 10 dynamic targets with 23 sub-targets each. Most of the sub-targets take about 5-20 minutes to complete (probably not that important, but just in case it is...).
After the first few dynamic targets complete, at some point (that I've not yet identified), it switches to using just one of the clustermq tasks-- effectively it is running serially. It definitely runs in parallel for the first 2 dynamic targets-- and maybe more.
I have confirmed that only one task is taking up the processor by the Windows task manager, and I see that there are the requested number of processes waiting for tasks as the number of
jobs
frommake()
(specifically, withmake(..., parallelism="clustermq", jobs=3)
, there are three R processes in addition to the interactive R process). Those 3 processes were all in use when I startedmake()
, but when I came back about an hour later, only one was in use.When I restarted R and
make()
, it went back to using 3 jobs (as confirmed by CPU usage per process).Reproducible example
I haven't found a way to make a good reprex yet. Is there a way that I can detect which clustermq backend a target or subtarget is running on? The simple-to-run targets complete so quickly that I can't confirm if one or more than one backend is used, and I can't share the data set for my long-running task.
Some of the code parts that I think are relevant are:
Expected result
Three jobs should have been running simultaneously.
Session info
The text was updated successfully, but these errors were encountered: