clustermq with multiprocess on Windows drops to single thread after some number of tasks #1305

billdenney · 2020-08-08T17:48:14Z

Prework

Read and abide by drake's code of conduct.
Search for duplicates among the existing issues, both open and closed.
I am reasonably sure this is a genuine bug in drake and most likely not a user error. (If you run into an error and do not know the cause, please submit a "Trouble" issue instead.)
Be considerate of the maintainer's time and make it as easy as possible to troubleshoot any problems you identify. Read here and here to learn about minimal reproducible examples. Format your code according to the tidyverse style guide to make it easier for others to read.

Description

When I'm running a plan that takes a relatively long time to complete (probably 12-24 hours), I have tried to break the plan up into more dynamic targets and then use clustermq to make it faster. Overall, I have 10 dynamic targets with 23 sub-targets each. Most of the sub-targets take about 5-20 minutes to complete (probably not that important, but just in case it is...).

After the first few dynamic targets complete, at some point (that I've not yet identified), it switches to using just one of the clustermq tasks-- effectively it is running serially. It definitely runs in parallel for the first 2 dynamic targets-- and maybe more.

I have confirmed that only one task is taking up the processor by the Windows task manager, and I see that there are the requested number of processes waiting for tasks as the number of jobs from make() (specifically, with make(..., parallelism="clustermq", jobs=3), there are three R processes in addition to the interactive R process). Those 3 processes were all in use when I started make(), but when I came back about an hour later, only one was in use.

When I restarted R and make(), it went back to using 3 jobs (as confirmed by CPU usage per process).

Reproducible example

I haven't found a way to make a good reprex yet. Is there a way that I can detect which clustermq backend a target or subtarget is running on? The simple-to-run targets complete so quickly that I can't confirm if one or more than one backend is used, and I can't share the data set for my long-running task.

Some of the code parts that I think are relevant are:

my_plan <-
  drake_plan(
    [...],
    cts_phase3_sim_result_trial_summary=
      target(
        make_phase3_sim_result_trial_summary_single_scenario(cts_phase3_sim_result_prep),
        dynamic=map(cts_phase3_sim_result_prep)
      ),
    [...]
  )

options(clustermq.scheduler = "multiprocess")
make(
  my_plan,
  seed=5,
  # lock_envir=FALSE is required to run brms (or generally, stan) models in
  # drake
  lock_envir=FALSE,
  parallelism="clustermq",
  jobs=3
)

Expected result

Three jobs should have been running simultaneously.

Session info

> sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19041)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] cowplot_1.0.0           forcats_0.5.0           purrr_0.3.4             tibble_3.0.3            ggplot2_3.3.2          
 [6] tidyverse_1.3.0         bsd.report_0.0.0.9076   Rsdtm_0.0.0.9010        tidyr_1.1.0             rio_0.5.16             
[11] readr_1.3.1             lubridate_1.7.9         dplyr_1.0.0             latex.makers_0.0.0.9000 assertr_2.7            
[16] drake_7.12.4            brms_2.13.3             Rcpp_1.0.5              nlme_3.1-148            emmeans_1.4.8          
[21] lme4_1.1-23             Matrix_1.2-18           pmxTools_1.1            XML_3.99-0.5            stringr_1.4.0          
[26] MASS_7.3-51.6           mrgsolve_0.10.4         rmutil_1.1.5           

loaded via a namespace (and not attached):
  [1] tidyselect_1.1.0          htmlwidgets_1.5.1         grid_4.0.2                munsell_0.5.0             base64url_1.4            
  [6] codetools_0.2-16          statmod_1.4.34            DT_0.14                   miniUI_0.1.1.1            withr_2.2.0              
 [11] Brobdingnag_1.2-6         colorspace_1.4-1          filelock_1.0.2            knitr_1.29                rstudioapi_0.11          
 [16] stats4_4.0.2              ggsignif_0.6.0            bayesplot_1.7.2           rstan_2.21.1              KMsurv_0.1-5             
 [21] bridgesampling_1.0-0      txtq_0.2.3                coda_0.19-3               vctrs_0.3.2               generics_0.0.2           
 [26] xfun_0.16                 qpdf_1.1                  R6_2.4.1                  markdown_1.1              RcppArmadillo_0.9.900.1.0
 [31] assertthat_0.2.1          promises_1.1.1            scales_1.1.1              nnet_7.3-14               gtable_0.3.0             
 [36] processx_3.4.3            rlang_0.4.7               PKNCA_0.9.4               splines_4.0.2             rstatix_0.6.0            
 [41] acepack_1.4.1             broom_0.7.0               checkmate_2.0.0           inline_0.3.15             modelr_0.1.8             
 [46] yaml_2.2.1                reshape2_1.4.4            abind_1.4-5               threejs_0.3.3             crosstalk_1.1.0.1        
 [51] backports_1.1.7           httpuv_1.5.4              rsconnect_0.8.16          Hmisc_4.4-0               tools_4.0.2              
 [56] ellipsis_0.3.1            RColorBrewer_1.1-2        ggridges_0.5.2            plyr_1.8.6                base64enc_0.1-3          
 [61] progress_1.2.2            ps_1.3.3                  prettyunits_1.1.1         ggpubr_0.4.0              rpart_4.1-15             
 [66] zoo_1.8-8                 haven_2.3.1               ggrepel_0.8.2             cluster_2.1.0             fs_1.4.2                 
 [71] magrittr_1.5              data.table_1.13.0         openxlsx_4.1.5            reprex_0.3.0              colourpicker_1.0         
 [76] survminer_0.4.8           mvtnorm_1.1-1             truncdist_1.0-2           storr_1.2.1               matrixStats_0.56.0       
 [81] hms_0.5.3                 shinyjs_1.1               mime_0.9                  evaluate_0.14             xtable_1.8-4             
 [86] shinystan_2.5.0           jpeg_0.1-8.1              readxl_1.3.1              gridExtra_2.3             rstantools_2.1.1         
 [91] compiler_4.0.2            arsenal_3.5.0             V8_3.2.0                  crayon_1.3.4              minqa_1.2.4              
 [96] StanHeaders_2.21.0-5      htmltools_0.5.0           later_1.1.0.1             Formula_1.2-3             RcppParallel_5.0.2       
[101] DBI_1.1.0                 dbplyr_1.4.4              boot_1.3-25               car_3.0-8                 cli_2.0.2                
[106] parallel_4.0.2            evd_2.3-3                 igraph_1.2.5              pkgconfig_2.0.3           km.ci_0.5-2              
[111] foreign_0.8-80            xml2_1.3.2                dygraphs_1.1.1.6          estimability_1.3          rvest_0.3.6              
[116] snakecase_0.11.0          callr_3.4.3               digest_0.6.25             janitor_2.0.1             rmarkdown_2.3            
[121] cellranger_1.1.0          survMisc_0.5.5            htmlTable_2.0.1           curl_4.3                  shiny_1.5.0              
[126] gtools_3.8.2              nloptr_1.2.2.2            lifecycle_0.2.0           jsonlite_1.7.0            carData_3.0-4            
[131] askpass_1.1               fansi_0.4.1               pillar_1.4.6              lattice_0.20-41           loo_2.3.1                
[136] httr_1.4.2                fastmap_1.0.1             pkgbuild_1.1.0            survival_3.2-3            glue_1.4.1               
[141] xts_0.12-0                zip_2.0.4                 png_0.1-7                 shinythemes_1.1.2         stringi_1.4.6            
[146] blob_1.2.1                latticeExtra_0.6-29

The text was updated successfully, but these errors were encountered:

wlandau · 2020-08-08T18:03:14Z

Do you have access to other systems to try this? Multiprocess is fairly new in cluster.

billdenney · 2020-08-08T18:07:35Z

Lemme see if I can get it running on my Linux box. Any pointers as to what to look for or ways to try to trigger an issue like this? (Mainly, any way to try to detect where a subtarget ran?)

wlandau · 2020-08-08T19:57:13Z

Sorry, this could be caused by any number of factors, and I don’t know how I would detect it. My only guesses are a weird Windows-specific security/firewall thing or that multiprocess in clustermq is new enough to be rough around the edges. Don’t know if either is true.

billdenney · 2020-08-10T12:03:47Z

FYI, I've tried to make a few different plans that could possibly be reprexes, and I've not yet succeeded. I've not yet had a chance to try to run it on Linux.

billdenney · 2020-08-10T21:54:50Z

As I've looked at this more, I think that it may not be an issue. It may be that data transfer is happening between the main and worker processes, so they do not appear busy by CPU.

On top of that, I have a couple of targets that appear to take a similar amount of time for the transfer to workers as they take to perform the operation (it is still useful because it keeps me within the memory limits of the system).

I'll keep an eye out for this one, and if it pops back up in a way that appears to be real, I'll either come back here or open a new issue with an actual reprex.

wlandau · 2020-08-10T22:56:00Z

If all the workers use the same file system, caching = "worker"might reduce overhead because it just tells the workers to load the data from storage rather than sending it over ZeroMQ sockets.

billdenney · 2020-08-10T23:53:54Z

Ah, that could be better. (In my scenario, they are on the same filesystem.) If it is definitively faster, could it be automatically detected?

wlandau · 2020-08-11T01:31:25Z

The bottlenecks here depend on the system, so I don't think this can be detected or automated.

billdenney · 2020-08-11T02:15:58Z

That's fair. In my case, it does appear to speed things up and reduce memory usage significantly.

billdenney · 2020-08-12T23:21:37Z

The issue resurfaced again today. This time, I was patient and let it run to completion with a single thread, and I got the following info at the end:

(Note "report_cts" is the name of the ultimate target.)

> target report_cts
Warning in super$cleanup(quiet = quiet, timeout = timeout) :
  2/2 workers did not shut down properly
Master: [12783.4s 0.1% CPU]; Worker: [avg 99.0% CPU, max 12076.6 Mb]
Warning in self$finalize(quiet = (quiet || self$workers_running == 0)) :
  Unclean shutdown for PIDs: 22592, 18204

Does that give any hints? (If not, feel free to re-close this. I still don't have a reprex.)

wlandau · 2020-08-13T13:25:51Z

Maybe @mschubert can confirm, but that warning does not point to anything specific. Do you have the worker logs?

mschubert · 2020-08-13T13:39:51Z

I'm afraid I can only help if you provide the worker logs

billdenney · 2020-08-13T13:48:35Z

How do I get worker logs?

My quick look suggests that I don't have them based on the fact that I don't immediately see a .log file in my working directory and what I see in the clustermq user guide.

Is there a way to recreate the logs after the fact? I still have the session open, and it has not been touched since that error message.

mschubert · 2020-08-13T13:53:11Z

You'll need to call Q(..., log_worker=TRUE), maybe @wlandau can tell you how to do this via drake?

wlandau · 2020-08-13T14:21:55Z

I just exposed log_worker for clustermq::workers() as an argument to make(): 70dbce5.

wlandau · 2020-08-18T12:29:32Z

@billdenney, any update on the worker logs?

billdenney · 2020-08-18T12:54:16Z

I haven't been able to replicate the problem with the workers being logged yet. I have some more time today to try again.

Update on 2020-08-19: I'm still working on this. The project was in an intermediate state and not able to be run, so I'm working through that and then I'm going to try rerunning with an empty cache.

billdenney · 2020-08-20T10:37:20Z

I've not been able to replicate the behavior with several attempts. I will now leave this issue closed unless I can generate a reprex or at minimum provide logs that appear to be informative.

billdenney added the type: bug label Aug 8, 2020

billdenney assigned wlandau Aug 8, 2020

wlandau added depends: reprex topic: performance type: trouble and removed type: bug labels Aug 8, 2020

billdenney closed this as completed Aug 10, 2020

billdenney reopened this Aug 13, 2020

billdenney closed this as completed Aug 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

clustermq with multiprocess on Windows drops to single thread after some number of tasks #1305

clustermq with multiprocess on Windows drops to single thread after some number of tasks #1305

billdenney commented Aug 8, 2020

wlandau commented Aug 8, 2020 •

edited

Loading

billdenney commented Aug 8, 2020

wlandau commented Aug 8, 2020 •

edited

Loading

billdenney commented Aug 10, 2020

billdenney commented Aug 10, 2020

wlandau commented Aug 10, 2020

billdenney commented Aug 10, 2020

wlandau commented Aug 11, 2020

billdenney commented Aug 11, 2020

billdenney commented Aug 12, 2020 •

edited

Loading

wlandau commented Aug 13, 2020 •

edited

Loading

mschubert commented Aug 13, 2020

billdenney commented Aug 13, 2020

mschubert commented Aug 13, 2020

wlandau commented Aug 13, 2020

wlandau commented Aug 18, 2020

billdenney commented Aug 18, 2020 •

edited

Loading

billdenney commented Aug 20, 2020

clustermq with multiprocess on Windows drops to single thread after some number of tasks #1305

clustermq with multiprocess on Windows drops to single thread after some number of tasks #1305

Comments

billdenney commented Aug 8, 2020

Prework

Description

Reproducible example

Expected result

Session info

wlandau commented Aug 8, 2020 • edited Loading

billdenney commented Aug 8, 2020

wlandau commented Aug 8, 2020 • edited Loading

billdenney commented Aug 10, 2020

billdenney commented Aug 10, 2020

wlandau commented Aug 10, 2020

billdenney commented Aug 10, 2020

wlandau commented Aug 11, 2020

billdenney commented Aug 11, 2020

billdenney commented Aug 12, 2020 • edited Loading

wlandau commented Aug 13, 2020 • edited Loading

mschubert commented Aug 13, 2020

billdenney commented Aug 13, 2020

mschubert commented Aug 13, 2020

wlandau commented Aug 13, 2020

wlandau commented Aug 18, 2020

billdenney commented Aug 18, 2020 • edited Loading

billdenney commented Aug 20, 2020

wlandau commented Aug 8, 2020 •

edited

Loading

wlandau commented Aug 8, 2020 •

edited

Loading

billdenney commented Aug 12, 2020 •

edited

Loading

wlandau commented Aug 13, 2020 •

edited

Loading

billdenney commented Aug 18, 2020 •

edited

Loading