Steadily increasing memory consumption even with memory_strategy = "autoclean" #1257

matthiasgomolka · 2020-05-15T12:35:10Z

Prework

Read and abide by drake's code of conduct.
Search for duplicates among the existing issues, both open and closed.
Advanced users: verify that the bug still persists in the current development version (i.e. remotes::install_github("ropensci/drake")) and mention the SHA-1 hash of the Git commit you install.
[Sorry, I cannot install from github from this machine.]

Description

I use drake for producing a large research dataset (~1TB), which is chunked into pieces of 5 to 10 GB. My machine has 192 GB RAM and I run only 3 jobs in parallel.

In order to keep memory usage low, I specify the following configuration:

drake_config(
  plan,
  parallelism = "future",
  jobs = 3,
  keep_going = TRUE,
  garbage_collection = TRUE,
  memory_strategy = "autoclean",
  caching = "worker"
)

I run my plan via r_make() and everything seems to work nicely. However, despite configuring memory_strategy = "autoclean" and garbage_collection = TRUE, the memory usage grows steadily (over several hours). Finally, the machine crashes and I have to start over again (thanks to drake I can pick up right where it crashed).

From what I read in the documentation, I would expect a rather constant memory usage since every target is discarded from memory after it is finished and only direct dependencies were loaded beforehand. None of my targets has dependencies of more than 3 GB (stored as fst in the cache). Thus, I do not expect to see a memory usage of more than 40 to 60 GB.

Reproducible example

Since the error occurs only after several hours and the dataset is confidential, it is hard to generate a simple reproducible example. Please comment, if you have suggestions.

Expected result

Memory usage should not grow steadily and r_make() should finish without issues.

Session info

R version 3.6.1 (2019-07-05)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 16299)

Matrix products: default

locale:
[1] LC_COLLATE=German_Germany.1252  LC_CTYPE=German_Germany.1252    LC_MONETARY=German_Germany.1252
[4] LC_NUMERIC=C                    LC_TIME=German_Germany.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] compiler_3.6.1 tools_3.6.1

The text was updated successfully, but these errors were encountered:

wlandau · 2020-05-15T15:18:19Z

What happens with parallelism = "loop"?

matthiasgomolka · 2020-05-15T20:31:17Z

Then the memory usage remains constant and low (after more than 4 hours now).

wlandau · 2020-05-15T20:43:13Z

Thanks for checking. So now it looks like futures are somehow holding onto superfluous data.

wlandau · 2020-05-15T20:48:30Z

Also, what future::plan() did you use?

wlandau · 2020-05-15T21:25:03Z

I am trying to reproduce your issue, and so far I am not successful. Here is a workflow with targets large enough to noticeably impact memory.

library(drake)
plan <- drake_plan(
  x = target(
    do.call(rbind, replicate(1e5, mtcars)),
    transform = map(x = !!seq_len(100))
  )
)
future::plan(future::multisession, workers = 2)
make(
  plan,
  parallelism = "future",
  jobs = 2,
  garbage_collection = TRUE,
  memory_strategy = "autoclean",
)

Each target takes around 282 MB in memory.

> x <- do.call(rbind, replicate(1e5, mtcars))
> pryr::object_size(x)
282 MB

With the default drake storage format (e.g. no target(format = "fst")), storr duplicates targets in memory, so each target should theoretically consume at most 282 * 2 = 564 MB at any given time. With make(jobs = 2), at most 1128 MB at any given time. When I ran this workflow on a Linux machine and watched memory usage with watch -n .1 ps -o pid,%mem,rss,command, this is indeed what I saw. Memory fluctuated up and down, spending the majority of of the time around 500 MB, but it never went up above 1128 MB, and it stayed constant over time on average.

So it looks like drake's autoclean memory strategy is actually working properly, at least on Linux. I should probably also try Windows.

By the way, I also noticed you set keep_going equal to TRUE. Because your pipeline continues after targets fail, this may actually be an instance of #1253, which is fixed in the CRAN update I just released yesterday. So you might try installing drake 7.12.1 from CRAN.

wlandau · 2020-05-15T21:32:00Z

~~Confirmed: memory is constant on average (around 530 MB) on Windows as well.~~

wlandau · 2020-05-15T21:35:17Z

In #1257 (comment) I had a typo in the code that made targets small. Memory on Windows was still constant with time on average, but up around 2400 MB. Not sure why it was so much higher, but autoclean + garbage collection still appears to be working.

matthiasgomolka · 2020-05-16T06:10:42Z

Thanks for your effort so far!

I use plan("multisession").

I will update to the new version and try again. I'll let you know if it works.

matthiasgomolka · 2020-05-18T13:10:41Z

Upgrading to 7.12.1 does not seem to have an effect.

Another thought: In my plan, I call Stata via Powershell. Might this be a problem? Apart from this, there's nothing unusual, I guess...

wlandau · 2020-05-18T14:18:03Z

Upgrading to 7.12.1 does not seem to have an effect.

Are there any failed targets? That's where I thought 7.12.1 would help.

Another thought: In my plan, I call Stata via Powershell. Might this be a problem? Apart from this, there's nothing unusual, I guess...

I am not sure, I am not familiar with Stata. How are you calling it? Does Stata run in a child process? Is there a way to test your workflow without Stata?

matthiasgomolka · 2020-05-18T14:57:34Z

No, all targets build just fine. And yes, I can skip the Stata targets. I'll let you know tomorrow, if this has any impact.

matthiasgomolka · 2020-05-20T05:28:32Z

Skipping the Stata targets didn't solve the problem either. I think, I need to dig in deeper to create a reprex, so that you can actually see what's going on. Thanks for already spending time on this! I'll post a reprex as soon as I've figured out in which cases exactly the problem occurs.

matthiasgomolka · 2020-06-03T07:34:27Z

Small Update:
Unfortunately, I still cannot determine what caused the steadily increasing memory consumption. But I observed that after I cleaned the cache (with gc) I did not have any trouble for a while. Now, my cache is ~ 690 GB large again and I observed the issue again.

If the issue is in fact related to a large cache, this might be the reason why its hard to create a reprex which shows the problem.

Are there any known issues with large caches? And is ~ 690 GB large in drake terms?

wlandau · 2020-06-04T18:08:56Z

690 GB is larger than most drake projects get, and we certainly do not want it to affect memory. I am not exactly sure why cache$gc() would mitigate memory consumption. However, I do know that storrs keep their own in-memory caches and that use_cache is TRUE by default in methods like get(). So I just set use_cache = FALSE in a bunch more places (7bb9b51). It is a shot in the dark, but it might help.

wlandau · 2020-06-23T00:03:47Z

Any change since 7bb9b51 on this issue? Also, do you use any dynamic branching in your plan?

matthiasgomolka · 2020-06-23T15:58:53Z

No changes yet. Finished the plan sequentially since I needed the results. Didn't have the time to review the issue yet, sorry! But I will work on a plan with targets of similar size this week, maybe I get some insights from that.

I use only static branching in my plan.

wlandau · 2020-06-23T16:03:56Z

Thanks, that helps.

Also, I wonder if it is something to do with the data structures you are using. Ggplots and lm objects contain their own special environments, and the data in them can get surprisingly large. Maybe check to see if later targets are actually larger than earlier ones.

matthiasgomolka · 2020-07-03T14:08:13Z

Sorry for the late reply. Large objects are only data.tables. Apart from that just small lists or vectors. The targets differ is size, but not substantially.

wlandau · 2020-07-03T14:34:34Z

Glad you eliminated that possible explanation, that helps.

To troubleshoot further, I think we've reached the point where we really do need a reprex: as much of the code as possible in a workflow as scaled down as possible. Is that doable on your end?

matthiasgomolka · 2020-07-03T15:14:07Z

I'm pretty busy at the moment but I'll try to make a reprex.

Sidenote: My stakes in here are quite large since me and my colleagues are working on a proof of concept for our future data production which (at the moment) includes drake (which I'm pretty excited about). This was only possible due to your patient help during the last months!

wlandau · 2020-07-14T00:55:58Z

This issue has been up for a while, so I am closing it until we can reproduce it. Please ping me again when you have an end-to-end reprex.

matthiasgomolka added the type: bug label May 15, 2020

matthiasgomolka assigned wlandau May 15, 2020

wlandau added type: trouble and removed type: bug labels May 15, 2020

wlandau changed the title ~~memory_strategy = "autoclean" does not seem to work as described~~ Steadily increasing memory consumption even with memory_strategy = "autoclean" May 22, 2020

wlandau-lilly added a commit that referenced this issue Jun 4, 2020

Set use_cache to FALSE more often, re #1257

7bb9b51

wlandau mentioned this issue Jun 22, 2020

Does drake run autoclean in between dynamic subtargets? #1284

Closed

3 tasks

wlandau added the depends: reprex label Jul 3, 2020

wlandau closed this as completed Jul 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Steadily increasing memory consumption even with memory_strategy = "autoclean" #1257

Steadily increasing memory consumption even with memory_strategy = "autoclean" #1257

matthiasgomolka commented May 15, 2020

wlandau commented May 15, 2020

matthiasgomolka commented May 15, 2020

wlandau commented May 15, 2020

wlandau commented May 15, 2020

wlandau commented May 15, 2020 •

edited

Loading

wlandau commented May 15, 2020 •

edited

Loading

wlandau commented May 15, 2020

matthiasgomolka commented May 16, 2020

matthiasgomolka commented May 18, 2020

wlandau commented May 18, 2020

matthiasgomolka commented May 18, 2020

matthiasgomolka commented May 20, 2020

matthiasgomolka commented Jun 3, 2020 •

edited

Loading

wlandau commented Jun 4, 2020 •

edited

Loading

wlandau commented Jun 23, 2020

matthiasgomolka commented Jun 23, 2020

wlandau commented Jun 23, 2020

matthiasgomolka commented Jul 3, 2020

wlandau commented Jul 3, 2020

matthiasgomolka commented Jul 3, 2020

wlandau commented Jul 14, 2020

Steadily increasing memory consumption even with memory_strategy = "autoclean" #1257

Steadily increasing memory consumption even with memory_strategy = "autoclean" #1257

Comments

matthiasgomolka commented May 15, 2020

Prework

Description

Reproducible example

Expected result

Session info

wlandau commented May 15, 2020

matthiasgomolka commented May 15, 2020

wlandau commented May 15, 2020

wlandau commented May 15, 2020

wlandau commented May 15, 2020 • edited Loading

wlandau commented May 15, 2020 • edited Loading

wlandau commented May 15, 2020

matthiasgomolka commented May 16, 2020

matthiasgomolka commented May 18, 2020

wlandau commented May 18, 2020

matthiasgomolka commented May 18, 2020

matthiasgomolka commented May 20, 2020

matthiasgomolka commented Jun 3, 2020 • edited Loading

wlandau commented Jun 4, 2020 • edited Loading

wlandau commented Jun 23, 2020

matthiasgomolka commented Jun 23, 2020

wlandau commented Jun 23, 2020

matthiasgomolka commented Jul 3, 2020

wlandau commented Jul 3, 2020

matthiasgomolka commented Jul 3, 2020

wlandau commented Jul 14, 2020

wlandau commented May 15, 2020 •

edited

Loading

wlandau commented May 15, 2020 •

edited

Loading

matthiasgomolka commented Jun 3, 2020 •

edited

Loading

wlandau commented Jun 4, 2020 •

edited

Loading