Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Steadily increasing memory consumption even with memory_strategy = "autoclean" #1257

Closed
2 of 3 tasks
matthiasgomolka opened this issue May 15, 2020 · 21 comments
Closed
2 of 3 tasks

Comments

@matthiasgomolka
Copy link
Contributor

Prework

  • Read and abide by drake's code of conduct.
  • Search for duplicates among the existing issues, both open and closed.
  • Advanced users: verify that the bug still persists in the current development version (i.e. remotes::install_github("ropensci/drake")) and mention the SHA-1 hash of the Git commit you install.
    [Sorry, I cannot install from github from this machine.]

Description

I use drake for producing a large research dataset (~1TB), which is chunked into pieces of 5 to 10 GB. My machine has 192 GB RAM and I run only 3 jobs in parallel.

In order to keep memory usage low, I specify the following configuration:

drake_config(
  plan,
  parallelism = "future",
  jobs = 3,
  keep_going = TRUE,
  garbage_collection = TRUE,
  memory_strategy = "autoclean",
  caching = "worker"
)

I run my plan via r_make() and everything seems to work nicely. However, despite configuring memory_strategy = "autoclean" and garbage_collection = TRUE, the memory usage grows steadily (over several hours). Finally, the machine crashes and I have to start over again (thanks to drake I can pick up right where it crashed).

From what I read in the documentation, I would expect a rather constant memory usage since every target is discarded from memory after it is finished and only direct dependencies were loaded beforehand. None of my targets has dependencies of more than 3 GB (stored as fst in the cache). Thus, I do not expect to see a memory usage of more than 40 to 60 GB.

Reproducible example

Since the error occurs only after several hours and the dataset is confidential, it is hard to generate a simple reproducible example. Please comment, if you have suggestions.

Expected result

Memory usage should not grow steadily and r_make() should finish without issues.

Session info

R version 3.6.1 (2019-07-05)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 16299)

Matrix products: default

locale:
[1] LC_COLLATE=German_Germany.1252  LC_CTYPE=German_Germany.1252    LC_MONETARY=German_Germany.1252
[4] LC_NUMERIC=C                    LC_TIME=German_Germany.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] compiler_3.6.1 tools_3.6.1 
@wlandau
Copy link
Member

wlandau commented May 15, 2020

What happens with parallelism = "loop"?

@matthiasgomolka
Copy link
Contributor Author

Then the memory usage remains constant and low (after more than 4 hours now).

@wlandau
Copy link
Member

wlandau commented May 15, 2020

Thanks for checking. So now it looks like futures are somehow holding onto superfluous data.

@wlandau
Copy link
Member

wlandau commented May 15, 2020

Also, what future::plan() did you use?

@wlandau
Copy link
Member

wlandau commented May 15, 2020

I am trying to reproduce your issue, and so far I am not successful. Here is a workflow with targets large enough to noticeably impact memory.

library(drake)
plan <- drake_plan(
  x = target(
    do.call(rbind, replicate(1e5, mtcars)),
    transform = map(x = !!seq_len(100))
  )
)
future::plan(future::multisession, workers = 2)
make(
  plan,
  parallelism = "future",
  jobs = 2,
  garbage_collection = TRUE,
  memory_strategy = "autoclean",
)

Each target takes around 282 MB in memory.

> x <- do.call(rbind, replicate(1e5, mtcars))
> pryr::object_size(x)
282 MB

With the default drake storage format (e.g. no target(format = "fst")), storr duplicates targets in memory, so each target should theoretically consume at most 282 * 2 = 564 MB at any given time. With make(jobs = 2), at most 1128 MB at any given time. When I ran this workflow on a Linux machine and watched memory usage with watch -n .1 ps -o pid,%mem,rss,command, this is indeed what I saw. Memory fluctuated up and down, spending the majority of of the time around 500 MB, but it never went up above 1128 MB, and it stayed constant over time on average.

So it looks like drake's autoclean memory strategy is actually working properly, at least on Linux. I should probably also try Windows.

By the way, I also noticed you set keep_going equal to TRUE. Because your pipeline continues after targets fail, this may actually be an instance of #1253, which is fixed in the CRAN update I just released yesterday. So you might try installing drake 7.12.1 from CRAN.

@wlandau
Copy link
Member

wlandau commented May 15, 2020

Confirmed: memory is constant on average (around 530 MB) on Windows as well.

@wlandau
Copy link
Member

wlandau commented May 15, 2020

In #1257 (comment) I had a typo in the code that made targets small. Memory on Windows was still constant with time on average, but up around 2400 MB. Not sure why it was so much higher, but autoclean + garbage collection still appears to be working.

@matthiasgomolka
Copy link
Contributor Author

Thanks for your effort so far!

I use plan("multisession").

I will update to the new version and try again. I'll let you know if it works.

@matthiasgomolka
Copy link
Contributor Author

Upgrading to 7.12.1 does not seem to have an effect.

Another thought: In my plan, I call Stata via Powershell. Might this be a problem? Apart from this, there's nothing unusual, I guess...

@wlandau
Copy link
Member

wlandau commented May 18, 2020

Upgrading to 7.12.1 does not seem to have an effect.

Are there any failed targets? That's where I thought 7.12.1 would help.

Another thought: In my plan, I call Stata via Powershell. Might this be a problem? Apart from this, there's nothing unusual, I guess...

I am not sure, I am not familiar with Stata. How are you calling it? Does Stata run in a child process? Is there a way to test your workflow without Stata?

@matthiasgomolka
Copy link
Contributor Author

No, all targets build just fine. And yes, I can skip the Stata targets. I'll let you know tomorrow, if this has any impact.

@matthiasgomolka
Copy link
Contributor Author

Skipping the Stata targets didn't solve the problem either. I think, I need to dig in deeper to create a reprex, so that you can actually see what's going on. Thanks for already spending time on this! I'll post a reprex as soon as I've figured out in which cases exactly the problem occurs.

@wlandau wlandau changed the title memory_strategy = "autoclean" does not seem to work as described Steadily increasing memory consumption even with memory_strategy = "autoclean" May 22, 2020
@matthiasgomolka
Copy link
Contributor Author

matthiasgomolka commented Jun 3, 2020

Small Update:
Unfortunately, I still cannot determine what caused the steadily increasing memory consumption. But I observed that after I cleaned the cache (with gc) I did not have any trouble for a while. Now, my cache is ~ 690 GB large again and I observed the issue again.

If the issue is in fact related to a large cache, this might be the reason why its hard to create a reprex which shows the problem.

Are there any known issues with large caches? And is ~ 690 GB large in drake terms?

@wlandau
Copy link
Member

wlandau commented Jun 4, 2020

690 GB is larger than most drake projects get, and we certainly do not want it to affect memory. I am not exactly sure why cache$gc() would mitigate memory consumption. However, I do know that storrs keep their own in-memory caches and that use_cache is TRUE by default in methods like get(). So I just set use_cache = FALSE in a bunch more places (7bb9b51). It is a shot in the dark, but it might help.

@wlandau
Copy link
Member

wlandau commented Jun 23, 2020

Any change since 7bb9b51 on this issue? Also, do you use any dynamic branching in your plan?

@matthiasgomolka
Copy link
Contributor Author

No changes yet. Finished the plan sequentially since I needed the results. Didn't have the time to review the issue yet, sorry! But I will work on a plan with targets of similar size this week, maybe I get some insights from that.

I use only static branching in my plan.

@wlandau
Copy link
Member

wlandau commented Jun 23, 2020

Thanks, that helps.

Also, I wonder if it is something to do with the data structures you are using. Ggplots and lm objects contain their own special environments, and the data in them can get surprisingly large. Maybe check to see if later targets are actually larger than earlier ones.

@matthiasgomolka
Copy link
Contributor Author

Sorry for the late reply. Large objects are only data.tables. Apart from that just small lists or vectors. The targets differ is size, but not substantially.

@wlandau
Copy link
Member

wlandau commented Jul 3, 2020

Glad you eliminated that possible explanation, that helps.

To troubleshoot further, I think we've reached the point where we really do need a reprex: as much of the code as possible in a workflow as scaled down as possible. Is that doable on your end?

@matthiasgomolka
Copy link
Contributor Author

I'm pretty busy at the moment but I'll try to make a reprex.

Sidenote: My stakes in here are quite large since me and my colleagues are working on a proof of concept for our future data production which (at the moment) includes drake (which I'm pretty excited about). This was only possible due to your patient help during the last months!

@wlandau
Copy link
Member

wlandau commented Jul 14, 2020

This issue has been up for a while, so I am closing it until we can reproduce it. Please ping me again when you have an end-to-end reprex.

@wlandau wlandau closed this as completed Jul 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants