-
Notifications
You must be signed in to change notification settings - Fork 514
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lock contention in burn-autodiff #2729
Comments
Scratch that, after doing a release-with-debug-symbols build and profiling, it's very clearly the spinlock. |
Thanks for reporting this with relevant info! Related: #715 Many things have changed since the discussion in the linked issue, but this was never officially supported. We're seeing more RL use cases for which this should be helpful, so the reported issue needs to be fixed for multiple training runs to be supported. |
I'm viewing this as more of a bug, fwiw. I need to use burn from multiple threads, but the global mutex means I can't. This is pretty much a deal breaker for my application, so I'd really like to find a fix. |
I tried to apply some lock hygiene just in the place it jumped out to me (https://github.com/mullr/burn/tree/narrower-autodiff-lock). This helped a little; I gained about 20% when running on 4 cores. But it's clearly not nearly enough. It really seems like the global lock shouldn't exist at all. But dealing with that is probably extreme surgery. |
I forgot we also have a mpsc channel implementation under the |
The |
Yeah I figured you'd probably face the same issue at the server level.. 😅 |
@mullr The problem isn't with burn-autodiff, but with ndarray! We need to keep track of nodes and graphs across threads, and the state registration is very, very quick. The problem is that computation with the ndarray backend is synchronous, meaning each operation has to wait for the autodiff backend registration to complete, which isn't the case with GPU backends. The real fix would be to make the computation in the ndarray backend asynchronous. |
Describe the bug
I am trying to train a bunch of models in parallel, using the NdArray backend. I'm using rayon to do the training in parallel. I find that the added parallelism makes the process much slower. For completing my batch of training, I've measured:
This is clearly the opposite of what I'd expect from such a CPU bound task. All CPU cores are completely busy, as well.
After profiling, the culprit appears to be https://github.com/tracel-ai/burn/blob/main/crates/burn-autodiff/src/runtime/mutex.rs#L19 . Specifically, https://github.com/tracel-ai/burn/blob/main/crates/burn-autodiff/src/runtime/mutex.rs#L23. The threads are just burning cpu trying to acquire the spinlock.
Additional context
Burn 0.16
The text was updated successfully, but these errors were encountered: