Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lock contention in burn-autodiff #2729

Open
mullr opened this issue Jan 22, 2025 · 8 comments
Open

Lock contention in burn-autodiff #2729

mullr opened this issue Jan 22, 2025 · 8 comments
Labels
bug Something isn't working enhancement Enhance existing features

Comments

@mullr
Copy link

mullr commented Jan 22, 2025

Describe the bug
I am trying to train a bunch of models in parallel, using the NdArray backend. I'm using rayon to do the training in parallel. I find that the added parallelism makes the process much slower. For completing my batch of training, I've measured:

  • 1 thread: 8:12
  • 2 threads: 8:46
  • 4 threads: 10:40
  • 8 threads: 15:16
  • 16 threads: 29:18
  • 32 threads: 54:47

This is clearly the opposite of what I'd expect from such a CPU bound task. All CPU cores are completely busy, as well.

After profiling, the culprit appears to be https://github.com/tracel-ai/burn/blob/main/crates/burn-autodiff/src/runtime/mutex.rs#L19 . Specifically, https://github.com/tracel-ai/burn/blob/main/crates/burn-autodiff/src/runtime/mutex.rs#L23. The threads are just burning cpu trying to acquire the spinlock.

Additional context
Burn 0.16

@mullr
Copy link
Author

mullr commented Jan 22, 2025

I'm not 100% confident in this diagnosis, fwiw. What I know for sure is that I'm burning time in MutexClient::register; I can't easily see beyond that because of inlining.

Scratch that, after doing a release-with-debug-symbols build and profiling, it's very clearly the spinlock.

@laggui
Copy link
Member

laggui commented Jan 22, 2025

Thanks for reporting this with relevant info!

Related: #715

Many things have changed since the discussion in the linked issue, but this was never officially supported. We're seeing more RL use cases for which this should be helpful, so the reported issue needs to be fixed for multiple training runs to be supported.

@laggui laggui added the enhancement Enhance existing features label Jan 22, 2025
@mullr
Copy link
Author

mullr commented Jan 22, 2025

I'm viewing this as more of a bug, fwiw. I need to use burn from multiple threads, but the global mutex means I can't. This is pretty much a deal breaker for my application, so I'd really like to find a fix.

@mullr
Copy link
Author

mullr commented Jan 22, 2025

I tried to apply some lock hygiene just in the place it jumped out to me (https://github.com/mullr/burn/tree/narrower-autodiff-lock). This helped a little; I gained about 20% when running on 4 cores. But it's clearly not nearly enough. It really seems like the global lock shouldn't exist at all. But dealing with that is probably extreme surgery.

@laggui
Copy link
Member

laggui commented Jan 23, 2025

I forgot we also have a mpsc channel implementation under the "async" feature flag in burn-autodiff though it's not really used much. The channel doesn't block on sends, so maybe that will help?

@mullr
Copy link
Author

mullr commented Jan 23, 2025

The async feature does reduce CPU usage. The contention is still present though, so performance is bad. The problem seems to be that there is a single AutodiffServer which is shared between all threads.

@laggui laggui added bug Something isn't working accessibility Everything that is related to making Burn more accessible enhancement Enhance existing features and removed enhancement Enhance existing features accessibility Everything that is related to making Burn more accessible labels Jan 24, 2025
@laggui
Copy link
Member

laggui commented Jan 24, 2025

Yeah I figured you'd probably face the same issue at the server level.. 😅

@nathanielsimard
Copy link
Member

@mullr The problem isn't with burn-autodiff, but with ndarray! We need to keep track of nodes and graphs across threads, and the state registration is very, very quick. The problem is that computation with the ndarray backend is synchronous, meaning each operation has to wait for the autodiff backend registration to complete, which isn't the case with GPU backends.

The real fix would be to make the computation in the ndarray backend asynchronous.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement Enhance existing features
Projects
None yet
Development

No branches or pull requests

3 participants