Lock contention in burn-autodiff #2729

mullr · 2025-01-22T05:18:24Z

Describe the bug
I am trying to train a bunch of models in parallel, using the NdArray backend. I'm using rayon to do the training in parallel. I find that the added parallelism makes the process much slower. For completing my batch of training, I've measured:

1 thread: 8:12
2 threads: 8:46
4 threads: 10:40
8 threads: 15:16
16 threads: 29:18
32 threads: 54:47

This is clearly the opposite of what I'd expect from such a CPU bound task. All CPU cores are completely busy, as well.

After profiling, the culprit appears to be https://github.com/tracel-ai/burn/blob/main/crates/burn-autodiff/src/runtime/mutex.rs#L19 . Specifically, https://github.com/tracel-ai/burn/blob/main/crates/burn-autodiff/src/runtime/mutex.rs#L23. The threads are just burning cpu trying to acquire the spinlock.

Additional context
Burn 0.16

mullr · 2025-01-22T05:23:49Z

~~I'm not 100% confident in this diagnosis, fwiw. What I know for sure is that I'm burning time in MutexClient::register; I can't easily see beyond that because of inlining.~~

Scratch that, after doing a release-with-debug-symbols build and profiling, it's very clearly the spinlock.

laggui · 2025-01-22T20:26:52Z

Thanks for reporting this with relevant info!

Related: #715

Many things have changed since the discussion in the linked issue, but this was never officially supported. We're seeing more RL use cases for which this should be helpful, so the reported issue needs to be fixed for multiple training runs to be supported.

mullr · 2025-01-22T21:27:01Z

I'm viewing this as more of a bug, fwiw. I need to use burn from multiple threads, but the global mutex means I can't. This is pretty much a deal breaker for my application, so I'd really like to find a fix.

mullr · 2025-01-22T21:45:17Z

I tried to apply some lock hygiene just in the place it jumped out to me (https://github.com/mullr/burn/tree/narrower-autodiff-lock). This helped a little; I gained about 20% when running on 4 cores. But it's clearly not nearly enough. It really seems like the global lock shouldn't exist at all. But dealing with that is probably extreme surgery.

laggui · 2025-01-23T14:42:26Z

I forgot we also have a mpsc channel implementation under the "async" feature flag in burn-autodiff though it's not really used much. The channel doesn't block on sends, so maybe that will help?

mullr · 2025-01-23T17:55:44Z

The async feature does reduce CPU usage. The contention is still present though, so performance is bad. The problem seems to be that there is a single AutodiffServer which is shared between all threads.

laggui · 2025-01-24T15:23:18Z

Yeah I figured you'd probably face the same issue at the server level.. 😅

nathanielsimard · 2025-01-27T23:33:45Z

@mullr The problem isn't with burn-autodiff, but with ndarray! We need to keep track of nodes and graphs across threads, and the state registration is very, very quick. The problem is that computation with the ndarray backend is synchronous, meaning each operation has to wait for the autodiff backend registration to complete, which isn't the case with GPU backends.

The real fix would be to make the computation in the ndarray backend asynchronous.

laggui added the enhancement Enhance existing features label Jan 22, 2025

laggui added bug Something isn't working accessibility Everything that is related to making Burn more accessible enhancement Enhance existing features and removed enhancement Enhance existing features accessibility Everything that is related to making Burn more accessible labels Jan 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lock contention in burn-autodiff #2729

Lock contention in burn-autodiff #2729

mullr commented Jan 22, 2025

mullr commented Jan 22, 2025 •

edited

Loading

laggui commented Jan 22, 2025

mullr commented Jan 22, 2025

mullr commented Jan 22, 2025

laggui commented Jan 23, 2025 •

edited

Loading

mullr commented Jan 23, 2025 •

edited

Loading

laggui commented Jan 24, 2025 •

edited

Loading

nathanielsimard commented Jan 27, 2025

Lock contention in burn-autodiff #2729

Lock contention in burn-autodiff #2729

Comments

mullr commented Jan 22, 2025

mullr commented Jan 22, 2025 • edited Loading

laggui commented Jan 22, 2025

mullr commented Jan 22, 2025

mullr commented Jan 22, 2025

laggui commented Jan 23, 2025 • edited Loading

mullr commented Jan 23, 2025 • edited Loading

laggui commented Jan 24, 2025 • edited Loading

nathanielsimard commented Jan 27, 2025

mullr commented Jan 22, 2025 •

edited

Loading

laggui commented Jan 23, 2025 •

edited

Loading

mullr commented Jan 23, 2025 •

edited

Loading

laggui commented Jan 24, 2025 •

edited

Loading