You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Overall, the performance improvement from that feature is great. However, it seems there's an issue with scaling. We're using the default CPU EP and we have more than 20 models (sessions) that are shared (Arc'ed) between all threads and on which we're calling run concurrently from all worker threads (note: each thread does not run an inference request on every model, but chooses a specific one depending on certain conditions).
As the number of threads increases, I see an increase in system (kernel) CPU load. At 88 threads, our system CPU load increased from <5% to 12-15%. strace showed that ~90% of kernel time is spent in futex syscalls. Take a look at what perf shows:
I'm assuming that if we had a single shared model, then the contention would be even higher.
There are essentially no other futex syscalls in the whole flamegraph (unfortunately, I cannot share raw .svg, sorry about that)
Then, I've stumbled upon the following documentation (the Share allocator(s) between sessions part) https://onnxruntime.ai/docs/get-started/with-c.html#features
I've hypothesized that if there's a global session object, and many threads are calling run on it, then run could be getting stuck on some kind of arena mutex. I then tried changing the application to have a session(s) per worker thread, instead of shared ones. If sessions have their own local arena, I expected to see an increased memory usage, but reduced contention.
Unfortunately, pretty much nothing changed, and the before/after flamegraphs look more or less identical.
So, I'm not familiar with ONNX's internals, but could it be that the arena allocator is shared between all sessions by default? Do you think it makes sense to make that configurable? Is it an arena mutex at all, or is my assumption simply wrong? I'm assuming it is an arena mutex because these syscalls show up in Value::from_array, drop calls, etc.
Also, somewhat related, take a look at zoomed-in Session::run:
There's two Drop::drop calls, zooming-in on them:
Again, I'm not familiar with ONNX's internals, but arenas have to reset their chunk pointer at some point, and when new values are written, the old memory simply gets overwritten. As such, it makes sense (at least, in other cases I've used arenas), to avoid calling Drop at all. With that in mind, does it make sense to avoid calling ReleaseMemoryInfo/ReleaseValue at all, if the allocator is an arena? That could be a nice optimization
Those futex calls are probably from ort as each call to an ONNX Runtime API would (needlessly) lock a Mutex. I removed the mutex in #160, does ort @ 04df44d help with the contention at all?
Hi!
I've been testing recently added feature in #155
Overall, the performance improvement from that feature is great. However, it seems there's an issue with scaling. We're using the default CPU EP and we have more than 20 models (sessions) that are shared (Arc'ed) between all threads and on which we're calling

run
concurrently from all worker threads (note: each thread does not run an inference request on every model, but chooses a specific one depending on certain conditions).As the number of threads increases, I see an increase in system (kernel) CPU load. At 88 threads, our system CPU load increased from <5% to 12-15%.
strace
showed that ~90% of kernel time is spent infutex
syscalls. Take a look at whatperf
shows:I'm assuming that if we had a single shared model, then the contention would be even higher.
There are essentially no other
futex
syscalls in the whole flamegraph (unfortunately, I cannot share raw.svg
, sorry about that)Then, I've stumbled upon the following documentation (the
Share allocator(s) between sessions
part)https://onnxruntime.ai/docs/get-started/with-c.html#features
I've hypothesized that if there's a global session object, and many threads are calling
run
on it, thenrun
could be getting stuck on some kind of arena mutex. I then tried changing the application to have a session(s) per worker thread, instead of shared ones. If sessions have their own local arena, I expected to see an increased memory usage, but reduced contention.Unfortunately, pretty much nothing changed, and the before/after flamegraphs look more or less identical.
So, I'm not familiar with ONNX's internals, but could it be that the arena allocator is shared between all sessions by default? Do you think it makes sense to make that configurable? Is it an arena mutex at all, or is my assumption simply wrong? I'm assuming it is an arena mutex because these syscalls show up in
Value::from_array
,drop
calls, etc.Also, somewhat related, take a look at zoomed-in


Session::run
:There's two
Drop::drop
calls, zooming-in on them:Again, I'm not familiar with ONNX's internals, but arenas have to reset their chunk pointer at some point, and when new values are written, the old memory simply gets overwritten. As such, it makes sense (at least, in other cases I've used arenas), to avoid calling
Drop
at all. With that in mind, does it make sense to avoid callingReleaseMemoryInfo
/ReleaseValue
at all, if the allocator is an arena? That could be a nice optimizationort/src/memory.rs
Line 146 in d1ae982
ort/src/value.rs
Line 703 in d1ae982
The text was updated successfully, but these errors were encountered: