-
Notifications
You must be signed in to change notification settings - Fork 510
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
4096 thread limit in safekeepers. #3036
Comments
We should fix this quickly; 4k is a low limit. I'd probably open a PR to tracing which would make sharded-slab config adjustable. However, generally tracing-subscriber is evidently tailored to async world, so we'd better understand potential problems here -- e.g. how much memory high MAX_THREADS would consume. However, right now I'd just switch safekeeper to async in kinda pageserver mode: with separate runtimes (to prevent different jobs blocking each other) and sync io.
In a broader picture, async without async disc io doesn't make sense for disc intensive safekeepers -- they should saturate disk much earlier than experience lack of CPU due to large number of connections. However, we use tokio for handy cancellations, and also several used libs are async, so the basic model is thread-per-connection with current thread runtime in each one. So it would be nice to reconcile tracing-subscriber with that or use something different (slog or something custom) -- however the latter doesn't look appealing with tracing being used in other parts of the project. An alternative is the actual async disc io -- uring etc, but right now this is not a priority. |
Please note that this created other problems. Mixing sync io with async sometimes doesnt go really well. To me its clear that we should move to full async and doesnt stop here. Regarding "not really well part" see this issue by @koivunej
I cannot agree with that. Without spawn blocking you starve executor threads with blocking operations. Fsyncs can take seconds. You got number of fsyncs equal to number of executor threads? Everything is stuck now. Such calls cannot be preempted so at best it will cause latency spike for other tasks in the queue. Not to mention deadlocks e g with nonstealable queue slot in tokio scheduler (see the issue I mentioned above) Tokio's own File operations offload actual io to a thread pool. Threads are started in advance. So to me it is rather unclear how it can make things significantly worse performance wise. It even allows to easily increase concurrency for writes (e g schedule several writes at once). In general async allows for higher concurrency which is IMO quite suitable for our use case and number of things we need to do concurrently.
To me this model is rather pointless, async makes a lot of sense for networking when you have a lot of clients, so it would be beneficial to have runtime that can multiplex several connections in one thread. With per connection runtime it seems that we pick worst parts from both worlds. Do we really have 4096 active threads? If you do not want to go down the async road its probably worth to consider become more aggressive in shutting down idle things. To me it looks like the fastest and less disruptive way to solve the issue |
What is stuck if basically the only thing we do is writing/reading to disk? It is actually untrivial question, it would be interesting to benchmark.
How? I thought each spawn_blocking is an individual operation, no?
Currently no (we have had several such situations, but it were bugs), but having 1-2 thousand of active computes would do that. In general, unless we are talking about async disc io, async makes sense only to save CPU from context switching to do actually useful work instead. But if we only actively write to disk, we don't have a problem with which async helps (we won't have more than dozens of thousand threads to suffer CPU doing mostly context switching -- disk would be saturated earlier) and we do job at which async is less efficient since it has to pass to separate pool -- disc io. How severe are these effects (what is the cost of offloading) I actually don't know, it would be interesting to benchmark. |
The queue of tasks inside executor. It wont go forward when we're inside blocking syscall.
It is, but you can have several of them and join them.
And not only this. It also allows you to increase concurrency without increasing the number of threads. Cancellation support. Just convenience of coding when you do want to implement some sort of pipelinening or something like that.
Looks like big number of threads is one of those :)
And in this case we will have "dozen thousands of threads" doing exactly context switching. Do we know where are our limits? How many active computes one safekeeper can serve? There are some techniques from Scylla's io scheduler when it tries to find the sweet spot between throughput and latency so disk is not overloaded and latency keeps being predictable. I can look for materials explaining this if you're interested.
It heavily depends on the concurrency, size of tasks, etc. I've seen papers where people offload writes to persistent memory (write takes around 200ns) to another thread and have performance uplift based partially on that. The whole thing requires investing into benchmarking anyway. Going partial async is problematic. So easiest solution to me is still to find ways to decrease number of threads. Overloaded safekeeper (with 1000 of active computes) looks like a bug too. But rather in control plane which should be able to balance the load |
Sure. What I meant is that if all threads are in fsync, disc obviously can't carry out any more requests, so giving more execution time is (or mostly) pointless if writing to disc is the only job.
That's why we have thread local executor.
We don't have it yet, 4k (this issue) is about an artificial limitation.
More investigations would be nice, but clearly this heavily depends on the workload. Even a dozen of instances can fully utilize disc e.g. if they do COPY IN; OTOH, with weak write load (almost read only instances) I think up to 10k computes per safekeeper might be okay. Perhaps less.
Yes, may be interesting. |
Seems to be not so obvious to me. Disk bandwidth is asymmetric, if you have n threads and all of them are in fsync that doesnt mean that read bandwidth is saturated. Modern nvmes have many io queues. What number of threads do we consider? One per core? One per core + threadpool? Thousands?
As I said having thousands of threads with each running single threaded executor is pointless. You dont use async multiplexing capabilities this way. It is only overhead of two schedulers.
To me it is already a problem, 4k of constantly context switching threads waste much more cycles on context switches than nproc number of threads.
10k+ threads sounds really bad to me (even more if we count ones that stream data to pageservers). Especially with single-threaded executor running in each one. Resourses: https://www.p99conf.io/session/a-new-io-scheduler-algorithm-for-mixed-workloads/ this is a video from p99 conference talk. This is the written version https://www.scylladb.com/2022/08/03/implementing-a-new-io-scheduler-algorithm-for-mixed-read-write-workloads/ There are links inside the post to previous iterations of the algorithm |
This is a full switch, fs io operations are also tokio ones, working through thread pool. Similar to pageserver, we have multiple runtimes for easier `top` usage and isolation. Notable points: - Now that guts of safekeeper.rs are full of .await's, we need to be very careful not to drop task at random point, leaving timeline in unclear state. Currently the only writer is walreceiver and we don't have top level cancellation there, so we are good. But to be safe probably we should add a fuse panicking if task is being dropped while operation on a timeline is in progress. - Timeline lock is Tokio one now, as we do disk IO under it. - Collecting metrics got a crutch: since prometheus Collector is synchronous, there is now a special task copying once in a scrape period data from under async lock to sync one where collector can take it. - Anything involving closures becomes significantly more complicated, as async fns are already kinda closures + 'async closures are unstable'. - Main thread now tracks other main tasks, which got much easier. - The only sync place left is initial data loading, as otherwise clippy complains on timeline map lock being held across await points -- which is not bad here as it happens only in single threaded runtime of main thread. But having it sync doesn't hurt either. I'm concerned about performance of thread pool io offloading, async traits and many await points; but we can try and see how it goes. fixes #3036 fixes #3966
This is a full switch, fs io operations are also tokio ones, working through thread pool. Similar to pageserver, we have multiple runtimes for easier `top` usage and isolation. Notable points: - Now that guts of safekeeper.rs are full of .await's, we need to be very careful not to drop task at random point, leaving timeline in unclear state. Currently the only writer is walreceiver and we don't have top level cancellation there, so we are good. But to be safe probably we should add a fuse panicking if task is being dropped while operation on a timeline is in progress. - Timeline lock is Tokio one now, as we do disk IO under it. - Collecting metrics got a crutch: since prometheus Collector is synchronous, it spawns a thread with current thread runtime collecting data. - Anything involving closures becomes significantly more complicated, as async fns are already kinda closures + 'async closures are unstable'. - Main thread now tracks other main tasks, which got much easier. - The only sync place left is initial data loading, as otherwise clippy complains on timeline map lock being held across await points -- which is not bad here as it happens only in single threaded runtime of main thread. But having it sync doesn't hurt either. I'm concerned about performance of thread pool io offloading, async traits and many await points; but we can try and see how it goes. fixes #3036 fixes #3966
This is a full switch, fs io operations are also tokio ones, working through thread pool. Similar to pageserver, we have multiple runtimes for easier `top` usage and isolation. Notable points: - Now that guts of safekeeper.rs are full of .await's, we need to be very careful not to drop task at random point, leaving timeline in unclear state. Currently the only writer is walreceiver and we don't have top level cancellation there, so we are good. But to be safe probably we should add a fuse panicking if task is being dropped while operation on a timeline is in progress. - Timeline lock is Tokio one now, as we do disk IO under it. - Collecting metrics got a crutch: since prometheus Collector is synchronous, it spawns a thread with current thread runtime collecting data. - Anything involving closures becomes significantly more complicated, as async fns are already kinda closures + 'async closures are unstable'. - Main thread now tracks other main tasks, which got much easier. - The only sync place left is initial data loading, as otherwise clippy complains on timeline map lock being held across await points -- which is not bad here as it happens only in single threaded runtime of main thread. But having it sync doesn't hurt either. I'm concerned about performance of thread pool io offloading, async traits and many await points; but we can try and see how it goes. fixes #3036 fixes #3966
This is a full switch, fs io operations are also tokio ones, working through thread pool. Similar to pageserver, we have multiple runtimes for easier `top` usage and isolation. Notable points: - Now that guts of safekeeper.rs are full of .await's, we need to be very careful not to drop task at random point, leaving timeline in unclear state. Currently the only writer is walreceiver and we don't have top level cancellation there, so we are good. But to be safe probably we should add a fuse panicking if task is being dropped while operation on a timeline is in progress. - Timeline lock is Tokio one now, as we do disk IO under it. - Collecting metrics got a crutch: since prometheus Collector is synchronous, it spawns a thread with current thread runtime collecting data. - Anything involving closures becomes significantly more complicated, as async fns are already kinda closures + 'async closures are unstable'. - Main thread now tracks other main tasks, which got much easier. - The only sync place left is initial data loading, as otherwise clippy complains on timeline map lock being held across await points -- which is not bad here as it happens only in single threaded runtime of main thread. But having it sync doesn't hurt either. I'm concerned about performance of thread pool io offloading, async traits and many await points; but we can try and see how it goes. fixes #3036 fixes #3966
follow up of #2722
Essentially, due to tracing-subscriber usage we are limited to max 4096 threads, exceeding that causes panic.
I had a look at tracing-subscriber & sharded-slab crates. The former depends on the latter; sharded-slab allows to statically configure
MAX_THREADS
, buttracing-subscriber
uses just default 4096 config.see also tokio-rs/tracing#1485
The text was updated successfully, but these errors were encountered: