Blocking throughout `pageserver` in async contexts #2975

koivunej · 2022-11-30T13:53:18Z

During #2875 I hit an interesting deadlock, which is reproduced at this branch. I managed to find multiple "waker woken, but deadlock". The most interesting was:

A thread of task_mgr::BACKGROUND_RUNTIME was executing periodic compaction from pageserver::tenant_tasks while the same thread had previously executed the async walredo task. The thread became deadlocked on blocking flume::Receiver::recv because the async walredo task was blocked on the same runtime worker thread, on its LIFO slot. The async walredo task would have been the one to process the Timeline::get eventually.

I think the repro commit does have other deadlocks, but I failed to really understand them. Originally I suspected a thread being blocked on locking Timeline::layer_removal_cs have the async walredo task on it's LIFO slot, but haven't seen it now in rr at least.

More reproduction info

branch has the NEON_LOCAL_RECORD_COMMANDS="pageserver" support if you want to try your hand at rr (closed neon_local: allow rr recordings via env var #2958)
used dev builds, cargo build_testing
./scripts/pytest --capture=no test_runner/regress/test_branch_and_gc.py::test_branch_and_gc
- I think this was already fixed but on the branch compaction runs nonstop

I don't think this is an issue of tokio, though they are working on making this LIFO slot stealable in the future. I think it's the issue of blocking the codebase. More precisely I think it's the many paths that are called from async context which forbid using async, like in the deadlock-repro branch.

Straight forward solution to this is to never block on tokio runtime threads. That is however difficult to implement and will likely come with some performance loss at least with the current tokio version we use. A "test case" I've used in the past to keep the blocking in check is to run the whole app within one current_thread runtime instead of multiple runtimes with multiple threads. We should be able to do this with the regression tests. I think it should be a long-term goal.

During #2875 I tried sprinkling tokio::task::block_in_place and tokio::task::spawn_blocking around. The use of tokio::task::block_in_place has the downside of making it impossible to aim for the current_thread flavored executor. Eventually tests passed, but it's difficult to estimate how complete a solution this is. In the course of the sprinkling, I used a very rough tool to help guide the process, but it's a prolonged process, didn't complete it.

The text was updated successfully, but these errors were encountered:

A step towards more async code in our repo, to help avoid most of the odd blocking calls, that might deadlock, as mentioned in #2975

koivunej · 2022-12-09T07:00:13Z

The tokio::task::block_in_place is basically not a solution or alternative because of tokio-rs/tokio#5239 for spawn_blocking, at least until we get capabilities but the proper support for those might come in sooner.

Still I think there's benefit to consider a single runtime with very few threads (1-2) and using for example 8 threads for blocking pool. We most likely don't need that much async code going on in the first place. Blocking pool will be utilized regardless with block_in_place so it might be used explicitly.

jcsp · 2024-02-05T11:29:16Z

This comment was moved to #4744 (comment)

problame · 2024-02-05T11:54:57Z

A lot has happened since opening this issue.

Closing it as duplicate in favor of the two currently known remaining sources of executor stalling:

) Before this PR, the `nix::poll::poll` call would stall the executor. This PR refactors the `walredo::process` module to allow for different implementations, and adds a new `async` implementation which uses `tokio::process::ChildStd{in,out}` for IPC. The `sync` variant remains the default for now; we'll do more testing in staging and gradual rollout to prod using the config variable. Performance ----------- I updated `bench_walredo.rs`, demonstrating that a single `async`-based walredo manager used by N=1...128 tokio tasks has lower latency and higher throughput. I further did manual less-micro-benchmarking in the real pageserver binary. Methodology & results are published here: https://neondatabase.notion.site/2024-04-08-async-walredo-benchmarking-8c0ed3cc8d364a44937c4cb50b6d7019?pvs=4 tl;dr: - use pagebench against a pageserver patched to answer getpage request & small-enough working set to fit into PS PageCache / kernel page cache. - compare knee in the latency/throughput curve - N tenants, each 1 pagebench clients - sync better throughput at N < 30, async better at higher N - async generally noticable but not much worse p99.X tail latencies - eyeballing CPU efficiency in htop, `async` seems significantly more CPU efficient at ca N=[0.5*ncpus, 1.5*ncpus], worse than `sync` outside of that band Mental Model For Walredo & Scheduler Interactions ------------------------------------------------- Walredo is CPU-/DRAM-only work. This means that as soon as the Pageserver writes to the pipe, the walredo process becomes runnable. To the Linux kernel scheduler, the `$ncpus` executor threads and the walredo process thread are just `struct task_struct`, and it will divide CPU time fairly among them. In `sync` mode, there are always `$ncpus` runnable `struct task_struct` because the executor thread blocks while `walredo` runs, and the executor thread becomes runnable when the `walredo` process is done handling the request. In `async` mode, the executor threads remain runnable unless there are no more runnable tokio tasks, which is unlikely in a production pageserver. The above means that in `sync` mode, there is an implicit concurrency limit on concurrent walredo requests (`$num_runtimes * $num_executor_threads_per_runtime`). And executor threads do not compete in the Linux kernel scheduler for CPU time, due to the blocked-runnable-ping-pong. In `async` mode, there is no concurrency limit, and the walredo tasks compete with the executor threads for CPU time in the kernel scheduler. If we're not CPU-bound, `async` has a pipelining and hence throughput advantage over `sync` because one executor thread can continue processing requests while a walredo request is in flight. If we're CPU-bound, under a fair CPU scheduler, the *fixed* number of executor threads has to share CPU time with the aggregate of walredo processes. It's trivial to reason about this in `sync` mode due to the blocked-runnable-ping-pong. In `async` mode, at 100% CPU, the system arrives at some (potentially sub-optiomal) equilibrium where the executor threads get just enough CPU time to fill up the remaining CPU time with runnable walredo process. Why `async` mode Doesn't Limit Walredo Concurrency -------------------------------------------------- To control that equilibrium in `async` mode, one may add a tokio semaphore to limit the number of in-flight walredo requests. However, the placement of such a semaphore is non-trivial because it means that tasks queuing up behind it hold on to their request-scoped allocations. In the case of walredo, that might be the entire reconstruct data. We don't limit the number of total inflight Timeline::get (we only throttle admission). So, that queue might lead to an OOM. The alternative is to acquire the semaphore permit *before* collecting reconstruct data. However, what if we need to on-demand download? A combination of semaphores might help: one for reconstruct data, one for walredo. The reconstruct data semaphore permit is dropped after acquiring the walredo semaphore permit. This scheme effectively enables both a limit on in-flight reconstruct data and walredo concurrency. However, sizing the amount of permits for the semaphores is tricky: - Reconstruct data retrieval is a mix of disk IO and CPU work. - If we need to do on-demand downloads, it's network IO + disk IO + CPU work. - At this time, we have no good data on how the wall clock time is distributed. It turns out that, in my benchmarking, the system worked fine without a semaphore. So, we're shipping async walredo without one for now. Future Work ----------- We will do more testing of `async` mode and gradual rollout to prod using the config flag. Once that is done, we'll remove `sync` mode to avoid the temporary code duplication introduced by this PR. The flag will be removed. The `wait()` for the child process to exit is still synchronous; the comment [here]( https://github.com/neondatabase/neon/blob/655d3b64681b6562530665c9ab5f2f806f30ad01/pageserver/src/walredo.rs#L294-L306) is still a valid argument in favor of that. The `sync` mode had another implicit advantage: from tokio's perspective, the calling task was using up coop budget. But with `async` mode, that's no longer the case -- to tokio, the writes to the child process pipe look like IO. We could/should inform tokio about the CPU time budget consumed by the task to achieve fairness similar to `sync`. However, the [runtime function for this is `tokio_unstable`](`https://docs.rs/tokio/latest/tokio/task/fn.consume_budget.html). Refs ---- refs #6628 refs #2975

koivunej mentioned this issue Dec 1, 2022

Tokio based walredo #2875

Closed

2 tasks

SomeoneToIgnore mentioned this issue Dec 5, 2022

Use async RwLock around tenants #3009

Merged

SomeoneToIgnore pushed a commit that referenced this issue Dec 5, 2022

Use async RwLock around tenants (#3009)

38af453

A step towards more async code in our repo, to help avoid most of the odd blocking calls, that might deadlock, as mentioned in #2975

LizardWizzard mentioned this issue Dec 8, 2022

4096 thread limit in safekeepers. #3036

Closed

SomeoneToIgnore mentioned this issue Dec 19, 2022

Retry remote downloads. #3134

Merged

koivunej mentioned this issue Feb 21, 2023

calculate_logical_size: no longer use spawn_blocking #3664

Merged

shanyp added the c/storage/pageserver Component: storage: pageserver label Mar 22, 2023

koivunej mentioned this issue Apr 14, 2023

Excessive tenant initial load times on large pageserver after restart #4025

Closed

problame self-assigned this Dec 4, 2023

problame mentioned this issue Jan 31, 2024

add async walredo mode (disabled-by-default, opt-in via config) #6548

Merged

problame mentioned this issue Feb 5, 2024

Epic: scalable async disk IO (tokio-epoll-uring) #4744

Closed

problame closed this as not planned Won't fix, can't repro, duplicate, stale Feb 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Blocking throughout `pageserver` in async contexts #2975

Blocking throughout `pageserver` in async contexts #2975

koivunej commented Nov 30, 2022 •

edited

Loading

koivunej commented Dec 9, 2022

jcsp commented Feb 5, 2024 •

edited by problame

Loading

problame commented Feb 5, 2024

Blocking throughout pageserver in async contexts #2975

Blocking throughout pageserver in async contexts #2975

Comments

koivunej commented Nov 30, 2022 • edited Loading

koivunej commented Dec 9, 2022

jcsp commented Feb 5, 2024 • edited by problame Loading

problame commented Feb 5, 2024

Blocking throughout `pageserver` in async contexts #2975

Blocking throughout `pageserver` in async contexts #2975

koivunej commented Nov 30, 2022 •

edited

Loading

jcsp commented Feb 5, 2024 •

edited by problame

Loading