-
-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
assert failed: cx_core.is_none() #5239
Comments
RUST_BACKTRACE=full with tokio 1.22.0
|
We have the same problem in our private repository, which is very hard to debug, because it is nested deep inside a really heavy-weight logic (tens of minutes to re-run). The main problem is that this panic is highly flaky and hard to reproduce. |
I did not try it yet but hoping to try this week; catch the bug happening in an Neglected to mention in the issue, we have multiple runtimes in our codebase as well (because of the blocking in async contexts). |
With |
The assertion you triggered certainly shouldn't happen under normal circumstances. |
Found the single file reproducer/mvce. When ran with Please find it at https://github.com/koivunej/tokio-nested-block_in_place. Expected output:
Checked tokio versions:
With AMD Ryzen 9 3900XT it reproduces regardless of
Confirmed on:
|
Maybe it can still be slimmed down, but I don't understand why with regards to how difficult it was to catch originally: #[tokio::main]
async fn main() {
use tokio::{
runtime::Handle,
task::{block_in_place, spawn},
};
let res = spawn(async move {
block_in_place(|| {
Handle::current().block_on(async {
block_in_place(|| {});
});
});
})
.await;
let e = match res {
Ok(()) => unreachable!("no panic?"),
Err(e) => e,
};
let panic = e.into_panic();
let s = *panic.downcast_ref::<&'static str>().unwrap();
assert_eq!("assertion failed: cx_core.is_none()", s);
} |
Oki it seems this is simply beyond what was fixed in #2645. I had assumed that nested |
I'd also expect |
Yes, it should work at any level of nesting as long as the |
Looking at the existing tests, I was thinking maybe the expectation is to use a different or new current thread runtime or futures executor to drive the more nested: tokio/tokio/tests/task_blocking.rs Lines 212 to 215 in c693ccd
tokio::task::block_in_place .
So I don't really understand the response here. Still, I'm no longer looking at this approach, I've always thought the |
Using |
I'm facing a similar issue, but I can't quite figure out what's triggering it. I thought it was a combination of Looking at the backtrace, it seems this happens in the I guess this is evident, but I figured I'd write it out: it seems like some logic is putting back the
tokio/tokio/src/runtime/scheduler/multi_thread/worker.rs Lines 391 to 395 in ec1f52e
Edit: I might've gotten confused. It is expected that |
I added this log line before the assert: println!(
"Reset dropped worker core is_some? {}, context core is_some? {}",
core.is_some(),
cx_core.is_some()
); When the assert failed, I got:
I'm assuming it's a problem that the core is put back when it's not expected to be put back. It sounds like some blocking task might've ran on a thread that wasn't marked as blocking anymore. Edit: I confirmed that the |
Some more debugging, added thread id and name + added a thread local AtomicU16 acting as some kind of ID for the
Compared to one where no panic occur:
Everything is the same up until line 19 of each log output where, in the non-panicking output, the there's no worker core (and still not context core). Somehow that prevents it from panicking. It seems odd to me that the worker's core would ever get put back, outside of My branch with the |
This patch fixes a bug where nesting `block_in_place` with a `block_on` between could lead to a panic. This happened because the nested `block_in_place` would try to acquire a core on return when it should not attempt to do so. The `block_on` between the two nested `block_in_place` altered the thread-local state to lead to the incorrect behavior. The fix is for each call to `block_in_place` to track if it needs to try to steal a core back. Fixes #5239
This patch fixes a bug where nesting `block_in_place` with a `block_on` between could lead to a panic. This happened because the nested `block_in_place` would try to acquire a core on return when it should not attempt to do so. The `block_on` between the two nested `block_in_place` altered the thread-local state to lead to the incorrect behavior. The fix is for each call to `block_in_place` to track if it needs to try to steal a core back. Fixes #5239
This patch fixes a bug where nesting `block_in_place` with a `block_on` between could lead to a panic. This happened because the nested `block_in_place` would try to acquire a core on return when it should not attempt to do so. The `block_on` between the two nested `block_in_place` altered the thread-local state to lead to the incorrect behavior. The fix is for each call to `block_in_place` to track if it needs to try to steal a core back. Fixes #5239
Version
tokio 1.21.1, with grep results prettified:
We use rust
1.62.1
.Platform
Initially detected from a CI run on:
Linux hostname 5.10.144-127.601.amzn2.x86_64 #1 SMP Thu Sep 29 01:11:59 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Reproduced on:
Linux hostname 5.15.0-52-generic #58-Ubuntu SMP Thu Oct 13 08:03:55 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Description
Assertion failure message with release build:
In our codebase this is after a few tries reproducable under load locally for me. Load as in
while true; do cargo clean && cargo build; done
for example running in the background. I can try out some patches if needed.I haven't been able to find an MVCE.
Full steps to reproduce in our codebase
Expect to see:
Then you will find the assertion failure in
test_output/test_gc_aggressive/repo/pageserver.log
. I have also copied the full stacktrace to the next<details>
.RUST_BACKTRACE=full of the assertion failure
In this branch, I've sprinkled a lot of
block_in_place
around the blocking parts after I ran into a deadlock caused by the unstealable lifo slot because there was blocking within the runtime threads. It is unlikely that I've caught all places of blocking within async context.If I ended up misusing the
block_in_place
andblock_on
then I wish the assertion would have a clear message about the misuse. However since it only triggers under external load (and while beingnice -20
), I suspect it is a real tokio issue.The text was updated successfully, but these errors were encountered: