Force test failure on async task panic (for now) #2479

thomcc · 2022-02-01T18:03:55Z

Currently our tests occasionally hang in CI. I think I've seen it happen at least 3-4 times in the past week, so it's not exactly rare.

I believe the main reason this happens is because a task (or worker) panics, which is currently swallowed by tokio (tokio-rs/tokio#2002). This kind of issue is the kind of issue we should work out.

There are a few options:

Restructure code so that we handle task panics in a more principled way. Ultimately, we should do this, but I think it's not critical in the short term.
When running tests, wrap spawn(...) in a wrapper that forces test failure if the task panics.
- For example, there is likely a more elegant way, but you could do by wrapping it in catch_unwind, and sending a message on a channel in the case of panics. Who listens to this channel would have to be worked out, but an easy answer is that we make it happen in the test wrapper.
When running tests under nightly, use RUSTFLAGS="-Cpanic=abort -Zpanic-abort-tests" (see panic=abort testing / subprocess testing rust-lang/rust#67650). This is unstable, but will cause panics emitted anywhere during tests to immediately abort the process.
- The -Zpanic-abort-tests is an unstable feature required for the test runner to handle this abort gracefully -- otherwise, we might not even notice!
Something else? @spacekookie may have some thoughts on other options.

As it stands, given that we have to support no_std and probably other environments where we build with -Cpanic=abort (which isn't that rarely enabled of a flag), we should probably assume any panic is a fatal bug in our code, at least for the time being.

Anyway, I'm going to poke at 3, since it seems much less work than 2, and would (if nothing else) upgrade some CI hangs to actual test failures.

Long term thoughts?

However, in the long term, I think if/when -Cpanic=abort (or similar) is not enabled, we'd probably like users to be able to build workers which are robust to panics, since this kind of isolation is (sometimes) considered a hallmark of actor models. This seems like something we have some time to punt on though, as it is mostly desirable for panics caused by user code (rather than issues in our code -- ideally we should return Err rather than panic).

The text was updated successfully, but these errors were encountered:

thomcc · 2022-02-01T19:18:59Z

@spacekookie I think you're more informed than I am here, and probably have some opinions or plans around how you imagine this to be handled in the node runtime, so please feel free to leave any thoughts (even/especially if you disagree with what I've written!)

spacekookie · 2022-02-02T12:21:08Z

This is a tricky problem. My experience with test-wrappers that handle panics in a special way has so far mostly been negative, as it adds a load on maintenance burden to the project and corner-cases that may only exist in CI but not in production settings.

Going down this line of thinking though I'm wondering if we could make use of std::panic::set_hook to override whatever tokio is doing to catch panics in a way that doesn't break everything. But again: I think this is a good way to introduce unbound side effects into the system.

Anyway, my general approach to panics is that we shouldn't ever have one. So making panic=abort the default will lead to a lot of crashes, but will also force us to make the codebase more robust to failures.

As for the information we'll be losing: I think we have to face the logging and diagnostics problem already because on no_std environments we can't rely on pretty stack traces, so we shouldn't rely on them on "full" deployments either.

thomcc · 2022-02-22T23:34:58Z

Anyway, my general approach to panics is that we shouldn't ever have one. So making panic=abort the default will lead to a lot of crashes, but will also force us to make the codebase more robust to failures.

Yes, we're in agreement here. The reason for doing a special thing in CI is because panic=abort doesn't work with cargo test on stable, you need to pass a -Zpanic-abort-tests flag.

bsbds mentioned this issue Jun 29, 2023

[Bug]: The test_no_root_user_do_admin_ops will pass despite the panic xline-kv/Xline#267

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Force test failure on async task panic (for now) #2479

Force test failure on async task panic (for now) #2479

thomcc commented Feb 1, 2022 •

edited

Loading

thomcc commented Feb 1, 2022

spacekookie commented Feb 2, 2022

thomcc commented Feb 22, 2022

Force test failure on async task panic (for now) #2479

Force test failure on async task panic (for now) #2479

Comments

thomcc commented Feb 1, 2022 • edited Loading

Long term thoughts?

thomcc commented Feb 1, 2022

spacekookie commented Feb 2, 2022

thomcc commented Feb 22, 2022

thomcc commented Feb 1, 2022 •

edited

Loading