Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Force test failure on async task panic (for now) #2479

Open
thomcc opened this issue Feb 1, 2022 · 3 comments
Open

Force test failure on async task panic (for now) #2479

thomcc opened this issue Feb 1, 2022 · 3 comments

Comments

@thomcc
Copy link
Contributor

thomcc commented Feb 1, 2022

Currently our tests occasionally hang in CI. I think I've seen it happen at least 3-4 times in the past week, so it's not exactly rare.

I believe the main reason this happens is because a task (or worker) panics, which is currently swallowed by tokio (tokio-rs/tokio#2002). This kind of issue is the kind of issue we should work out.

There are a few options:

  1. Restructure code so that we handle task panics in a more principled way. Ultimately, we should do this, but I think it's not critical in the short term.

  2. When running tests, wrap spawn(...) in a wrapper that forces test failure if the task panics.

    • For example, there is likely a more elegant way, but you could do by wrapping it in catch_unwind, and sending a message on a channel in the case of panics. Who listens to this channel would have to be worked out, but an easy answer is that we make it happen in the test wrapper.
  3. When running tests under nightly, use RUSTFLAGS="-Cpanic=abort -Zpanic-abort-tests" (see panic=abort testing / subprocess testing rust-lang/rust#67650). This is unstable, but will cause panics emitted anywhere during tests to immediately abort the process.

    • The -Zpanic-abort-tests is an unstable feature required for the test runner to handle this abort gracefully -- otherwise, we might not even notice!
  4. Something else? @spacekookie may have some thoughts on other options.

As it stands, given that we have to support no_std and probably other environments where we build with -Cpanic=abort (which isn't that rarely enabled of a flag), we should probably assume any panic is a fatal bug in our code, at least for the time being.

Anyway, I'm going to poke at 3, since it seems much less work than 2, and would (if nothing else) upgrade some CI hangs to actual test failures.

Long term thoughts?

However, in the long term, I think if/when -Cpanic=abort (or similar) is not enabled, we'd probably like users to be able to build workers which are robust to panics, since this kind of isolation is (sometimes) considered a hallmark of actor models. This seems like something we have some time to punt on though, as it is mostly desirable for panics caused by user code (rather than issues in our code -- ideally we should return Err rather than panic).

@thomcc
Copy link
Contributor Author

thomcc commented Feb 1, 2022

@spacekookie I think you're more informed than I am here, and probably have some opinions or plans around how you imagine this to be handled in the node runtime, so please feel free to leave any thoughts (even/especially if you disagree with what I've written!)

@spacekookie
Copy link
Contributor

This is a tricky problem. My experience with test-wrappers that handle panics in a special way has so far mostly been negative, as it adds a load on maintenance burden to the project and corner-cases that may only exist in CI but not in production settings.

Going down this line of thinking though I'm wondering if we could make use of std::panic::set_hook to override whatever tokio is doing to catch panics in a way that doesn't break everything. But again: I think this is a good way to introduce unbound side effects into the system.

Anyway, my general approach to panics is that we shouldn't ever have one. So making panic=abort the default will lead to a lot of crashes, but will also force us to make the codebase more robust to failures.

As for the information we'll be losing: I think we have to face the logging and diagnostics problem already because on no_std environments we can't rely on pretty stack traces, so we shouldn't rely on them on "full" deployments either.

@thomcc
Copy link
Contributor Author

thomcc commented Feb 22, 2022

Anyway, my general approach to panics is that we shouldn't ever have one. So making panic=abort the default will lead to a lot of crashes, but will also force us to make the codebase more robust to failures.

Yes, we're in agreement here. The reason for doing a special thing in CI is because panic=abort doesn't work with cargo test on stable, you need to pass a -Zpanic-abort-tests flag.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants