Skip to content

Conversation

@MichaReiser
Copy link
Member

@MichaReiser MichaReiser commented Apr 25, 2025

Summary

Fixes #17537

Red Knot's CLI sometimes hangs after encountering a panic. This was due to non-determinsim related to how panics are propagated by rayon and Salsa.

  • Our main loop spawns a job to run the check command to remain responsive to Ctrl + C commands and incoming file watcher changes
  • This thread is spawned using rayon::spawn. Rayon terminates the entire process if the task scheduled with rayon::spawn panics (we don't have any other panic handling today).
  • This thread calls db.check right away which calls project.check, but is wrapped in a salsa::Cancelled::catch. I assumed that this only catches cancellations because of a pending write (e.g. a file watcher change) but it turns out, that it also catches panics if thread A depends on a query running in thread B and thread B panics.
  • The actual checking happens in project.check where we use a thread pool and spawn a task for every file.
  • project.check uses rayon::scope: Unlike rayon::spawn, it propagates panics instead of terminating the process. However, it propagates an arbitrary panic if two threads panic (which is the case if thread A depends on a query in thread B and B panics).

What happened is that sometimes rayon propagated the Cancelled::PropagatedPanic (the panic raised by salsa in thread A that depends on the panicking query running in thread B) panic over the actual panic in thread B, which then got swallowed by our Cancelled::catch wrapper inside db.check.

We should probably change Salsa to use a different mechanism to handle thread-dependent panics because this is a logical error, whereas a pending write is not.

This PR fixes the hang by repeating the Cancelled::catch inside each spawned thread. This has the advantage that we propagate the real panic from thread B and never the placeholder Cancelled::PropagatedPanic from thread A (which doesn't contain any useful debug information). More specifically, we now catch panics inside check_file_impl and create a diagnostic for them.

error: panic: Panicked while checking `/Users/micha/astral/ecosystem/hydpy/hydpy/core/devicetools.py`: `dependency graph cycle querying try_metaclass_(Id(250b7)); set cycle_fn/cycle_initial to fixpoint iterate`
info: This indicates a bug in Red Knot.
info: If you could open an issue at https://github.com/astral-sh/ruff/issues/new?title=%5BRed%20Knot%20panic%5D, we'd be very appreciative!

Ideally, we'd capture the backtrace too but that's a) more complicated and b) mostly empty in production builds.

Test Plan

I ran red knot on hydpy for a couple of minutes and I couldn't reproduce the hang anymore (normally reproduces after 30s or so)

@MichaReiser MichaReiser added the ty Multi-file analysis & type inference label Apr 25, 2025
@github-actions
Copy link
Contributor

github-actions bot commented Apr 25, 2025

mypy_primer results

No ecosystem changes detected ✅

@MichaReiser MichaReiser marked this pull request as ready for review April 25, 2025 16:31
@AlexWaygood AlexWaygood removed their request for review April 25, 2025 16:38
@MichaReiser MichaReiser marked this pull request as draft April 25, 2025 16:41
@MichaReiser MichaReiser force-pushed the micha/cli-hang branch 2 times, most recently from 8e3a07b to c85b945 Compare April 25, 2025 17:13
@MichaReiser MichaReiser marked this pull request as ready for review April 25, 2025 17:13
@AlexWaygood AlexWaygood changed the title [red-knot] Fix CLI hang when a dependend query panics [red-knot] Fix CLI hang when a dependent query panics Apr 25, 2025
@MichaReiser MichaReiser force-pushed the micha/cli-hang branch 3 times, most recently from c53a283 to b34bcd6 Compare April 25, 2025 17:48
Copy link
Contributor

@carljm carljm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Thank you so much for tracking this down.

The fix looks good, and also provides a better user experience for panics.

Is my understanding correct that this means new ecosystem panics will now again not show up as a failure in the ecosystem job, and instead just as a diagnostic output diff? I think that's OK, but it does mean we need to be careful to check ecosystem output on our diffs.

"This indicates a bug in Red Knot.",
));

let report_message = "If you could open an issue at https://github.com/astral-sh/ruff/issues/new?title=%5BRed%20Knot%20panic%5D, we'd be very appreciative!";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: today we use [red-knot] prefix on all our issues and PRs, can we stay consistent with that? e.g. [red-knot] panic: maybe?

Of course we'll have to change this again soon :)

@carljm
Copy link
Contributor

carljm commented Apr 25, 2025

Looks like clippy is not happy?

@MichaReiser
Copy link
Member Author

That's correct. But panics come first with Andrew's new sorting and should be easy to discover because of it (unless it gets truncated).

@MichaReiser MichaReiser enabled auto-merge (squash) April 26, 2025 06:26
@MichaReiser MichaReiser merged commit cfa1505 into main Apr 26, 2025
32 checks passed
@MichaReiser MichaReiser deleted the micha/cli-hang branch April 26, 2025 06:28
@sharkdp
Copy link
Contributor

sharkdp commented Apr 28, 2025

Is my understanding correct that this means new ecosystem panics will now again not show up as a failure in the ecosystem job, and instead just as a diagnostic output diff? I think that's OK, but it does mean we need to be careful to check ecosystem output on our diffs.

Maybe we could still exit with a code different from 1 (type checking failed) and 2 (some other red knot error), like before? In this case, it would still be easy to detect panics (require no changes in mypy_primer).

@sharkdp
Copy link
Contributor

sharkdp commented Apr 28, 2025

Maybe we could still exit with a code different from 1 (type checking failed) and 2 (some other red knot error), like before? In this case, it would still be easy to detect panics (require no changes in mypy_primer).

Oh, you already proposed that change in #17640. In that case, mypy_primer CI runs will still fail in case of panics.

Edit: well, not quite, still uses error code 2

@MichaReiser
Copy link
Member Author

Edit: well, not quite, still uses error code 2

We could change the error code to something else but 2 is what Ruff/Red Knot already used for other errors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ty Multi-file analysis & type inference

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[red-knot] Panic leads to hang

3 participants