-
Notifications
You must be signed in to change notification settings - Fork 1.8k
[red-knot] Fix CLI hang when a dependent query panics #17631
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
8e3a07b to
c85b945
Compare
c85b945 to
49e9981
Compare
c53a283 to
b34bcd6
Compare
carljm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Thank you so much for tracking this down.
The fix looks good, and also provides a better user experience for panics.
Is my understanding correct that this means new ecosystem panics will now again not show up as a failure in the ecosystem job, and instead just as a diagnostic output diff? I think that's OK, but it does mean we need to be careful to check ecosystem output on our diffs.
crates/red_knot_project/src/lib.rs
Outdated
| "This indicates a bug in Red Knot.", | ||
| )); | ||
|
|
||
| let report_message = "If you could open an issue at https://github.com/astral-sh/ruff/issues/new?title=%5BRed%20Knot%20panic%5D, we'd be very appreciative!"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: today we use [red-knot] prefix on all our issues and PRs, can we stay consistent with that? e.g. [red-knot] panic: maybe?
Of course we'll have to change this again soon :)
|
Looks like clippy is not happy? |
|
That's correct. But panics come first with Andrew's new sorting and should be easy to discover because of it (unless it gets truncated). |
b34bcd6 to
7de4d9f
Compare
7de4d9f to
db36802
Compare
Maybe we could still exit with a code different from 1 (type checking failed) and 2 (some other red knot error), like before? In this case, it would still be easy to detect panics (require no changes in mypy_primer). |
Oh, you already proposed that change in #17640. In that case, mypy_primer CI runs will still fail in case of panics. Edit: well, not quite, still uses error code 2 |
We could change the error code to something else but 2 is what Ruff/Red Knot already used for other errors. |
Summary
Fixes #17537
Red Knot's CLI sometimes hangs after encountering a panic. This was due to non-determinsim related to how panics are propagated by rayon and Salsa.
Ctrl + Ccommands and incoming file watcher changesrayon::spawn. Rayon terminates the entire process if the task scheduled withrayon::spawnpanics (we don't have any other panic handling today).db.checkright away which callsproject.check, but is wrapped in asalsa::Cancelled::catch. I assumed that this only catches cancellations because of a pending write (e.g. a file watcher change) but it turns out, that it also catches panics if thread A depends on a query running in thread B and thread B panics.project.checkwhere we use a thread pool and spawn a task for every file.project.checkusesrayon::scope: Unlikerayon::spawn, it propagates panics instead of terminating the process. However, it propagates an arbitrary panic if two threads panic (which is the case if thread A depends on a query in thread B and B panics).What happened is that sometimes rayon propagated the
Cancelled::PropagatedPanic(the panic raised by salsa in thread A that depends on the panicking query running in thread B) panic over the actual panic in thread B, which then got swallowed by ourCancelled::catchwrapper insidedb.check.We should probably change Salsa to use a different mechanism to handle thread-dependent panics because this is a logical error, whereas a pending write is not.
This PR fixes the hang by repeating the
Cancelled::catchinside each spawned thread. This has the advantage that we propagate the real panic from thread B and never the placeholderCancelled::PropagatedPanicfrom thread A (which doesn't contain any useful debug information). More specifically, we now catch panics insidecheck_file_impland create a diagnostic for them.Ideally, we'd capture the backtrace too but that's a) more complicated and b) mostly empty in production builds.
Test Plan
I ran red knot on hydpy for a couple of minutes and I couldn't reproduce the hang anymore (normally reproduces after 30s or so)