[red-knot] Fix CLI hang when a dependent query panics #17631

MichaReiser · 2025-04-25T16:19:45Z

Summary

Red Knot's CLI sometimes hangs after encountering a panic. This was due to non-determinsim related to how panics are propagated by rayon and Salsa.

Our main loop spawns a job to run the check command to remain responsive to Ctrl + C commands and incoming file watcher changes
This thread is spawned using rayon::spawn. Rayon terminates the entire process if the task scheduled with rayon::spawn panics (we don't have any other panic handling today).
This thread calls db.check right away which calls project.check, but is wrapped in a salsa::Cancelled::catch. I assumed that this only catches cancellations because of a pending write (e.g. a file watcher change) but it turns out, that it also catches panics if thread A depends on a query running in thread B and thread B panics.
The actual checking happens in project.check where we use a thread pool and spawn a task for every file.
project.check uses rayon::scope: Unlike rayon::spawn, it propagates panics instead of terminating the process. However, it propagates an arbitrary panic if two threads panic (which is the case if thread A depends on a query in thread B and B panics).

What happened is that sometimes rayon propagated the Cancelled::PropagatedPanic (the panic raised by salsa in thread A that depends on the panicking query running in thread B) panic over the actual panic in thread B, which then got swallowed by our Cancelled::catch wrapper inside db.check.

We should probably change Salsa to use a different mechanism to handle thread-dependent panics because this is a logical error, whereas a pending write is not.

This PR fixes the hang by repeating the Cancelled::catch inside each spawned thread. This has the advantage that we propagate the real panic from thread B and never the placeholder Cancelled::PropagatedPanic from thread A (which doesn't contain any useful debug information). More specifically, we now catch panics inside check_file_impl and create a diagnostic for them.

error: panic: Panicked while checking `/Users/micha/astral/ecosystem/hydpy/hydpy/core/devicetools.py`: `dependency graph cycle querying try_metaclass_(Id(250b7)); set cycle_fn/cycle_initial to fixpoint iterate`
info: This indicates a bug in Red Knot.
info: If you could open an issue at https://github.com/astral-sh/ruff/issues/new?title=%5BRed%20Knot%20panic%5D, we'd be very appreciative!

Ideally, we'd capture the backtrace too but that's a) more complicated and b) mostly empty in production builds.

Test Plan

I ran red knot on hydpy for a couple of minutes and I couldn't reproduce the hang anymore (normally reproduces after 30s or so)

github-actions · 2025-04-25T16:24:39Z

`mypy_primer` results

No ecosystem changes detected ✅

carljm

Nice! Thank you so much for tracking this down.

The fix looks good, and also provides a better user experience for panics.

Is my understanding correct that this means new ecosystem panics will now again not show up as a failure in the ecosystem job, and instead just as a diagnostic output diff? I think that's OK, but it does mean we need to be careful to check ecosystem output on our diffs.

carljm · 2025-04-25T18:26:06Z

crates/red_knot_project/src/lib.rs

+                "This indicates a bug in Red Knot.",
+            ));
+
+            let report_message = "If you could open an issue at https://github.com/astral-sh/ruff/issues/new?title=%5BRed%20Knot%20panic%5D, we'd be very appreciative!";


nit: today we use [red-knot] prefix on all our issues and PRs, can we stay consistent with that? e.g. [red-knot] panic: maybe?

Of course we'll have to change this again soon :)

carljm · 2025-04-25T18:29:02Z

Looks like clippy is not happy?

MichaReiser · 2025-04-25T20:18:07Z

That's correct. But panics come first with Andrew's new sorting and should be easy to discover because of it (unless it gets truncated).

sharkdp · 2025-04-28T07:07:32Z

Is my understanding correct that this means new ecosystem panics will now again not show up as a failure in the ecosystem job, and instead just as a diagnostic output diff? I think that's OK, but it does mean we need to be careful to check ecosystem output on our diffs.

Maybe we could still exit with a code different from 1 (type checking failed) and 2 (some other red knot error), like before? In this case, it would still be easy to detect panics (require no changes in mypy_primer).

sharkdp · 2025-04-28T07:15:57Z

Maybe we could still exit with a code different from 1 (type checking failed) and 2 (some other red knot error), like before? In this case, it would still be easy to detect panics (require no changes in mypy_primer).

Oh, you already proposed that change in #17640. In that case, mypy_primer CI runs will still fail in case of panics.

Edit: well, not quite, still uses error code 2

MichaReiser · 2025-04-28T07:25:44Z

Edit: well, not quite, still uses error code 2

We could change the error code to something else but 2 is what Ruff/Red Knot already used for other errors.

MichaReiser added the ty Multi-file analysis & type inference label Apr 25, 2025

MichaReiser marked this pull request as ready for review April 25, 2025 16:31

MichaReiser requested review from AlexWaygood, carljm, dcreager and sharkdp as code owners April 25, 2025 16:31

AlexWaygood removed their request for review April 25, 2025 16:38

MichaReiser marked this pull request as draft April 25, 2025 16:41

[red-knot] Fix CLI hang when a dependend query panics

4a6f5e1

MichaReiser force-pushed the micha/cli-hang branch 2 times, most recently from 8e3a07b to c85b945 Compare April 25, 2025 17:13

MichaReiser marked this pull request as ready for review April 25, 2025 17:13

AlexWaygood changed the title ~~[red-knot] Fix CLI hang when a dependend query panics~~ [red-knot] Fix CLI hang when a dependent query panics Apr 25, 2025

MichaReiser force-pushed the micha/cli-hang branch from c85b945 to 49e9981 Compare April 25, 2025 17:15

MichaReiser mentioned this pull request Apr 25, 2025

ruff_db: add tests for annotations with no ranges #17632

Merged

MichaReiser force-pushed the micha/cli-hang branch 3 times, most recently from c53a283 to b34bcd6 Compare April 25, 2025 17:48

carljm approved these changes Apr 25, 2025

View reviewed changes

MichaReiser force-pushed the micha/cli-hang branch from b34bcd6 to 7de4d9f Compare April 26, 2025 06:23

Create diagnostic for panic

db36802

MichaReiser force-pushed the micha/cli-hang branch from 7de4d9f to db36802 Compare April 26, 2025 06:25

MichaReiser enabled auto-merge (squash) April 26, 2025 06:26

MichaReiser merged commit cfa1505 into main Apr 26, 2025
32 checks passed

MichaReiser deleted the micha/cli-hang branch April 26, 2025 06:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[red-knot] Fix CLI hang when a dependent query panics #17631

[red-knot] Fix CLI hang when a dependent query panics #17631

Uh oh!

MichaReiser commented Apr 25, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Apr 25, 2025 •

edited

Loading

Uh oh!

carljm left a comment

Uh oh!

carljm Apr 25, 2025

Uh oh!

carljm commented Apr 25, 2025

Uh oh!

MichaReiser commented Apr 25, 2025

Uh oh!

Uh oh!

sharkdp commented Apr 28, 2025

Uh oh!

sharkdp commented Apr 28, 2025 •

edited

Loading

Uh oh!

MichaReiser commented Apr 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[red-knot] Fix CLI hang when a dependent query panics #17631

[red-knot] Fix CLI hang when a dependent query panics #17631

Uh oh!

Conversation

MichaReiser commented Apr 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test Plan

Uh oh!

github-actions bot commented Apr 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

mypy_primer results

Uh oh!

carljm left a comment

Choose a reason for hiding this comment

Uh oh!

carljm Apr 25, 2025

Choose a reason for hiding this comment

Uh oh!

carljm commented Apr 25, 2025

Uh oh!

MichaReiser commented Apr 25, 2025

Uh oh!

Uh oh!

sharkdp commented Apr 28, 2025

Uh oh!

sharkdp commented Apr 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MichaReiser commented Apr 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

MichaReiser commented Apr 25, 2025 •

edited

Loading

github-actions bot commented Apr 25, 2025 •

edited

Loading

`mypy_primer` results

sharkdp commented Apr 28, 2025 •

edited

Loading