Display diffs for `ruff format --check` and add support for different output formats by ntBre · Pull Request #20443 · astral-sh/ruff

ntBre · 2025-09-16T22:08:46Z

Summary

This PR uses the new Diagnostic type for rendering formatter diagnostics. This allows the formatter to inherit all of the output formats already implemented in the linter and ty. For example, here's the new full output format, with the formatting diff displayed using the same infrastructure as the linter:

Resolved TODOs

~~There are several limitiations/todos here still, especially around the OutputFormat type~~:

A few literal todo!s for the remaining OutputFormats without matching DiagnosticFormats
The default output format is full instead of something more concise like the current output
Some of the output formats (namely JSON) have information that doesn't make much sense for these diagnostics

The first of these is definitely resolved, and I think the other two are as well, based on discussion on the design document. In brief, we're okay inheriting the default OutputFormat and can separate the global option into lint.output-format and format.output-format in the future, if needed; and we're okay including redundant information in the non-human-readable output formats.

My last major concern is with the performance of the new code, as discussed in the Benchmarks section below.

A smaller question is whether we should use Diagnostics for formatting errors too. I think the answer to this is yes, in line with changes we're making in the linter too. I still need to implement that here.

Benchmarks

The values in the table are from a large benchmark on the CPython 3.10 code
base, which involves checking 2011 files, 1872 of which need to be reformatted.
stable corresponds to the same code used on main, while preview-full and
preview-concise use the new Diagnostic code gated behind --preview for the
full and concise output formats, respectively. stable-diff uses the
--diff to compare the two diff rendering approaches. See the full hyperfine
command below for more details. For a sense of scale, the stable output format
produces 1873 lines on stdout, compared to 855,278 for preview-full and
857,798 for stable-diff.

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`stable`	201.2 ± 6.8	192.9	220.6	1.00
`preview-full`	9113.2 ± 31.2	9076.1	9152.0	45.29 ± 1.54
`preview-concise`	214.2 ± 1.4	212.0	217.6	1.06 ± 0.04
`stable-diff`	3308.6 ± 20.2	3278.6	3341.8	16.44 ± 0.56

In summary, the preview-concise diagnostics are ~6% slower than the stable
output format, increasing the average runtime from 201.2 ms to 214.2 ms. The
full preview diagnostics are much more expensive, taking over 9113.2 ms to
complete, which is ~3x more expensive even than the stable diffs produced by the
--diff flag.

My main takeaways here are:

Rendering Edits is much more expensive than rendering the diffs from --diff
Constructing Edits actually isn't too bad

Constructing `Edit`s

I also took a closer look at Edit construction by modifying the code and
repeating the preview-concise benchmark and found that the main issue is
constructing a SourceFile for use in the Edit rendering. Commenting out the
Edit construction itself has basically no effect:

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`stable`	197.5 ± 1.6	195.0	200.3	1.00
`no-edit`	208.9 ± 2.2	204.8	212.2	1.06 ± 0.01

However, also omitting the source text from the SourceFile construction
resolves the slowdown compared to stable. So it seems that copying the full
source text into a SourceFile is the main cause of the slowdown for non-full
diagnostics.

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`stable`	202.4 ± 2.9	197.6	207.9	1.00
`no-source-text`	202.7 ± 3.3	196.3	209.1	1.00 ± 0.02

Rendering diffs

The main difference between stable-diff and preview-full seems to be the diffing strategy we use from similar. Both versions use the same algorithm, but in the existing CodeDiff rendering for the --diff flag, we only do line-level diffing, whereas for Diagnostics we use TextDiff::iter_inline_changes to highlight word-level changes too. Skipping the word diff for Diagnostics closes most of the gap:

Command	Mean [s]	Min [s]	Max [s]	Relative
`stable-diff`	3.323 ± 0.015	3.297	3.341	1.00
`preview-full`	3.654 ± 0.019	3.618	3.682	1.10 ± 0.01

(In some repeated runs, I've seen as small as a ~5% difference, down from 10% in the table)

This doesn't actually change any of our snapshots, but it would obviously change the rendered result in a terminal since we wouldn't highlight the specific words that changed within a line.

Another much smaller change that we can try is removing the deadline from the iter_inline_changes call. It looks like there's a fair amount of overhead from the default 500 ms deadline for computing these, and using iter_inline_changes(op, None) (None for the optional deadline argument) improves the runtime quite a bit:

Command	Mean [s]	Min [s]	Max [s]	Relative
`stable-diff`	3.322 ± 0.013	3.298	3.341	1.00
`preview-full`	5.296 ± 0.030	5.251	5.366	1.59 ± 0.01

hyperfine command

cargo build --release --bin ruff && hyperfine --ignore-failure --warmup 10 --export-markdown /tmp/table.md \
  -n stable -n preview-full -n preview-concise -n stable-diff \
  "./target/release/ruff format --check ./crates/ruff_linter/resources/test/cpython/ --no-cache" \
  "./target/release/ruff format --check ./crates/ruff_linter/resources/test/cpython/ --no-cache --preview --output-format=full" \
  "./target/release/ruff format --check ./crates/ruff_linter/resources/test/cpython/ --no-cache --preview --output-format=concise" \
  "./target/release/ruff format --check ./crates/ruff_linter/resources/test/cpython/ --no-cache --diff"

Test Plan

Some new CLI tests and manual testing

github-actions · 2025-09-16T22:21:43Z

`ruff-ecosystem` results

Linter (stable)

✅ ecosystem check detected no linter changes.

Linter (preview)

✅ ecosystem check detected no linter changes.

Formatter (stable)

✅ ecosystem check detected no format changes.

Formatter (preview)

✅ ecosystem check detected no format changes.

MichaReiser · 2025-09-17T07:16:59Z

More broadly, it seems pretty expensive to add the entire file's contents as an Edit. I'm guessing that might show up in the benchmarks on this PR. On a related note, this is a pretty shallow conversion, only constructing the Diagnostics right before rendering. There might be a better way to use the new infrastructure more. We could also use them for rendering FormatCommandErrors, not just FormatPathResults.

A solution here could be to add a diff field to Diagnostic so that the "edit" is computed lazily or the diff is rendered directly instead of using the edit rendering.

The default output format is full instead of something more concise like the current output

Hmm, that's an interesting find. This needs some design work.

ntBre · 2025-09-24T22:33:52Z

I think this is mostly ready for review now. I did a lot of squashing today to make it easier to review commit-by-commit.

The first 7 commits are all very small, standalone refactors used in later steps. The 8th commit is the biggest, converting FormatPathResults into diagnostics. The 9th commit is an incremental improvement on that to split notebook Edits by cell so that they can actually be rendered in the full output format (otherwise they never satisfy this check). Finally, the last commit emits errors as diagnostics too.

I still have a bunch of TODO comments on the errors. I did my best to match up the error variants with DiagnosticIds, including adding two new DiagnosticIds, but I think there's still room for improvement. I should probably also add a test for some or all of these.

I haven't tried to truncate the lines in the diff yet either. I know that came up in the design discussion, so I'm happy to tackle it now or leave it for a follow up if desired.

Hopefully it's not too glaring, but in the screenshot in the summary you can see that the | for line numbers doesn't line up with the middle of the filename --> like it does for lint diagnostics. I don't think there's really a good way around this since the arrow alignment comes from annotate-snippets and only gets indented if annotate-snippets renders line numbers too. I'm sure we could hack another option into annotate-snippets, but it would likely make #20411 harder still.

this is convenient for passing in the result of tempdir.join calls and matches the version in lint.rs

this wasn't a problem for the linter because we always want to show the fix status, but we don't need the visual clutter of marking every formatter diagnostic as fixable

instead of just syntax errors

as shown in the snapshot changes, this allows us to manually align what annotate-snippets calls the header sigil (the `-->` arrow in the diagnostic header) with our diff line number separators. these were aligned automatically in the linter because we were also emitting snippets with annotate-snippets with the same line numbers, but since the formatter diagnostics only have diffs and not snippets, the default line number width was zero. note that this still isn't perfect because we align with the highest line number in the entire file, not necessarily the highest _rendered_ line number, as shown in the updated notebook snapshot. still, it should get us closer to aligned than not having an offset at all and also end up being correct in most(?) cases.

github-actions · 2025-09-25T21:29:47Z

Diagnostic diff on typing conformance tests

No changes detected when running ty on typing conformance tests ✅

github-actions · 2025-09-25T21:30:48Z

`mypy_primer` results

No ecosystem changes detected ✅
No memory usage changes detected ✅

ntBre · 2025-09-25T22:02:57Z

I'm marking this ready for review. I pushed a few more commits resolving the main issues I noted above:

I added a lineno_offset/header_offset field for improving the --> alignment:

It can still go wrong and makes it harder to un-fork annotate-snippets, so I'm happy to revert this, but I think it's a visual improvement in general.
I added tests for all of the FormatCommandError variants and fixed a couple of oversights in their formatting, including factoring out and using some of ty's panic rendering. There are still TODOs on a few of these, but I think I could use some input on the best DiagnosticIds to use. The tests at least show what they look like to help us iterate on them.

I think the performance hit is acceptable. It's only ~10 ms (~5%) for the concise output on a large project, and the larger discrepancy between full output and --diff seems justified given the additional work of computing more granular diffs. I am seeing slightly worse performance today compared to the last time I ran the benchmarks, so it may be worth a bit more profiling, but I think the point still stands.

New benchmark table

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`stable`	200.0 ± 2.3	196.6	203.5	1.00
`preview-full`	9134.8 ± 32.1	9108.0	9197.5	45.67 ± 0.55
`preview-concise`	232.6 ± 4.2	227.0	240.7	1.16 ± 0.02
`stable-diff`	3354.8 ± 15.9	3333.7	3383.6	16.77 ± 0.21

stable is basically identical to last time, but both preview versions are ~20 ms slower. --diff is almost 50 ms slower, which really doesn't make sense.

Similarly, I think we may want to truncate large diffs at some point (Zanie even mentioned possibly making the limit configurable), but I think we can hold off on that for now.

Oh, one other TODO is that it would be nice to emit a warning if the output-format is set without preview, but I don't think that's so easy to do since output-format is a global option and will have already been unwrap_or_defaulted by the time we see it.

## Summary Addresses #20443 (comment) by factoring out the `match` on the ruff output format in a way that should be reusable by the formatter. I didn't think this was going to work at first, but the fact that the config holds options that apply only to certain output formats works in our favor here. We can set up a single config for all of the output formats and then use `try_from` to convert the `OutputFormat` to a `DiagnosticFormat` later. ## Test Plan Existing tests, plus a few new ones to make sure relocating the `SHOW_FIX_SUMMARY` rendering worked, that was untested before. I deleted a bunch of test code along with the `text` module, but I believe all of it is now well-covered by the `full` and `concise` tests in `ruff_db`. I also merged this branch into #20443 locally and made sure that the API actually helps. `render_diagnostics` dropped in perfectly and passed the tests there too.

Using `modified_range.formatted` was incorrect because it would fail to count unchanged lines at the start of the file, so the computed line number would be too low

ntBre · 2025-09-29T19:55:22Z

Thank you for the reviews!

I think this could use one more look. I was over-complicating the range calculations for a while today, but they felt kind of tricky, at least when I was trying to use zip like the start calculation¹. The new ModifiedRange type seems to be working well now, though, unless I missed any edge cases.

The other changes seemed relatively straightforward, and I also merged the changes from #20595.

I'll timebox trying to move our diff rendering to annotate-snippets to one day later this week.

The tricky part was that zipping and finding the first different character could easily fail if one of the snippets was shorter than the other; at that point it's not clear which one was shorter and caused the failure. We also want an exclusive range, so we've gone one character too far by finding the first different character rather than the last common character. Now I just loop from the end and track the length of the common suffix, which we can subtract from both text_lens despite the actual offsets likely being different. ↩

MichaReiser · 2025-09-30T07:03:05Z

crates/ruff/src/commands/format.rs

+                (fix, line_count)
+            } else {
+                let formatted_code = &formatted.source_code()[modified_range.formatted];
+                let edit = if formatted_code.is_empty() {


It shouldn't matter so I think it's fine leaving as is but there's the third case where modified_range.unformatted is empty (e.g. when adding blank lines between two classes), in which case its an insertion. We could add a Edit::from_text_and_range(new_text, range) (with a better name) that does this dance.

I think I'll leave this for now. I only split out deletions because I hit the debug_assert! that content is not empty in Edit::range_replacement. Insertions seem okay to group with full replacements since content is Some in both cases, and we already have a TextRange, which Edit::insertion would otherwise construct.

crates/ruff/src/commands/format.rs

MichaReiser · 2025-09-30T07:07:51Z

crates/ruff/src/commands/format.rs

+        let start = unformatted
+            .char_indices()
+            .zip(formatted.chars())
+            .find_map(|((offset, old), new)| {
+                (old != new).then_some(TextSize::try_from(offset).unwrap())
+            })
+            // Fall back on the shorter text length if one of the strings is a strict prefix of the
+            // other (i.e. the zip iterator ended before finding a difference).
+            .unwrap_or_else(|| unformatted.text_len().min(formatted.text_len()));


I wonder if a regular loop would have been easier here (similar to what you have below):

let mut prefix_length = TextSize::ZERO; for (unformatted, formatted) in unformatted.chars().zip(formatted.chars()) { if unformatted != formatted { break; } prefix_length += unformatted.text_len(); }

That is nicer, thanks. I used the guarded break for the suffix too instead of the if-else.

kaddkaka · 2025-10-02T20:27:28Z

Does this PR take any steps closer to solving #14452 ?

ntBre · 2025-10-02T21:03:48Z

No, I don't think so. This PR was for the format subcommand, so it shouldn't affect ruff check (the lint subcommand) at all. Micha's comment about Ruff only reporting leftover diagnostics after fixes have been applied is still accurate, as far as I know.

pygarap · 2025-10-03T02:25:33Z

@ntBre

Now, ruff format --check, become like ruff format --diff? But with more features.

So, ruff format --diff is deprecated for preview mode now?

And why the docs don't mention it?

      --check
          Avoid writing any formatted files back; instead, exit with a non-zero
          status code if any files would have been modified, and zero otherwise
      --diff
          Avoid writing any formatted files back; instead, exit with a non-zero
          status code and the difference between the current file and how the
          formatted file would look like

Looks like the --check CLI flag didn't change in the docs, But it did.

ntBre · 2025-10-03T03:59:14Z

--diff produces a standalone diff that can still be useful for applying as a patch, for example, so it's not deprecated. The "diff" shown by the default format --check output uses the same format that's in preview for lint rules, which is a bit different.

I also think the --check help message is still accurate. The new information is in the new --output-format entry in the CLI help.

joukewitteveen · 2025-10-13T09:04:55Z

This would be even more powerful when ruff format --check (and ty check) would also support --output-file. Currently, only ruff check supports that. Redirection (ruff format --check [...] > DIR/OUTFILE) doesn't fully solve the issue, since the directory containing the output file is required to exist.

…21021) ## Summary I spun this out from #21005 because I thought it might be helpful separately. It just renders a nice `Diagnostic` for syntax errors pointing to the source of the error. This seemed a bit more helpful to me than just the byte offset when working on #21005, and we had most of the code around after #20443 anyway. ## Test Plan This doesn't actually affect any passing tests, but here's an example of the additional output I got when I broke the spacing after the `in` token: ``` error[internal-error]: Expected 'in', found name --> /home/brent/astral/ruff/crates/ruff_python_formatter/resources/test/fixtures/black/cases/cantfit.py:50:79 | 48 | need_more_to_make_the_line_long_enough, 49 | ) 50 | del ([], name_1, name_2), [(), [], name_4, name_3], name_1[[name_2 for name_1 inname_0]] | ^^^^^^^^ 51 | del () | ``` I just appended this to the other existing output for now.

kaddkaka · 2026-01-08T05:29:37Z

No, I don't think so. This PR was for the format subcommand, so it shouldn't affect ruff check (the lint subcommand) at all. Micha's comment about Ruff only reporting leftover diagnostics after fixes have been applied is still accurate, as far as I know.

Which comment are you referring to?

ntBre · 2026-01-08T16:34:10Z

Which comment are you referring to?

I think I was referring to this comment: #14452 (comment)

ntBre force-pushed the brent/formatter-diagnostics branch 5 times, most recently from 29cd7fa to 5996dc3 Compare September 24, 2025 22:04

ntBre added preview Related to preview mode features diagnostics Related to reporting of diagnostics. formatter Related to the formatter labels Sep 24, 2025

ntBre added 13 commits September 25, 2025 15:17

add DiagnosticId::Unformatted

4620604

add --output-format to the ruff format cli

92208ab

allow tempdir_filter to take any AsRef<Path>

041b35d

this is convenient for passing in the result of tempdir.join calls and matches the version in lint.rs

respect show_fix_status for full output

9bb4374

this wasn't a problem for the linter because we always want to show the fix status, but we don't need the visual clutter of marking every formatter diagnostic as fixable

suppress the URL for all non-lint DiagnosticIds

9cffdd8

instead of just syntax errors

fix a couple of Rome references

5bfe3f2

add CellOffsets::ranges helper

76e63b3

convert FormatPathResults to Diagnostics just before rendering

7e48156

build a real notebook index and render cell diffs

a5ec195

convert errors to diagnostics too

4ea16b7

factor out PanicError::to_diagnostic_message

a43fa93

add error tests, set_file_level, and improve panic formatting

959f55a

ntBre force-pushed the brent/formatter-diagnostics branch from 5996dc3 to 959f55a Compare September 25, 2025 21:27

ntBre marked this pull request as ready for review September 25, 2025 22:03

ntBre requested a review from carljm as a code owner September 25, 2025 22:03

ntBre added 9 commits September 29, 2025 09:01

debug assert FormatResult::Diff

1864018

narrow edit range to modified lines

aa532d7

account for context lines when computing the line number width

cbf7d1f

mark FormatModuleError::InvalidSyntax as an internal error

7b8b442

Merge branch 'main' into brent/formatter-diagnostics

57db031

use render_diagnostics helper

74de549

fix range narrowing for strict prefixes

6a72314

range -> modified_range, restore previous range references

793f526

only restrict the end of the script line counting

bf48b9b

Using `modified_range.formatted` was incorrect because it would fail to count unchanged lines at the start of the file, so the computed line number would be too low

MichaReiser approved these changes Sep 30, 2025

View reviewed changes

use loop for prefix_length too, use guarded break for suffix

bedfc6f

ntBre mentioned this pull request Sep 30, 2025

Move diff rendering into annotate-snippets #20648

Open

ntBre changed the title ~~Use Diagnostics for rendering formatting results~~ Display diffs for ruff format --check and add support for different output formats Sep 30, 2025

ntBre merged commit 2b1d3c6 into main Sep 30, 2025
39 checks passed

ntBre deleted the brent/formatter-diagnostics branch September 30, 2025 16:00

ntBre mentioned this pull request Sep 30, 2025

[Feature request] Support Github output for ruff format #10430

Closed

BrewTestBot mentioned this pull request Oct 2, 2025

ruff 0.13.3 Homebrew/homebrew-core#246675

Merged

ntBre mentioned this pull request Oct 7, 2025

Stabilize format --check output formats #20755

Open

Renkai mentioned this pull request Oct 18, 2025

Refactor format tests to use CliTest helper #20953

Merged

ntBre mentioned this pull request Oct 21, 2025

Render a diagnostic for syntax errors introduced in formatter tests #21021

Merged

Comments

Conversation

ntBre commented Sep 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Constructing Edits

Rendering diffs

Test Plan

Uh oh!

github-actions bot commented Sep 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

ruff-ecosystem results

Linter (stable)

Linter (preview)

Formatter (stable)

Formatter (preview)

Uh oh!

MichaReiser commented Sep 17, 2025

Uh oh!

ntBre commented Sep 24, 2025

Uh oh!

github-actions bot commented Sep 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Diagnostic diff on typing conformance tests

Uh oh!

github-actions bot commented Sep 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

mypy_primer results

Uh oh!

ntBre commented Sep 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ntBre commented Sep 29, 2025

Footnotes

Uh oh!

MichaReiser Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

ntBre Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

MichaReiser Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

ntBre Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kaddkaka commented Oct 2, 2025

Uh oh!

ntBre commented Oct 2, 2025

Uh oh!

pygarap commented Oct 3, 2025

Uh oh!

ntBre commented Oct 3, 2025

Uh oh!

joukewitteveen commented Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kaddkaka commented Jan 8, 2026

Uh oh!

ntBre commented Jan 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ntBre commented Sep 16, 2025 •

edited

Loading

Constructing `Edit`s

github-actions bot commented Sep 16, 2025 •

edited

Loading

`ruff-ecosystem` results

github-actions bot commented Sep 25, 2025 •

edited

Loading

github-actions bot commented Sep 25, 2025 •

edited

Loading

`mypy_primer` results

ntBre commented Sep 25, 2025 •

edited

Loading

joukewitteveen commented Oct 13, 2025 •

edited

Loading