Skip to content

Comments

Display diffs for ruff format --check and add support for different output formats#20443

Merged
ntBre merged 37 commits intomainfrom
brent/formatter-diagnostics
Sep 30, 2025
Merged

Display diffs for ruff format --check and add support for different output formats#20443
ntBre merged 37 commits intomainfrom
brent/formatter-diagnostics

Conversation

@ntBre
Copy link
Contributor

@ntBre ntBre commented Sep 16, 2025

Summary

This PR uses the new Diagnostic type for rendering formatter diagnostics. This allows the formatter to inherit all of the output formats already implemented in the linter and ty. For example, here's the new full output format, with the formatting diff displayed using the same infrastructure as the linter:

image
Resolved TODOs

There are several limitiations/todos here still, especially around the OutputFormat type:

  • A few literal todo!s for the remaining OutputFormats without matching DiagnosticFormats
  • The default output format is full instead of something more concise like the current output
  • Some of the output formats (namely JSON) have information that doesn't make much sense for these diagnostics

The first of these is definitely resolved, and I think the other two are as well, based on discussion on the design document. In brief, we're okay inheriting the default OutputFormat and can separate the global option into lint.output-format and format.output-format in the future, if needed; and we're okay including redundant information in the non-human-readable output formats.

My last major concern is with the performance of the new code, as discussed in the Benchmarks section below.

A smaller question is whether we should use Diagnostics for formatting errors too. I think the answer to this is yes, in line with changes we're making in the linter too. I still need to implement that here.

Benchmarks

The values in the table are from a large benchmark on the CPython 3.10 code
base, which involves checking 2011 files, 1872 of which need to be reformatted.
stable corresponds to the same code used on main, while preview-full and
preview-concise use the new Diagnostic code gated behind --preview for the
full and concise output formats, respectively. stable-diff uses the
--diff to compare the two diff rendering approaches. See the full hyperfine
command below for more details. For a sense of scale, the stable output format
produces 1873 lines on stdout, compared to 855,278 for preview-full and
857,798 for stable-diff.

Command Mean [ms] Min [ms] Max [ms] Relative
stable 201.2 ± 6.8 192.9 220.6 1.00
preview-full 9113.2 ± 31.2 9076.1 9152.0 45.29 ± 1.54
preview-concise 214.2 ± 1.4 212.0 217.6 1.06 ± 0.04
stable-diff 3308.6 ± 20.2 3278.6 3341.8 16.44 ± 0.56

In summary, the preview-concise diagnostics are ~6% slower than the stable
output format, increasing the average runtime from 201.2 ms to 214.2 ms. The
full preview diagnostics are much more expensive, taking over 9113.2 ms to
complete, which is ~3x more expensive even than the stable diffs produced by the
--diff flag.

My main takeaways here are:

  1. Rendering Edits is much more expensive than rendering the diffs from --diff
  2. Constructing Edits actually isn't too bad

Constructing Edits

I also took a closer look at Edit construction by modifying the code and
repeating the preview-concise benchmark and found that the main issue is
constructing a SourceFile for use in the Edit rendering. Commenting out the
Edit construction itself has basically no effect:

Command Mean [ms] Min [ms] Max [ms] Relative
stable 197.5 ± 1.6 195.0 200.3 1.00
no-edit 208.9 ± 2.2 204.8 212.2 1.06 ± 0.01

However, also omitting the source text from the SourceFile construction
resolves the slowdown compared to stable. So it seems that copying the full
source text into a SourceFile is the main cause of the slowdown for non-full
diagnostics.

Command Mean [ms] Min [ms] Max [ms] Relative
stable 202.4 ± 2.9 197.6 207.9 1.00
no-source-text 202.7 ± 3.3 196.3 209.1 1.00 ± 0.02

Rendering diffs

The main difference between stable-diff and preview-full seems to be the diffing strategy we use from similar. Both versions use the same algorithm, but in the existing CodeDiff rendering for the --diff flag, we only do line-level diffing, whereas for Diagnostics we use TextDiff::iter_inline_changes to highlight word-level changes too. Skipping the word diff for Diagnostics closes most of the gap:

Command Mean [s] Min [s] Max [s] Relative
stable-diff 3.323 ± 0.015 3.297 3.341 1.00
preview-full 3.654 ± 0.019 3.618 3.682 1.10 ± 0.01

(In some repeated runs, I've seen as small as a ~5% difference, down from 10% in the table)

This doesn't actually change any of our snapshots, but it would obviously change the rendered result in a terminal since we wouldn't highlight the specific words that changed within a line.

Another much smaller change that we can try is removing the deadline from the iter_inline_changes call. It looks like there's a fair amount of overhead from the default 500 ms deadline for computing these, and using iter_inline_changes(op, None) (None for the optional deadline argument) improves the runtime quite a bit:

Command Mean [s] Min [s] Max [s] Relative
stable-diff 3.322 ± 0.013 3.298 3.341 1.00
preview-full 5.296 ± 0.030 5.251 5.366 1.59 ± 0.01

hyperfine command
cargo build --release --bin ruff && hyperfine --ignore-failure --warmup 10 --export-markdown /tmp/table.md \
  -n stable -n preview-full -n preview-concise -n stable-diff \
  "./target/release/ruff format --check ./crates/ruff_linter/resources/test/cpython/ --no-cache" \
  "./target/release/ruff format --check ./crates/ruff_linter/resources/test/cpython/ --no-cache --preview --output-format=full" \
  "./target/release/ruff format --check ./crates/ruff_linter/resources/test/cpython/ --no-cache --preview --output-format=concise" \
  "./target/release/ruff format --check ./crates/ruff_linter/resources/test/cpython/ --no-cache --diff"

Test Plan

Some new CLI tests and manual testing

@github-actions
Copy link
Contributor

github-actions bot commented Sep 16, 2025

ruff-ecosystem results

Linter (stable)

✅ ecosystem check detected no linter changes.

Linter (preview)

✅ ecosystem check detected no linter changes.

Formatter (stable)

✅ ecosystem check detected no format changes.

Formatter (preview)

✅ ecosystem check detected no format changes.

@MichaReiser
Copy link
Member

More broadly, it seems pretty expensive to add the entire file's contents as an Edit. I'm guessing that might show up in the benchmarks on this PR. On a related note, this is a pretty shallow conversion, only constructing the Diagnostics right before rendering. There might be a better way to use the new infrastructure more. We could also use them for rendering FormatCommandErrors, not just FormatPathResults.

A solution here could be to add a diff field to Diagnostic so that the "edit" is computed lazily or the diff is rendered directly instead of using the edit rendering.

The default output format is full instead of something more concise like the current output

Hmm, that's an interesting find. This needs some design work.

@ntBre ntBre force-pushed the brent/formatter-diagnostics branch 5 times, most recently from 29cd7fa to 5996dc3 Compare September 24, 2025 22:04
@ntBre
Copy link
Contributor Author

ntBre commented Sep 24, 2025

I think this is mostly ready for review now. I did a lot of squashing today to make it easier to review commit-by-commit.

The first 7 commits are all very small, standalone refactors used in later steps. The 8th commit is the biggest, converting FormatPathResults into diagnostics. The 9th commit is an incremental improvement on that to split notebook Edits by cell so that they can actually be rendered in the full output format (otherwise they never satisfy this check). Finally, the last commit emits errors as diagnostics too.

I still have a bunch of TODO comments on the errors. I did my best to match up the error variants with DiagnosticIds, including adding two new DiagnosticIds, but I think there's still room for improvement. I should probably also add a test for some or all of these.

I haven't tried to truncate the lines in the diff yet either. I know that came up in the design discussion, so I'm happy to tackle it now or leave it for a follow up if desired.

Hopefully it's not too glaring, but in the screenshot in the summary you can see that the | for line numbers doesn't line up with the middle of the filename --> like it does for lint diagnostics. I don't think there's really a good way around this since the arrow alignment comes from annotate-snippets and only gets indented if annotate-snippets renders line numbers too. I'm sure we could hack another option into annotate-snippets, but it would likely make #20411 harder still.

@ntBre ntBre added preview Related to preview mode features diagnostics Related to reporting of diagnostics. formatter Related to the formatter labels Sep 24, 2025
this is convenient for passing in the result of tempdir.join calls and matches
the version in lint.rs
this wasn't a problem for the linter because we always want to show the fix
status, but we don't need the visual clutter of marking every formatter
diagnostic as fixable
as shown in the snapshot changes, this allows us to manually align what
annotate-snippets calls the header sigil (the `-->` arrow in the diagnostic
header) with our diff line number separators.

these were aligned automatically in the linter because we were also emitting
snippets with annotate-snippets with the same line numbers, but since the
formatter diagnostics only have diffs and not snippets, the default line number
width was zero.

note that this still isn't perfect because we align with the highest line number
in the entire file, not necessarily the highest _rendered_ line number, as shown
in the updated notebook snapshot. still, it should get us closer to aligned than
not having an offset at all and also end up being correct in most(?) cases.
@ntBre ntBre force-pushed the brent/formatter-diagnostics branch from 5996dc3 to 959f55a Compare September 25, 2025 21:27
@github-actions
Copy link
Contributor

github-actions bot commented Sep 25, 2025

Diagnostic diff on typing conformance tests

No changes detected when running ty on typing conformance tests ✅

@github-actions
Copy link
Contributor

github-actions bot commented Sep 25, 2025

mypy_primer results

No ecosystem changes detected ✅
No memory usage changes detected ✅

@ntBre
Copy link
Contributor Author

ntBre commented Sep 25, 2025

I'm marking this ready for review. I pushed a few more commits resolving the main issues I noted above:

  • I added a lineno_offset/header_offset field for improving the --> alignment:

    image

    It can still go wrong and makes it harder to un-fork annotate-snippets, so I'm happy to revert this, but I think it's a visual improvement in general.

  • I added tests for all of the FormatCommandError variants and fixed a couple of oversights in their formatting, including factoring out and using some of ty's panic rendering. There are still TODOs on a few of these, but I think I could use some input on the best DiagnosticIds to use. The tests at least show what they look like to help us iterate on them.

I think the performance hit is acceptable. It's only ~10 ms (~5%) for the concise output on a large project, and the larger discrepancy between full output and --diff seems justified given the additional work of computing more granular diffs. I am seeing slightly worse performance today compared to the last time I ran the benchmarks, so it may be worth a bit more profiling, but I think the point still stands.

New benchmark table

Command Mean [ms] Min [ms] Max [ms] Relative
stable 200.0 ± 2.3 196.6 203.5 1.00
preview-full 9134.8 ± 32.1 9108.0 9197.5 45.67 ± 0.55
preview-concise 232.6 ± 4.2 227.0 240.7 1.16 ± 0.02
stable-diff 3354.8 ± 15.9 3333.7 3383.6 16.77 ± 0.21

stable is basically identical to last time, but both preview versions are ~20 ms slower. --diff is almost 50 ms slower, which really doesn't make sense.

Similarly, I think we may want to truncate large diffs at some point (Zanie even mentioned possibly making the limit configurable), but I think we can hold off on that for now.

Oh, one other TODO is that it would be nice to emit a warning if the output-format is set without preview, but I don't think that's so easy to do since output-format is a global option and will have already been unwrap_or_defaulted by the time we see it.

@ntBre ntBre marked this pull request as ready for review September 25, 2025 22:03
@ntBre ntBre requested a review from carljm as a code owner September 25, 2025 22:03
ntBre added a commit that referenced this pull request Sep 29, 2025
## Summary

Addresses
#20443 (comment) by
factoring out the `match` on the ruff output format in a way that should
be reusable by the formatter.

I didn't think this was going to work at first, but the fact that the
config holds options that apply only to certain output formats works in
our favor here. We can set up a single config for all of the output
formats and then use `try_from` to convert the `OutputFormat` to a
`DiagnosticFormat` later.

## Test Plan

Existing tests, plus a few new ones to make sure relocating the
`SHOW_FIX_SUMMARY` rendering worked, that was untested before. I deleted
a bunch of test code along with the `text` module, but I believe all of
it is now well-covered by the `full` and `concise` tests in `ruff_db`.

I also merged this branch into
#20443 locally and made sure that
the API actually helps. `render_diagnostics` dropped in perfectly and
passed the tests there too.
@ntBre
Copy link
Contributor Author

ntBre commented Sep 29, 2025

Thank you for the reviews!

I think this could use one more look. I was over-complicating the range calculations for a while today, but they felt kind of tricky, at least when I was trying to use zip like the start calculation1. The new ModifiedRange type seems to be working well now, though, unless I missed any edge cases.

The other changes seemed relatively straightforward, and I also merged the changes from #20595.

I'll timebox trying to move our diff rendering to annotate-snippets to one day later this week.

Footnotes

  1. The tricky part was that zipping and finding the first different character could easily fail if one of the snippets was shorter than the other; at that point it's not clear which one was shorter and caused the failure. We also want an exclusive range, so we've gone one character too far by finding the first different character rather than the last common character. Now I just loop from the end and track the length of the common suffix, which we can subtract from both text_lens despite the actual offsets likely being different.

(fix, line_count)
} else {
let formatted_code = &formatted.source_code()[modified_range.formatted];
let edit = if formatted_code.is_empty() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It shouldn't matter so I think it's fine leaving as is but there's the third case where modified_range.unformatted is empty (e.g. when adding blank lines between two classes), in which case its an insertion. We could add a Edit::from_text_and_range(new_text, range) (with a better name) that does this dance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'll leave this for now. I only split out deletions because I hit the debug_assert! that content is not empty in Edit::range_replacement. Insertions seem okay to group with full replacements since content is Some in both cases, and we already have a TextRange, which Edit::insertion would otherwise construct.

Comment on lines 1035 to 1043
let start = unformatted
.char_indices()
.zip(formatted.chars())
.find_map(|((offset, old), new)| {
(old != new).then_some(TextSize::try_from(offset).unwrap())
})
// Fall back on the shorter text length if one of the strings is a strict prefix of the
// other (i.e. the zip iterator ended before finding a difference).
.unwrap_or_else(|| unformatted.text_len().min(formatted.text_len()));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if a regular loop would have been easier here (similar to what you have below):

let mut prefix_length = TextSize::ZERO;

for (unformatted, formatted) in unformatted.chars().zip(formatted.chars()) {
    if unformatted != formatted {
        break;
    }

    prefix_length += unformatted.text_len();
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is nicer, thanks. I used the guarded break for the suffix too instead of the if-else.

@ntBre ntBre changed the title Use Diagnostics for rendering formatting results Display diffs for ruff format --check and add support for different output formats Sep 30, 2025
@ntBre ntBre merged commit 2b1d3c6 into main Sep 30, 2025
39 checks passed
@ntBre ntBre deleted the brent/formatter-diagnostics branch September 30, 2025 16:00
@kaddkaka
Copy link

kaddkaka commented Oct 2, 2025

Does this PR take any steps closer to solving #14452 ?

@ntBre
Copy link
Contributor Author

ntBre commented Oct 2, 2025

No, I don't think so. This PR was for the format subcommand, so it shouldn't affect ruff check (the lint subcommand) at all. Micha's comment about Ruff only reporting leftover diagnostics after fixes have been applied is still accurate, as far as I know.

@pygarap
Copy link

pygarap commented Oct 3, 2025

@ntBre

Now, ruff format --check, become like ruff format --diff? But with more features.

So, ruff format --diff is deprecated for preview mode now?

And why the docs don't mention it?

      --check
          Avoid writing any formatted files back; instead, exit with a non-zero
          status code if any files would have been modified, and zero otherwise
      --diff
          Avoid writing any formatted files back; instead, exit with a non-zero
          status code and the difference between the current file and how the
          formatted file would look like

Looks like the --check CLI flag didn't change in the docs, But it did.

@ntBre
Copy link
Contributor Author

ntBre commented Oct 3, 2025

--diff produces a standalone diff that can still be useful for applying as a patch, for example, so it's not deprecated. The "diff" shown by the default format --check output uses the same format that's in preview for lint rules, which is a bit different.

I also think the --check help message is still accurate. The new information is in the new --output-format entry in the CLI help.

@joukewitteveen
Copy link

joukewitteveen commented Oct 13, 2025

This would be even more powerful when ruff format --check (and ty check) would also support --output-file. Currently, only ruff check supports that. Redirection (ruff format --check [...] > DIR/OUTFILE) doesn't fully solve the issue, since the directory containing the output file is required to exist.

ntBre added a commit that referenced this pull request Oct 21, 2025
…21021)

## Summary

I spun this out from #21005 because I thought it might be helpful
separately. It just renders a nice `Diagnostic` for syntax errors
pointing to the source of the error. This seemed a bit more helpful to
me than just the byte offset when working on #21005, and we had most of
the code around after #20443 anyway.

## Test Plan

This doesn't actually affect any passing tests, but here's an example of
the additional output I got when I broke the spacing after the `in`
token:

```
    error[internal-error]: Expected 'in', found name
      --> /home/brent/astral/ruff/crates/ruff_python_formatter/resources/test/fixtures/black/cases/cantfit.py:50:79
       |
    48 |     need_more_to_make_the_line_long_enough,
    49 | )
    50 | del ([], name_1, name_2), [(), [], name_4, name_3], name_1[[name_2 for name_1 inname_0]]
       |                                                                               ^^^^^^^^
    51 | del ()
       |
```

I just appended this to the other existing output for now.
@kaddkaka
Copy link

kaddkaka commented Jan 8, 2026

No, I don't think so. This PR was for the format subcommand, so it shouldn't affect ruff check (the lint subcommand) at all. Micha's comment about Ruff only reporting leftover diagnostics after fixes have been applied is still accurate, as far as I know.

Which comment are you referring to?

@ntBre
Copy link
Contributor Author

ntBre commented Jan 8, 2026

Which comment are you referring to?

I think I was referring to this comment: #14452 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

diagnostics Related to reporting of diagnostics. formatter Related to the formatter preview Related to preview mode features

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants