Skip to content

Deduplicate input paths#20105

Merged
ntBre merged 9 commits intoastral-sh:mainfrom
TaKO8Ki:deduplicate-input-files
Sep 19, 2025
Merged

Deduplicate input paths#20105
ntBre merged 9 commits intoastral-sh:mainfrom
TaKO8Ki:deduplicate-input-files

Conversation

@TaKO8Ki
Copy link
Contributor

@TaKO8Ki TaKO8Ki commented Aug 26, 2025

Summary

Fixes #20035, fixes #19395

This is for deduplicating input paths to avoid processing the same file multiple times.

This is my first contribution, so I'm sorry if I miss something. Please tell me if this is needed for this feature.

Test Plan

I just added a test find_python_files_deduplicated in https://github.com/TaKO8Ki/ruff/blob/eee1020e322e693bf977d91bf3edd03b45420254/crates/ruff_workspace/src/resolver.rs#L1017
. This pull request adds changes to WalkPythonFilesState::finish, which is used in python_files_in_path, so they affect some commands such as analyze, format, check and so on. I will add snapshot tests for them if necessary.

I’ve already confirmed that the same thing happens with ruff check as well.

$ echo "x   = 1" > example/foo.py
$ uvx ruff check example example/foo.py
I002 [*] Missing required import: `from __future__ import annotations`
--> /path/to/example/foo.py:1:1
help: Insert required import: `from __future__ import annotations`

I002 [*] Missing required import: `from __future__ import annotations`
--> /path/to/example/foo.py:1:1
help: Insert required import: `from __future__ import annotations`

Found 2 errors.
[*] 2 fixable with the `--fix` option.

@TaKO8Ki
Copy link
Contributor Author

TaKO8Ki commented Aug 27, 2025

I need to handle this cause Explicitly pass test.py, should be linted regardless of it being excluded by lint.exclude, so I will.

@TaKO8Ki TaKO8Ki force-pushed the deduplicate-input-files branch 2 times, most recently from 377c218 to 1210376 Compare August 27, 2025 04:48
@TaKO8Ki TaKO8Ki changed the title Deduplicate input paths [ruff] Deduplicate input paths Aug 27, 2025
@TaKO8Ki
Copy link
Contributor Author

TaKO8Ki commented Aug 28, 2025

@ntBre Thank you for reviewing another pull request of mine. Could you approve the workflow and review this pull request if possible?

@ntBre
Copy link
Contributor

ntBre commented Aug 28, 2025

No problem! I'll get to this soon, it's on my todo list :) I'll kick off the workflow now

@ntBre ntBre added bug Something isn't working cli Related to the command-line interface labels Aug 28, 2025
@ntBre ntBre self-requested a review August 28, 2025 01:29
@github-actions
Copy link
Contributor

github-actions bot commented Aug 28, 2025

ruff-ecosystem results

Linter (stable)

✅ ecosystem check detected no linter changes.

Linter (preview)

✅ ecosystem check detected no linter changes.

@TaKO8Ki
Copy link
Contributor Author

TaKO8Ki commented Aug 28, 2025

@ntBre Thank you. I fixed clippy errors on CI.

Copy link
Contributor

@ntBre ntBre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this! I left more detailed comments inline, but I think it would be good to add CLI tests for this, and I'm also quite interested in an approach where we deduplicate as we visit the files.

Comment on lines 534 to 538
/// This ensures that when the same file is found both as a directly specified
/// input (Root) and discovered through directory traversal (Nested), the Root
/// version takes precedence. This behavior is important for explicit exclusion
/// handling, where explicitly passed files should override directory-based
/// discovery rules.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you run into this situation when testing this out? This is my first time looking at this code, but it seems like we will have already applied exclusion rules in python_files_in_path here:

// Check if the paths themselves are excluded.
if resolver.force_exclude() {
paths.retain(|path| !is_file_excluded(path, &resolver));
if paths.is_empty() {
return Ok((vec![], resolver));
}
}

and in PythonFilesVisitorBuilder:

let file_path = Candidate::new(path);
let file_basename = Candidate::new(file_name);
if match_candidate_exclusion(
&file_path,
&file_basename,
&settings.file_resolver.exclude,
) {
debug!("Ignored path via `exclude`: {path:?}");

which will both run before WalkPythonFileState::finish.

It makes sense to me to favor Root nodes over Nested ones when deduplicating, but I'm not sure if it actually affects exclusions. It might be good to add a test for that case, if it does.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. They don't deduplicate input paths in this situation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not entirely sure I follow why this logic is important. If the paths are identical, why does it matter that we return the root path first?

Can you add an example why this is important? If it isn't important, should we use a Set instead of a Vec to avoid this collecting step altogether? Or could we do better and deduplicate the input paths instead?

Copy link
Contributor Author

@TaKO8Ki TaKO8Ki Sep 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's important and related to:

// Explicitly pass test.py, should be linted regardless of it being excluded by lint.exclude

Root has to be prioritized because ResolvedFile::Root means “explicitly passed on the CLI,” and excludes only apply to non‑root entries unless --force-exclude is set.

Dropping the root entry means that the explicitly passed path may be unintentionally ignored, since it is treated as nested and can be excluded despite being requested.

Concretely, with lint.exclude = ["foo.py"] and ruff check . foo.py, we must keep Root(foo.py) and drop Nested(foo.py) so foo.py is linted as the user requested.

Copy link
Contributor Author

@TaKO8Ki TaKO8Ki Sep 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added this explanation to the comment.

let mut seen_paths = FxHashSet::default();
let mut deduplicated_files = Vec::new();

for file_result in files {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be simpler just to sort the files and then deduplicate them? Something like this:

fn deduplicate_files(mut files: ResolvedFiles) -> ResolvedFiles {
    files.sort();
    files.dedup_by_key(|result| result.map(|file| file.path()));
    files

If we derive PartialOrd on ResolvedFile, I think this will automatically sort Root files before Nested files, and then the dedup call will take the first element with a given path.

I think this would also deduplicate the Errors, which could be a good thing or a bad thing. I'm not really sure.

Another idea, which seems like it could be cheaper overall, would be to filter out duplicates as we're walking the file system. Did you give that a try? It seems a bit more intuitive to me and avoids having to sort or filter anything at the end. I think if PythonFilesVisitor::local_files were an FxHashMap of path -> Result<ResolvedFile>, it might be pretty straightforward.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will try using FxHashMap for local_files and confirm if it does not cause any side effects.

@TaKO8Ki TaKO8Ki requested a review from ntBre September 1, 2025 22:23
@TaKO8Ki
Copy link
Contributor Author

TaKO8Ki commented Sep 2, 2025

@ntBre Thank you for the review. I have addressed your comments.

@TaKO8Ki TaKO8Ki requested a review from MichaReiser September 16, 2025 13:39
Copy link
Contributor

@ntBre ntBre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I just had a couple more potential simplification suggestions. I'm also still interested in deduplicating the files as we traverse the file system instead of sorting and filtering at the end, as I mentioned in #20105 (comment), but if Micha is happy with this then I am too! This is probably the easier approach if it's not too expensive.

impl Ord for ResolvedFile {
fn cmp(&self, other: &Self) -> Ordering {
self.path().cmp(other.path())
match self.path().cmp(other.path()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure this is equivalent to the implementation that would be derived, unless I'm missing some subtlety here. Could we replace the manual Ord and PartialOrd implementations with derive(Ord, PartialOrd)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added derive(Ord, PartialOrd).

@TaKO8Ki
Copy link
Contributor Author

TaKO8Ki commented Sep 16, 2025

@ntBre

I'm also still interested in deduplicating the files as we traverse the file system instead of sorting and filtering at the end, as I mentioned in #20105 (comment), but if Micha is happy with this then I am too! This is probably the easier approach if it's not too expensive.

I tested the approach and confirmed it seems feasible to implement. However, changing the type of local_files would directly affect python_files_in_path, which is used by various commands, so the impact area is fairly large. It’s possible to contain the change within PythonFilesVisitorBuilder, but doing so would require modifying the current implementation to discard the ignore::Error that’s currently passed to python_files_in_path. Given the scope, this looks like a significant change—would it be alright if I handle it in a separate follow-up pull request?

@TaKO8Ki TaKO8Ki requested a review from ntBre September 17, 2025 16:26
Copy link
Contributor

@ntBre ntBre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for looking into the other approach! I think this version is probably fine then, no need to follow up as long as it's okay with @MichaReiser too. Avoiding the work of checking the same files multiple times should more than offset the sorting, I would guess.

I just had two more small suggestions about tests, but I think this is good to go otherwise.

@TaKO8Ki TaKO8Ki requested a review from ntBre September 18, 2025 14:29
Copy link
Contributor

@ntBre ntBre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! I tried out my hash map idea locally, and I think I agree with you that this is a nicer way to go.

I think we could get around the local_files type issues (#20105 (comment)) by using some kind of wrapper type (either wrapping a Vec or converting to a Vec at the end). I was playing with something like this locally:

struct LocalFiles(Vec<Result<ResolvedFile, ignore::Error>>);

and doing the deduplication in LocalFiles::push, but the validation is still tricky. I think what you have is good for now.

@ntBre
Copy link
Contributor

ntBre commented Sep 19, 2025

I updated the summary to close #19395 too!

@ntBre ntBre changed the title [ruff] Deduplicate input paths Deduplicate input paths Sep 19, 2025
@ntBre ntBre merged commit bd5b3e4 into astral-sh:main Sep 19, 2025
35 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working cli Related to the command-line interface

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ruff format can double count formatted files Formatting an overlapping set of files can lead to errors

3 participants