Deduplicate input paths by TaKO8Ki · Pull Request #20105 · astral-sh/ruff

TaKO8Ki · 2025-08-26T19:42:01Z

Summary

Fixes #20035, fixes #19395

This is for deduplicating input paths to avoid processing the same file multiple times.

This is my first contribution, so I'm sorry if I miss something. Please tell me if this is needed for this feature.

Test Plan

I just added a test find_python_files_deduplicated in https://github.com/TaKO8Ki/ruff/blob/eee1020e322e693bf977d91bf3edd03b45420254/crates/ruff_workspace/src/resolver.rs#L1017
. This pull request adds changes to WalkPythonFilesState::finish, which is used in python_files_in_path, so they affect some commands such as analyze, format, check and so on. I will add snapshot tests for them if necessary.

I’ve already confirmed that the same thing happens with ruff check as well.

$ echo "x   = 1" > example/foo.py
$ uvx ruff check example example/foo.py
I002 [*] Missing required import: `from __future__ import annotations`
--> /path/to/example/foo.py:1:1
help: Insert required import: `from __future__ import annotations`

I002 [*] Missing required import: `from __future__ import annotations`
--> /path/to/example/foo.py:1:1
help: Insert required import: `from __future__ import annotations`

Found 2 errors.
[*] 2 fixable with the `--fix` option.

TaKO8Ki · 2025-08-27T03:17:24Z

I need to handle this cause Explicitly pass test.py, should be linted regardless of it being excluded by lint.exclude, so I will.

TaKO8Ki · 2025-08-28T00:39:58Z

@ntBre Thank you for reviewing another pull request of mine. Could you approve the workflow and review this pull request if possible?

ntBre · 2025-08-28T01:28:58Z

No problem! I'll get to this soon, it's on my todo list :) I'll kick off the workflow now

github-actions · 2025-08-28T01:40:03Z

`ruff-ecosystem` results

Linter (stable)

✅ ecosystem check detected no linter changes.

Linter (preview)

✅ ecosystem check detected no linter changes.

TaKO8Ki · 2025-08-28T02:27:07Z

@ntBre Thank you. I fixed clippy errors on CI.

ntBre

Thanks for working on this! I left more detailed comments inline, but I think it would be good to add CLI tests for this, and I'm also quite interested in an approach where we deduplicate as we visit the files.

ntBre · 2025-08-28T19:02:22Z

crates/ruff_workspace/src/resolver.rs

+/// This ensures that when the same file is found both as a directly specified
+/// input (Root) and discovered through directory traversal (Nested), the Root
+/// version takes precedence. This behavior is important for explicit exclusion
+/// handling, where explicitly passed files should override directory-based
+/// discovery rules.


Did you run into this situation when testing this out? This is my first time looking at this code, but it seems like we will have already applied exclusion rules in python_files_in_path here:

ruff/crates/ruff_workspace/src/resolver.rs

Lines 470 to 476 in e42006e

// Check if the paths themselves are excluded.

if resolver.force_exclude() {

paths.retain(|path| !is_file_excluded(path, &resolver));

if paths.is_empty() {

return Ok((vec![], resolver));

}

}

and in PythonFilesVisitorBuilder:

ruff/crates/ruff_workspace/src/resolver.rs

Lines 574 to 581 in e42006e

let file_path = Candidate::new(path);

let file_basename = Candidate::new(file_name);

if match_candidate_exclusion(

&file_path,

&file_basename,

&settings.file_resolver.exclude,

) {

debug!("Ignored path via `exclude`: {path:?}");

which will both run before WalkPythonFileState::finish.

It makes sense to me to favor Root nodes over Nested ones when deduplicating, but I'm not sure if it actually affects exclusions. It might be good to add a test for that case, if it does.

Yes. They don't deduplicate input paths in this situation.

I'm not entirely sure I follow why this logic is important. If the paths are identical, why does it matter that we return the root path first?

Can you add an example why this is important? If it isn't important, should we use a Set instead of a Vec to avoid this collecting step altogether? Or could we do better and deduplicate the input paths instead?

It's important and related to:

ruff/crates/ruff/tests/lint.rs

Line 252 in 9edbeb4

// Explicitly pass test.py, should be linted regardless of it being excluded by lint.exclude

Root has to be prioritized because ResolvedFile::Root means “explicitly passed on the CLI,” and excludes only apply to non‑root entries unless --force-exclude is set.

Dropping the root entry means that the explicitly passed path may be unintentionally ignored, since it is treated as nested and can be excluded despite being requested.

Concretely, with lint.exclude = ["foo.py"] and ruff check . foo.py, we must keep Root(foo.py) and drop Nested(foo.py) so foo.py is linted as the user requested.

I have added this explanation to the comment.

crates/ruff_workspace/src/resolver.rs

ntBre · 2025-08-28T19:20:06Z

crates/ruff_workspace/src/resolver.rs

+    let mut seen_paths = FxHashSet::default();
+    let mut deduplicated_files = Vec::new();
+
+    for file_result in files {


Would it be simpler just to sort the files and then deduplicate them? Something like this:

fn deduplicate_files(mut files: ResolvedFiles) -> ResolvedFiles { files.sort(); files.dedup_by_key(|result| result.map(|file| file.path())); files

If we derive PartialOrd on ResolvedFile, I think this will automatically sort Root files before Nested files, and then the dedup call will take the first element with a given path.

I think this would also deduplicate the Errors, which could be a good thing or a bad thing. I'm not really sure.

Another idea, which seems like it could be cheaper overall, would be to filter out duplicates as we're walking the file system. Did you give that a try? It seems a bit more intuitive to me and avoids having to sort or filter anything at the end. I think if PythonFilesVisitor::local_files were an FxHashMap of path -> Result<ResolvedFile>, it might be pretty straightforward.

I will try using FxHashMap for local_files and confirm if it does not cause any side effects.

TaKO8Ki · 2025-09-02T17:13:35Z

@ntBre Thank you for the review. I have addressed your comments.

ntBre

Thanks, I just had a couple more potential simplification suggestions. I'm also still interested in deduplicating the files as we traverse the file system instead of sorting and filtering at the end, as I mentioned in #20105 (comment), but if Micha is happy with this then I am too! This is probably the easier approach if it's not too expensive.

ntBre · 2025-09-16T14:46:05Z

crates/ruff_workspace/src/resolver.rs

 impl Ord for ResolvedFile {
    fn cmp(&self, other: &Self) -> Ordering {
-        self.path().cmp(other.path())
+        match self.path().cmp(other.path()) {


I'm pretty sure this is equivalent to the implementation that would be derived, unless I'm missing some subtlety here. Could we replace the manual Ord and PartialOrd implementations with derive(Ord, PartialOrd)?

I have added derive(Ord, PartialOrd).

crates/ruff_workspace/src/resolver.rs

TaKO8Ki · 2025-09-16T16:24:11Z

@ntBre

I'm also still interested in deduplicating the files as we traverse the file system instead of sorting and filtering at the end, as I mentioned in #20105 (comment), but if Micha is happy with this then I am too! This is probably the easier approach if it's not too expensive.

I tested the approach and confirmed it seems feasible to implement. However, changing the type of local_files would directly affect python_files_in_path, which is used by various commands, so the impact area is fairly large. It’s possible to contain the change within PythonFilesVisitorBuilder, but doing so would require modifying the current implementation to discard the ignore::Error that’s currently passed to python_files_in_path. Given the scope, this looks like a significant change—would it be alright if I handle it in a separate follow-up pull request?

ntBre

Thank you for looking into the other approach! I think this version is probably fine then, no need to follow up as long as it's okay with @MichaReiser too. Avoiding the work of checking the same files multiple times should more than offset the sorting, I would guess.

I just had two more small suggestions about tests, but I think this is good to go otherwise.

crates/ruff_workspace/src/resolver.rs

crates/ruff/tests/lint.rs

ntBre

Thank you! I tried out my hash map idea locally, and I think I agree with you that this is a nicer way to go.

I think we could get around the local_files type issues (#20105 (comment)) by using some kind of wrapper type (either wrapping a Vec or converting to a Vec at the end). I was playing with something like this locally:

struct LocalFiles(Vec<Result<ResolvedFile, ignore::Error>>);

and doing the deduplication in LocalFiles::push, but the validation is still tricky. I think what you have is good for now.

ntBre · 2025-09-19T18:39:47Z

I updated the summary to close #19395 too!

deduplicate input files

4f11df0

Prioritize Root files over Nested files in deduplication

1210376

TaKO8Ki force-pushed the deduplicate-input-files branch 2 times, most recently from 377c218 to 1210376 Compare August 27, 2025 04:48

TaKO8Ki changed the title ~~Deduplicate input paths~~ [ruff] Deduplicate input paths Aug 27, 2025

ntBre added bug Something isn't working cli Related to the command-line interface labels Aug 28, 2025

ntBre self-requested a review August 28, 2025 01:29

fix clippy errors

00d961e

ntBre reviewed Aug 28, 2025

View reviewed changes

TaKO8Ki added 2 commits September 2, 2025 07:19

add unit tests

b0ba6ba

sort and dedup files

3d9c667

TaKO8Ki requested a review from ntBre September 1, 2025 22:23

add an example to the comment

a2010e6

TaKO8Ki requested a review from MichaReiser September 16, 2025 13:39

ntBre reviewed Sep 16, 2025

View reviewed changes

derive Ord, PartialOrd

482a1da

TaKO8Ki requested a review from ntBre September 17, 2025 16:26

ntBre reviewed Sep 17, 2025

View reviewed changes

crates/ruff_workspace/src/resolver.rs Outdated Show resolved Hide resolved

crates/ruff/tests/lint.rs Show resolved Hide resolved

TaKO8Ki added 2 commits September 18, 2025 23:27

add exclude option

5c9304d

remove unnecessary tests

c8717ca

TaKO8Ki requested a review from ntBre September 18, 2025 14:29

ntBre approved these changes Sep 19, 2025

View reviewed changes

ntBre changed the title ~~[ruff] Deduplicate input paths~~ Deduplicate input paths Sep 19, 2025

ntBre merged commit bd5b3e4 into astral-sh:main Sep 19, 2025
35 checks passed

BrewTestBot mentioned this pull request Sep 25, 2025

ruff 0.13.2 Homebrew/homebrew-core#245692

Merged

	// Check if the paths themselves are excluded.
	if resolver.force_exclude() {
	paths.retain(\|path\| !is_file_excluded(path, &resolver));
	if paths.is_empty() {
	return Ok((vec![], resolver));
	}
	}

	let file_path = Candidate::new(path);
	let file_basename = Candidate::new(file_name);
	if match_candidate_exclusion(
	&file_path,
	&file_basename,
	&settings.file_resolver.exclude,
	) {
	debug!("Ignored path via `exclude`: {path:?}");

Conversation

TaKO8Ki commented Aug 26, 2025 • edited by ntBre Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test Plan

Uh oh!

TaKO8Ki commented Aug 27, 2025

Uh oh!

TaKO8Ki commented Aug 28, 2025

Uh oh!

ntBre commented Aug 28, 2025

Uh oh!

github-actions bot commented Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

ruff-ecosystem results

Linter (stable)

Linter (preview)

Uh oh!

TaKO8Ki commented Aug 28, 2025

Uh oh!

ntBre left a comment

Choose a reason for hiding this comment

Uh oh!

ntBre Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

TaKO8Ki Aug 30, 2025

Choose a reason for hiding this comment

Uh oh!

MichaReiser Sep 15, 2025

Choose a reason for hiding this comment

Uh oh!

TaKO8Ki Sep 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TaKO8Ki Sep 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ntBre Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

TaKO8Ki Sep 16, 2025

Choose a reason for hiding this comment

Uh oh!

TaKO8Ki commented Sep 2, 2025

Uh oh!

ntBre left a comment

Choose a reason for hiding this comment

Uh oh!

ntBre Sep 16, 2025

Choose a reason for hiding this comment

Uh oh!

TaKO8Ki Sep 16, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

TaKO8Ki commented Sep 16, 2025

Uh oh!

ntBre left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ntBre left a comment

Choose a reason for hiding this comment

Uh oh!

ntBre commented Sep 19, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

TaKO8Ki commented Aug 26, 2025 •

edited by ntBre

Loading

github-actions bot commented Aug 28, 2025 •

edited

Loading

`ruff-ecosystem` results

TaKO8Ki Sep 16, 2025 •

edited

Loading

TaKO8Ki Sep 16, 2025 •

edited

Loading