Conversation
|
I need to handle this cause |
377c218 to
1210376
Compare
|
@ntBre Thank you for reviewing another pull request of mine. Could you approve the workflow and review this pull request if possible? |
|
No problem! I'll get to this soon, it's on my todo list :) I'll kick off the workflow now |
|
|
@ntBre Thank you. I fixed clippy errors on CI. |
ntBre
left a comment
There was a problem hiding this comment.
Thanks for working on this! I left more detailed comments inline, but I think it would be good to add CLI tests for this, and I'm also quite interested in an approach where we deduplicate as we visit the files.
| /// This ensures that when the same file is found both as a directly specified | ||
| /// input (Root) and discovered through directory traversal (Nested), the Root | ||
| /// version takes precedence. This behavior is important for explicit exclusion | ||
| /// handling, where explicitly passed files should override directory-based | ||
| /// discovery rules. |
There was a problem hiding this comment.
Did you run into this situation when testing this out? This is my first time looking at this code, but it seems like we will have already applied exclusion rules in python_files_in_path here:
ruff/crates/ruff_workspace/src/resolver.rs
Lines 470 to 476 in e42006e
and in PythonFilesVisitorBuilder:
ruff/crates/ruff_workspace/src/resolver.rs
Lines 574 to 581 in e42006e
which will both run before WalkPythonFileState::finish.
It makes sense to me to favor Root nodes over Nested ones when deduplicating, but I'm not sure if it actually affects exclusions. It might be good to add a test for that case, if it does.
There was a problem hiding this comment.
Yes. They don't deduplicate input paths in this situation.
There was a problem hiding this comment.
I'm not entirely sure I follow why this logic is important. If the paths are identical, why does it matter that we return the root path first?
Can you add an example why this is important? If it isn't important, should we use a Set instead of a Vec to avoid this collecting step altogether? Or could we do better and deduplicate the input paths instead?
There was a problem hiding this comment.
It's important and related to:
ruff/crates/ruff/tests/lint.rs
Line 252 in 9edbeb4
Root has to be prioritized because ResolvedFile::Root means “explicitly passed on the CLI,” and excludes only apply to non‑root entries unless --force-exclude is set.
Dropping the root entry means that the explicitly passed path may be unintentionally ignored, since it is treated as nested and can be excluded despite being requested.
Concretely, with lint.exclude = ["foo.py"] and ruff check . foo.py, we must keep Root(foo.py) and drop Nested(foo.py) so foo.py is linted as the user requested.
There was a problem hiding this comment.
I have added this explanation to the comment.
| let mut seen_paths = FxHashSet::default(); | ||
| let mut deduplicated_files = Vec::new(); | ||
|
|
||
| for file_result in files { |
There was a problem hiding this comment.
Would it be simpler just to sort the files and then deduplicate them? Something like this:
fn deduplicate_files(mut files: ResolvedFiles) -> ResolvedFiles {
files.sort();
files.dedup_by_key(|result| result.map(|file| file.path()));
filesIf we derive PartialOrd on ResolvedFile, I think this will automatically sort Root files before Nested files, and then the dedup call will take the first element with a given path.
I think this would also deduplicate the Errors, which could be a good thing or a bad thing. I'm not really sure.
Another idea, which seems like it could be cheaper overall, would be to filter out duplicates as we're walking the file system. Did you give that a try? It seems a bit more intuitive to me and avoids having to sort or filter anything at the end. I think if PythonFilesVisitor::local_files were an FxHashMap of path -> Result<ResolvedFile>, it might be pretty straightforward.
There was a problem hiding this comment.
I will try using FxHashMap for local_files and confirm if it does not cause any side effects.
|
@ntBre Thank you for the review. I have addressed your comments. |
ntBre
left a comment
There was a problem hiding this comment.
Thanks, I just had a couple more potential simplification suggestions. I'm also still interested in deduplicating the files as we traverse the file system instead of sorting and filtering at the end, as I mentioned in #20105 (comment), but if Micha is happy with this then I am too! This is probably the easier approach if it's not too expensive.
| impl Ord for ResolvedFile { | ||
| fn cmp(&self, other: &Self) -> Ordering { | ||
| self.path().cmp(other.path()) | ||
| match self.path().cmp(other.path()) { |
There was a problem hiding this comment.
I'm pretty sure this is equivalent to the implementation that would be derived, unless I'm missing some subtlety here. Could we replace the manual Ord and PartialOrd implementations with derive(Ord, PartialOrd)?
There was a problem hiding this comment.
I have added derive(Ord, PartialOrd).
I tested the approach and confirmed it seems feasible to implement. However, changing the type of |
ntBre
left a comment
There was a problem hiding this comment.
Thank you for looking into the other approach! I think this version is probably fine then, no need to follow up as long as it's okay with @MichaReiser too. Avoiding the work of checking the same files multiple times should more than offset the sorting, I would guess.
I just had two more small suggestions about tests, but I think this is good to go otherwise.
ntBre
left a comment
There was a problem hiding this comment.
Thank you! I tried out my hash map idea locally, and I think I agree with you that this is a nicer way to go.
I think we could get around the local_files type issues (#20105 (comment)) by using some kind of wrapper type (either wrapping a Vec or converting to a Vec at the end). I was playing with something like this locally:
struct LocalFiles(Vec<Result<ResolvedFile, ignore::Error>>);and doing the deduplication in LocalFiles::push, but the validation is still tricky. I think what you have is good for now.
|
I updated the summary to close #19395 too! |
Summary
Fixes #20035, fixes #19395
This is for deduplicating input paths to avoid processing the same file multiple times.
This is my first contribution, so I'm sorry if I miss something. Please tell me if this is needed for this feature.
Test Plan
I just added a test
find_python_files_deduplicatedin https://github.com/TaKO8Ki/ruff/blob/eee1020e322e693bf977d91bf3edd03b45420254/crates/ruff_workspace/src/resolver.rs#L1017. This pull request adds changes to
WalkPythonFilesState::finish, which is used inpython_files_in_path, so they affect some commands such asanalyze,format,checkand so on. I will add snapshot tests for them if necessary.I’ve already confirmed that the same thing happens with ruff check as well.