Use a HashSet to store inputs and avoid duplicates#1781
Use a HashSet to store inputs and avoid duplicates#1781mre merged 3 commits intolycheeverse:masterfrom
Conversation
|
Thanks for the PR. Looks good! I can't see the new test which checks if duplicate Inputs get removed. 🤔 Edit: ah, you already wrote that you manually tested for it, but we can add a test for that. |
|
Thanks for the review @mre. I was considering testing |
Yeah, you could add an integration test in |
80a0e03 to
b3eac94
Compare
|
Hi again @mre, I've added a test to verify that #[test]
fn test_dump_inputs_does_not_include_duplicates() -> Result<()> {
let pattern = fixtures_path().join("dump_inputs/markdown.md");
let mut cmd = main_command();
cmd.arg("--dump-inputs")
.arg(&pattern)
.arg(&pattern)
.assert()
.success()
.stdout(contains("fixtures/dump_inputs/markdown.md").count(1));
Ok(())
}However, tests involving globs would fail, #[test]
fn test_dump_inputs_does_not_include_duplicates() -> Result<()> {
let pattern1 = fixtures_path().join("**/markdown.*");
let pattern2 = fixtures_path().join("**/*.md");
let mut cmd = main_command();
cmd.arg("--dump-inputs")
.arg(pattern1)
.arg(pattern2)
.assert()
.success()
.stdout(contains("fixtures/dump_inputs/markdown.md").count(1));
Ok(())
}even though the equivalent command works fine in the shell. The reason is that, in a shell environment, globs get expanded before arguments reach lychee. For example: lychee -v --format=markdown "**/*.md" "**/markdown.*"gets expanded to something like: lychee -v --format=markdown ".../dump_inputs/markdown.md" ".../dump_inputs/markdown.md"But in tests (or any non-shell context), this expansion doesn't happen automatically. Instead, in the code, globs are parsed as I see two possible solutions:
wdyt? |
b3eac94 to
a15f450
Compare
|
Apologies for the late response. 😅 Looking at your two options, I'd go with Option 1: Deduplicate when collecting sources. I think this is the easier approach. We already use dashmap, so you could use The performance overhead is really minimal here. Option 2 would require a much bigger refactor, I guess. We'd need to change how the entire input parsing system works and potentially break other parts of the code that rely on the current With Option 1, we make a small, focused change that solves the problem without touching the core input handling logic and deduplication happens where the actual file discovery occurs, not during the initial parsing phase. Now the question is if you'd like to make that change in your current PR or if you mark the test with ignore and add a comment to look into that in another PR. Up to you. I'm fine with either way. 😊 |
3c21f09 to
7e592b2
Compare
|
hi again @mre 👋. I added another commit to handle the globs expansion. Please give it a look when you get a chance. |
|
Yup, that looks about right. Any reason for not using |
|
@mre i see that edit: pushed the change. |
|
This was really solid work Aleksandar. Thanks so much! |
Description
This PR addresses #1660 by ensuring that input sources are deduplicated before processing. It also guarantees that paths resulting from glob expansion do not contain duplicates.
Changes
Testing