Skip to content

Use a HashSet to store inputs and avoid duplicates#1781

Merged
mre merged 3 commits intolycheeverse:masterfrom
rodic:store-inputs-in-hash-set
Aug 14, 2025
Merged

Use a HashSet to store inputs and avoid duplicates#1781
mre merged 3 commits intolycheeverse:masterfrom
rodic:store-inputs-in-hash-set

Conversation

@rodic
Copy link
Contributor

@rodic rodic commented Aug 1, 2025

Description
This PR addresses #1660 by ensuring that input sources are deduplicated before processing. It also guarantees that paths resulting from glob expansion do not contain duplicates.

Changes

  • Replace Vec with HashSet for storing raw input sources.
  • Process each input only once, preventing redundant checks and improving performance when duplicates are present.
  • Deduplicate collected sources from glob expansion.

Testing

  • All existing tests pass.
  • Adding integration test for duplicated inputs
  • Adding integration tests for globs

@mre
Copy link
Member

mre commented Aug 1, 2025

Thanks for the PR. Looks good!

I can't see the new test which checks if duplicate Inputs get removed. 🤔

Edit: ah, you already wrote that you manually tested for it, but we can add a test for that.

@rodic
Copy link
Contributor Author

rodic commented Aug 1, 2025

Thanks for the review @mre. I was considering testing input on LycheeOptions, but we'd essentially be asserting that a HashSet doesn't contain duplicates—which is guaranteed by its definition. Do you have a better test in mind?

@mre
Copy link
Member

mre commented Aug 1, 2025

Do you have a better test in mind?

Yeah, you could add an integration test in cli.rs. You can check that there's no duplicates when calling --dump-inputs.

@rodic rodic force-pushed the store-inputs-in-hash-set branch from 80a0e03 to b3eac94 Compare August 2, 2025 06:55
@rodic
Copy link
Contributor Author

rodic commented Aug 2, 2025

Hi again @mre,

I've added a test to verify that --dump-inputs doesn't include duplicates, and it works for non globs:

#[test]
fn test_dump_inputs_does_not_include_duplicates() -> Result<()> {
    let pattern = fixtures_path().join("dump_inputs/markdown.md");

    let mut cmd = main_command();
    cmd.arg("--dump-inputs")
        .arg(&pattern)
        .arg(&pattern)
        .assert()
        .success()
        .stdout(contains("fixtures/dump_inputs/markdown.md").count(1));

    Ok(())
}

However, tests involving globs would fail,

#[test]
fn test_dump_inputs_does_not_include_duplicates() -> Result<()> {
    let pattern1 = fixtures_path().join("**/markdown.*");
    let pattern2 = fixtures_path().join("**/*.md");

    let mut cmd = main_command();
    cmd.arg("--dump-inputs")
        .arg(pattern1)
        .arg(pattern2)
        .assert()
        .success()
        .stdout(contains("fixtures/dump_inputs/markdown.md").count(1));

    Ok(())
}

even though the equivalent command works fine in the shell. The reason is that, in a shell environment, globs get expanded before arguments reach lychee. For example:

lychee -v --format=markdown "**/*.md" "**/markdown.*"

gets expanded to something like:

lychee -v --format=markdown ".../dump_inputs/markdown.md" ".../dump_inputs/markdown.md"

But in tests (or any non-shell context), this expansion doesn't happen automatically. Instead, in the code, globs are parsed as InputSource::FsGlob and only get expanded later.

I see two possible solutions:

  • Deduplicate when collecting sources
    This would require a thread-safe hash set to track seen paths since we collect sources in parallel. However, that introduces some performance overhead.

  • Expand globs earlier
    Get rid of InputSource::FsGlob and expand a glob during input creation—when we encounter it.

wdyt?

@rodic rodic force-pushed the store-inputs-in-hash-set branch from b3eac94 to a15f450 Compare August 2, 2025 08:07
@mre
Copy link
Member

mre commented Aug 7, 2025

Apologies for the late response. 😅

Looking at your two options, I'd go with Option 1: Deduplicate when collecting sources.

I think this is the easier approach. We already use dashmap, so you could use DashSet to track which paths you've already seen during collection. When you encounter a path that's already been processed, just skip it.

The performance overhead is really minimal here.

Option 2 would require a much bigger refactor, I guess. We'd need to change how the entire input parsing system works and potentially break other parts of the code that rely on the current InputSource::FsGlob design. It's great that globs are handled lazily in the current architecture.

With Option 1, we make a small, focused change that solves the problem without touching the core input handling logic and deduplication happens where the actual file discovery occurs, not during the initial parsing phase.

Now the question is if you'd like to make that change in your current PR or if you mark the test with ignore and add a comment to look into that in another PR. Up to you. I'm fine with either way. 😊

@rodic rodic force-pushed the store-inputs-in-hash-set branch from 3c21f09 to 7e592b2 Compare August 12, 2025 12:47
@rodic
Copy link
Contributor Author

rodic commented Aug 12, 2025

hi again @mre 👋. I added another commit to handle the globs expansion. Please give it a look when you get a chance.

@mre
Copy link
Member

mre commented Aug 13, 2025

Yup, that looks about right. Any reason for not using DashSet? It's lock-free and the API is a bit easier to use than the Arc<Mutex<Has set>>. We use it in other places already so it wouldn't be an additional dependency either.

@rodic
Copy link
Contributor Author

rodic commented Aug 14, 2025

@mre i see that DashMap is a dependency in bin, but not in lib. I'll add it and refactor.

edit: pushed the change.

@mre
Copy link
Member

mre commented Aug 14, 2025

This was really solid work Aleksandar. Thanks so much!

@mre mre merged commit 537efdd into lycheeverse:master Aug 14, 2025
6 checks passed
@mre mre mentioned this pull request Aug 14, 2025
@mre mre mentioned this pull request Aug 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants