fix: skip fragment check if website URL doesn't contain fragment by MichaIng · Pull Request #1733 · lycheeverse/lychee

MichaIng · 2025-06-16T15:19:45Z

Part 2, since the same issue happens with remote URLs as well.

EDIT: Tested those builds, and works as expected. I am too lazy to add respective tests to the code 😛.

Signed-off-by: MichaIng <micha@dietpi.com>

Copilot

Pull Request Overview

This PR addresses issue #1709 by modifying the website fragment handling logic to skip the fragment check when the URL does not contain a fragment.

The conditional check has been updated to include a verification of the URL fragment.
The change ensures that HTML fragment checks are only performed when the URL contains a fragment.

Comments suppressed due to low confidence (1)

lychee-lib/src/checker/website.rs:106

Consider adding an inline comment explaining that the additional check 'response.url().fragment().is_some()' ensures that HTML fragment processing is only performed when a fragment is present, improving code clarity for future maintainers.

if self.include_fragments

thomas-zahner · 2025-06-17T06:40:43Z

Seems to make sense. But yeah, maybe we should add tests before merging, will take a closer look in the coming days.

@mre What's up with that Copilot comment 😅 Did you enable this or was this enabled automatically by GitHub? I'd really prefer to disable this. It just reiterates the obvious with a good chance of hallucinating and lying about things.

mre · 2025-06-17T08:16:48Z

Yeah, seems to be a new feature. I'll try to disable it.

MichaIng · 2025-06-17T18:07:47Z

The Copilot "review" was triggered by me. I was curious to see that I can add it even for PRs on remote repos. Potentially okay to get a summary for larger PRs. But so far I saw it doing only generic best practice recommendations, like here to add a comment to the code, in other cases to do sanity checks and such. Sorry for the spam 😉.

Okay I added tests, but they do not behave as expected:

[200] file:///home/runner/work/lychee/lychee/fixtures/fragments/zero.bin#fragment
[ERROR] https://raw.githubusercontent.com/MichaIng/lychee/skip-fragment-check-by-url/fixtures/fragments/zero.bin#fragment | Cannot find fragment

Why does the first link succeed, despite the fact that there is a non-existing fragment in the URL?
The second link fails, but not with the expected error decoding response body. This is a single NUL byte file. I'll test locally in which case it fails to read the body.

EDIT: Okay

     [200] file:///tmp/test2.txt#fragment
   [ERROR] file:///tmp/test.html#fragment | Cannot find fragment
   [WARN ] Skipping fragment check for file:///tmp/zero.bin#fragment due to the following error: Cannot read input content from file `/tmp/zero.bin`
     [200] file:///tmp/zero.bin#fragment

So for local files, it does not fail if reading content fails, but it throws a warning only: https://github.com/lycheeverse/lychee/blob/master/lychee-lib/src/checker/file.rs#L174-L190
The warning is not thrown on NUL files, but only with at least 2 random bytes.
Furthermore, while the content is tried to be read on any file type, the fragment checker (silently) succeeds if it is no Markdown or HTML file: https://github.com/lycheeverse/lychee/blob/master/lychee-lib/src/utils/fragment_checker.rs#L74-L78

Signed-off-by: MichaIng <micha@dietpi.com>

`is_some()` is true as well if the fragment is given but empty, i.e. `#`. While it is an edge case, skip the fragment checker as well in case of an empty fragment. Signed-off-by: MichaIng <micha@dietpi.com>

MichaIng · 2025-06-17T19:58:20Z

Interesting:

[WARN ] Skipping fragment check for file:///home/runner/work/lychee/lychee/fixtures/fragments/zero.bin# due to the following error: Cannot read input content from file `/home/runner/work/lychee/lychee/fixtures/fragments/zero.bin`
[WARN ] Skipping fragment check for file:///home/runner/work/lychee/lychee/fixtures/fragments/zero.bin#fragment due to the following error: Cannot read input content from file `/home/runner/work/lychee/lychee/fixtures/fragments/zero.bin`
Error: ROR] https://raw.githubusercontent.com/MichaIng/lychee/skip-fragment-check-by-url/fixtures/fragments/zero.bin#fragment | Cannot find fragment
[200] https://raw.githubusercontent.com/MichaIng/lychee/skip-fragment-check-by-url/fixtures/fragments/zero.bin#

Pretty inconsistent behavior between file and website checker:

Local zero.bin# with empty fragment is still passed to the fragment checker, despite uri.url.fragment().is_some(). An empty string passes this test?
Remote zero.bin# however does not throw an error (or warning), so I was assuming the content is not passed to the fragment checker. But, testing with a larger file, it actually is, but the checker for unknown reasons does not fail parsing the content. I tested this with some other real image files. If they are not too big to cause a parser error, the fragment checker "succeeds" checking for fragments. But if I download the file and check it locally, the parser fails with Cannot read input content from file.

It is an edge case, and I do not want to pull all exceptions/early positive returns from the fragment checker into the parent file and website checkers, but is_some_and(|x| !x.is_empty()) seems simple enough.

Generally, the fragment checker adds additional processing, and the parent checkers pull and pass the whole content to it. So it makes sense to avoid invoking the fragment checker in the first place, if there is no fragment to check, or no way it can succeed. Strictly seen this includes #top as well. But on the other hand, the fragment checker checks for empty fragment and #top as well, and it is not great to do the same exception checks at 3 places in the code. Maybe a public function in the fragment checker could be used by the parent checkers file.rs and website.rs as well, so they know whether to fetch and pass the full content or not, but share the same code for exceptions?

... but better to do that in a new PR. This one is good enough, probably the last commit already too much. Test file description is not 100% accurate, since the remote binary URL does not fail with a read error (for such small file) like the local one does, but the result is the same. Next I'll look into content-type based fragment check skipping for remote website.rs checks.

Signed-off-by: MichaIng <micha@dietpi.com>

MichaIng · 2025-06-17T20:36:26Z

Okay I tested new builds and is_some_and(|x| !x.is_empty()) works as expected: checking an URL of a large remote file with empty # is not passed to the fragment checker anymore and succeeds quickly. The previous builds downloaded the file completely into RAM (or tried to do so) and passed it to the fragment checker, which then fails either way. No change to previous behavior in case of #test, as expected.

To preserve the succeeding tests, before I adjust the URLs to point to lycheeverse/master: https://github.com/lycheeverse/lychee/actions/runs/15716307029

Ready to be merged from my end.

Signed-off-by: MichaIng <micha@dietpi.com>

thomas-zahner · 2025-06-18T07:52:43Z

The Copilot "review" was triggered by me.

Ah I see, no problem. I'm just really skeptical that it will be of any use to us which is why I'd like to disable the feature. In my opinion there is something wrong with the PR, documentation or commit history if we need generative AI to summarise a PR. And code reviews should be done by humans. We don't want to end up like this 😅

Thanks for adding the tests. The PR looks good to me. Did I understand correctly that the CI tests now only fail because you've updated the URLs to point to master? So they will pass after merging?

... but better to do that in a new PR.

I agree. It probably makes sense to create a new issue for that.

MichaIng · 2025-06-18T11:33:49Z

In my opinion there is something wrong with the PR, documentation or commit history if we need generative AI to summarise a PR.

True, but I do see PRs too big/complex to understand, even when knowing the code base well. If AI fails to create a good summary, then it is a good hint that it needs split 😅, and maybe AI does similar suggestions already, before a human reviewer needs to do a review iteration 😉.

And code reviews should be done by humans. We don't want to end up like this 😅

Also true, it cannot be a replacement, but a potential time saver as an initial reviewer, like CI can be. Always needs to be taken with care, but I think it is worth it to experiment, see what it can do, and how it develops over next years. In the examples you linked, someone always explicitly assigned Copilot to an issue to (attempt to) solve it. So it does not look like AI is driving employees insane, but employees are experimenting themselves 😄: dotnet/runtime#115649 (comment)
Looking at the history, at least ~half it its PRs were merged. Not sure how much (human) time and iterations it took to get all those PRs right, and maybe a single human developer could have done it in overall much less time, or better: https://github.com/dotnet/runtime/pulls?q=is%3Apr+author%3Aapp%2Fcopilot-swe-agent+is%3Aclosed
But a tool that is developing 👀.

Back to topic:

Did I understand correctly that the CI tests now only fail because you've updated the URLs to point to master? So they will pass after merging?

Exactly. Two of the zero.bin checks will succeed after the PR has been merged, matching the expected 28 OK/6 Errors. The failure does not happen exactly for the reason described, as the binary needs to be much larger to cause an error, it seems, probably depending on hardware/network. But adding test binaries to code immediately reminds me about the liblzma vulnerability, so I better keep it at 2 bytes 😅. I'll open a PR for MIME type checking ASAP, then all binary link checks succeed, and generally it is good to have those in place, to assure fragment checking never chokes on binary files from that point on.

If there is a clever non-complex way to test HTTP URLs not on a hard-coded repo but on the current one, give me a hint. Like if Rust tools or nextest have a tiny embedded webserver to spin up on the current dir?

thomas-zahner · 2025-06-18T12:16:57Z

@MichaIng True, it could be that Copilot gets better in the coming years but as of now I think it's much more distracting than helpful. The dotnet/runtime PRs you linked I think are a perfect example of that. I just had a look at them. Updating the search with is:unmerged you'll see that more than half of the PRs are unmerged. Most of them were rejected. The ones which were accepted were either really trivial (renaming/boilerplate [probably not clean code]) or there was much manual intervention going on. If a human were to submit so many useless PRs, I'd consider banning them from my project.

So I really don't think generative AI fits into a developer platform like GitHub. If people want to test or use such tools they can do so locally. If they end up with anything good or useful they can still submit a PR. I can also recommend this little tale: https://endler.dev/2025/paolo/

You can probably use wiremock::MockServer as we do in many other tests.

MichaIng · 2025-06-18T14:50:00Z

So I really don't think generative AI fits into a developer platform like GitHub. If people want to test or use such tools they can do so locally.

Generally I think it fits pretty well, just the quality is not yet ready for production, or at least not above chore tasks. As said, to me it looks like Microsoft developers explicitly experiment with it. Their decision, and it makes sense since Microsoft is owning and investing in it. No one forces them to merge bad PRs, but letting it do its stuff and letting Copilot (meta) developers review and use the result to make it better, totally makes sense, IMO.

I do understand the skepticism also expressed in the tale, and stories about vibe coding underline it pretty well. But I think it is a mistake to take it as "AI is bad in general" or "AI should never do any coding". As with every new technology, it must be used with brain and care, and requires learning. Especially we must not let it do things we do not understand. I mean there are already a lot of black boxes for most in IT, like that the two of us surely have no idea what exactly the Rust compiler is doing in detail to generate the binary from our code. But AI is lifting it to another level, as even AI (meta) developers have only limited idea of what their AI might be coming up with, due to is non-programmed nature. But its potential is large, it will be used more and more, and it will be more and more useful, and I am sure it will/does change how we (need to) work, whether we like the change or not. So learning to use it, experimenting with it, also its limits, checking out in how far it can be useful for own projects, now or in future, is no waste of time. At least as long as its only the click of a button, or typing a question into an AI chat. So far I would never think about letting it solve tasks in my project, where a change in function A may have unrecognized impact elsewhere.

I hope no one is annoyed by us/me bloating this PR with AI discussion 😅.

You can probably use wiremock::MockServer as we do in many other tests.

I checked it out a bit, and it really is for mocking requests essentially, not for serving local content. So one would need to define manually that it shall serve the content of file A when a request of type X with a path A is done, and that it shall return code 200, which then somewhat defeats the purpose. For this particular binary file, it would be okay, guess could be set up to serve certain or random binary code on test request, eliminating the need for the test file. But it would be great to have an actual small static webserver, so that all or most tests done on files in fixtures/, currently all as file:// URLs and hence handled by file.rs checker exclusively, can be done with http:// requests additionally, checking website.rs in turn, without relying on a public repo which contains the same file.

Maybe easiest (least code for good common webserver behavior) is using warp. For now within the particular test, it could be something like this?

let input = fixtures_path().join("fragments");

// Spin up webserver to serve local files via HTTP
let file_server = tokio::spawn(async {
    warp::serve(warp::fs::dir(input))
        .run(([127, 0, 0, 1], 1337))
        .await;
});

...

// Stop webserver and wait for it to finish
file_server.abort();
let _ = file_server.await;

Generally would be better to start it once before/for all tests. But currently not sure how to achieve this best. Maybe a function which checks whether the server is running already, and starts it only if not. Or is there some kind of test init feature?

mre

From my side, we could merge this. The tests should pass after the merge to master as far as I understood.

MichaIng · 2025-06-20T14:33:11Z

From my side, we could merge this. The tests should pass after the merge to master as far as I understood.

Yeah, there is/was another test checking https://github.com/lycheeverse/lychee with a fragment already. Spinning up a local webserver for tests to make them repo-agnostic can be done in a separate PR.

thomas-zahner · 2025-06-20T15:31:19Z

I agree, this can be merged

mre · 2025-06-20T15:51:11Z

@MichaIng, can you check out https://github.com/lycheeverse/lychee/actions/runs/15782696285/job/44492011076?
Looks like there are still some issues with the tests.

MichaIng · 2025-06-20T16:05:12Z

Oops, I adjusted the repo in the test file, but not in the assertion 😄. Will open a PR to fix it. It least that one will then be green right from the start.

MichaIng added bug Something isn't working regression false-positive labels Jun 16, 2025

fix: skip fragment check if website URL doesn't contain fragment

aa7b905

Signed-off-by: MichaIng <micha@dietpi.com>

MichaIng force-pushed the skip-fragment-check-by-url branch from 5a12f2d to aa7b905 Compare June 16, 2025 15:22

MichaIng mentioned this pull request Jun 16, 2025

Checking fragments in local files results in 'Cannot read input content from' errors and eventually 'Too many open files' #1709

Closed

MichaIng requested a review from Copilot June 16, 2025 15:25

Copilot AI reviewed Jun 16, 2025

View reviewed changes

MichaIng force-pushed the skip-fragment-check-by-url branch from c3ce5ed to 6bf642a Compare June 17, 2025 17:52

MichaIng force-pushed the skip-fragment-check-by-url branch from 6bf642a to 1bf8dd7 Compare June 17, 2025 19:05

test: add tests for fragment checks with binary data

04e2450

Signed-off-by: MichaIng <micha@dietpi.com>

MichaIng force-pushed the skip-fragment-check-by-url branch from 1bf8dd7 to 04e2450 Compare June 17, 2025 19:22

fix: skip fragment checking as well if fragment is empty

ab0da6f

`is_some()` is true as well if the fragment is given but empty, i.e. `#`. While it is an edge case, skip the fragment checker as well in case of an empty fragment. Signed-off-by: MichaIng <micha@dietpi.com>

MichaIng closed this Jun 17, 2025

MichaIng force-pushed the skip-fragment-check-by-url branch from fb85e63 to ab0da6f Compare June 17, 2025 19:55

MichaIng reopened this Jun 17, 2025

test: switch to lycheeverse/master remote URLs

e16aa33

Signed-off-by: MichaIng <micha@dietpi.com>

fix: apply rustfmt annotation

dd7fe0a

Signed-off-by: MichaIng <micha@dietpi.com>

MichaIng mentioned this pull request Jun 18, 2025

Check fragments for remote URLs only for certain MIME types #1737

Closed

mre approved these changes Jun 20, 2025

View reviewed changes

mre merged commit b970256 into lycheeverse:master Jun 20, 2025
5 of 6 checks passed

mre mentioned this pull request Jun 20, 2025

chore: release v0.19.2 #1739

Closed

This was referenced Jun 20, 2025

test: fix assertion in fragment checks #1740

Closed

ci: add tests done via local webserver #1743

Open

This was referenced Jul 3, 2025

chore: release v0.20.0 #1755

Closed

chore: release v0.20.0 #1760

Closed

mre mentioned this pull request Aug 19, 2025

chore: release v0.20.0 #1808

Merged

Uh oh!

Conversation

MichaIng commented Jun 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

thomas-zahner commented Jun 17, 2025

Uh oh!

mre commented Jun 17, 2025

Uh oh!

MichaIng commented Jun 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MichaIng commented Jun 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MichaIng commented Jun 17, 2025

Uh oh!

thomas-zahner commented Jun 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MichaIng commented Jun 18, 2025

Uh oh!

thomas-zahner commented Jun 18, 2025

Uh oh!

MichaIng commented Jun 18, 2025

Uh oh!

mre left a comment

Choose a reason for hiding this comment

Uh oh!

MichaIng commented Jun 20, 2025

Uh oh!

thomas-zahner commented Jun 20, 2025

Uh oh!

Uh oh!

mre commented Jun 20, 2025

Uh oh!

MichaIng commented Jun 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

MichaIng commented Jun 16, 2025 •

edited

Loading

MichaIng commented Jun 17, 2025 •

edited

Loading

MichaIng commented Jun 17, 2025 •

edited

Loading

thomas-zahner commented Jun 18, 2025 •

edited

Loading