Skip to content

fix: skip fragment check if website URL doesn't contain fragment#1733

Merged
mre merged 5 commits intolycheeverse:masterfrom
MichaIng:skip-fragment-check-by-url
Jun 20, 2025
Merged

fix: skip fragment check if website URL doesn't contain fragment#1733
mre merged 5 commits intolycheeverse:masterfrom
MichaIng:skip-fragment-check-by-url

Conversation

@MichaIng
Copy link
Member

@MichaIng MichaIng commented Jun 16, 2025

Fixes #1709

Part 2, since the same issue happens with remote URLs as well.

EDIT: Tested those builds, and works as expected. I am too lazy to add respective tests to the code 😛.

@MichaIng MichaIng added bug Something isn't working regression false-positive labels Jun 16, 2025
Signed-off-by: MichaIng <micha@dietpi.com>
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR addresses issue #1709 by modifying the website fragment handling logic to skip the fragment check when the URL does not contain a fragment.

  • The conditional check has been updated to include a verification of the URL fragment.
  • The change ensures that HTML fragment checks are only performed when the URL contains a fragment.
Comments suppressed due to low confidence (1)

lychee-lib/src/checker/website.rs:106

  • Consider adding an inline comment explaining that the additional check 'response.url().fragment().is_some()' ensures that HTML fragment processing is only performed when a fragment is present, improving code clarity for future maintainers.
if self.include_fragments

@thomas-zahner
Copy link
Member

Seems to make sense. But yeah, maybe we should add tests before merging, will take a closer look in the coming days.

@mre What's up with that Copilot comment 😅 Did you enable this or was this enabled automatically by GitHub? I'd really prefer to disable this. It just reiterates the obvious with a good chance of hallucinating and lying about things.

@mre
Copy link
Member

mre commented Jun 17, 2025

Yeah, seems to be a new feature. I'll try to disable it.

@MichaIng MichaIng force-pushed the skip-fragment-check-by-url branch from c3ce5ed to 6bf642a Compare June 17, 2025 17:52
@MichaIng
Copy link
Member Author

MichaIng commented Jun 17, 2025

The Copilot "review" was triggered by me. I was curious to see that I can add it even for PRs on remote repos. Potentially okay to get a summary for larger PRs. But so far I saw it doing only generic best practice recommendations, like here to add a comment to the code, in other cases to do sanity checks and such. Sorry for the spam 😉.

Okay I added tests, but they do not behave as expected:

[200] file:///home/runner/work/lychee/lychee/fixtures/fragments/zero.bin#fragment
[ERROR] https://raw.githubusercontent.com/MichaIng/lychee/skip-fragment-check-by-url/fixtures/fragments/zero.bin#fragment | Cannot find fragment
  1. Why does the first link succeed, despite the fact that there is a non-existing fragment in the URL?
  2. The second link fails, but not with the expected error decoding response body. This is a single NUL byte file. I'll test locally in which case it fails to read the body.

EDIT: Okay

     [200] file:///tmp/test2.txt#fragment
   [ERROR] file:///tmp/test.html#fragment | Cannot find fragment
   [WARN ] Skipping fragment check for file:///tmp/zero.bin#fragment due to the following error: Cannot read input content from file `/tmp/zero.bin`
     [200] file:///tmp/zero.bin#fragment
  1. So for local files, it does not fail if reading content fails, but it throws a warning only: https://github.com/lycheeverse/lychee/blob/master/lychee-lib/src/checker/file.rs#L174-L190
    The warning is not thrown on NUL files, but only with at least 2 random bytes.
  2. Furthermore, while the content is tried to be read on any file type, the fragment checker (silently) succeeds if it is no Markdown or HTML file: https://github.com/lycheeverse/lychee/blob/master/lychee-lib/src/utils/fragment_checker.rs#L74-L78

@MichaIng MichaIng force-pushed the skip-fragment-check-by-url branch from 6bf642a to 1bf8dd7 Compare June 17, 2025 19:05
Signed-off-by: MichaIng <micha@dietpi.com>
@MichaIng MichaIng force-pushed the skip-fragment-check-by-url branch from 1bf8dd7 to 04e2450 Compare June 17, 2025 19:22
`is_some()` is true as well if the fragment is given but empty, i.e. `#`. While it is an edge case, skip the fragment checker as well in case of an empty fragment.

Signed-off-by: MichaIng <micha@dietpi.com>
@MichaIng MichaIng closed this Jun 17, 2025
@MichaIng MichaIng force-pushed the skip-fragment-check-by-url branch from fb85e63 to ab0da6f Compare June 17, 2025 19:55
@MichaIng
Copy link
Member Author

MichaIng commented Jun 17, 2025

Interesting:

[WARN ] Skipping fragment check for file:///home/runner/work/lychee/lychee/fixtures/fragments/zero.bin# due to the following error: Cannot read input content from file `/home/runner/work/lychee/lychee/fixtures/fragments/zero.bin`
[WARN ] Skipping fragment check for file:///home/runner/work/lychee/lychee/fixtures/fragments/zero.bin#fragment due to the following error: Cannot read input content from file `/home/runner/work/lychee/lychee/fixtures/fragments/zero.bin`
Error: ROR] https://raw.githubusercontent.com/MichaIng/lychee/skip-fragment-check-by-url/fixtures/fragments/zero.bin#fragment | Cannot find fragment
[200] https://raw.githubusercontent.com/MichaIng/lychee/skip-fragment-check-by-url/fixtures/fragments/zero.bin#

Pretty inconsistent behavior between file and website checker:

  • Local zero.bin# with empty fragment is still passed to the fragment checker, despite uri.url.fragment().is_some(). An empty string passes this test?
  • Remote zero.bin# however does not throw an error (or warning), so I was assuming the content is not passed to the fragment checker. But, testing with a larger file, it actually is, but the checker for unknown reasons does not fail parsing the content. I tested this with some other real image files. If they are not too big to cause a parser error, the fragment checker "succeeds" checking for fragments. But if I download the file and check it locally, the parser fails with Cannot read input content from file.

It is an edge case, and I do not want to pull all exceptions/early positive returns from the fragment checker into the parent file and website checkers, but is_some_and(|x| !x.is_empty()) seems simple enough.

Generally, the fragment checker adds additional processing, and the parent checkers pull and pass the whole content to it. So it makes sense to avoid invoking the fragment checker in the first place, if there is no fragment to check, or no way it can succeed. Strictly seen this includes #top as well. But on the other hand, the fragment checker checks for empty fragment and #top as well, and it is not great to do the same exception checks at 3 places in the code. Maybe a public function in the fragment checker could be used by the parent checkers file.rs and website.rs as well, so they know whether to fetch and pass the full content or not, but share the same code for exceptions?

... but better to do that in a new PR. This one is good enough, probably the last commit already too much. Test file description is not 100% accurate, since the remote binary URL does not fail with a read error (for such small file) like the local one does, but the result is the same. Next I'll look into content-type based fragment check skipping for remote website.rs checks.

@MichaIng MichaIng reopened this Jun 17, 2025
Signed-off-by: MichaIng <micha@dietpi.com>
@MichaIng
Copy link
Member Author

Okay I tested new builds and is_some_and(|x| !x.is_empty()) works as expected: checking an URL of a large remote file with empty # is not passed to the fragment checker anymore and succeeds quickly. The previous builds downloaded the file completely into RAM (or tried to do so) and passed it to the fragment checker, which then fails either way. No change to previous behavior in case of #test, as expected.

To preserve the succeeding tests, before I adjust the URLs to point to lycheeverse/master: https://github.com/lycheeverse/lychee/actions/runs/15716307029

Ready to be merged from my end.

Signed-off-by: MichaIng <micha@dietpi.com>
@thomas-zahner
Copy link
Member

thomas-zahner commented Jun 18, 2025

The Copilot "review" was triggered by me.

Ah I see, no problem. I'm just really skeptical that it will be of any use to us which is why I'd like to disable the feature. In my opinion there is something wrong with the PR, documentation or commit history if we need generative AI to summarise a PR. And code reviews should be done by humans. We don't want to end up like this 😅

Thanks for adding the tests. The PR looks good to me. Did I understand correctly that the CI tests now only fail because you've updated the URLs to point to master? So they will pass after merging?

... but better to do that in a new PR.

I agree. It probably makes sense to create a new issue for that.

@MichaIng
Copy link
Member Author

In my opinion there is something wrong with the PR, documentation or commit history if we need generative AI to summarise a PR.

True, but I do see PRs too big/complex to understand, even when knowing the code base well. If AI fails to create a good summary, then it is a good hint that it needs split 😅, and maybe AI does similar suggestions already, before a human reviewer needs to do a review iteration 😉.

And code reviews should be done by humans. We don't want to end up like this 😅

Also true, it cannot be a replacement, but a potential time saver as an initial reviewer, like CI can be. Always needs to be taken with care, but I think it is worth it to experiment, see what it can do, and how it develops over next years. In the examples you linked, someone always explicitly assigned Copilot to an issue to (attempt to) solve it. So it does not look like AI is driving employees insane, but employees are experimenting themselves 😄: dotnet/runtime#115649 (comment)
Looking at the history, at least ~half it its PRs were merged. Not sure how much (human) time and iterations it took to get all those PRs right, and maybe a single human developer could have done it in overall much less time, or better: https://github.com/dotnet/runtime/pulls?q=is%3Apr+author%3Aapp%2Fcopilot-swe-agent+is%3Aclosed
But a tool that is developing 👀.


Back to topic:

Did I understand correctly that the CI tests now only fail because you've updated the URLs to point to master? So they will pass after merging?

Exactly. Two of the zero.bin checks will succeed after the PR has been merged, matching the expected 28 OK/6 Errors. The failure does not happen exactly for the reason described, as the binary needs to be much larger to cause an error, it seems, probably depending on hardware/network. But adding test binaries to code immediately reminds me about the liblzma vulnerability, so I better keep it at 2 bytes 😅. I'll open a PR for MIME type checking ASAP, then all binary link checks succeed, and generally it is good to have those in place, to assure fragment checking never chokes on binary files from that point on.

If there is a clever non-complex way to test HTTP URLs not on a hard-coded repo but on the current one, give me a hint. Like if Rust tools or nextest have a tiny embedded webserver to spin up on the current dir?

@thomas-zahner
Copy link
Member

@MichaIng True, it could be that Copilot gets better in the coming years but as of now I think it's much more distracting than helpful. The dotnet/runtime PRs you linked I think are a perfect example of that. I just had a look at them. Updating the search with is:unmerged you'll see that more than half of the PRs are unmerged. Most of them were rejected. The ones which were accepted were either really trivial (renaming/boilerplate [probably not clean code]) or there was much manual intervention going on. If a human were to submit so many useless PRs, I'd consider banning them from my project.

So I really don't think generative AI fits into a developer platform like GitHub. If people want to test or use such tools they can do so locally. If they end up with anything good or useful they can still submit a PR. I can also recommend this little tale: https://endler.dev/2025/paolo/


You can probably use wiremock::MockServer as we do in many other tests.

@MichaIng
Copy link
Member Author

So I really don't think generative AI fits into a developer platform like GitHub. If people want to test or use such tools they can do so locally.

Generally I think it fits pretty well, just the quality is not yet ready for production, or at least not above chore tasks. As said, to me it looks like Microsoft developers explicitly experiment with it. Their decision, and it makes sense since Microsoft is owning and investing in it. No one forces them to merge bad PRs, but letting it do its stuff and letting Copilot (meta) developers review and use the result to make it better, totally makes sense, IMO.

I do understand the skepticism also expressed in the tale, and stories about vibe coding underline it pretty well. But I think it is a mistake to take it as "AI is bad in general" or "AI should never do any coding". As with every new technology, it must be used with brain and care, and requires learning. Especially we must not let it do things we do not understand. I mean there are already a lot of black boxes for most in IT, like that the two of us surely have no idea what exactly the Rust compiler is doing in detail to generate the binary from our code. But AI is lifting it to another level, as even AI (meta) developers have only limited idea of what their AI might be coming up with, due to is non-programmed nature. But its potential is large, it will be used more and more, and it will be more and more useful, and I am sure it will/does change how we (need to) work, whether we like the change or not. So learning to use it, experimenting with it, also its limits, checking out in how far it can be useful for own projects, now or in future, is no waste of time. At least as long as its only the click of a button, or typing a question into an AI chat. So far I would never think about letting it solve tasks in my project, where a change in function A may have unrecognized impact elsewhere.

I hope no one is annoyed by us/me bloating this PR with AI discussion 😅.


You can probably use wiremock::MockServer as we do in many other tests.

I checked it out a bit, and it really is for mocking requests essentially, not for serving local content. So one would need to define manually that it shall serve the content of file A when a request of type X with a path A is done, and that it shall return code 200, which then somewhat defeats the purpose. For this particular binary file, it would be okay, guess could be set up to serve certain or random binary code on test request, eliminating the need for the test file. But it would be great to have an actual small static webserver, so that all or most tests done on files in fixtures/, currently all as file:// URLs and hence handled by file.rs checker exclusively, can be done with http:// requests additionally, checking website.rs in turn, without relying on a public repo which contains the same file.

Maybe easiest (least code for good common webserver behavior) is using warp. For now within the particular test, it could be something like this?

let input = fixtures_path().join("fragments");

// Spin up webserver to serve local files via HTTP
let file_server = tokio::spawn(async {
    warp::serve(warp::fs::dir(input))
        .run(([127, 0, 0, 1], 1337))
        .await;
});

...

// Stop webserver and wait for it to finish
file_server.abort();
let _ = file_server.await;

Generally would be better to start it once before/for all tests. But currently not sure how to achieve this best. Maybe a function which checks whether the server is running already, and starts it only if not. Or is there some kind of test init feature?

Copy link
Member

@mre mre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my side, we could merge this. The tests should pass after the merge to master as far as I understood.

@MichaIng
Copy link
Member Author

From my side, we could merge this. The tests should pass after the merge to master as far as I understood.

Yeah, there is/was another test checking https://github.com/lycheeverse/lychee with a fragment already. Spinning up a local webserver for tests to make them repo-agnostic can be done in a separate PR.

@thomas-zahner
Copy link
Member

I agree, this can be merged

@mre mre merged commit b970256 into lycheeverse:master Jun 20, 2025
5 of 6 checks passed
@mre
Copy link
Member

mre commented Jun 20, 2025

@MichaIng, can you check out https://github.com/lycheeverse/lychee/actions/runs/15782696285/job/44492011076?
Looks like there are still some issues with the tests.

@mre mre mentioned this pull request Jun 20, 2025
@MichaIng
Copy link
Member Author

Oops, I adjusted the repo in the test file, but not in the assertion 😄. Will open a PR to fix it. It least that one will then be green right from the start.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working false-positive regression

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Checking fragments in local files results in 'Cannot read input content from' errors and eventually 'Too many open files'

3 participants