Skip to content

Extract spans#1870

Merged
thomas-zahner merged 6 commits intomasterfrom
extract-spans
Oct 10, 2025
Merged

Extract spans#1870
thomas-zahner merged 6 commits intomasterfrom
extract-spans

Conversation

@thomas-zahner
Copy link
Member

Replaces #1806. Rebased on master and tidied up a few small things. This moves us one big step closer to completing #1304.

Thank you very much @Akida31

Akida31 and others added 3 commits October 9, 2025 14:02
This can be used in the future to add line numbers and columns in
output.

Note that the html5ever extractor does not output column information,
since currently there is no way to retrieve it.
@thomas-zahner
Copy link
Member Author

@Akida31 as you can see in CI test_include_verbatim fails with:

thread 'extract::html::html5ever::tests::test_include_verbatim' panicked at lychee-lib/src/extract/html/html5ever.rs:349:9:
    assertion `left == right` failed
      left: [RawUri { text: "https://example.com/", element: None, attribute: None, span: RawUriSpan { line: 4, column: None } }, RawUri { text: "https://example.org/", element: Some("a"), attribute: Some("href"), span: RawUriSpan { line: 4, column: None } }, RawUri { text: "https://foo.com/", element: None, attribute: None, span: RawUriSpan { line: 10, column: Some(9) } }, RawUri { text: "http://bar.com/some/path", element: None, attribute: None, span: RawUriSpan { line: 10, column: Some(29) } }, RawUri { text: "https://baz.org/", element: Some("a"), attribute: Some("href"), span: RawUriSpan { line: 9, column: None } }]
     right: [RawUri { text: "https://example.com/", element: None, attribute: None, span: RawUriSpan { line: 4, column: None } }, RawUri { text: "https://example.org/", element: Some("a"), attribute: Some("href"), span: RawUriSpan { line: 4, column: None } }, RawUri { text: "https://foo.com/", element: None, attribute: None, span: RawUriSpan { line: 7, column: None } }, RawUri { text: "http://bar.com/some/path", element: None, attribute: None, span: RawUriSpan { line: 7, column: None } }, RawUri { text: "https://baz.org/", element: Some("a"), attribute: Some("href"), span: RawUriSpan { line: 9, column: None } }]

So the line numbers and columns for https://foo.com/ and http://bar.com/some/path are no longer correct.
This is caused by rebasing to master where html5ever was updated from version 0.31.0 to 0.35.0. I've debugged the tokens in process_token and noticed the following difference between the old and new version:

<     Tendril<UTF8>(shared: "        Some random text"),
< )
< [lychee-lib/src/extract/html/html5ever.rs:51:15] token = CharacterTokens(
<     Tendril<UTF8>(inline: "\n"),
< )
< [lychee-lib/src/extract/html/html5ever.rs:51:15] token = CharacterTokens(
<     Tendril<UTF8>(shared: "        https://foo.com and http://bar.com/some/path"),
< )
< [lychee-lib/src/extract/html/html5ever.rs:51:15] token = CharacterTokens(
<     Tendril<UTF8>(inline: "\n"),
< )
< [lychee-lib/src/extract/html/html5ever.rs:51:15] token = CharacterTokens(
<     Tendril<UTF8>(shared: "        Something else"),
< )
< [lychee-lib/src/extract/html/html5ever.rs:51:15] token = CharacterTokens(
<     Tendril<UTF8>(inline: "\n"),
< )
< [lychee-lib/src/extract/html/html5ever.rs:51:15] token = CharacterTokens(
<     Tendril<UTF8>(inline: "        "),
---
>     Tendril<UTF8>(shared: "        Some random text\n        https://foo.com and http://bar.com/some/path\n        Something else\n        "),

So in summary the newer html5ever version produces a single Tendril containing multiple lines, whereas the older version produced a single Tendril for each line. I wasn't yet able to figure out adapt to this new behaviour. The problem might have to do with SourceSpanProvider::from_input. Do you know how to fix this?

@Akida31
Copy link
Contributor

Akida31 commented Oct 10, 2025

I think the problem is that the line number given by html5ever is the number of the last line of the multi-line tendril. So I think the following patch should fix that by giving the span provider the line number of the first line of the tendril. I didn't test this patch, so I don't know if this fixes the issue.

diff --git a/lychee-lib/src/extract/html/html5ever.rs b/lychee-lib/src/extract/html/html5ever.rs
index d7fa67bbf1..2c70d00865 100644
--- a/lychee-lib/src/extract/html/html5ever.rs
+++ b/lychee-lib/src/extract/html/html5ever.rs
@@ -58,6 +58,14 @@
                     return TokenSinkResult::Continue;
                 }
                 if self.include_verbatim {
+                    // offset line number by line breaks included in the raw text
+                    let line_number = line_number.saturating_sub(
+                        raw.chars()
+                            .filter(|c| *c == '\n')
+                            .count()
+                            .try_into()
+                            .unwrap(),
+                    );
                     self.links
                         .borrow_mut()
                         .extend(extract_raw_uri_from_plaintext(

If you have questions how my larger changes work, I'm happy to answer them. Sadly, documenting code isn't a strength of mine currently.

@thomas-zahner
Copy link
Member Author

This indeed fixed the problem 👍
I'm now finally merging this. Note that we don't make use of the spans yet in lychee-bin as requested in #1304.
But I think the hardest part in getting there is completed with this PR so thanks again @Akida31!

@thomas-zahner thomas-zahner merged commit 1f97165 into master Oct 10, 2025
6 checks passed
@thomas-zahner thomas-zahner deleted the extract-spans branch October 10, 2025 09:21
@mre mre mentioned this pull request Oct 10, 2025
This was referenced Oct 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants