Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does it supply line&column numbers for the parsed tokens? #492

Open
hoijui opened this issue Jan 17, 2023 · 7 comments
Open

Does it supply line&column numbers for the parsed tokens? #492

hoijui opened this issue Jan 17, 2023 · 7 comments

Comments

@hoijui
Copy link

hoijui commented Jan 17, 2023

I searched the sources, and found _line_numer in a few places, but overall, I had the impression that this info is not available to client code. Am I wrong?

@jdm
Copy link
Member

jdm commented Jan 18, 2023

Changes in line numbers are available to client code in the tree builder (

/// Called whenever the line number changes.
fn set_current_line(&mut self, _line_number: u64) {}
). We didn't have a reason to expose column number data in Servo so far, so we didn't both looking into it.

@jdm
Copy link
Member

jdm commented Jan 18, 2023

Simiarly, the tokenizer receives a line number with each token:

/// Process a token.
fn process_token(&mut self, token: Token, line_number: u64) -> TokenSinkResult<Self::Handle>;

@hoijui
Copy link
Author

hoijui commented Jan 18, 2023

thank you @jdm ! :-)
I am working on some code that checks links in documents, and tells the user which ones are valid and which not (anymore). For this, I have to be able to tell the user where exactly these links are in the document, so they can fix them.
I am currently using some very shady, ueber-simple, self-made HTML parser, because none of the libraries for HTML parsing seem to supply line&column info. I understand, it makes no sense to track these for each little detail, in 99% of use-cases for these libraries, so I am not suggesting to add this. Would be glad for some hints about how to go about this.
Will I need to maintain a fork of one of these libraries (eg. html5ever)?

@RXminuS
Copy link

RXminuS commented Jul 28, 2023

Yeah the line number on its own is kind of useless for certain applications. For my own project I'm having to resort to https://github.com/y21 just to get the exact byte positions of each DOM node.

Positions for DOM nodes were also recently added to JSoup and also seems available in HTML parsers in other major languages, so I think it would make sense if we could figure out a way for html5ever to provide the same. Also there's been several issues over the years asking for similar features.

One thing that I was trying to make work but couldn't quite yet is to provide a byte stream that I can read the offset from as tokens are emitted from html5ever, however since tokens are actually consumed ahead of time it doesn't quite give the right positions. This could maybe be fixed by providing something that's Peekable, but tbh. I didn't really like the direction anyways.

Are there any better ideas of how this could potentially be added in such a way that it's an opt-in performance penalty?

@hoijui
Copy link
Author

hoijui commented Jul 28, 2023

hey @RXminuS :-)
... you resorted to https://github.com/y21/tl?
why is it not optimal?

@RXminuS
Copy link

RXminuS commented Jul 28, 2023

It's not actively maintained and you need to do some hacky things such as replacing script/style/no script content otherwise the ranges will be off since it still matches on those tokens inside (e.g. no state switching)

@domenic
Copy link

domenic commented Feb 15, 2024

For anyone else running into this problem, in whatwg/html-build#291 I'm creating a RcDomWithLineNumbers which overrides the two methods necessary to at least track line numbers in the errors recorded. I'm very much a Rust beginner so it's just kind of been a process of flailing around until I got something working, and the fact that Rust makes you delegate all methods of TreeSink just to override set_current_line (to record the current line) and parse_error (to augment the recorded error with the current line) seems bonkers. But it seems to work so far.

Column numbers, of course, are not so easy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants