Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WHATWG-compliant? #34

Open
untitaker opened this issue Mar 20, 2022 · 2 comments
Open

WHATWG-compliant? #34

untitaker opened this issue Mar 20, 2022 · 2 comments

Comments

@untitaker
Copy link

Does this parser attempt to follow the spec that browsers follow?

Rust crates that follow WHATWG (I think this list is complete):

  • lol-html
  • html5ever
  • html5gum

In its benchmarking suite, tl compares itself against both kinds of parsers, ones that do attempt to comply to the WHATWG spec and parsers who don't. Since WHATWG defines error-recovery etc very precisely, that influences what kind of optimizations one can do, and explains why html5ever is slow.

@y21
Copy link
Owner

y21 commented Mar 20, 2022

thanks for bringing this up, it's a good point and I think it's important to add this to the README. Seems a little unfair to make crates like html5ever look "bad" in the benchmarks even though the reason for that is probably that they closely follow the spec as you say, and comparing it to this crate. This is mentioned in the separate benchmark repo (linked in the README), but it's kind of hidden behind a wall of text (and not mentioned here), which is unfortunate.
Currently, this crate doesn't follow the full spec. When I made this crate, I needed a fast library for a project, something that can parse "sane" HTML documents very quickly (doesn't need to be spec compliant - it just needs to be able to parse the typical document) and provides a simple API to interact with the parsed tree. Hopefully in the future we can work towards being whatwg-compliant, without losing too much performance. Also html5gum looks cool ;)

bors bot pushed a commit that referenced this issue Mar 21, 2022
There's nothing in tl right now that documents the spec compliance status (whether it tries to follow it or not, see #34). 
This PR adds some information to the README regarding this, and changes the benchmark section. It should be a lot clearer which parsers attempt to comply with the specification and which ones don't, as well as making it clear that it's more of a theoretical benchmark and doesn't necessarily say something meaningful (at least comparing tl to html5ever and lol-html), since the performance of a parser that attempts to follow the specification can't be compared to one that doesn't (due to the fact that the spec really limits what one can do).
@y21
Copy link
Owner

y21 commented Mar 21, 2022

I added a few things to the README, hoping that it makes the goals of this crate more clear. I've also fixed the benchmarks section up a bit. The table now has a column for "follows spec" (whether compliance with the spec is a goal) and a "note", saying that it's important to understand what difference it makes for performance if one isn't bound to a specification.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants