-
Notifications
You must be signed in to change notification settings - Fork 139
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
It is unlikely that the spec's two definitions of valid URL strings are equal #704
Comments
The HTML Standard does the same thing for the HTML syntax. Mainly because it serves different audiences. I liked that idea and based this on that. I do agree that there is definitely a chance for issues and it would be nice to automatically check them both. |
This is an excellent point and really cuts, in a way, to the core of what #479 was originally about. Thank you! Given that this is a well-explored area of theory, I think we can do better than a fuzzer and use a formally verified implementation of the parser. Then the spec could be arranged as follows, perhaps:
There could potentially be other benefits:
I do have thoughts on the "named validation errors" issue but I think those belong in the other thread. |
There is a lot of useful information in those comments. People are not just complaining, they are taking the time to think about this and they are providing constructive feedback.
I find this really nice. When a specification has both a formal grammar and an algorithm that matches it. And the courage to make the claim that they are equivalent. Multiple perspectives on the same thing help a lot of you want to really understand it. |
(Background note: valid URL strings are completely different from parseable URL strings. See https://url.spec.whatwg.org/#writing and https://url.spec.whatwg.org/#urls , especially the table in the latter, for background.)
The spec currently gives two ways of determining whether an input string is considered "valid":
<dfn>
of "valid URL string".)The latter definition is not precisely specified anywhere; the only thing you can really infer from the spec is that running the parser on some strings sometimes gives as a side effect, one or more validation errors. There's no actual bridging text to a concept like "valid URL string", that I can find.
But I think the intent is for these two to be equivalent.
The problem is, there's absolutely no guarantee this is the case. One instance of a mismatch was found in #437, but I suspect there are many more. As I commented there, I think it would be a fun project to first write an implementation of the parser which tracks validation errors, and then write an implementation of the #url-writing section, and then compare what results those two pieces of code get on a bunch of different (URL string, base URL) inputs. (Sounds like a job for a fuzzer!)
We could solve this in one of two ways:
The latter sounds very silly and redundant to me at first glance. However:
The parser-produced validation errors have the nice property that they can be gotten almost for free as a side effect of parsing. (The hardest part is implementing "is a URL code point", which a non-validating parser doesn't need at all.) They also could easily be given names that allow you to track down exactly what went wrong; see Named validation errors #406. Whereas the grammar-ish #url-writing section just gives you a boolean predicate. So that argues for keeping the parser-produced validation errors...
On the other hand, Provide a grammar for the URL parser #479 indicates people do seem to like the idea of a short grammar-ish algorithm for determining validity. It's not 100% clear whether they are distinguishing validity from parseability; in particular, the fact that #url-writing has existed forever but people have spent 77 comments complaining about the spec lacking a grammar, indicates maybe #url-writing is not that helpful and what they're really complaining about is how the URL parser is complicated to implement. But I'm not sure.
So, I'm not sure where that really leaves us. But I wanted to log this issue so we had a canonical place to record this unfortunate fact about the current standard.
See also jsdom/whatwg-url#156.
The text was updated successfully, but these errors were encountered: