It is unlikely that the spec's two definitions of valid URL strings are equal #704

domenic · 2022-09-02T02:45:08Z

(Background note: valid URL strings are completely different from parseable URL strings. See https://url.spec.whatwg.org/#writing and https://url.spec.whatwg.org/#urls , especially the table in the latter, for background.)

The spec currently gives two ways of determining whether an input string is considered "valid":

A string is a valid URL string if it follows the psuedo-grammar given in https://url.spec.whatwg.org/#url-writing . (This is the actual <dfn> of "valid URL string".)
A string is supposedly a valid URL string if running the URL parser on it never results in a step that invokes the validation error concept.

The latter definition is not precisely specified anywhere; the only thing you can really infer from the spec is that running the parser on some strings sometimes gives as a side effect, one or more validation errors. There's no actual bridging text to a concept like "valid URL string", that I can find.

But I think the intent is for these two to be equivalent.

The problem is, there's absolutely no guarantee this is the case. One instance of a mismatch was found in #437, but I suspect there are many more. As I commented there, I think it would be a fun project to first write an implementation of the parser which tracks validation errors, and then write an implementation of the #url-writing section, and then compare what results those two pieces of code get on a bunch of different (URL string, base URL) inputs. (Sounds like a job for a fuzzer!)

We could solve this in one of two ways:

Delete one of the two definitions of valid URL string
Make sure they align, and maybe try to maintain that via automated fuzzing somehow.

The latter sounds very silly and redundant to me at first glance. However:

The parser-produced validation errors have the nice property that they can be gotten almost for free as a side effect of parsing. (The hardest part is implementing "is a URL code point", which a non-validating parser doesn't need at all.) They also could easily be given names that allow you to track down exactly what went wrong; see Named validation errors #406. Whereas the grammar-ish #url-writing section just gives you a boolean predicate. So that argues for keeping the parser-produced validation errors...
On the other hand, Provide a grammar for the URL parser #479 indicates people do seem to like the idea of a short grammar-ish algorithm for determining validity. It's not 100% clear whether they are distinguishing validity from parseability; in particular, the fact that #url-writing has existed forever but people have spent 77 comments complaining about the spec lacking a grammar, indicates maybe #url-writing is not that helpful and what they're really complaining about is how the URL parser is complicated to implement. But I'm not sure.

So, I'm not sure where that really leaves us. But I wanted to log this issue so we had a canonical place to record this unfortunate fact about the current standard.

See also jsdom/whatwg-url#156.

The text was updated successfully, but these errors were encountered:

annevk · 2022-09-02T07:30:34Z

The HTML Standard does the same thing for the HTML syntax. Mainly because it serves different audiences. I liked that idea and based this on that. I do agree that there is definitely a chance for issues and it would be nice to automatically check them both.

alercah · 2022-09-02T11:58:01Z

This is an excellent point and really cuts, in a way, to the core of what #479 was originally about. Thank you!

Given that this is a well-explored area of theory, I think we can do better than a fuzzer and use a formally verified implementation of the parser. Then the spec could be arranged as follows, perhaps:

Make a formal grammar defining the valid URLs.
Add a section with a "web algorithm" which parsea URLs and determines validity.
Write an implementation of the algorithm in a language with formal verification (Idris would be my choice but only because or familiarity) that proves equivalence of the two approaches.

There could potentially be other benefits:

Proving the equivalence of various portions of the URL as parsed by the algorithm with the equivalent grammar productions, validating the use of other algorithms to produce the same result, at least on valid input.
Exploring whether there is a more succinct way to describe a "parseable" URL and the corresponding valid one.
With a formally verified implementation validating against a grammar, it would be possible to propose changes to the structure of the algorithm to improve clarity or fix bugs with confidence that they do not have unexpected consequences, which would make the spec easier to maintain.

I do have thoughts on the "named validation errors" issue but I think those belong in the other thread.

alwinb · 2022-09-02T17:03:10Z

in particular, the fact that #url-writing has existed forever but people have spent 77 comments complaining about the spec lacking a grammar.

There is a lot of useful information in those comments. People are not just complaining, they are taking the time to think about this and they are providing constructive feedback.

The HTML Standard does the same thing for the HTML syntax. Mainly because it serves different audiences. I liked that idea and based this on that.

I find this really nice. When a specification has both a formal grammar and an algorithm that matches it. And the courage to make the claim that they are equivalent. Multiple perspectives on the same thing help a lot of you want to really understand it.

domenic mentioned this issue Sep 2, 2022

Provide a grammar for the URL parser #479

Open

annevk added the topic: validation Pertaining to the rules for URL writing and validity (as opposed to parsing) label Sep 2, 2022

domenic mentioned this issue Nov 27, 2024

Clarify using CDATA in HTML context whatwg/html#10801

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

It is unlikely that the spec's two definitions of valid URL strings are equal #704

It is unlikely that the spec's two definitions of valid URL strings are equal #704

domenic commented Sep 2, 2022

annevk commented Sep 2, 2022

alercah commented Sep 2, 2022 •

edited

Loading

alwinb commented Sep 2, 2022

It is unlikely that the spec's two definitions of valid URL strings are equal #704

It is unlikely that the spec's two definitions of valid URL strings are equal #704

Comments

domenic commented Sep 2, 2022

annevk commented Sep 2, 2022

alercah commented Sep 2, 2022 • edited Loading

alwinb commented Sep 2, 2022

alercah commented Sep 2, 2022 •

edited

Loading