-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Base URL inheritance gives unintuitive results #179
Comments
I gave up on adopting URLPattern largely due to the behaviors described here, so I would be very happy to see them change. |
I think a second change is required to kill the requirement for new URLPattern('https://example.com/wp-login.php').test('https://example.com/wp-login.php?a=1')
// false We would have to also make these default to wildcard, or default to wildcard if an earlier component is present (functionally, this distinction only really applies to the case of a URL pattern which is just a hash), in order to avoid this. Authors would then have to explicitly write Also, is there a compelling use case for |
Right, this isn't really an "assumption" as much as just what the URL parser does. It produces an empty string for these components. We can deviate from that, but just be aware that we are doing so. |
@wanderview seemed to think it was important, in the last paragraph of WICG/nav-speculation#259 (comment). I admit I can't think of any use case myself. |
Somehow I didn't realize this. I swear I saw some version of the string format defaulting to What do you think of making the wildcard-for-unset-components apply the same way uniformly? Regardless of base URL vs. no base URL, string format vs. object format. Then I don't know if that's a good idea, since we haven't run into much confusion for patterns of that sort. But it is more elegant, maybe? |
I'm not opposed to it necessarily. The example of I think the tricky part is when you extend this to strings that will parse as URLs. We've now deviated from the URL parser behavior which just seems risky to me. But again, not opposed in general. Have you given thought how to make these backward incompatible changes? Is the intent to measure volume of usage to see if its safe to just change them? Or do you expect to create some new API surface while deprecating this one? |
Any change in this direction does lose the kinda nice property that you can just escape a URL properly and it becomes a URL pattern that matches exactly that URL. Unfortunately I think that property is probably less useful than matching expectations in other ways. Very anecdotally I think for standard URLs I would expect to need to specify the protocol, hostname and path, as those are the parts that are always visible in a URL string. So I don't think it's that bad to expect to write By contrast, in One difference between doing any-earlier-specified vs wildcard-if-omitted for the string format stuff is the behavior of things like If we want to make a change of this type, I think we would ideally evaluate any cases in the wild that would be adversely affected, hopefully there are few-to-none, and then just make a breaking change. |
So for the string format, we can compare the following three policies: (a) status quo Some examples with no base URL:
(c) is probably not too bad, though (b) is a smaller behavior change. |
I would extend this same logic to
For (c), this differs from
I tend to think this should not match. |
re. username and password, I agree they shouldn't be ordinarily showing up, but if I'm an author and I say "okay, you can can prefetch everything except |
For (c), we can of course define it, but the most direct interpretation of what you suggested would seem to imply that specifying the (Tangentially, it's nifty to learn that more stuff is optional in URLs than I realized. |
The Chromium use counters for this are in stable now, and are both registering virtually zero (let's say <0.01%). Therefore I think this change is web-compatible. |
This change seems reasonable overall, but given @jeremyroman's example above around |
A similar username/password question arises for the base URL inheritance side of things, I discovered while tweaking Chromium unit tests in a draft CL.
Incidentally, to my mild surprise, the username and password do contribute to We could leave this to authors as a case they need to consider when using URL patterns for blocking from suspicious content, or we could extend the defaults to always wildcard username and password (at a cost of making it necessary to explicitly make them empty by writing It is possible that, conversely, an author is using patterns to allow something specific but since servers typically ignore usernames/passwords they aren't expecting (AFAIK) that seems less risky. And it's consistent with them defaulting to I agree that writing out the interesting variants (ideally all of them, but I don't know how large that combinatorial explosion is) would be an interesting way of looking for additional surprises. |
I tend toward treating username/password in similar ways for both types of defaulting: base URL inheritance, and wildcard-if-something-earlier-is-wildcard. I think that suggests in your example, they would still be inherited, since the wildcard only occurs in the path ( I don't think this is the highest-stakes decision though, because URLs using usernames/passwords are non-conformant, and (I think?) on a path toward removal. (Mike West, purposefully not pinging him, has expressed cheerful sentiments about removing support for fetching such URLs in the past.) I would, in fact, be very OK with encouraging URLPattern-using specs to turn themselves off, or at least turn URL pattern matching off, when used on documents with such URLs. |
I think it really depends on the use case what ends up being bad:
I think since username and password are so unexpected, wildcarding them unless we explicitly know intent (i.e., they are explicitly set) seems very reasonable and probably even necessary. |
Without any additional change to username/password, here's what Chromium does today vs with proposed changes, including every variation of including/excluding a typical fixed part, and then the patterns from the WPT test suite. https://gist.github.com/jeremyroman/c712860aed8080b5524feb8118a5105a Edit: some dictionary-style examples were incorrect right but have been corrected |
Similar diffs for wildcarding username/password unless they're explicit (i.e., never inherit them from the base URL, and in the string format wildcard them unless specified). https://gist.github.com/jeremyroman/8d99c4667490800dbb3c46d3d26b4885 Even though this adds more logic/rules, I do think it seems likely to make for better defaults. If other behavior is desirable we can always add options in the future, but I'm hopeful they won't prove that necessary in practice. |
The following changes apply to patterns which are constructed using a base URL, the string syntax, or both -- but not any pattern which explicitly specifies components separately without a base URL. * Components are not inherited from a base URL if an "earlier" component is explicitly specified. * In the string format, unspecified "later" components are implicitly wildcarded, rather than required to be empty (with the exception of the port, which is always taken to be specified when the hostname is). * Username and password are never implicitly specified or inherited. This means that a pattern like "https://example.com/*" also matches with any username, password, search, and hash. Previously this would be written "https://*:*@example.com/*\\?*#*". Bug: 1468446 Change-Id: Ie0d7a80e36e89e05a0c634f7565c3365909edb2d
The following changes apply to patterns which are constructed using a base URL, the string syntax, or both -- but not any pattern which explicitly specifies components separately without a base URL. * Components are not inherited from a base URL if an "earlier" component is explicitly specified. * In the string format, unspecified "later" components are implicitly wildcarded, rather than required to be empty (with the exception of the port, which is always taken to be specified when the hostname is). * Username and password are never implicitly specified or inherited. This means that a pattern like "https://example.com/*" also matches with any username, password, search, and hash. Previously this would be written "https://*:*@example.com/*\\?*#*". Bug: 1468446 Change-Id: Ie0d7a80e36e89e05a0c634f7565c3365909edb2d
The following changes apply to patterns which are constructed using a base URL, the string syntax, or both -- but not any pattern which explicitly specifies components separately without a base URL. * Components are not inherited from a base URL if an "earlier" component is explicitly specified. * In the string format, unspecified "later" components are implicitly wildcarded, rather than required to be empty (with the exception of the port, which is always taken to be specified when the hostname is). * Username and password are never implicitly specified or inherited. This means that a pattern like "https://example.com/*" also matches with any username, password, search, and hash. Previously this would be written "https://*:*@example.com/*\\?*#*". Bug: 1468446 Change-Id: Ie0d7a80e36e89e05a0c634f7565c3365909edb2d Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/4975595 Reviewed-by: Dominic Farolino <[email protected]> Commit-Queue: Jeremy Roman <[email protected]> Cr-Commit-Position: refs/heads/main@{#1221197}
…nd constructor string parsing The following changes apply to patterns which are constructed using a base URL, the constructor string syntax, or both -- but not any pattern which explicitly specifies components separately without a base URL. * Components are not inherited from a base URL if an "earlier" component is explicitly specified. * In the string format, unspecified "later" components are implicitly wildcarded, rather than required to be empty (with the exception of the port, which is always taken to be specified when the hostname is). * Username and password are never implicitly specified or inherited. For example: 1. `"https://example.com/*"` also matches with any username, password, search, and hash. Previously this would be written `"https://*:*@example.com/*\\?*#*"`. 2. `new URLPattern({ pathname: "/login" }, "https://example.com/?q=hello")` accepts any query string and hash, not only `"?q=hello"` and `""`. 3. `"https://*:*"` or `{protocol: "https"}` means "any HTTPS URL, on any port", and `"https://*"` means "any HTTPS URL on the default port (443)". These have the same meaning whether or not a base URL is provided, since specifying the protocol prohibits inheritance of other components. This makes patterns more expansive than before, in cases where wildcards are likely to be desirable. The logic of inheriting components from a base URL dictionary is also similarly changed in a way that may make it _not_ match where it did before, but more consistently with the above and with how relative URL strings are resolved. For example, `new URLPattern("https://example.com/foo?q=1#hello").test({pathname: "/foo", hash: "hello", baseURL: "https://example.com/bar?q=1"})` previously returned `true` but will now be `false`, since the search component is not inherited when the pathname is specified. This is analogous to how `new URL("/foo#hello", "https://example.com/bar?q=1")` works. The reverse is also possible; in both cases this is quite niche. Fixes #179.
The following changes apply to patterns which are constructed using a base URL, the string syntax, or both -- but not any pattern which explicitly specifies components separately without a base URL. * Components are not inherited from a base URL if an "earlier" component is explicitly specified. * In the string format, unspecified "later" components are implicitly wildcarded, rather than required to be empty (with the exception of the port, which is always taken to be specified when the hostname is). * Username and password are never implicitly specified or inherited. This means that a pattern like "https://example.com/*" also matches with any username, password, search, and hash. Previously this would be written "https://*:*@example.com/*\\?*#*". Bug: 1468446 Change-Id: Ie0d7a80e36e89e05a0c634f7565c3365909edb2d Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/4975595 Reviewed-by: Dominic Farolino <[email protected]> Commit-Queue: Jeremy Roman <[email protected]> Cr-Commit-Position: refs/heads/main@{#1221197}
The following changes apply to patterns which are constructed using a base URL, the string syntax, or both -- but not any pattern which explicitly specifies components separately without a base URL. * Components are not inherited from a base URL if an "earlier" component is explicitly specified. * In the string format, unspecified "later" components are implicitly wildcarded, rather than required to be empty (with the exception of the port, which is always taken to be specified when the hostname is). * Username and password are never implicitly specified or inherited. This means that a pattern like "https://example.com/*" also matches with any username, password, search, and hash. Previously this would be written "https://*:*@example.com/*\\?*#*". Bug: 1468446 Change-Id: Ie0d7a80e36e89e05a0c634f7565c3365909edb2d Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/4975595 Reviewed-by: Dominic Farolino <[email protected]> Commit-Queue: Jeremy Roman <[email protected]> Cr-Commit-Position: refs/heads/main@{#1221197}
…atwg/urlpattern#179., a=testonly Automatic update from web-platform-tests URLPattern: Apply syntax changes from whatwg/urlpattern#179. The following changes apply to patterns which are constructed using a base URL, the string syntax, or both -- but not any pattern which explicitly specifies components separately without a base URL. * Components are not inherited from a base URL if an "earlier" component is explicitly specified. * In the string format, unspecified "later" components are implicitly wildcarded, rather than required to be empty (with the exception of the port, which is always taken to be specified when the hostname is). * Username and password are never implicitly specified or inherited. This means that a pattern like "https://example.com/*" also matches with any username, password, search, and hash. Previously this would be written "https://*:*@example.com/*\\?*#*". Bug: 1468446 Change-Id: Ie0d7a80e36e89e05a0c634f7565c3365909edb2d Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/4975595 Reviewed-by: Dominic Farolino <[email protected]> Commit-Queue: Jeremy Roman <[email protected]> Cr-Commit-Position: refs/heads/main@{#1221197} -- wpt-commits: d3e55612911b00cb53271476de610e75a8603ae7 wpt-pr: 42738
…atwg/urlpattern#179., a=testonly Automatic update from web-platform-tests URLPattern: Apply syntax changes from whatwg/urlpattern#179. The following changes apply to patterns which are constructed using a base URL, the string syntax, or both -- but not any pattern which explicitly specifies components separately without a base URL. * Components are not inherited from a base URL if an "earlier" component is explicitly specified. * In the string format, unspecified "later" components are implicitly wildcarded, rather than required to be empty (with the exception of the port, which is always taken to be specified when the hostname is). * Username and password are never implicitly specified or inherited. This means that a pattern like "https://example.com/*" also matches with any username, password, search, and hash. Previously this would be written "https://*:*@example.com/*\\?*#*". Bug: 1468446 Change-Id: Ie0d7a80e36e89e05a0c634f7565c3365909edb2d Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/4975595 Reviewed-by: Dominic Farolino <[email protected]> Commit-Queue: Jeremy Roman <[email protected]> Cr-Commit-Position: refs/heads/main@{#1221197} -- wpt-commits: d3e55612911b00cb53271476de610e75a8603ae7 wpt-pr: 42738
…atwg/urlpattern#179., a=testonly Automatic update from web-platform-tests URLPattern: Apply syntax changes from whatwg/urlpattern#179. The following changes apply to patterns which are constructed using a base URL, the string syntax, or both -- but not any pattern which explicitly specifies components separately without a base URL. * Components are not inherited from a base URL if an "earlier" component is explicitly specified. * In the string format, unspecified "later" components are implicitly wildcarded, rather than required to be empty (with the exception of the port, which is always taken to be specified when the hostname is). * Username and password are never implicitly specified or inherited. This means that a pattern like "https://example.com/*" also matches with any username, password, search, and hash. Previously this would be written "https://*:*@example.com/*\\?*#*". Bug: 1468446 Change-Id: Ie0d7a80e36e89e05a0c634f7565c3365909edb2d Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/4975595 Reviewed-by: Dominic Farolino <[email protected]> Commit-Queue: Jeremy Roman <[email protected]> Cr-Commit-Position: refs/heads/main@{#1221197} -- wpt-commits: d3e55612911b00cb53271476de610e75a8603ae7 wpt-pr: 42738
…atwg/urlpattern#179., a=testonly Automatic update from web-platform-tests URLPattern: Apply syntax changes from whatwg/urlpattern#179. The following changes apply to patterns which are constructed using a base URL, the string syntax, or both -- but not any pattern which explicitly specifies components separately without a base URL. * Components are not inherited from a base URL if an "earlier" component is explicitly specified. * In the string format, unspecified "later" components are implicitly wildcarded, rather than required to be empty (with the exception of the port, which is always taken to be specified when the hostname is). * Username and password are never implicitly specified or inherited. This means that a pattern like "https://example.com/*" also matches with any username, password, search, and hash. Previously this would be written "https://*:*@example.com/*\\?*#*". Bug: 1468446 Change-Id: Ie0d7a80e36e89e05a0c634f7565c3365909edb2d Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/4975595 Reviewed-by: Dominic Farolino <[email protected]> Commit-Queue: Jeremy Roman <[email protected]> Cr-Commit-Position: refs/heads/main@{#1221197} -- wpt-commits: d3e55612911b00cb53271476de610e75a8603ae7 wpt-pr: 42738
When Chrome recently updated, customers started reporting issues with a webapp that uses URLPattern and it was due to this change. Hash is important in SPAs and I don't think it should be "wildcarded" like this by default. In my case, I have a tabbed UI which sets the hash as tabs are clicked. When users load the app the first time (without a hash), I have a redirect that will add the hash causing the first tab to display. With this change it causes an infinite-loop, since test now returns |
Motivating this change, we noticed that several developers were surprised that, e.g., In your case I hope that adapting to this change would be relatively straightforward in a way that is backward-compatible with any prior implementations of URLPattern, such as:
If your code looks like (1), it's possible it might have previously not worked correctly if a search query was present and not stripped (and some pages receive these unexpectedly, like the |
Thanks @jeremyroman I had tried adding the IMHO, it's more surprising that it would match things I'm not explicit about but I see the other perspective as well. In my view with SPAs the url is part of the state and |
All has minor changes - floating-ui - react-error-boundary (https://github.com/bvaughn/react-error-boundary/releases/tag/5.0.0) - urlpattern polyfill (whatwg/urlpattern#179) - use-debounce (https://github.com/xnimorz/use-debounce/releases/tag/10.0.0) - react-aria - react-hot-toast (https://github.com/timolins/react-hot-toast/releases/tag/v2.5.0)
All has minor changes - floating-ui - react-error-boundary (https://github.com/bvaughn/react-error-boundary/releases/tag/5.0.0) - urlpattern polyfill (whatwg/urlpattern#179) - use-debounce (https://github.com/xnimorz/use-debounce/releases/tag/10.0.0) - react-aria - react-hot-toast (https://github.com/timolins/react-hot-toast/releases/tag/v2.5.0)
(For background see WICG/nav-speculation#259; in particular this is attempting to formalize the proposal at WICG/nav-speculation#259 (comment). cc @jeremyroman)
Problem
When constructing a
URLPattern
from a base URL, we have found it surprising how many components get inherited. The particularly problematic ones aresearch
andhash
. Consider code such as:Can you spot the bug? There are actually two!
document.baseURL
is something likehttps://example.com/
, then this sort of check gets defeated by auserInput
derived from a relative URL such as/wp-login.php?bypass
or/wp-login.php#bypass
.document.baseURL
is something likehttps://example.com/?utm_source=blah
, then this sort of check fails even foruserInput
derived from/wp-login.php
.This is because these
URLPattern
instances inherited the empty-stringsearch
andhash
components from the base URL:Non-path/search/hash cases
Another instance that some might find unexpected is the following:
This pattern inherits all non-
protocol
components from the base URL. Whereas, some might expect that since we overrode such an "early" component, we'd only get matching onprotocol
, not the rest.This is especially problematic in environments where the base URL argument is implicit, e.g. speculation rules, service worker static routing, or any framework which attempts to specialize itself to "the current page", such as
In such environments, you might provide a URL pattern such as
or
and expect this to match all
https://
URLs. But instead, since it inherits all non-protocol components from the base URL, it essentially only matches the base URL itself.Proposed solution
URLPattern
input has a given component, do not (by default) inherit any components from the base URL that are "later" than that component. "Later" means mostly the usual order:protocol
,hostname
,port
,pathname
,search
,hash
. For example:protocol
component, do not inherit any componentspathname
component, do not inherit thesearch
andhash
componentsusername
andpassword
, we define those to be "later" thanprotocol
,hostname
, andport
(and definepassword
to be "later" thanusername
). For example:hostname
component, do not inheritport
,pathname
,search
,hash
,username
, andpassword
username
component, do not inheritpathname
,search
,hash
,password
baseURLInheritance
, which lets you control this behavior"auto"
"all"
gives the current behavior["port", "search"]
. We can leave that out for now minus compelling use cases.Then:
Considerations
Compat
There might be some compat impact here. I suspect it is low, as we'd need to:
URLPattern
"using
URLPattern
" is swinging between 0.02% and 0.05% of page views, so we're in a good place to start with. We plan to add use counters to see how often this conjunction happens in the wild.Alternatives considered
We could make this new behavior opt-in instead of opt-out. I would strongly prefer to make it opt-out, because: I think it's more sensible and less likely to lead to bugs; and, I think it's the best default for various web platform features, such as speculation rules or service worker scope matching, which hope to base themselves on URL patterns.
We could just tell people to always write their patterns like
"/wp-login.php?*#*
. I think that would be a sad place to end up.We could restrict this logic to just
pathname
->search
->hash
. I think this is reasonable as those are the most problematic components. This would mean not solving the{ protocol: "https" }
problem, but maybe that's OK.Server-side considerations
From my experience, most server-side uses of
URLPattern
are pathname-focused. They probably aren't overly impacted by this. Any confirmation from the server-side community would be helpful.Who does the work?
Jeremy and I are happy to volunteer on the implementation/spec/test work. Jeremy is working on the use counter.
The text was updated successfully, but these errors were encountered: