Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Review slashes after file:/// and file path normalization #405

Closed
alwinb opened this issue Jul 23, 2018 · 25 comments · Fixed by #544
Closed

Review slashes after file:/// and file path normalization #405

alwinb opened this issue Jul 23, 2018 · 25 comments · Fixed by #544

Comments

@alwinb
Copy link
Contributor

alwinb commented Jul 23, 2018

The current standard asks to remove leading empty directories from file paths.
This seems to have been motivated by the desire to ignore slashes after file:/// whilst avoiding reparsing issues. However,

  • It introduces new behavior that has no precedence in browsers.
  • It is inconsistent with path normalization of other special URLs, and less expressive.

Examples, tested in Chrome canary, Safari preview and Firefox nightly:

Input Currently specified IE8 upto Edge18 Firefox and Safari Chrome
1 /..//localhost//pig against file://lion/ file://lion/localhost//pig file://lion//localhost//pig file:////localhost//pig file://lion//localhost//pig
2 file://localhost//a//../..// file:/// file:///.// file:///// file://localhost///
3 file://localhost//a//../..//foo file:///foo file:///.//foo file://///foo file://localhost///foo
4 file://localhost////foo file:///foo file://foo/ file://////foo file://localhost////foo
5 file:////foo file:///foo file://foo/ file:////foo file:///foo

Notes:

  • The spec agrees only with Chrome, only in the fourth case.
  • Browsers agree on the pathName except for the fourth case.
    (Dropping the host in the first seems off).
  • Firefox and Safari do not ignore slashes and thereby avoid reparsing issues.
  • Chrome only ignores slashes immediately following file:/// and keeps localhost, avoiding a reparsing issue.
@alwinb
Copy link
Contributor Author

alwinb commented Jul 23, 2018

As I wrote in #232 maybe there are other solutions,

Some ideas:

  1. Always replace the empty host with localhost.
  2. Do so only in ambiguous cases.
  3. Prefix ambiguous pathNames with ./.
  4. Don't collapse slashes at all.
  5. Come up with something else?

No 4. seems most strict/ traditional but does not unify file:////foo with file:///foo. If that is a hard requirement, then perhaps 1. or 2. can be considered.

@annevk
Copy link
Member

annevk commented May 10, 2020

FWIW, I'm open to revisiting this as it's clearly not implemented as-is (apart from perhaps in Node.js). It would be good to test IE6 and especially browsers on Windows in general. And we need to ensure that file:foo and file:/foo and such continue to work and not present idempotence issues.

Is this something you're still interested in and willing to help out with perhaps?

@alwinb
Copy link
Contributor Author

alwinb commented May 13, 2020

I don't know if I can help, but I'm willing to give it a try.
I'll try and get some info on the older browsers using browserstack first (hoping that'll work) and post back here.

@alwinb
Copy link
Contributor Author

alwinb commented May 13, 2020

I'm still trying to get results on IE6/7.
Meanwhile here are the results for IE 8 upto Edge 18 added to the table. (I've updated the table in the original issue also.)

Input Currently specified IE8 upto Edge18 Firefox and Safari Chrome
1 /..//localhost//pig against file://lion/ file://lion/localhost//pig file://lion//localhost//pig file:////localhost//pig file://lion//localhost//pig
2 file://localhost//a//../..// file:/// file:///.// file:///// file://localhost///
3 file://localhost//a//../..//foo file:///foo file:///.//foo file://///foo file://localhost///foo
4 file://localhost////foo file:///foo file://foo/ file://////foo file://localhost////foo
5 file:////foo file:///foo file://foo/ file:////foo file:///foo

I tested a bunch more examples but I need some more time to make sense of them.

@alwinb
Copy link
Contributor Author

alwinb commented May 20, 2020

I ran another batch of examples. An attempt to summarize:

  • Safari and Firefox behave according to the RFC and don't collapse slashes in file URLs.
  • Firefox always sets the host to the empty string.
  • Chrome/Mac does remove the leading empty path segments.
    • However, it does parse e.g. //one as having a host.
  • Both Edge and Chrome/Windows ignore slashes after file:/// and interpret the first nonempty path segment as the host.
    • However, file://one and file:///one are parsed as just that.

Given file:/// as a base URL:

Input Edge Chrome/Windows Chrome/Mac Safari Firefox
file:///one/two file:///one/two file:///one/two file:///one/two file:///one/two file:///one/two
file:////one/two file://one/two file://one/two file:///one/two file:////one/two file:////one/two
//one/two file://one/two file://one/two file://one/two file://one/two file:///two
///one/two file:///.///one/two file:///one/two file:///one/two file:///one/two file:///one/two
////one/two file:///.////one/two file://one/two file:///one/two file:////one/two file:////one/two

Tested on Edge 17 and 18 and Chrome 71 on Windows (via browserstack), Safari 13, Chrome 81 and Firefox 76 on Mac.

@domenic
Copy link
Member

domenic commented Sep 16, 2020

I'll add another case (which came up in https://bugs.chromium.org/p/chromium/issues/detail?id=1128999): file:///.// becomes file://// (four slashes) in browsers, but file:/// (three slashes) in the spec.

@domenic
Copy link
Member

domenic commented Sep 16, 2020

@alwinb (or anyone else): would you be able to propose a spec patch, or just a general algorithm for this? Then we can prototype your algorithm in jsdom/whatwg-url, run it against all the test cases you have compiled, and see how well it works and how well it matches browsers. If it's idempotent and closer to browser behavior than the current spec, then we can update the spec.

I'm happy to help with the prototyping/testing/spec patch process, but I would need an algorithm to start from.

@domenic
Copy link
Member

domenic commented Sep 16, 2020

Reading through the above in a bit more detail, it looks like you were asking for feedback on what approach would be most reasonable. I'll check with coworkers to see if they have any stronger opinions, but for my part, my requirements are very minimal. This is a crusty legacy area and just getting to interop is the important thing. So:

  1. (Must have) Idempotency: i.e., parsing-serializing-parsing should produce the same result.

  2. (Must have) Platform-independence: no differing behavior on Windows/Linux/Mac.

  3. (Important) Match the maximum number of modern browser engines in the maximum number of cases as possible.

  4. (Nice to have) Make the algorithm relatively simple.

  5. (Nice to have) Make the mental model relatively simple. E.g. if we could explain to developers something like "localhost always goes away in this particular fashion" or "N slashes will always collapse to 4 slashes" or "slashes will never collapse" or "file URLs are just like non-file URLs in most ways" then that'd be nice.

  6. (Nice to have) also try to solve related issues like Why is the host dropped if the path contains a Windows drive letter? #302 and "D:\foo" should be parsed as "file:///D:/foo" #271 in ways that are satisfactory to the folks opening them.

  7. (Nice to have) match legacy IE/Edge versions, since those often reflect "what enterprises expect".

Trading (3) off against the nice to haves is also reasonable. E.g. if there's a much simpler version of the algorithm that only matches 1/3 modern browser engines, then we should probably go with that. (But if it matches 0 engines, then it's not good.)

@masinter
Copy link

Platform independence is difficule since there are many file system types (case sensitive, case insensitive, and many more with unicode and different kinds of unicode normalization) and different behavior for one platform. The platform-independence of box, dropbox, onedrive, icloud, google drive when mounted might be a more achievable goal.

@domenic
Copy link
Member

domenic commented Sep 16, 2020

The URL parser, i.e. the algorithm we're discussing here, doesn't need to worry about those things. There may be further troubles translating a parsed URL into a file on the filesystem, but that's out of scope for this repository, and more of the providence of e.g. HTTP server software.

@alwinb
Copy link
Contributor Author

alwinb commented Sep 18, 2020

This turns out to be a very complicated problem.

The only viable approach I've been able to come up with is to not ignore multiple slashes in file URLs. At least this agrees with Safari and Firefox and the RFCs, and it prevents messy interactions with #302 that break idempotency. (It even agrees with IE/Edge in some cases).

Then for #302 I would propose to preserve the host in file URLs (also in the presence of drive letters) except for localhost, which would be replaced with the empty string. That then disagrees with Firefox but is more in line with the others (considering just the host).

Together these changes will cause quite a bit of disagreement though. But I don't see any better alternative.

Is this clear enough? I can have a look at the state machine to see where to make the changes, but I'd like to be sure that there's some consensus about the approach first.


Somewhat off-topic, and I'm hesitant to mention this, but I decided to go ahead because it could be useful for the standardisation process.

I have tried to build an URL library that is compatible with the standard, but has separate parsing, resolving and normalisation phases. It uses a simple 'theory' of URLs, modelling them as a sequence of tokens. They're similar, but different from the parse trees as per the RFCs. I wrote it down in the Theory section of the README.

It can be useful to describe the different browser behaviours as operations on such token lists. And in general it could help with points 4 and 5 that @domenic mentions above, amongst other things. – Just in case.

@domenic
Copy link
Member

domenic commented Sep 18, 2020

Is this clear enough? I can have a look at the state machine to see where to make the changes, but I'd like to be sure that there's some consensus about the approach first.

This makes sense to me, and I'm happy to say that Chrome is in favor of it at a high level. (In practice, Chrome is very slow to make changes to its URL parser, so this support doesn't necessarily translate into action. But it should be enough to help proceed with consensus-gathering.)

It sounds like your proposal mostly matches Safari, so probably if we can get @annevk's high-level sign-off for Mozilla, then we're ready to move on to the next step. We can perform a second round of checks with all the browser representatives once we have a spec patch/test results patch/complete listing of mismatches with current browsers.

@annevk
Copy link
Member

annevk commented Sep 21, 2020

If it works for Chrome and Safari, it'll likely work for Firefox. @jasnell for thoughts from Node.js.

cc @valenting

@valenting
Copy link
Collaborator

I'm all for it. Changing the Firefox implementation should be relatively straight-forward.

@achristensen07
Copy link
Collaborator

alwinb's proposal sounds good to me. It sounds very close to if not exactly what I implemented in WebKit.

@ghost ghost mentioned this issue Sep 21, 2020
@alwinb
Copy link
Contributor Author

alwinb commented Sep 22, 2020

Excellent!

I will start with updating the tests (urltestdata.json), then try and make the changes to whatwg-url, and post back here.

I may need some time to learn the process (forks and merge requests and the like), I have not done that before, but I suppose it'll become clear to me as I go along.

@alwinb
Copy link
Contributor Author

alwinb commented Sep 23, 2020

I have updated the tests and made the few changes needed to to jsdom/whatwg-url.
My forks are here alwinb/whatwg-url, alwinb/wpt.

I'm not sure how to proceed now. I assume these have to be reviewed before I open any merge request? It seems inappropriate to do so before the standard has been updated.

I can have a look at making those changes to the standard as well (i.e. this repo).

Thanks,

@annevk
Copy link
Member

annevk commented Sep 23, 2020

@alwinb nice! You should feel free to open PRs and we can label them "pending specification change" (paraphrased). It's not a problem. If you can make the change to the specification too that'd be great. Note that you'll need to sign https://participate.whatwg.org/agreement as well. Hope that helps.

@domenic
Copy link
Member

domenic commented Sep 23, 2020

Yes, this is amazing work! Please do open pull requests; it's only merging of them that we'll block until the specification is updated. We'll also get an extra benefit, where if you open the web platform tests pull request, a bot will run the tests on all browsers so we can see the results.

The changes to whatwg-url are promising, in that they make the algorithm simpler. Hooray!

@domenic
Copy link
Member

domenic commented Sep 23, 2020

Looking at https://wpt.fyi/results/url/url-constructor.html?label=pr_head&max-count=1&pr=25716 (which is generated from web-platform-tests/wpt#25716 (comment)) most things look good, but there is a section of tests which fail in more browsers more than would be ideal:

  • Parsing: </..//localhost//pig> against <file://lion/>, Parsing: </rooibos> against <file://tea/>, and Parsing: </?chai> against <file://tea/> fail in Firefox and Safari, passing only in Chrome.

  • Parsing: <C|> against <file://host/dir/file> through Parsing: <C|\> against <file://host/dir/file> fail in all three browsers.

  • Parsing: <C|a> against <file://host/dir/file> fails in Chrome and Firefox, passing in Safari.

  • Parsing: </c:/foo/bar> against <file://host/path> fails in Firefox and Safari, passing in Chrome.

It looks like most of the cases with 2 failures were also 2 failures before (referencing https://wpt.fyi/results/url/url-constructor.html?label=master&label=experimental&product=chrome&product=firefox&product=safari&aligned), albeit sometimes a different 2. So that's not a regression. Indeed, in cases where all 3 browsers give different results, that's the best we can hope for.

But the conversion of some of them from 2-failures to 3-failures seems suboptimal. It seems to be concentrated around C|-style drive letter handling. Do you think there's something that could be tweaked there to match at least one browser?

@alwinb
Copy link
Contributor Author

alwinb commented Sep 24, 2020

Hmm. This is actually about a separate issue – the handling of C| style drive letters.

Apparently, Chrome and Firefox do not handle the C| style drive letters according to the standard (they interpret it as an ordinary path component).

What is interesting is that IE/Edge prior to version 18 also did not interpret C| as a drive letter in a lot of cases. I discovered that whilst making browsershots for this issue and for #302.

I just checked. I have no data on e.g. file:C|, file:/C| and file://C|, nor file://host/C|, But

  • c|/dir/abc against file:/// gave file:///c%7C/dir/abc.
  • file:///c|/dir/abc did result in file:///c:/dir/abc.

This has changed since version 18, which does interpret c| as a drive letter.

As an aside, Safari fails the mentioned tests because it (sometimes? or always?) does not copy the host from the base URL when parsing host-less URLs. I suspect that this is not intentional; it seems inconsistent with the way in which it parses absolute file URLs. It does however handle the C| style drive letters according to the standard.

@alwinb
Copy link
Contributor Author

alwinb commented Sep 24, 2020

Wait, no there's more going on. I think that Chrome on windows will pass those drive letter tests. If I remember correctly, Chrome disables drive-letter handling on non-windows platforms.

@annevk
Copy link
Member

annevk commented Sep 24, 2020

Yeah, wpt.fyi does not run Chrome on Windows and what Chrome on Windows does is somewhat important for file: URLs.

@domenic
Copy link
Member

domenic commented Sep 24, 2020

Ah, OK. Yeah, for drive letter normalization I'd rather align with Windows browsers. Since we have two to choose from these days, aligning with Chrome Windows seems reasonable to me. (Firefox seems to not treat pipe characters specially.)

OK, then yeah, let's proceed with the spec update! Let us know if we can help with that; it should be mostly mechanical given the changes you made to whatwg-url.

@alwinb
Copy link
Contributor Author

alwinb commented Sep 25, 2020

Ok, I made the changes, I hope it all went ok!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging a pull request may close this issue.

6 participants