Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

canonicalize_url use of safe_url_string breaks when an encoded hash character is encountered #91

Closed
jvanasco opened this issue May 19, 2017 · 8 comments · Fixed by #198
Labels

Comments

@jvanasco
Copy link
Contributor

jvanasco commented May 19, 2017

canonicalize_url will decode all percent encoded elements in a string.

if a hash # is present as a percent-encoded entity (%23) it will be decoded... however it shouldn't be as it is a rfc delimiter and fundamentally changes the URL structure -- turning the subsequent characters into a fragment. one of the effects is that a canonical url with a safely encoded hash will point to another url; another is that running the canonical on the output will return a different url.

example:

>>> import w3lib.url
>>> url = "https://example.com/path/to/foo%20bar%3a%20biz%20%2376%2c%20bang%202017#bash"
>>> canonical = w3lib.url.canonicalize_url(url)
>>> print canonical
https://example.com/path/to/foo%20bar:%20biz%20#76,%20bang%202017
>>> canonical2 = w3lib.url.canonicalize_url(canonical)
>>> print canonical2
https://example.com/path/to/foo%20bar:%20biz%20

what is presented as a fragment in "canonical": #76,%20bang%202017 is part of the valid url - not a fragment - and is discarded when canonicalize_url is run again.

references:

@redapple
Copy link
Contributor

This offending unquoting happens in w3lib.url._unquotepath.
It only considers / and ?

@redapple redapple added the bug label May 19, 2017
@redapple
Copy link
Contributor

What I commented earlier is not really relevant.

The issue is not so much unquoting %23 but instead not percent-encoding # afterwards when re-building the URI. (And maybe ? should not be there in _unquotepath even. The URL being already parsed into parts before unquoting the path, it should be correct to percent-decode all but %2F (/).)

In RFC 3986, pchars, valid characters for each path segment, are:

unreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~"
pct-encoded   = "%" HEXDIG HEXDIG
sub-delims    = "!" / "$" / "&" / "'" / "(" / ")"
                 / "*" / "+" / "," / ";" / "="
pchar         = unreserved / pct-encoded / sub-delims / ":" / "@"

So, as / is used as path segment delimiter, and % is there for untouched percent-encoded chars, I believe safe characters (not needing percent-encoding) in the path component are:

unreserved / sub-delims / ":" / "@" / "/" / "%"

>>> print(w3lib.url._pchar_safe_chars.decode('ascii'))
abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789-._~!$&'()*+,;=:@/%

w3lib.url.canonicalize_url() currently uses these _safe_chars for quoting the path:

>>> print(w3lib.url._safe_chars.decode('ascii'))
ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789_.-%;/?:@&=+$|,#-_.!~*'()

The difference being:

>>> set(w3lib.url._safe_chars.decode('ascii')) - set(w3lib.url._pchar_safe_chars.decode('ascii'))
{'?', '#', '|'}

So indeed, # should be quoted. ? is currently "protected" (not-decoded) in _unquotepath. And | is the subject of #25 and #80.

I may have missed something, and reading RFC 3986 always confuses me at some point or another. So I guess we'll need to work on more tests and clean canonicalize_url and safe_url_string to match browsers better.
And maybe @kmike 's #25 already does the right thing already :)

@kmike
Copy link
Member

kmike commented May 19, 2017

Sorry, I haven't looked at this issue in detail yet, but wanted to note that canonicalize_url is not guaranteed to preserve URL. URLs produced by canonicalize_url results should be used for deduplication and alike use cases. Of course, if there's an issue (it seems there is), we should make canonicalize_url work in a more reasonable way, but keep in mind that downloading results of canonicalize_url is usually a mistake.

@jvanasco
Copy link
Contributor Author

@redapple that RFC has driven me crazy for years. I didn't jump into the function and how the vars you referenced are used, but the % is pseudo-safe -- it is valid to appear in the URL but must be percent encoded as %25.

@kmike Yes, I understand - canonicalize_url really only gives a fingerprint of the url.

Characters sets and the utility of outputs aside, I think a better explanation (and what unit test could check against) is a concept that the output of canonicalize_url(url) should be final/terminal in that it can not be canonicalized any further and should "point to itself" if the canonicalize function is run on it. The current behavior of #/%23 breaks this concept, and you wind up with a canonicalized url that will point to a completely different resource if the function is applied to it.

 canonicalize_url(url) == canonicalize_url(canonicalize_url(url)) 

@jvanasco
Copy link
Contributor Author

I created two PRs (and a testcase) with some potential approaches to address this..

#93 - handles the _safe_chars differently, more in line with the above notes by @redapple
#94 - applies a transformation in _unquotepath, which should only be a path (e.g. not a fragment).

@jvanasco
Copy link
Contributor Author

jvanasco commented May 2, 2019

any chance someone has input on the above approaches? if either looks like a candidate for inclusion, I would be excited to generate a new PR against master that passes travis on py2 and py3.

@Gallaecio
Copy link
Member

Gallaecio commented May 3, 2019

@jvanasco I like your #93 approach, but you’ll have to rewrite it due to conflicts with later changes (e.g. #25).

I would strongly suggest making your change as simple as possible. I think you could simply add the following two lines after the current _safe_chars definition:

# see https://github.com/scrapy/w3lib/issues/91
_safe_chars = _safe_chars.replace(b'#', b'')

With that and your tests I think #93 should be ready to merge.

@jvanasco
Copy link
Contributor Author

jvanasco commented May 3, 2019

Thanks for the fast response, @Gallaecio. It looks like you're right! just updating _safe_chars works and passes all tests. I wasn't too familiar with the innerworkings of this library, which is why I originally created a separate _safe_chars_component variable.

I'm going to ensure the tests cover everything against some production code, then make another PR.

jvanasco added a commit to jvanasco/w3lib that referenced this issue May 9, 2019
* issue scrapy#91
* don't decode `%23` to `#` when it appears in a url
* tests: test_url.CanonicalizeUrlTest.test_preserve_nonfragment_hash
* tests pass in: py27, py36
* notes: adjustment to _safe_chars suggested by @Gallaecio
jvanasco added a commit to jvanasco/w3lib that referenced this issue May 9, 2019
* issue scrapy#91
* description: don't decode `%23` to `#` when it appears in a url
* tests-new: test_url.CanonicalizeUrlTest.test_preserve_nonfragment_hash
* tests-pass: py27, py36
* notes: adjustment to _safe_chars suggested by @Gallaecio
Gallaecio pushed a commit to Gallaecio/w3lib that referenced this issue Oct 18, 2019
* issue scrapy#91
* description: don't decode `%23` to `#` when it appears in a url
* tests-new: test_url.CanonicalizeUrlTest.test_preserve_nonfragment_hash
* tests-pass: py27, py36
* notes: adjustment to _safe_chars suggested by @Gallaecio
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
4 participants