`canonicalize_url` use of `safe_url_string` breaks when an encoded hash character is encountered #91

jvanasco · 2017-05-19T01:51:46Z

canonicalize_url will decode all percent encoded elements in a string.

if a hash # is present as a percent-encoded entity (%23) it will be decoded... however it shouldn't be as it is a rfc delimiter and fundamentally changes the URL structure -- turning the subsequent characters into a fragment. one of the effects is that a canonical url with a safely encoded hash will point to another url; another is that running the canonical on the output will return a different url.

example:

>>> import w3lib.url
>>> url = "https://example.com/path/to/foo%20bar%3a%20biz%20%2376%2c%20bang%202017#bash"
>>> canonical = w3lib.url.canonicalize_url(url)
>>> print canonical
https://example.com/path/to/foo%20bar:%20biz%20#76,%20bang%202017
>>> canonical2 = w3lib.url.canonicalize_url(canonical)
>>> print canonical2
https://example.com/path/to/foo%20bar:%20biz%20

what is presented as a fragment in "canonical": #76,%20bang%202017 is part of the valid url - not a fragment - and is discarded when canonicalize_url is run again.

references:

HTML Entities and Numeric character references in URL #5

The text was updated successfully, but these errors were encountered:

redapple · 2017-05-19T10:54:22Z

This offending unquoting happens in w3lib.url._unquotepath.
It only considers / and ?

redapple · 2017-05-19T13:43:46Z

What I commented earlier is not really relevant.

The issue is not so much unquoting %23 but instead not percent-encoding # afterwards when re-building the URI. (And maybe ? should not be there in _unquotepath even. The URL being already parsed into parts before unquoting the path, it should be correct to percent-decode all but %2F (/).)

In RFC 3986, pchars, valid characters for each path segment, are:

unreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~"
pct-encoded   = "%" HEXDIG HEXDIG
sub-delims    = "!" / "$" / "&" / "'" / "(" / ")"
                 / "*" / "+" / "," / ";" / "="
pchar         = unreserved / pct-encoded / sub-delims / ":" / "@"

So, as / is used as path segment delimiter, and % is there for untouched percent-encoded chars, I believe safe characters (not needing percent-encoding) in the path component are:

unreserved / sub-delims / ":" / "@" / "/" / "%"

>>> print(w3lib.url._pchar_safe_chars.decode('ascii'))
abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789-._~!$&'()*+,;=:@/%

w3lib.url.canonicalize_url() currently uses these _safe_chars for quoting the path:

>>> print(w3lib.url._safe_chars.decode('ascii'))
ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789_.-%;/?:@&=+$|,#-_.!~*'()

The difference being:

>>> set(w3lib.url._safe_chars.decode('ascii')) - set(w3lib.url._pchar_safe_chars.decode('ascii'))
{'?', '#', '|'}

So indeed, # should be quoted. ? is currently "protected" (not-decoded) in _unquotepath. And | is the subject of #25 and #80.

I may have missed something, and reading RFC 3986 always confuses me at some point or another. So I guess we'll need to work on more tests and clean canonicalize_url and safe_url_string to match browsers better.
And maybe @kmike 's #25 already does the right thing already :)

kmike · 2017-05-19T14:26:53Z

Sorry, I haven't looked at this issue in detail yet, but wanted to note that canonicalize_url is not guaranteed to preserve URL. URLs produced by canonicalize_url results should be used for deduplication and alike use cases. Of course, if there's an issue (it seems there is), we should make canonicalize_url work in a more reasonable way, but keep in mind that downloading results of canonicalize_url is usually a mistake.

jvanasco · 2017-05-19T17:33:21Z

@redapple that RFC has driven me crazy for years. I didn't jump into the function and how the vars you referenced are used, but the % is pseudo-safe -- it is valid to appear in the URL but must be percent encoded as %25.

@kmike Yes, I understand - canonicalize_url really only gives a fingerprint of the url.

Characters sets and the utility of outputs aside, I think a better explanation (and what unit test could check against) is a concept that the output of canonicalize_url(url) should be final/terminal in that it can not be canonicalized any further and should "point to itself" if the canonicalize function is run on it. The current behavior of #/%23 breaks this concept, and you wind up with a canonicalized url that will point to a completely different resource if the function is applied to it.

 canonicalize_url(url) == canonicalize_url(canonicalize_url(url))

jvanasco · 2017-05-26T22:38:47Z

I created two PRs (and a testcase) with some potential approaches to address this..

#93 - handles the _safe_chars differently, more in line with the above notes by @redapple
#94 - applies a transformation in _unquotepath, which should only be a path (e.g. not a fragment).

jvanasco · 2019-05-02T23:00:37Z

any chance someone has input on the above approaches? if either looks like a candidate for inclusion, I would be excited to generate a new PR against master that passes travis on py2 and py3.

Gallaecio · 2019-05-03T09:47:24Z

@jvanasco I like your #93 approach, but you’ll have to rewrite it due to conflicts with later changes (e.g. #25).

I would strongly suggest making your change as simple as possible. I think you could simply add the following two lines after the current _safe_chars definition:

# see https://github.com/scrapy/w3lib/issues/91
_safe_chars = _safe_chars.replace(b'#', b'')

With that and your tests I think #93 should be ready to merge.

jvanasco · 2019-05-03T18:06:42Z

Thanks for the fast response, @Gallaecio. It looks like you're right! just updating _safe_chars works and passes all tests. I wasn't too familiar with the innerworkings of this library, which is why I originally created a separate _safe_chars_component variable.

I'm going to ensure the tests cover everything against some production code, then make another PR.

@Gallaecio

* issue scrapy#91 * don't decode `%23` to `#` when it appears in a url * tests: test_url.CanonicalizeUrlTest.test_preserve_nonfragment_hash * tests pass in: py27, py36 * notes: adjustment to _safe_chars suggested by @Gallaecio

@Gallaecio

* issue scrapy#91 * description: don't decode `%23` to `#` when it appears in a url * tests-new: test_url.CanonicalizeUrlTest.test_preserve_nonfragment_hash * tests-pass: py27, py36 * notes: adjustment to _safe_chars suggested by @Gallaecio

@Gallaecio

* issue scrapy#91 * description: don't decode `%23` to `#` when it appears in a url * tests-new: test_url.CanonicalizeUrlTest.test_preserve_nonfragment_hash * tests-pass: py27, py36 * notes: adjustment to _safe_chars suggested by @Gallaecio

redapple added the bug label May 19, 2017

This was referenced May 26, 2017

keeps fragments in the path/query from being decoded #93

Closed

another approach to encoded # in url paths #94

Closed

jvanasco mentioned this issue May 9, 2019

[MRG+1] fix: preservation of url encoded hash signs. #128

Closed

Gallaecio mentioned this issue Oct 18, 2019

Do not decode # in URL paths during canonicalization #141

Merged

felipeboffnunes mentioned this issue Oct 28, 2022

unit test Issue 91 is fixed #198

Merged

kmike closed this as completed in #198 Oct 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`canonicalize_url` use of `safe_url_string` breaks when an encoded hash character is encountered #91

`canonicalize_url` use of `safe_url_string` breaks when an encoded hash character is encountered #91

jvanasco commented May 19, 2017 •

edited

Loading

redapple commented May 19, 2017

redapple commented May 19, 2017

kmike commented May 19, 2017 •

edited

Loading

jvanasco commented May 19, 2017

jvanasco commented May 26, 2017

jvanasco commented May 2, 2019

Gallaecio commented May 3, 2019 •

edited

Loading

jvanasco commented May 3, 2019 •

edited

Loading

canonicalize_url use of safe_url_string breaks when an encoded hash character is encountered #91

canonicalize_url use of safe_url_string breaks when an encoded hash character is encountered #91

Comments

jvanasco commented May 19, 2017 • edited Loading

redapple commented May 19, 2017

redapple commented May 19, 2017

kmike commented May 19, 2017 • edited Loading

jvanasco commented May 19, 2017

jvanasco commented May 26, 2017

jvanasco commented May 2, 2019

Gallaecio commented May 3, 2019 • edited Loading

jvanasco commented May 3, 2019 • edited Loading

`canonicalize_url` use of `safe_url_string` breaks when an encoded hash character is encountered #91

`canonicalize_url` use of `safe_url_string` breaks when an encoded hash character is encountered #91

jvanasco commented May 19, 2017 •

edited

Loading

kmike commented May 19, 2017 •

edited

Loading

Gallaecio commented May 3, 2019 •

edited

Loading

jvanasco commented May 3, 2019 •

edited

Loading