-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
canonicalize_url
use of safe_url_string
breaks when an encoded hash character is encountered
#91
Comments
This offending unquoting happens in |
What I commented earlier is not really relevant. The issue is not so much unquoting In RFC 3986,
So, as
The difference being:
So indeed, I may have missed something, and reading RFC 3986 always confuses me at some point or another. So I guess we'll need to work on more tests and clean |
Sorry, I haven't looked at this issue in detail yet, but wanted to note that |
@redapple that RFC has driven me crazy for years. I didn't jump into the function and how the vars you referenced are used, but the @kmike Yes, I understand - Characters sets and the utility of outputs aside, I think a better explanation (and what unit test could check against) is a concept that the output of
|
any chance someone has input on the above approaches? if either looks like a candidate for inclusion, I would be excited to generate a new PR against master that passes travis on py2 and py3. |
@jvanasco I like your #93 approach, but you’ll have to rewrite it due to conflicts with later changes (e.g. #25). I would strongly suggest making your change as simple as possible. I think you could simply add the following two lines after the current # see https://github.com/scrapy/w3lib/issues/91
_safe_chars = _safe_chars.replace(b'#', b'') With that and your tests I think #93 should be ready to merge. |
Thanks for the fast response, @Gallaecio. It looks like you're right! just updating I'm going to ensure the tests cover everything against some production code, then make another PR. |
* issue scrapy#91 * don't decode `%23` to `#` when it appears in a url * tests: test_url.CanonicalizeUrlTest.test_preserve_nonfragment_hash * tests pass in: py27, py36 * notes: adjustment to _safe_chars suggested by @Gallaecio
* issue scrapy#91 * description: don't decode `%23` to `#` when it appears in a url * tests-new: test_url.CanonicalizeUrlTest.test_preserve_nonfragment_hash * tests-pass: py27, py36 * notes: adjustment to _safe_chars suggested by @Gallaecio
* issue scrapy#91 * description: don't decode `%23` to `#` when it appears in a url * tests-new: test_url.CanonicalizeUrlTest.test_preserve_nonfragment_hash * tests-pass: py27, py36 * notes: adjustment to _safe_chars suggested by @Gallaecio
canonicalize_url
will decode all percent encoded elements in a string.if a hash
#
is present as a percent-encoded entity (%23
) it will be decoded... however it shouldn't be as it is a rfc delimiter and fundamentally changes the URL structure -- turning the subsequent characters into a fragment. one of the effects is that a canonical url with a safely encoded hash will point to another url; another is that running the canonical on the output will return a different url.example:
what is presented as a fragment in "canonical":
#76,%20bang%202017
is part of the valid url - not a fragment - and is discarded whencanonicalize_url
is run again.references:
The text was updated successfully, but these errors were encountered: