Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] use constants from RFC3986 #25

Merged
merged 4 commits into from
Jun 16, 2017
Merged

[MRG] use constants from RFC3986 #25

merged 4 commits into from
Jun 16, 2017

Conversation

kmike
Copy link
Member

@kmike kmike commented Aug 3, 2014

@nramirezuy noticed that "|" is not in RFC3986: scrapy/scrapy#508 (comment). I've checked the RFC and updated code to use constants from this RFC:

  • "|" is removed;
  • "[" and "]" are added.

@dangra
Copy link
Member

dangra commented Aug 4, 2014

are you sure about removing "|" ? chrome quotes it if you type it in the url bar. the same does github markdown processor.. :)

@kmike
Copy link
Member Author

kmike commented Aug 4, 2014

So does the updated safe_url_string - "|" is removed from "safe_chars" in this PR, so safe_url_string started to quote it.

safe_url_string doesn't quote symbols that are never quoted (ascii letters, digits, a couple of other symbols); it also doesn't quote symbols that have special meaning in URL (separators and %) - it assumes they are already quoted properly. It is essentially a hack - quoting rules depend on a part of URL used, using a single set of chars to escape for the whole URL is not going to work properly in 100% cases. As I understand it, safe_url_string assumes the URL is already almost quoted and tries to fix some easy stuff common in webpages created by humans.

I'm not 100% sure the PR is correct because the whole safe_url_string stuff is tricky. But at least it is more consistent with the RFC, references to RFC become correct, and the quoting is more like Chrome's (as your example with "|" shows). Chrome doesn't quote ][ though.

@dangra
Copy link
Member

dangra commented Aug 4, 2014

I'm ok to merge it, do you think we need a test case for the removed/added chars? I wonder why got the current list.

@kmike
Copy link
Member Author

kmike commented Aug 4, 2014

@dangra I tried to track where this list came from; it was there since forever (I can't find a place it was introduced neither in w3lib nor in scrapy git history). As for the tests, I'm not sure it worths it because the fix is clear, but the correct behavior is not :)

Also, as Chrome doesn't quote ][ we may want not to quote it as well, but I think a departure from RFC constants is better to be justified by some real-world example.

@dangra
Copy link
Member

dangra commented Aug 4, 2014

firefox does the same than chrome. I tend to prefer browsers behavior than
RFC to be honest.
see #25 (comment)
On Aug 4, 2014 12:35 PM, "Mikhail Korobov" [email protected] wrote:

@dangra https://github.com/dangra I tried to track where this list came
from; it was there since forever (I can't find a place it was introduced
neither in w3lib nor in scrapy git history). As for the tests, I'm not sure
it worths it because the fix is clear, but the correct behavior is not :)

Also, as Chrome doesn't quote ][ we may want not to quote it as well, but
I think a departure from RFC constants is better to be justified by some
real-world example.


Reply to this email directly or view it on GitHub
#25 (comment).

@kmike
Copy link
Member Author

kmike commented Aug 4, 2014

How did you check what is firefox doing?

@kmike
Copy link
Member Author

kmike commented Aug 4, 2014

Browsers behaving differently is a valid reason to depart from RFCs.

@dangra
Copy link
Member

dangra commented Aug 4, 2014

How did you check what is firefox doing?

I was looking at url bar.

more experiments clicking the link at http://files.scrapinghub.com/w3lib-issue25.html

<!DOCTYPE>
<html>
  <body>
    <p>run in a console: <pre>nc -l -p 8001</pre> and <a href="http://localhost:8001/[]|">click here</a></p>
  </body>
</html>

Chrome

GET /[]%7C HTTP/1.1
Host: localhost:8001
Connection: keep-alive
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36
DNT: 1
Referer: http://files.scrapinghub.com/w3lib-issue25.html
Accept-Encoding: gzip,deflate,sdch
Accept-Language: en-US,en;q=0.8,es-419;q=0.6,es;q=0.4

Firefox

GET /%5B%5D| HTTP/1.1
Host: localhost:8001
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Firefox/31.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Referer: http://files.scrapinghub.com/w3lib-issue25.html
Connection: keep-alive

so let's follow RFC and merge this PR asap now I am confused.

@dangra
Copy link
Member

dangra commented Aug 4, 2014

w3lib v1.8.0 follows FF behavior:

>>> from w3lib.url import safe_url_string
>>> safe_url_string('http://localhost:8001/[]|')
'http://localhost:8001/%5B%5D|'

@dangra
Copy link
Member

dangra commented Aug 4, 2014

and after this change, w3lib follows RFC and Chrome behavior:

>>> from w3lib.url import safe_url_string
>>> safe_url_string('http://localhost:8001/[]|')
'http://localhost:8001/[]%7C'

@dangra
Copy link
Member

dangra commented Aug 4, 2014

curl has some issues interpreting [] in urls:

$ curl 'http://localhost:8001/[]|' 
curl: (3) [globbing] bad range specification in column 24

wget follows the rfc.

@dangra
Copy link
Member

dangra commented Aug 4, 2014

damn.. curl treatment of [ and ] is weird for users copy pasting urls in the console, and at least my gnome-terminal doesn't recognize the full url as clickable.

@dangra
Copy link
Member

dangra commented Aug 4, 2014

...and at least my gnome-terminal doesn't recognize the full url as clickable.

just a fun fact, it doesn't recognize the unquoted | either.

@kmike
Copy link
Member Author

kmike commented Aug 4, 2014

A good investigation!

curl handles [] differently because of its "globbing" option; when used with --globoff curl doesn't escape anything.

To be clear: this issue it is not about following RFC because AFAIK there is no RFC for such "loose" URL escaping, and safe_url_string doesn't follow any RFC. It implements a heuristic that uses constants from the RFC and adds its own constants, namely %.

I'd prefer to follow Chrome because its behavior looks easier to explain and because if we ever start testing w3lib to work the same as browsers we'll likely use some WebKit wrapper.

A better docstring for safe_url_string is worth adding.

@nyov
Copy link
Contributor

nyov commented Feb 26, 2016

So if both styles are common (firefox and chrome), then how about merging both and adding an optional argument to safe_url_string to emulate either browser behavior?
quote_chars=[firefox|chrome|rfc|both]?

Always good to have the ability to escape some browser detection methods ;)
And for now I know faking a chrome/webkit UA here can be detected...

@redapple
Copy link
Contributor

redapple commented Apr 4, 2016

FWIW, here's what Firefox (45.0, Ubuntu) and Chrome (Version 49.0.2623.110 (64-bit)) requested for "unwise" characters from RFC 2396 in path part ,

unwise      = "{" | "}" | "|" | "\" | "^" | "[" | "]" | "`"

for a link with "href" attribute of:

http://localhost:8001/unwise{,},|,\,^,[,],`?|=[]&[]=|

(with local HTTP test server):

  • Chrome GET /unwise%7B,%7D,%7C,/,%5E,[,],%60?|=[]&[]=| HTTP/1.1
  • Firefox GET /unwise%7B,%7D,|,%5C,%5E,[,],%60?|=[]&[]=| HTTP/1.1

Tabular view:

Input:   {   ,   }   ,   |   ,   \   , ^   ,   [ , ] , `
Chrome:  %7B ,   %7D ,   %7C ,   /   , %5E ,   [ , ] , %60
Firefox: %7B ,   %7D ,   |   ,   %5C , %5E ,   [ , ] , %60

I'm surprised to see Chrome transforming \ into /

@nyov
Copy link
Contributor

nyov commented Apr 5, 2016

That must be new in FF then, since @dangra especially noted that firefox 31 would turn [] into %5B%5D.
So the differences seem to be diminishing? Or platform dependent behavior?
Thanks for the test, and now I wonder how other browsers might "identify" there.

@redapple
Copy link
Contributor

redapple commented Jun 2, 2017

Today's tests:

For this link

http://localhost:8001/unwise{,},|,\,^,[,],`?|=[]&[]=|

this is what is captured:

Chrome/58.0.3029.110            GET /unwise%7B,%7D,%7C,/,%5E,[,],%60?|=[]&[]=| HTTP/1.1
Firefox/53.0                    GET /unwise%7B,%7D,|,/,%5E,[,],%60?|=[]&[]=| HTTP/1.1
Scrapy/1.4.0/w3lib/1.17.0       GET /unwise%7B,%7D,|,%5C,%5E,%5B,%5D,%60?|=%5B%5D&%5B%5D=| HTTP/1.1

In tabular view:

                          {   }   |   \   ^   [   ]   `   ?   |   =   [   ]   &   [   ]   =   |
Chrome/58.0.3029.110      %7B %7D %7C /   %5E [   ]   %60 ?   |   =   [   ]   &   [   ]   =   |
Firefox/53.0              %7B %7D |   /   %5E [   ]   %60 ?   |   =   [   ]   &   [   ]   =   |
Scrapy/1.4.0/w3lib/1.17.0 %7B %7D |   %5C %5E %5B %5D %60 ?   |   =   %5B %5D &   %5B %5D =   |

@kmike
Copy link
Member Author

kmike commented Jun 16, 2017

@redapple https://stackoverflow.com/questions/10438008/different-behaviours-of-treating-backslash-in-the-url-by-firefox-and-chrome/39860198#39860198 suggests that \ handling is made compatible with Scrapy in Chrome > 53, but it seems your test doesn't confirm that.

For me it looks like changing [, ] handling and leaving | and \ aside is the way to go.

We can leave | as is for now, because according to your tests replacement should be smarter - even Chrome only escapes it in path, but not in query. Replacing \ is crazy, I'm also surprised browsers do that. Though stackoverflow answer provide motivation - it is fr Windows users who are used to \ as a path separator.

What do you think?

@redapple
Copy link
Contributor

I agree with not escaping [ and ] anymore and leave it a that for the time being.

* "|" is removed;
* "[" and "]" are added.
@codecov-io
Copy link

codecov-io commented Jun 16, 2017

Codecov Report

Merging #25 into master will increase coverage by 0.03%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #25      +/-   ##
==========================================
+ Coverage   94.84%   94.88%   +0.03%     
==========================================
  Files           7        7              
  Lines         466      469       +3     
  Branches       95       95              
==========================================
+ Hits          442      445       +3     
  Misses         16       16              
  Partials        8        8
Impacted Files Coverage Δ
w3lib/url.py 96.89% <100%> (+0.04%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1095a42...b92a222. Read the comment docs.

@kmike kmike changed the title use constants from RFC3986 [MRG] use constants from RFC3986 Jun 16, 2017
Copy link
Contributor

@redapple redapple left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kmike !

@dangra
Copy link
Member

dangra commented Jan 3, 2019

FTR an update on the url encoding change all the time world (WIP)

Firefox

GET /[]|?|=[]&[]=| HTTP/1.1
Host: localhost:8001
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:64.0) Gecko/20100101 Firefox/64.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Referer: http://files.scrapinghub.com/w3lib-issue25.html
DNT: 1
Connection: keep-alive
Upgrade-Insecure-Requests: 1

Chrome

GET /[]%7C?|=[]&[]=| HTTP/1.1
Host: localhost:8001
Connection: keep-alive
Cache-Control: max-age=0
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8
Referer: http://files.scrapinghub.com/w3lib-issue25.html
Accept-Encoding: gzip, deflate, br
Accept-Language: en-US,en;q=0.9,es-419;q=0.8,es;q=0.7

Safari

GET /[]|?|=[]&[]=| HTTP/1.1
Host: localhost:8001
Accept-Encoding: gzip, deflate
Connection: keep-alive
Upgrade-Insecure-Requests: 1
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_1) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.0.1 Safari/605.1.15
Referer: http://files.scrapinghub.com/w3lib-issue25.html
DNT: 1
Accept-Language: en-us

Scrapy (master version 5e65e52e, pre1.6)

GET /unwise%7B,%7D,|,%5C,%5E,%5B,%5D,%60?|=%5B%5D&%5B%5D=| HTTP/1.1
Host: localhost:8001
Accept-Language: en
Accept-Encoding: gzip,deflate
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
User-Agent: Scrapy/1.5.1 (+https://scrapy.org)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants