[MRG] use constants from RFC3986 #25

kmike · 2014-08-03T16:34:14Z

@nramirezuy noticed that "|" is not in RFC3986: scrapy/scrapy#508 (comment). I've checked the RFC and updated code to use constants from this RFC:

"|" is removed;
"[" and "]" are added.

dangra · 2014-08-04T15:06:25Z

~~are you sure about removing "|" ? chrome quotes it if you type it in the url bar. the same does github markdown processor.~~. :)

kmike · 2014-08-04T15:19:55Z

So does the updated safe_url_string - "|" is removed from "safe_chars" in this PR, so safe_url_string started to quote it.

safe_url_string doesn't quote symbols that are never quoted (ascii letters, digits, a couple of other symbols); it also doesn't quote symbols that have special meaning in URL (separators and %) - it assumes they are already quoted properly. It is essentially a hack - quoting rules depend on a part of URL used, using a single set of chars to escape for the whole URL is not going to work properly in 100% cases. As I understand it, safe_url_string assumes the URL is already almost quoted and tries to fix some easy stuff common in webpages created by humans.

I'm not 100% sure the PR is correct because the whole safe_url_string stuff is tricky. But at least it is more consistent with the RFC, references to RFC become correct, and the quoting is more like Chrome's (as your example with "|" shows). Chrome doesn't quote ][ ~~though~~.

dangra · 2014-08-04T15:25:50Z

I'm ok to merge it, do you think we need a test case for the removed/added chars? I wonder why got the current list.

kmike · 2014-08-04T15:35:23Z

@dangra I tried to track where this list came from; it was there since forever (I can't find a place it was introduced neither in w3lib nor in scrapy git history). As for the tests, I'm not sure it worths it because the fix is clear, but the correct behavior is not :)

~~Also, as Chrome doesn't quote ][ we may want not to quote it as well, but I think a departure from RFC constants is better to be justified by some real-world example.~~

dangra · 2014-08-04T16:05:08Z

firefox does the same than chrome. I tend to prefer browsers behavior than
RFC to be honest. see #25 (comment)
On Aug 4, 2014 12:35 PM, "Mikhail Korobov" [email protected] wrote:

@dangra https://github.com/dangra I tried to track where this list came
from; it was there since forever (I can't find a place it was introduced
neither in w3lib nor in scrapy git history). As for the tests, I'm not sure
it worths it because the fix is clear, but the correct behavior is not :)

Also, as Chrome doesn't quote ][ we may want not to quote it as well, but
I think a departure from RFC constants is better to be justified by some
real-world example.

—
Reply to this email directly or view it on GitHub
#25 (comment).

kmike · 2014-08-04T16:09:36Z

How did you check what is firefox doing?

kmike · 2014-08-04T16:10:18Z

Browsers behaving differently is a valid reason to depart from RFCs.

dangra · 2014-08-04T16:32:06Z

How did you check what is firefox doing?

I was looking at url bar.

more experiments clicking the link at http://files.scrapinghub.com/w3lib-issue25.html

<!DOCTYPE>
<html>
  <body>
    <p>run in a console: <pre>nc -l -p 8001</pre> and <a href="http://localhost:8001/[]|">click here</a></p>
  </body>
</html>

Chrome

GET /[]%7C HTTP/1.1
Host: localhost:8001
Connection: keep-alive
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36
DNT: 1
Referer: http://files.scrapinghub.com/w3lib-issue25.html
Accept-Encoding: gzip,deflate,sdch
Accept-Language: en-US,en;q=0.8,es-419;q=0.6,es;q=0.4

Firefox

GET /%5B%5D| HTTP/1.1
Host: localhost:8001
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Firefox/31.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Referer: http://files.scrapinghub.com/w3lib-issue25.html
Connection: keep-alive

~~so let's follow RFC and merge this PR asap~~ now I am confused.

dangra · 2014-08-04T16:57:16Z

w3lib v1.8.0 follows FF behavior:

>>> from w3lib.url import safe_url_string
>>> safe_url_string('http://localhost:8001/[]|')
'http://localhost:8001/%5B%5D|'

dangra · 2014-08-04T17:02:35Z

and after this change, w3lib follows RFC and Chrome behavior:

>>> from w3lib.url import safe_url_string
>>> safe_url_string('http://localhost:8001/[]|')
'http://localhost:8001/[]%7C'

dangra · 2014-08-04T17:04:48Z

curl has some issues interpreting [] in urls:

$ curl 'http://localhost:8001/[]|' 
curl: (3) [globbing] bad range specification in column 24

wget follows the rfc.

dangra · 2014-08-04T17:08:00Z

damn.. curl treatment of [ and ] is weird for users copy pasting urls in the console, and at least my gnome-terminal doesn't recognize the full url as clickable.

dangra · 2014-08-04T17:13:43Z

...and at least my gnome-terminal doesn't recognize the full url as clickable.

just a fun fact, it doesn't recognize the unquoted | either.

kmike · 2014-08-04T17:31:59Z

A good investigation!

curl handles [] differently because of its "globbing" option; when used with --globoff curl doesn't escape anything.

To be clear: this issue it is not about following RFC because AFAIK there is no RFC for such "loose" URL escaping, and safe_url_string doesn't follow any RFC. It implements a heuristic that uses constants from the RFC and adds its own constants, namely %.

I'd prefer to follow Chrome because its behavior looks easier to explain and because if we ever start testing w3lib to work the same as browsers we'll likely use some WebKit wrapper.

A better docstring for safe_url_string is worth adding.

nyov · 2016-02-26T00:09:38Z

So if both styles are common (firefox and chrome), then how about merging both and adding an optional argument to safe_url_string to emulate either browser behavior?
quote_chars=[firefox|chrome|rfc|both]?

Always good to have the ability to escape some browser detection methods ;)
And for now I know faking a chrome/webkit UA here can be detected...

redapple · 2016-04-04T15:35:47Z

FWIW, here's what Firefox (45.0, Ubuntu) and Chrome (Version 49.0.2623.110 (64-bit)) requested for "unwise" characters from RFC 2396 in path part ,

unwise      = "{" | "}" | "|" | "\" | "^" | "[" | "]" | "`"

for a link with "href" attribute of:

http://localhost:8001/unwise{,},|,\,^,[,],`?|=[]&[]=|

(with local HTTP test server):

Chrome GET /unwise%7B,%7D,%7C,/,%5E,[,],%60?|=[]&[]=| HTTP/1.1
Firefox GET /unwise%7B,%7D,|,%5C,%5E,[,],%60?|=[]&[]=| HTTP/1.1

Tabular view:

Input:   {   ,   }   ,   |   ,   \   , ^   ,   [ , ] , `
Chrome:  %7B ,   %7D ,   %7C ,   /   , %5E ,   [ , ] , %60
Firefox: %7B ,   %7D ,   |   ,   %5C , %5E ,   [ , ] , %60

I'm surprised to see Chrome transforming \ into /

nyov · 2016-04-05T04:56:05Z

That must be new in FF then, since @dangra especially noted that firefox 31 would turn [] into %5B%5D.
So the differences seem to be diminishing? Or platform dependent behavior?
Thanks for the test, and now I wonder how other browsers might "identify" there.

redapple · 2017-06-02T10:39:02Z

Today's tests:

For this link

http://localhost:8001/unwise{,},|,\,^,[,],`?|=[]&[]=|

this is what is captured:

Chrome/58.0.3029.110            GET /unwise%7B,%7D,%7C,/,%5E,[,],%60?|=[]&[]=| HTTP/1.1
Firefox/53.0                    GET /unwise%7B,%7D,|,/,%5E,[,],%60?|=[]&[]=| HTTP/1.1
Scrapy/1.4.0/w3lib/1.17.0       GET /unwise%7B,%7D,|,%5C,%5E,%5B,%5D,%60?|=%5B%5D&%5B%5D=| HTTP/1.1

In tabular view:

                          {   }   |   \   ^   [   ]   `   ?   |   =   [   ]   &   [   ]   =   |
Chrome/58.0.3029.110      %7B %7D %7C /   %5E [   ]   %60 ?   |   =   [   ]   &   [   ]   =   |
Firefox/53.0              %7B %7D |   /   %5E [   ]   %60 ?   |   =   [   ]   &   [   ]   =   |
Scrapy/1.4.0/w3lib/1.17.0 %7B %7D |   %5C %5E %5B %5D %60 ?   |   =   %5B %5D &   %5B %5D =   |

kmike · 2017-06-16T14:15:22Z

@redapple https://stackoverflow.com/questions/10438008/different-behaviours-of-treating-backslash-in-the-url-by-firefox-and-chrome/39860198#39860198 suggests that \ handling is made compatible with Scrapy in Chrome > 53, but it seems your test doesn't confirm that.

For me it looks like changing [, ] handling and leaving | and \ aside is the way to go.

We can leave | as is for now, because according to your tests replacement should be smarter - even Chrome only escapes it in path, but not in query. Replacing \ is crazy, I'm also surprised browsers do that. Though stackoverflow answer provide motivation - it is fr Windows users who are used to \ as a path separator.

What do you think?

redapple · 2017-06-16T14:46:50Z

I agree with not escaping [ and ] anymore and leave it a that for the time being.

* "|" is removed; * "[" and "]" are added.

The issue is fixed in pytest-dev/pytest#2500

codecov-io · 2017-06-16T15:31:24Z

Codecov Report

Merging #25 into master will increase coverage by 0.03%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master      #25      +/-   ##
==========================================
+ Coverage   94.84%   94.88%   +0.03%     
==========================================
  Files           7        7              
  Lines         466      469       +3     
  Branches       95       95              
==========================================
+ Hits          442      445       +3     
  Misses         16       16              
  Partials        8        8

Impacted Files	Coverage Δ
w3lib/url.py	`96.89% <100%> (+0.04%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1095a42...b92a222. Read the comment docs.

redapple

Thanks @kmike !

dangra · 2019-01-03T19:56:35Z

FTR an update on the url encoding change all the time world (WIP)

Firefox

GET /[]|?|=[]&[]=| HTTP/1.1
Host: localhost:8001
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:64.0) Gecko/20100101 Firefox/64.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Referer: http://files.scrapinghub.com/w3lib-issue25.html
DNT: 1
Connection: keep-alive
Upgrade-Insecure-Requests: 1

Chrome

GET /[]%7C?|=[]&[]=| HTTP/1.1
Host: localhost:8001
Connection: keep-alive
Cache-Control: max-age=0
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8
Referer: http://files.scrapinghub.com/w3lib-issue25.html
Accept-Encoding: gzip, deflate, br
Accept-Language: en-US,en;q=0.9,es-419;q=0.8,es;q=0.7

Safari

GET /[]|?|=[]&[]=| HTTP/1.1
Host: localhost:8001
Accept-Encoding: gzip, deflate
Connection: keep-alive
Upgrade-Insecure-Requests: 1
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_1) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.0.1 Safari/605.1.15
Referer: http://files.scrapinghub.com/w3lib-issue25.html
DNT: 1
Accept-Language: en-us

Scrapy (master version 5e65e52e, pre1.6)

GET /unwise%7B,%7D,|,%5C,%5E,%5B,%5D,%60?|=%5B%5D&%5B%5D=| HTTP/1.1
Host: localhost:8001
Accept-Language: en
Accept-Encoding: gzip,deflate
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
User-Agent: Scrapy/1.5.1 (+https://scrapy.org)

dangra mentioned this pull request Aug 4, 2014

FormRequest.from_response fails to parse form action url scrapy/scrapy#508

Open

kmike mentioned this pull request Apr 20, 2016

canonicalize_url in linkextractor: not what browsers do scrapy/scrapy#1941

Closed

kmike mentioned this pull request Dec 12, 2016

Pipe symbol ("|") is not percent encoded #80

Closed

redapple mentioned this pull request May 19, 2017

canonicalize_url use of safe_url_string breaks when an encoded hash character is encountered #91

Closed

redapple mentioned this pull request Jun 2, 2017

Don`t fetch url scrapy/scrapy#2774

Closed

use constants from RFC3986

19f33ac

* "|" is removed; * "[" and "]" are added.

kmike force-pushed the rfc3986 branch from 17601ad to 19f33ac Compare June 16, 2017 15:13

kmike added 2 commits June 16, 2017 20:15

cleanup: move deprecation function to the bottom, add blank lines

fc49d1c

TST exclude pytest versions with broken unicode doctest handling.

051ced0

The issue is fixed in pytest-dev/pytest#2500

Don't change | handling for now

b92a222

kmike changed the title ~~use constants from RFC3986~~ [MRG] use constants from RFC3986 Jun 16, 2017

redapple approved these changes Jun 16, 2017

View reviewed changes

dangra merged commit 63f391b into scrapy:master Jun 16, 2017

dangra deleted the rfc3986 branch June 16, 2017 16:02

kmike mentioned this pull request Jun 26, 2017

reserved chars escaping is not compatible with what browsers do python-hyper/hyperlink#18

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MRG] use constants from RFC3986 #25

[MRG] use constants from RFC3986 #25

kmike commented Aug 3, 2014

dangra commented Aug 4, 2014

kmike commented Aug 4, 2014

dangra commented Aug 4, 2014

kmike commented Aug 4, 2014

dangra commented Aug 4, 2014

kmike commented Aug 4, 2014

kmike commented Aug 4, 2014

dangra commented Aug 4, 2014

dangra commented Aug 4, 2014

dangra commented Aug 4, 2014

dangra commented Aug 4, 2014

dangra commented Aug 4, 2014

dangra commented Aug 4, 2014

kmike commented Aug 4, 2014

nyov commented Feb 26, 2016

redapple commented Apr 4, 2016

nyov commented Apr 5, 2016

redapple commented Jun 2, 2017 •

edited

Loading

kmike commented Jun 16, 2017 •

edited

Loading

redapple commented Jun 16, 2017

codecov-io commented Jun 16, 2017 •

edited

Loading

redapple left a comment

dangra commented Jan 3, 2019

[MRG] use constants from RFC3986 #25

[MRG] use constants from RFC3986 #25

Conversation

kmike commented Aug 3, 2014

dangra commented Aug 4, 2014

kmike commented Aug 4, 2014

dangra commented Aug 4, 2014

kmike commented Aug 4, 2014

dangra commented Aug 4, 2014

kmike commented Aug 4, 2014

kmike commented Aug 4, 2014

dangra commented Aug 4, 2014

Chrome

Firefox

dangra commented Aug 4, 2014

dangra commented Aug 4, 2014

dangra commented Aug 4, 2014

dangra commented Aug 4, 2014

dangra commented Aug 4, 2014

kmike commented Aug 4, 2014

nyov commented Feb 26, 2016

redapple commented Apr 4, 2016

nyov commented Apr 5, 2016

redapple commented Jun 2, 2017 • edited Loading

kmike commented Jun 16, 2017 • edited Loading

redapple commented Jun 16, 2017

codecov-io commented Jun 16, 2017 • edited Loading

Codecov Report

redapple left a comment

Choose a reason for hiding this comment

dangra commented Jan 3, 2019

Firefox

Chrome

Safari

Scrapy (master version 5e65e52e, pre1.6)

redapple commented Jun 2, 2017 •

edited

Loading

kmike commented Jun 16, 2017 •

edited

Loading

codecov-io commented Jun 16, 2017 •

edited

Loading