-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] use constants from RFC3986 #25
Conversation
|
So does the updated
I'm not 100% sure the PR is correct because the whole |
I'm ok to merge it, do you think we need a test case for the removed/added chars? I wonder why got the current list. |
@dangra I tried to track where this list came from; it was there since forever (I can't find a place it was introduced neither in w3lib nor in scrapy git history). As for the tests, I'm not sure it worths it because the fix is clear, but the correct behavior is not :)
|
|
How did you check what is firefox doing? |
Browsers behaving differently is a valid reason to depart from RFCs. |
I was looking at url bar. more experiments clicking the link at http://files.scrapinghub.com/w3lib-issue25.html <!DOCTYPE>
<html>
<body>
<p>run in a console: <pre>nc -l -p 8001</pre> and <a href="http://localhost:8001/[]|">click here</a></p>
</body>
</html> Chrome
Firefox
|
w3lib v1.8.0 follows FF behavior: >>> from w3lib.url import safe_url_string
>>> safe_url_string('http://localhost:8001/[]|')
'http://localhost:8001/%5B%5D|' |
and after this change, w3lib follows RFC and Chrome behavior: >>> from w3lib.url import safe_url_string
>>> safe_url_string('http://localhost:8001/[]|')
'http://localhost:8001/[]%7C' |
curl has some issues interpreting
wget follows the rfc. |
damn.. curl treatment of |
just a fun fact, it doesn't recognize the unquoted |
A good investigation! curl handles To be clear: this issue it is not about following RFC because AFAIK there is no RFC for such "loose" URL escaping, and I'd prefer to follow Chrome because its behavior looks easier to explain and because if we ever start testing w3lib to work the same as browsers we'll likely use some WebKit wrapper. A better docstring for |
So if both styles are common (firefox and chrome), then how about merging both and adding an optional argument to Always good to have the ability to escape some browser detection methods ;) |
FWIW, here's what Firefox (45.0, Ubuntu) and Chrome (Version 49.0.2623.110 (64-bit)) requested for "unwise" characters from RFC 2396 in path part ,
for a link with "href" attribute of:
(with local HTTP test server):
Tabular view:
I'm surprised to see Chrome transforming |
That must be new in FF then, since @dangra especially noted that firefox 31 would turn |
Today's tests: For this link
this is what is captured:
In tabular view:
|
@redapple https://stackoverflow.com/questions/10438008/different-behaviours-of-treating-backslash-in-the-url-by-firefox-and-chrome/39860198#39860198 suggests that For me it looks like changing We can leave What do you think? |
I agree with not escaping |
* "|" is removed; * "[" and "]" are added.
Codecov Report
@@ Coverage Diff @@
## master #25 +/- ##
==========================================
+ Coverage 94.84% 94.88% +0.03%
==========================================
Files 7 7
Lines 466 469 +3
Branches 95 95
==========================================
+ Hits 442 445 +3
Misses 16 16
Partials 8 8
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @kmike !
FTR an update on the url encoding change all the time world (WIP) Firefox
Chrome
Safari
Scrapy (master version 5e65e52e, pre1.6)
|
@nramirezuy noticed that "|" is not in RFC3986: scrapy/scrapy#508 (comment). I've checked the RFC and updated code to use constants from this RFC: