-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pipe symbol ("|") is not percent encoded #80
Comments
There is a stalled PR to address that: #25 |
Yes @kmike, that PR addressing exactly this issue. Maybe if it is impossible to make general solution for safe symbols then it is worth to give optional control to the user? With some additional parameter in request.meta for example. |
@kmike with Firefox it is answering with same code 400 if there is a pipe symbol in url. |
It seems there is still no consensus between browsers how to handle different characters in URL path (e.g. https://bugzilla.mozilla.org/show_bug.cgi?id=1064700). This means that a website which works in one browser may break in another, and we can't create escaping method which works everywhere. #25 is merged, but we've removed
I'm not opposed to change the way But even if we do it, we still won't cover all cases, because behavior differs in browsers. @nyov proposed to have an option to specify which browser should we emulate (#25 (comment)). I think it may require work to maintain, because browsers change, so these rules are not set in stone. They already changed between experiments @dangra and @redapple were making. I wonder if a more future-proof (though less user-friendly) way to tackle this is to fix scrapy/scrapy#833. |
I think Firefox’s approach is the right one in light of https://url.spec.whatwg.org/, which should be considered the latest URL standard. However, until adoption grows, I wonder if we should, as you @kmike suggest, update |
Well, if we focus specifically on the logic of whether or not to escape a given code point, I think escaping it if any major browser escapes it would be a valid, safe approach. In fact, we may want to decide which characters to escape not so much based on what characters web browser escape, but what characters servers out there need escaped. Over-escaping should not be a problem, so aiming to support as many servers as possible by escaping any characters that some server may need escaped may be the safest approach here. |
I think this makes sense. |
Pipe symbol ("|") is in reserved symbols list in url.py https://github.com/scrapy/w3lib/blob/master/w3lib/url.py#L67 and is not percent encoded by safe_url_string which used by scrapy to download urls.
RFC mentioned in url.py https://www.ietf.org/rfc/rfc3986.txt doesn't contain "|" in reserved symbols:
And I've found a site (in top 20 of Alexa, possible using play framework) which has such links with pipes, and it is answering with http code 400 (bad request) if "|" is not percent encoded in url.
Is this a bug? How can I avoid it properly? For now I just removed "|" from url.py itself.
The text was updated successfully, but these errors were encountered: