Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pipe symbol ("|") is not percent encoded #80

Closed
Tracked by #221 ...
odinplus opened this issue Dec 12, 2016 · 8 comments · Fixed by #203
Closed
Tracked by #221 ...

Pipe symbol ("|") is not percent encoded #80

odinplus opened this issue Dec 12, 2016 · 8 comments · Fixed by #203

Comments

@odinplus
Copy link

odinplus commented Dec 12, 2016

Pipe symbol ("|") is in reserved symbols list in url.py https://github.com/scrapy/w3lib/blob/master/w3lib/url.py#L67 and is not percent encoded by safe_url_string which used by scrapy to download urls.
RFC mentioned in url.py https://www.ietf.org/rfc/rfc3986.txt doesn't contain "|" in reserved symbols:

      reserved    = gen-delims / sub-delims

      gen-delims  = ":" / "/" / "?" / "#" / "[" / "]" / "@"

      sub-delims  = "!" / "$" / "&" / "'" / "(" / ")"
                  / "*" / "+" / "," / ";" / "="

And I've found a site (in top 20 of Alexa, possible using play framework) which has such links with pipes, and it is answering with http code 400 (bad request) if "|" is not percent encoded in url.
Is this a bug? How can I avoid it properly? For now I just removed "|" from url.py itself.

@kmike
Copy link
Member

kmike commented Dec 12, 2016

There is a stalled PR to address that: #25

@odinplus
Copy link
Author

odinplus commented Dec 13, 2016

Yes @kmike, that PR addressing exactly this issue. Maybe if it is impossible to make general solution for safe symbols then it is worth to give optional control to the user? With some additional parameter in request.meta for example.

@kmike
Copy link
Member

kmike commented Jun 16, 2017

@odinplus I wonder how this site works with Firefox, as according to @redapple's test Firefox doesn't encode | as well.

@odinplus
Copy link
Author

odinplus commented Jun 20, 2017

@kmike with Firefox it is answering with same code 400 if there is a pipe symbol in url.

@kmike
Copy link
Member

kmike commented Jun 20, 2017

It seems there is still no consensus between browsers how to handle different characters in URL path (e.g. https://bugzilla.mozilla.org/show_bug.cgi?id=1064700). This means that a website which works in one browser may break in another, and we can't create escaping method which works everywhere. #25 is merged, but we've removed | handling from it, so #25 does not fix this particular issue. | handling was removed for these reasons:

  1. it makes changes smaller, more focused and less controversial;
  2. Firefox handles | the same way as w3lib, so it is not that | handling is incorrect per se;
  3. Chrome handles | differently in path and in query, while safe_url_string doesn't make this distinction currently, using the same set of chars - likely it should though.

I'm not opposed to change the way | is handled; we can do it in a separate PR.

But even if we do it, we still won't cover all cases, because behavior differs in browsers. @nyov proposed to have an option to specify which browser should we emulate (#25 (comment)). I think it may require work to maintain, because browsers change, so these rules are not set in stone. They already changed between experiments @dangra and @redapple were making.

I wonder if a more future-proof (though less user-friendly) way to tackle this is to fix scrapy/scrapy#833.

@Gallaecio
Copy link
Member

I think Firefox’s approach is the right one in light of https://url.spec.whatwg.org/, which should be considered the latest URL standard.

However, until adoption grows, I wonder if we should, as you @kmike suggest, update safe_url_string to be “safer”.

@Gallaecio
Copy link
Member

Gallaecio commented Nov 3, 2022

we can't create escaping method which works everywhere

Well, if we focus specifically on the logic of whether or not to escape a given code point, I think escaping it if any major browser escapes it would be a valid, safe approach. In fact, we may want to decide which characters to escape not so much based on what characters web browser escape, but what characters servers out there need escaped. Over-escaping should not be a problem, so aiming to support as many servers as possible by escaping any characters that some server may need escaped may be the safest approach here.

@wRAR
Copy link
Member

wRAR commented Nov 4, 2022

I think this makes sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants