Pipe symbol ("|") is not percent encoded #80

odinplus · 2016-12-12T20:40:18Z

Pipe symbol ("|") is in reserved symbols list in url.py https://github.com/scrapy/w3lib/blob/master/w3lib/url.py#L67 and is not percent encoded by safe_url_string which used by scrapy to download urls.
RFC mentioned in url.py https://www.ietf.org/rfc/rfc3986.txt doesn't contain "|" in reserved symbols:

      reserved    = gen-delims / sub-delims

      gen-delims  = ":" / "/" / "?" / "#" / "[" / "]" / "@"

      sub-delims  = "!" / "$" / "&" / "'" / "(" / ")"
                  / "*" / "+" / "," / ";" / "="

And I've found a site (in top 20 of Alexa, possible using play framework) which has such links with pipes, and it is answering with http code 400 (bad request) if "|" is not percent encoded in url.
Is this a bug? How can I avoid it properly? For now I just removed "|" from url.py itself.

The text was updated successfully, but these errors were encountered:

kmike · 2016-12-12T20:52:23Z

There is a stalled PR to address that: #25

odinplus · 2016-12-13T05:55:12Z

Yes @kmike, that PR addressing exactly this issue. Maybe if it is impossible to make general solution for safe symbols then it is worth to give optional control to the user? With some additional parameter in request.meta for example.

kmike · 2017-06-16T14:16:46Z

@odinplus I wonder how this site works with Firefox, as according to @redapple's test Firefox doesn't encode | as well.

odinplus · 2017-06-20T03:14:43Z

@kmike with Firefox it is answering with same code 400 if there is a pipe symbol in url.

kmike · 2017-06-20T15:18:43Z

It seems there is still no consensus between browsers how to handle different characters in URL path (e.g. https://bugzilla.mozilla.org/show_bug.cgi?id=1064700). This means that a website which works in one browser may break in another, and we can't create escaping method which works everywhere. #25 is merged, but we've removed | handling from it, so #25 does not fix this particular issue. | handling was removed for these reasons:

it makes changes smaller, more focused and less controversial;
Firefox handles | the same way as w3lib, so it is not that | handling is incorrect per se;
Chrome handles | differently in path and in query, while safe_url_string doesn't make this distinction currently, using the same set of chars - likely it should though.

I'm not opposed to change the way | is handled; we can do it in a separate PR.

But even if we do it, we still won't cover all cases, because behavior differs in browsers. @nyov proposed to have an option to specify which browser should we emulate (#25 (comment)). I think it may require work to maintain, because browsers change, so these rules are not set in stone. They already changed between experiments @dangra and @redapple were making.

I wonder if a more future-proof (though less user-friendly) way to tackle this is to fix scrapy/scrapy#833.

Gallaecio · 2022-11-01T10:04:23Z

I think Firefox’s approach is the right one in light of https://url.spec.whatwg.org/, which should be considered the latest URL standard.

However, until adoption grows, I wonder if we should, as you @kmike suggest, update safe_url_string to be “safer”.

Gallaecio · 2022-11-03T22:33:12Z

we can't create escaping method which works everywhere

Well, if we focus specifically on the logic of whether or not to escape a given code point, I think escaping it if any major browser escapes it would be a valid, safe approach. In fact, we may want to decide which characters to escape not so much based on what characters web browser escape, but what characters servers out there need escaped. Over-escaping should not be a problem, so aiming to support as many servers as possible by escaping any characters that some server may need escaped may be the safest approach here.

wRAR · 2022-11-04T11:54:40Z

I think this makes sense.

redapple mentioned this issue May 19, 2017

canonicalize_url use of safe_url_string breaks when an encoded hash character is encountered #91

Closed

Gallaecio added enhancement discuss labels May 9, 2019

Gallaecio mentioned this issue Nov 3, 2022

RFC-2396-encode URLs zytedata/python-zyte-api#37

Merged

Gallaecio mentioned this issue Nov 8, 2022

Make safe_url_string safer #201

Closed

10 tasks

Gallaecio mentioned this issue Nov 23, 2022

safe_url_string: escape additional characters #203

Merged

kmike closed this as completed in #203 Nov 24, 2022

Gallaecio mentioned this issue Feb 13, 2024

Implement a safe_url based on all standards #221

Draft

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pipe symbol ("|") is not percent encoded #80

Pipe symbol ("|") is not percent encoded #80

odinplus commented Dec 12, 2016 •

edited

Loading

kmike commented Dec 12, 2016

odinplus commented Dec 13, 2016 •

edited

Loading

kmike commented Jun 16, 2017

odinplus commented Jun 20, 2017 •

edited

Loading

kmike commented Jun 20, 2017

Gallaecio commented Nov 1, 2022

Gallaecio commented Nov 3, 2022 •

edited

Loading

wRAR commented Nov 4, 2022

Pipe symbol ("|") is not percent encoded #80

Pipe symbol ("|") is not percent encoded #80

Comments

odinplus commented Dec 12, 2016 • edited Loading

kmike commented Dec 12, 2016

odinplus commented Dec 13, 2016 • edited Loading

kmike commented Jun 16, 2017

odinplus commented Jun 20, 2017 • edited Loading

kmike commented Jun 20, 2017

Gallaecio commented Nov 1, 2022

Gallaecio commented Nov 3, 2022 • edited Loading

wRAR commented Nov 4, 2022

odinplus commented Dec 12, 2016 •

edited

Loading

odinplus commented Dec 13, 2016 •

edited

Loading

odinplus commented Jun 20, 2017 •

edited

Loading

Gallaecio commented Nov 3, 2022 •

edited

Loading