-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC-2396-encode URLs #37
Conversation
@kmike brought up that it is not RFC-3986, but java.net.URI’s flavor of RFC-2396, that we need to support. After looking at the standards, and Java’s deviations, I think for our purposes updating the lists of characters to escape in different URL parts to those of the older standard is enough, and Java’s deviations do not seem to affect those, other than they make it clear that (escaped) UTF-8 characters are supported. |
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## main #37 +/- ##
==========================================
+ Coverage 37.15% 40.32% +3.17%
==========================================
Files 9 9
Lines 323 372 +49
Branches 45 54 +9
==========================================
+ Hits 120 150 +30
- Misses 203 217 +14
- Partials 0 5 +5
|
zyte_api/utils.py
Outdated
path_encoding: str = "utf8", | ||
quote_path: bool = True, | ||
) -> str: | ||
"""Fork of ``w3lib.url.safe_url_string`` that enforces `RFC-3986`_. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hey @Gallaecio! Do you know if there is any way to prevent double-encoding in this function?
In [1]: from w3lib.url import safe_download_url, safe_url_string, canonicalize_url
In [2]: safe_url_string("https://www.example.com/product/hometales-pack-of-2-sky/626857878294#bcrumbSearch:pillow%20cover")
Out[2]: 'https://www.example.com/product/hometales-pack-of-2-sky/626857878294#bcrumbSearch:pillow%20cover'
vs
In [1]: from zyte_api.utils import _safe_url_string
In [2]: _safe_url_string("https://www.example.com/product/hometales-pack-of-2-sky/626857878294#bcrumbSearch:pillow%20cover")
Out[2]: 'https://www.example.com/product/hometales-pack-of-2-sky/626857878294#bcrumbSearch:pillow%2520cover'
The issue is that sometimes input URLs are encoded, and sometimes they are not. w3lib's safe_url_string
tries support both. It seems that after this change we might break the client library for users who send the percent-escaped URLs (e.g. the website itself may already provide encoded versions).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great catch, let me fix that and add a test for it.
Rather than fixing my implementation here, I am thinking I should clean up and extend the corresponding tests on w3lib, then update w3lib to support the URL limitations of as many servers as possible (including Zyte API’s), and then use upstream |
Thanks @Gallaecio! |
No description provided.