Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add_or_replace_parameter lacks control of what is encoded #106

Open
rennerocha opened this issue Apr 17, 2018 · 1 comment
Open

add_or_replace_parameter lacks control of what is encoded #106

rennerocha opened this issue Apr 17, 2018 · 1 comment

Comments

@rennerocha
Copy link
Contributor

add_or_replace_parameter uses urlencode function to convert a string to percent-encoded ASCII string:
https://github.com/scrapy/w3lib/blob/master/w3lib/url.py#L229

If the value of the parameter is already percent-encoded (in my scenario, this value came from a JSON response of an API), it is not possible to specify safe characters that should not be encoded. This argument is available in urlencode signature:
https://docs.python.org/3/library/urllib.parse.html#urllib.parse.urlencode

My suggestion is to include a safe argument in add_or_replace_parameter to improve the control over the encoding of your values:
Actual: w3lib.url.add_or_replace_parameter (url, name, new_value)
New: w3lib.url.add_or_replace_parameter(url, name, new_value, safe='')

In [1]: from urllib.parse import urlencode
   ...: from w3lib.url import add_or_replace_parameter
   ...: # This came from an API response already encoded
   ...: full_hierarchy = 'Appliances_Air+Purifiers+%26+Dehumidifiers_Air+Purifiers'
   ...: url = 'http://example.com'

In [2]: urlencode({'hierarchy': full_hierarchy})
Out[2]: 'hierarchy=Appliances_Air%2BPurifiers%2B%2526%2BDehumidifiers_Air%2BPurifiers'

In [3]: urlencode({'hierarchy': full_hierarchy}, safe='+%')
Out[3]: 'hierarchy=Appliances_Air+Purifiers+%26+Dehumidifiers_Air+Purifiers'

In [4]: add_or_replace_parameter(url, 'hierarchy', full_hierarchy)
Out[4]: 'http://example.com?hierarchy=Appliances_Air%2BPurifiers%2B%2526%2BDehumidifiers_Air%2BPurifiers'
@kmike
Copy link
Member

kmike commented Apr 18, 2018

Hey @rennerocha!

Do you have any other use cases for overriding safe characters?

It looks like in your case a workaround could be to unescape parameters before passing them to add_or_replace_parameter, as this function works with raw, unescaped parameter values - these are values server gets after decoding.

Allowing to override safe characters can be seen as a performance optimization, a hack to avoid unescaping and re-escaping, unless you see other use cases for it. While it may solve a problem, I find the resulting API unintuitive. E.g. does safe parameter affect only parameters, or does it affect encoding of the original URL as well? What should be its value when you want to pass already-escaped parameters, or just use something as-is?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants