Improve uri.parseQuery to never raise an error #16647

mildred · 2021-01-09T00:26:07Z

In case of malformed query string where there is = on the value, handle
this character as part of the value instead of throwing an error.

The following query string should no longer crash a program:

key=value&key2=x=1

It will be interpreted as:

[("key", "value"), ("key2", "x=1")]

timotheecour · 2021-01-09T01:54:30Z

probably correct but please give some links (eg spec, stackoverflow etc) in your PR description to double check;

tests need fixing

lib/pure/uri.nim

dom96

This is a breaking change. But also whether to raise the error should be customisable.

Please introduce a strict_parsing flag like in Python's urllib: https://docs.python.org/3/library/urllib.parse.html#urllib.parse.parse_qs (it should be true by default to keep backwards compatibility)

timotheecour · 2021-01-09T19:43:17Z

This is a breaking change

which code would break?
if you're talking about doAssertRaises(parseQuery("key=value&key2=x=1")) I wouldn't categorize it as a breaking change, it'd be something used only in some test suite, not actual code

dom96 · 2021-01-09T20:07:30Z

@timotheecour As an example, there may be code out there which relies on being able to check for validity of the query string and explicitly checks for a ValueError.

Are there any downsides to keeping backwards compatibility here?

timotheecour · 2021-01-09T20:23:26Z

if the spec (or most libraries) allows =, then I'd much prefer doing a -d:nimLegacyParseQueryStrict and being correct by default than legacy behavior by default. The point is to avoid accumulating non-standard behavior over time.

mildred · 2021-01-11T16:07:55Z

I found some references for parsing query strings (and checked them):

WhatWG's HTML5 urlencoded parser accepts literal = characters in value
Go: https://play.golang.org/p/8Yug9Kql2CF no parsing error and = interpreted verbatim in value
Python urllib.parse.parse_qs: same behaviour
Ruby CGI::parse accepts = too

mildred · 2021-01-11T16:18:39Z

Updated code with @timotheecour requested changes. Add compilation flag to restore old behaviour suggested. I too prefer the new behaviour by default, it is unlikely to cause breaks in existing apps and the new behaviour is better I believe. it is also conforming to standards and to the general principle of network programming : be liberal in what you accept.

On the contrary, web apps written currently in Nim are probably not handling this specific case of query string parsing, and the new behavior can avoid an error that could in some cases provoke a denial of service.

timotheecour · 2021-01-11T17:55:48Z

Rebase error? Did you rebase against master instead of devel or something?

mildred · 2021-01-11T19:44:55Z

Yep, sorry, updated

timotheecour · 2021-01-11T19:58:00Z

lib/pure/uri.nim

+      else:
+        if c == sep: break
+        else: add(field, data[result])


Suggested change

else:

if c == sep: break

else: add(field, data[result])

elif c == sep: break

else: add(field, data[result])

This is the else of the case statement, I'm not convinced that elif is a valid option for it.

it's valid, and it works, i verified

lib/pure/uri.nim

timotheecour

LGTM after addressing comments + fixing test failures in tests/stdlib/turi.nim

dom96

Hmm. Looking into this a bit more, the specific example we are getting to work here isn't to do with strict parsing, this is in fact fixing a bug in the current parser. Python will parse it fine even with strict parsing enabled:

>>> import urllib.parse
>>> urllib.parse.parse_qs('foo=1&bar=2=3')
{'foo': ['1'], 'bar': ['2=3']}
>>> urllib.parse.parse_qs('foo=1&bar=2=3', strict_parsing=True)
{'foo': ['1'], 'bar': ['2=3']}

But then why are you changing it to "never raise an error"? Just fix that bug and don't get rid of all errors please. That way this isn't a breaking change.

timotheecour · 2021-01-11T22:32:36Z

don't get rid of all errors please

which other errors? can you show an input where you expect an error?

mildred · 2021-01-11T22:44:30Z

@dom96: I don't understand what you are suggesting... Should we (A) fix the parser to accept the = character and never raise an error, not bothering with compatibility define flag or (B) keep the current behaviour and not do anything, suggesting library users to fix their software to avoid the = character in the value field.

If we want to be compliant with HTML5 spec, we need to change the behaviour and this is breaking to library users that expects an exception in this special case.

If we want to keep raising an exception, we can't be compliant to the HTML spec.

Note: the = character we talk about is the only case when an error is raised. There is no other case where an error is raised as far as I know reading the code.

mildred · 2021-01-11T23:00:39Z

Updated the code.

I'm not convinced that the CI errors comes from my code, errors seems to come from fidget and PackageFileParsed (whatever those are) and I don't really see the problem looking at the CI logs...

lib/pure/uri.nim

dom96 · 2021-01-11T23:01:12Z

Looking at the Python implementation it appears this is a case where an exception will be raised with strict parsing on:

>>> urllib.parse.parse_qs('&q', strict_parsing=True)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.6/urllib/parse.py", line 676, in parse_qs
    max_num_fields=max_num_fields)
  File "/usr/lib/python3.6/urllib/parse.py", line 729, in parse_qsl
    raise ValueError("bad query field: %r" % (name_value,))
ValueError: bad query field: ''
>>> urllib.parse.parse_qs('&q', strict_parsing=False)
{}

And you get an exception with decodeQuery for this too.

Another thing to keep in mind here is that cgi makes use of this code, and explicitly checks for exceptions, see the diff that moved this iterator: 6895040. Are we sure this isn't breaking CGI?

@xflywind any thoughts?

lib/pure/uri.nim

mildred · 2021-01-11T23:08:49Z

@dom96 You're right, cgi makes use of the same code (and I wanted originally to fix the cgi.decodeData iterator. I am removing the error check in the code as this is no longer necessary and updating the changelog.

As for strict checking, I believe this is not the same thing. Is there a use case for struct checking? Nor all languages advertise such a feature. In any case I believe this would belong to another PR and should be clearly defined before any implementation is attempted.

dom96 · 2021-01-11T23:24:00Z

Okay, so yes you can make these changes. This code is actually brand new and hasn't been released yet in a Nim version (it was only added 16 days ago), so you don't need to add it to the changelog, nor do you need those nimLegacyParseQueryStrict defines.

The only concern I have is with any breakage in cgi, I can see that a test was added to check that the exception is raised correctly in 6895040.

As for strict_parsing, yes we can add it separately. We should really just copy the Python implementation IMO (it includes other things like support for ; as separators) but that can come in follow up PRs.

mildred · 2021-01-11T23:35:25Z

Well, actually the code in uri right now comes from cgi, it was just moved over. Perhaps the compatibility option should be renamed to refer to cgi instead. I'm not entirely sure it is needed though.

Also, I'm not sure that there is a need for a strict parsing as this does not seems to be a well defined behavior. There is no common implementation of such a strict mode across all languages and I'm not sure copying the Python behavior is of any use on its own. It looks like unneeded complexity for me.

timotheecour · 2021-01-11T23:46:21Z

changelog.md

@@ -96,6 +96,13 @@
 with other backends. see #9125. Use `-d:nimLegacyJsRound` for previous behavior.
 - Added `socketstream` module that wraps sockets in the stream interface

+- Changed the behavior of `uri.decodeQuery` when there are unencoded `=`


I'm not convinced that the CI errors comes from my code, errors seems to come from fidget and PackageFileParsed (whatever those are) and I don't really see the problem looking at the CI logs...

https://dev.azure.com/nim-lang/255dfe86-e590-40bb-a8a2-3c0295ebdeb1/_apis/build/builds/12009/logs/84

2021-01-11T23:28:40.1530530Z �[1m�[31mFAIL: �[36mtests/stdlib/turi.nim c�[0m
...
2021-01-11T23:28:40.1541230Z ../../lib/system/fatal.nim(53, 5) Error: unhandled exception: expected raising 'UriParseError', instead nothing was raised by:
2021-01-11T23:28:40.1542060Z discard toSeq(decodeQuery("a=1&b=2c=6")) [AssertionDefect]

it comes from this PR, the fix is trivial:
in tests/stdlib/turi.nim, adapt:

doAssertRaises(UriParseError): discard toSeq(decodeQuery("a=1&b=2c=6"))

accordingly

Thank you, I was looking at the other CI runs where this did not appear, my bad.

no problem; please click "resolve conversation" for any resolved comment (unless it contains pushback)

timotheecour · 2021-01-11T23:53:00Z

Also, I'm not sure that there is a need for a strict parsing as this does not seems to be a well defined behavior.

which of the languages you mentioned in #16647 (comment) define some notion of strict parsing? I'm fine without having such a "strict parsing" option, but if lots of languages have such a notion, then maybe we can reconsider (in future work)

mildred · 2021-01-12T09:56:30Z

#16647 (comment) which of the languages you mentioned in #16647 (comment) define some notion of strict parsing?

Of those languages, only Python seems to have strict parsing. Ruby does not neither does Go. Same for the extra options which are only available in Python.

In case of malformed query string where there is `=` on the value, handle this character as part of the value instead of throwing an error. The following query string should no longer crash a program: key=value&key2=x=1 It will be interpreted as [("key", "value"), ("key2", "x=1")] This is correct according to latest WhatWG's HTML5 specification recarding the urlencoded parser: https://url.spec.whatwg.org/#concept-urlencoded-parser Older behavior can be restored using the -d:nimLegacyParseQueryStrict flag.

timotheecour · 2021-01-12T10:03:17Z

Of those languages, only Python seems to have strict parsing. Ruby does not neither does Go. Same for the extra options which are only available in Python.

ok, that's convincing. No need to support a notion of strict parsing then.

Araq · 2021-01-12T12:41:51Z

Looks really good this way. Follup PRs can address potential shortcomings that I've missed.

In case of malformed query string where there is `=` on the value, handle this character as part of the value instead of throwing an error. The following query string should no longer crash a program: key=value&key2=x=1 It will be interpreted as [("key", "value"), ("key2", "x=1")] This is correct according to latest WhatWG's HTML5 specification recarding the urlencoded parser: https://url.spec.whatwg.org/#concept-urlencoded-parser Older behavior can be restored using the -d:nimLegacyParseQueryStrict flag.

ringabout · 2022-01-02T16:32:31Z

lib/pure/cgi.nim

  ## Reads and decodes CGI data and yields the (name, value) pairs the
  ## data consists of.


decodeData should document nimLegacyParseQueryStrict option too.

And even with -d:nimLegacyParseQueryStrict, it never raises CgiError now. It actually raises UriParseError.

ref #19308 (comment)

timotheecour reviewed Jan 9, 2021

View reviewed changes

lib/pure/uri.nim Outdated Show resolved Hide resolved

dom96 requested changes Jan 9, 2021

View reviewed changes

mildred force-pushed the fix-uri-parse-query branch from 0382591 to 28b1e9c Compare January 11, 2021 16:19

mildred force-pushed the fix-uri-parse-query branch from 28b1e9c to a296b4f Compare January 11, 2021 19:44