Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve uri.parseQuery to never raise an error #16647

Merged
merged 1 commit into from
Jan 12, 2021

Conversation

mildred
Copy link
Contributor

@mildred mildred commented Jan 9, 2021

In case of malformed query string where there is = on the value, handle
this character as part of the value instead of throwing an error.

The following query string should no longer crash a program:

key=value&key2=x=1

It will be interpreted as:

[("key", "value"), ("key2", "x=1")]

@timotheecour
Copy link
Member

probably correct but please give some links (eg spec, stackoverflow etc) in your PR description to double check;

tests need fixing

lib/pure/uri.nim Outdated Show resolved Hide resolved
Copy link
Contributor

@dom96 dom96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a breaking change. But also whether to raise the error should be customisable.

Please introduce a strict_parsing flag like in Python's urllib: https://docs.python.org/3/library/urllib.parse.html#urllib.parse.parse_qs (it should be true by default to keep backwards compatibility)

@timotheecour
Copy link
Member

This is a breaking change

which code would break?
if you're talking about doAssertRaises(parseQuery("key=value&key2=x=1")) I wouldn't categorize it as a breaking change, it'd be something used only in some test suite, not actual code

@dom96
Copy link
Contributor

dom96 commented Jan 9, 2021

@timotheecour As an example, there may be code out there which relies on being able to check for validity of the query string and explicitly checks for a ValueError.

Are there any downsides to keeping backwards compatibility here?

@timotheecour
Copy link
Member

timotheecour commented Jan 9, 2021

if the spec (or most libraries) allows =, then I'd much prefer doing a -d:nimLegacyParseQueryStrict and being correct by default than legacy behavior by default. The point is to avoid accumulating non-standard behavior over time.

@mildred
Copy link
Contributor Author

mildred commented Jan 11, 2021

I found some references for parsing query strings (and checked them):

@mildred
Copy link
Contributor Author

mildred commented Jan 11, 2021

Updated code with @timotheecour requested changes. Add compilation flag to restore old behaviour suggested. I too prefer the new behaviour by default, it is unlikely to cause breaks in existing apps and the new behaviour is better I believe. it is also conforming to standards and to the general principle of network programming : be liberal in what you accept.

On the contrary, web apps written currently in Nim are probably not handling this specific case of query string parsing, and the new behavior can avoid an error that could in some cases provoke a denial of service.

@timotheecour
Copy link
Member

Rebase error? Did you rebase against master instead of devel or something?

@mildred
Copy link
Contributor Author

mildred commented Jan 11, 2021

Yep, sorry, updated

Comment on lines +180 to +183
else:
if c == sep: break
else: add(field, data[result])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
else:
if c == sep: break
else: add(field, data[result])
elif c == sep: break
else: add(field, data[result])

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the else of the case statement, I'm not convinced that elif is a valid option for it.

Copy link
Member

@timotheecour timotheecour Jan 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's valid, and it works, i verified

lib/pure/uri.nim Outdated Show resolved Hide resolved
lib/pure/uri.nim Outdated Show resolved Hide resolved
lib/pure/uri.nim Outdated Show resolved Hide resolved
Copy link
Member

@timotheecour timotheecour left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM after addressing comments + fixing test failures in tests/stdlib/turi.nim

Copy link
Contributor

@dom96 dom96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. Looking into this a bit more, the specific example we are getting to work here isn't to do with strict parsing, this is in fact fixing a bug in the current parser. Python will parse it fine even with strict parsing enabled:

>>> import urllib.parse
>>> urllib.parse.parse_qs('foo=1&bar=2=3')
{'foo': ['1'], 'bar': ['2=3']}
>>> urllib.parse.parse_qs('foo=1&bar=2=3', strict_parsing=True)
{'foo': ['1'], 'bar': ['2=3']}

But then why are you changing it to "never raise an error"? Just fix that bug and don't get rid of all errors please. That way this isn't a breaking change.

@timotheecour
Copy link
Member

don't get rid of all errors please

which other errors? can you show an input where you expect an error?

@mildred
Copy link
Contributor Author

mildred commented Jan 11, 2021

@dom96: I don't understand what you are suggesting... Should we (A) fix the parser to accept the = character and never raise an error, not bothering with compatibility define flag or (B) keep the current behaviour and not do anything, suggesting library users to fix their software to avoid the = character in the value field.

If we want to be compliant with HTML5 spec, we need to change the behaviour and this is breaking to library users that expects an exception in this special case.

If we want to keep raising an exception, we can't be compliant to the HTML spec.

Note: the = character we talk about is the only case when an error is raised. There is no other case where an error is raised as far as I know reading the code.

@mildred
Copy link
Contributor Author

mildred commented Jan 11, 2021

Updated the code.

I'm not convinced that the CI errors comes from my code, errors seems to come from fidget and PackageFileParsed (whatever those are) and I don't really see the problem looking at the CI logs...

lib/pure/uri.nim Outdated Show resolved Hide resolved
@dom96
Copy link
Contributor

dom96 commented Jan 11, 2021

Looking at the Python implementation it appears this is a case where an exception will be raised with strict parsing on:

>>> urllib.parse.parse_qs('&q', strict_parsing=True)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.6/urllib/parse.py", line 676, in parse_qs
    max_num_fields=max_num_fields)
  File "/usr/lib/python3.6/urllib/parse.py", line 729, in parse_qsl
    raise ValueError("bad query field: %r" % (name_value,))
ValueError: bad query field: ''
>>> urllib.parse.parse_qs('&q', strict_parsing=False)
{}

And you get an exception with decodeQuery for this too.

Another thing to keep in mind here is that cgi makes use of this code, and explicitly checks for exceptions, see the diff that moved this iterator: 6895040. Are we sure this isn't breaking CGI?

@xflywind any thoughts?

lib/pure/uri.nim Outdated Show resolved Hide resolved
@mildred
Copy link
Contributor Author

mildred commented Jan 11, 2021

@dom96 You're right, cgi makes use of the same code (and I wanted originally to fix the cgi.decodeData iterator. I am removing the error check in the code as this is no longer necessary and updating the changelog.

As for strict checking, I believe this is not the same thing. Is there a use case for struct checking? Nor all languages advertise such a feature. In any case I believe this would belong to another PR and should be clearly defined before any implementation is attempted.

@mildred mildred force-pushed the fix-uri-parse-query branch 2 times, most recently from 55a5deb to 4cabada Compare January 11, 2021 23:09
@dom96
Copy link
Contributor

dom96 commented Jan 11, 2021

Okay, so yes you can make these changes. This code is actually brand new and hasn't been released yet in a Nim version (it was only added 16 days ago), so you don't need to add it to the changelog, nor do you need those nimLegacyParseQueryStrict defines.

The only concern I have is with any breakage in cgi, I can see that a test was added to check that the exception is raised correctly in 6895040.

As for strict_parsing, yes we can add it separately. We should really just copy the Python implementation IMO (it includes other things like support for ; as separators) but that can come in follow up PRs.

@mildred
Copy link
Contributor Author

mildred commented Jan 11, 2021

Well, actually the code in uri right now comes from cgi, it was just moved over. Perhaps the compatibility option should be renamed to refer to cgi instead. I'm not entirely sure it is needed though.

Also, I'm not sure that there is a need for a strict parsing as this does not seems to be a well defined behavior. There is no common implementation of such a strict mode across all languages and I'm not sure copying the Python behavior is of any use on its own. It looks like unneeded complexity for me.

@@ -96,6 +96,13 @@
with other backends. see #9125. Use `-d:nimLegacyJsRound` for previous behavior.
- Added `socketstream` module that wraps sockets in the stream interface

- Changed the behavior of `uri.decodeQuery` when there are unencoded `=`
Copy link
Member

@timotheecour timotheecour Jan 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced that the CI errors comes from my code, errors seems to come from fidget and PackageFileParsed (whatever those are) and I don't really see the problem looking at the CI logs...

https://dev.azure.com/nim-lang/255dfe86-e590-40bb-a8a2-3c0295ebdeb1/_apis/build/builds/12009/logs/84

2021-01-11T23:28:40.1530530Z �[1m�[31mFAIL: �[36mtests/stdlib/turi.nim c�[0m
...
2021-01-11T23:28:40.1541230Z ../../lib/system/fatal.nim(53, 5) Error: unhandled exception: expected raising 'UriParseError', instead nothing was raised by:
2021-01-11T23:28:40.1542060Z discard toSeq(decodeQuery("a=1&b=2c=6")) [AssertionDefect]

it comes from this PR, the fix is trivial:
in tests/stdlib/turi.nim, adapt:

    doAssertRaises(UriParseError):
      discard toSeq(decodeQuery("a=1&b=2c=6"))

accordingly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, I was looking at the other CI runs where this did not appear, my bad.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no problem; please click "resolve conversation" for any resolved comment (unless it contains pushback)

@timotheecour
Copy link
Member

Also, I'm not sure that there is a need for a strict parsing as this does not seems to be a well defined behavior.

which of the languages you mentioned in #16647 (comment) define some notion of strict parsing? I'm fine without having such a "strict parsing" option, but if lots of languages have such a notion, then maybe we can reconsider (in future work)

@mildred
Copy link
Contributor Author

mildred commented Jan 12, 2021

#16647 (comment) which of the languages you mentioned in #16647 (comment) define some notion of strict parsing?

Of those languages, only Python seems to have strict parsing. Ruby does not neither does Go. Same for the extra options which are only available in Python.

In case of malformed query string where there is `=` on the value, handle
this character as part of the value instead of throwing an error.

The following query string should no longer crash a program:

    key=value&key2=x=1

It will be interpreted as [("key", "value"), ("key2", "x=1")]

This is correct according to latest WhatWG's HTML5 specification
recarding the urlencoded parser:
https://url.spec.whatwg.org/#concept-urlencoded-parser

Older behavior can be restored using the -d:nimLegacyParseQueryStrict
flag.
@timotheecour
Copy link
Member

Of those languages, only Python seems to have strict parsing. Ruby does not neither does Go. Same for the extra options which are only available in Python.

ok, that's convincing. No need to support a notion of strict parsing then.

@Araq
Copy link
Member

Araq commented Jan 12, 2021

Looks really good this way. Follup PRs can address potential shortcomings that I've missed.

@Araq Araq merged commit 71db2be into nim-lang:devel Jan 12, 2021
ardek66 pushed a commit to ardek66/Nim that referenced this pull request Mar 26, 2021
In case of malformed query string where there is `=` on the value, handle
this character as part of the value instead of throwing an error.

The following query string should no longer crash a program:

    key=value&key2=x=1

It will be interpreted as [("key", "value"), ("key2", "x=1")]

This is correct according to latest WhatWG's HTML5 specification
recarding the urlencoded parser:
https://url.spec.whatwg.org/#concept-urlencoded-parser

Older behavior can be restored using the -d:nimLegacyParseQueryStrict
flag.
Comment on lines 85 to 86
## Reads and decodes CGI data and yields the (name, value) pairs the
## data consists of.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

decodeData should document nimLegacyParseQueryStrict option too.

And even with -d:nimLegacyParseQueryStrict, it never raises CgiError now. It actually raises UriParseError.

ref #19308 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants