Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

http silently corrupts the request URL when it contains non-Latin-1 codepoints #13296

Closed
Flimm opened this issue May 30, 2017 · 11 comments
Closed
Labels
http Issues or PRs related to the http subsystem.

Comments

@Flimm
Copy link
Contributor

Flimm commented May 30, 2017

I experience the bug on Node commit 399cb25

Darwin Tephs-Mac-Pro.local 16.6.0 Darwin Kernel Version 16.6.0: Fri Apr 14 16:21:16 PDT 2017; root:xnu-3789.60.24~6/RELEASE_X86_64 x86_64 .

Subsystem: http

Issue description

When making a request using http.get with the path set to /café🐶, the server receives /café=6. This is not the URL that was sent, it's not even the precent-encoded version of the URL (with encodeURI), which would be /caf%C3%A9%F0%9F%90%B6. I expected the URL to be passed along without data corruption.

I've created a pull request which contains a test case that currently fails, illustrating this bug. #13297

Flimm added a commit to Flimm/node that referenced this issue May 30, 2017
This test currently fails. It illustrates that Unicode in the URL does
not arrive intact to the server, there is silent data corruption along
the way at some point.

This test is for the issue nodejs#13296.
@jasnell
Copy link
Member

jasnell commented May 30, 2017

For sake of performance the impl does not transcode or escape the input into valid ASCII. It assumes the user has done the necessary escaping. That is unlikely to change but can be documented.

@Flimm
Copy link
Contributor Author

Flimm commented May 30, 2017

@jasnell That's great, but that's not what this issue is about. I'm not asking in this issue to transcode or escape the input into valid ASCII. All I'm asking is that silent data corruption not happen. There are solutions for that, one of them is to pass the data without corrupting it. Another would be to throw an exception when Unicode strings are passed in of course. I'll leave it up to core contributors to decide on the best solution to fix the bug that data is getting corrupted silently.

When /café is passed as a URL, it gets sent as /café (encoded in UTF-8, I assume) to the server. No data loss or data corruption happens, even though that URL is not ASCII. It could be that the server can't handle it, but at least no silent data loss or corruption has happened.

So when is the silent data corruption happening?

When /café🐶 is passed as a URL, it gets sent as /café=6, instead of /café🐶. Why is it corrupting the data? This data corruption is helping nobody. No server is going to read =6 and understand that that is meant to be the dog emoji.

(It's worth noting that http.get will check the passed URL for whitespace and throw an exception in the case that it finds a space, despite performance costs.)

@mscdex mscdex added the http Issues or PRs related to the http subsystem. label May 30, 2017
@seishun
Copy link
Contributor

seishun commented May 31, 2017

I agree, this behavior is unnecessarily unintuitive. It should either:

  1. silently percent-encode non-ASCII characters, or
  2. encode as UTF-8, or
  3. throw on non-Latin-1 characters instead of silently breaking them

Options 1 or 2 could be breaking if someone is deliberately using "binary" encoding to send UTF-8 encoded paths, e.g. path: Buffer.from('/wiki/サクラ').toString('binary'). But if anyone is actually doing that, we should just let them specify the path as a Buffer instead of resorting to such hacks.

They could also be breaking if someone actually wants Latin-1 encoded paths (or the server they are talking to is smart enough to recognize it's not UTF-8) and they just happen to use only the Latin-1 range of characters. So path: '/wiki/café' just works by luck. But that scenario seems even less likely than the above.

Here's my proposal to fix this mess:

  1. Allow specifying paths as Buffers.
  2. Runtime-deprecate non-ASCII Latin-1 characters in string paths (a warning notifying that the characters are encoded as Latin-1 and suggesting using Buffer instead).
  3. Throw on non-ASCII non-Latin-1 characters in string paths (they are broken anyway).

In a later major version of Node.js, we could consider one of the following:

  1. Encode paths as UTF-8 instead of Latin-1.
  2. Percent-encode non-ASCII characters in paths.
  3. Disallow non-ASCII completely in paths completely.
  4. Leave it as-is.

cc @nodejs/collaborators feel free to criticize.

@bnoordhuis
Copy link
Member

@seishun See #3062 (comment). Allow me to paraphrase myself:

The set of characters to reject depends on the encoding used for the request headers. That in turn is influenced by the encoding of the request body because node.js tries hard to pack the headers and the body into a single outgoing packet.

An example: U+010A ('Ċ') is fine with encoding="utf8"; it decodes to bytes C4 8A. The same codepoint should be rejected with encoding="binary" (or "latin1") because it decodes to byte 0A, a newline.

IOW, the following statement:

Throw on non-ASCII non-Latin-1 characters in string paths (they are broken anyway).

Is not true (or only conditionally true.) As well, UTF-8 URLs are used widely enough that rejecting them outright is probably not going to fly.

I think the first order of business is to untangle the conflation of headers and body somehow. Unfortunately, the naive approach is riddled with performance pitfalls and some backwards compatibility concerns.

@seishun
Copy link
Contributor

seishun commented May 31, 2017

@bnoordhuis
I see, so there are two issues here:

  1. Headers are encoded using the same encoding as the first data chunk.
  2. In the absence of any data chunks, headers are encoded using latin-1.

I think both of these could be fixed without touching the conflation of headers and body.

The set of characters to reject depends on the encoding used for the request headers.

So it boils down to two questions:

  1. Do we want to allow setting the encoding for the headers?
  2. Which encoding do we use/default to for the headers?

I think most would agree that defaulting to/using the encoding of the first data chunk or 'latin1' is broken.

@jasnell
Copy link
Member

jasnell commented May 31, 2017

Indeed. Encoding of the headers should have absolutely nothing to do with the encoding of the payload.

@seishun
Copy link
Contributor

seishun commented Jun 1, 2017

Great, if we agree on that, then the next question is which assumptions we can make.

  • If we assume that (almost) no one relies on the headers being encoded as Latin-1 (i.e. when there is no payload), then we can just always encode the headers as UTF-8 in the next major version.

  • If we assume that (almost) no one relies on the headers being encoded as UTF-8 (i.e. when the first payload chunk is UTF-8), then see my proposal in http silently corrupts the request URL when it contains non-Latin-1 codepoints #13296 (comment).

  • If we can't assume either of the above, then, as @bnoordhuis pointed out, we can't throw on non-Latin-1 codepoints (or we can do that only if it's going to be encoded as Latin-1, which seems pretty complicated). But if we agree that the header encoding should be independent of the payload, then we should introduce a warning for non-ASCII codepoints about the planned change in behavior, and allow specifying the path as a Buffer as an unambiguous alternative.

@seishun seishun changed the title http silently corrupts the request URL when it contains Unicode http silently corrupts the request URL when it contains non-Latin-1 codepoints Jun 1, 2017
jasnell pushed a commit that referenced this issue Jun 2, 2017
This test currently fails. It illustrates that Unicode in the URL does
not arrive intact to the server, there is silent data corruption along
the way at some point.

This test is for the issue #13296.

PR-URL: #13297
Reviewed-By: James M Snell <[email protected]>
@seishun
Copy link
Contributor

seishun commented Jun 2, 2017

@Flimm

When /café is passed as a URL, it gets sent as /café (encoded in UTF-8) to the server.

If you aren't sending any payload, then it should be encoded in Latin-1. Could you re-check?

jasnell pushed a commit that referenced this issue Jun 5, 2017
This test currently fails. It illustrates that Unicode in the URL does
not arrive intact to the server, there is silent data corruption along
the way at some point.

This test is for the issue #13296.

PR-URL: #13297
Reviewed-By: James M Snell <[email protected]>
@Flimm
Copy link
Contributor Author

Flimm commented Jun 13, 2017

@seishun Sorry, that was just an assumption on my part. I've edited my previous comment to add "I assume".

MylesBorins pushed a commit that referenced this issue Jul 17, 2017
This test currently fails. It illustrates that Unicode in the URL does
not arrive intact to the server, there is silent data corruption along
the way at some point.

This test is for the issue #13296.

PR-URL: #13297
Reviewed-By: James M Snell <[email protected]>
ryanmahan added a commit to ryanmahan/node that referenced this issue Jan 19, 2018
assert.strictEqual message argument removed to replace
with default assert message to show the expected vs
actual values

Refs: nodejs#13296
Trott pushed a commit to Trott/io.js that referenced this issue Jan 23, 2018
assert.strictEqual message argument removed to replace
with default assert message to show the expected vs
actual values

PR-URL: nodejs#18259
Refs: nodejs#13296
Reviewed-By: Luigi Pinca <[email protected]>
Reviewed-By: Shingo Inoue <[email protected]>
Reviewed-By: Jon Moss <[email protected]>
Reviewed-By: Colin Ihrig <[email protected]>
Reviewed-By: James M Snell <[email protected]>
evanlucas pushed a commit that referenced this issue Jan 30, 2018
assert.strictEqual message argument removed to replace
with default assert message to show the expected vs
actual values

PR-URL: #18259
Refs: #13296
Reviewed-By: Luigi Pinca <[email protected]>
Reviewed-By: Shingo Inoue <[email protected]>
Reviewed-By: Jon Moss <[email protected]>
Reviewed-By: Colin Ihrig <[email protected]>
Reviewed-By: James M Snell <[email protected]>
MylesBorins pushed a commit that referenced this issue Feb 27, 2018
assert.strictEqual message argument removed to replace
with default assert message to show the expected vs
actual values

PR-URL: #18259
Refs: #13296
Reviewed-By: Luigi Pinca <[email protected]>
Reviewed-By: Shingo Inoue <[email protected]>
Reviewed-By: Jon Moss <[email protected]>
Reviewed-By: Colin Ihrig <[email protected]>
Reviewed-By: James M Snell <[email protected]>
MylesBorins pushed a commit that referenced this issue Feb 27, 2018
assert.strictEqual message argument removed to replace
with default assert message to show the expected vs
actual values

PR-URL: #18259
Refs: #13296
Reviewed-By: Luigi Pinca <[email protected]>
Reviewed-By: Shingo Inoue <[email protected]>
Reviewed-By: Jon Moss <[email protected]>
Reviewed-By: Colin Ihrig <[email protected]>
Reviewed-By: James M Snell <[email protected]>
@Trott
Copy link
Member

Trott commented Apr 30, 2018

@nodejs/http Thoughts on what to do with this? Doc update? Code change? Neither? Something else?

@TimothyGu
Copy link
Member

This has more or less been fixed with #20270, which will be out in v11.x.

MayaLekova pushed a commit to MayaLekova/node that referenced this issue May 8, 2018
assert.strictEqual message argument removed to replace
with default assert message to show the expected vs
actual values

PR-URL: nodejs#18259
Refs: nodejs#13296
Reviewed-By: Luigi Pinca <[email protected]>
Reviewed-By: Shingo Inoue <[email protected]>
Reviewed-By: Jon Moss <[email protected]>
Reviewed-By: Colin Ihrig <[email protected]>
Reviewed-By: James M Snell <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
http Issues or PRs related to the http subsystem.
Projects
None yet
Development

No branches or pull requests

7 participants