Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

src: do not ignore IDNA conversion error #11549

Closed
wants to merge 3 commits into from
Closed

Conversation

TimothyGu
Copy link
Member

Currently, the ICU-based IDNA conversion methods only return errors on those passed along through a UErrorCode. However, according to ICU's documentation for uidna_nameToASCII(),

If any processing step fails, then pInfo->errors will be non-zero and the result might not be an ASCII string. The domain name might be modified according to the types of errors. Labels with severe errors will be left in (or turned into) their Unicode form.

The UErrorCode indicates an error only in exceptional cases, such as a U_MEMORY_ALLOCATION_ERROR.

In other words, when non-catastrophically invalid domains are passed, ToASCII() and ToUnicode() (and their downstream url.domainToASCII() and url.domainToUnicode()) currently return garbled domain names instead of errors.

This PR makes the C++ binding methods report errors in pInfo->errors in addition to UErrorCode, thereby fixing those aforementioned problems.

Also included in this PR are additional tests for invalid situations as well as documentation clarifications for the user-facing url.domainToASCII() and url.domainToUnicode().

Before vs. after
> url.domainToASCII('\ufffd.com')
'�.com'
> url.domainToUnicode('xn---\x03.com')
'xn---\u0003.com'
> process.binding('icu').toASCII('\ufffd.com')
'�.com'
> process.binding('icu').toUnicode('xn---\x03.com')
'xn---\u0003.com'
> process.binding('icu').toUnicode('xn--- .com')
'xn--- .com'
> url.domainToASCII('\ufffd.com')
''
> url.domainToUnicode('xn---\x03.com')
''
> process.binding('icu').toASCII('\ufffd.com')
Error: Cannot convert name to ASCII
    at repl:1:24
> process.binding('icu').toUnicode('xn---\x03.com')
Error: Cannot convert name to Unicode
    at repl:1:24
> process.binding('icu').toUnicode('xn--- .com')
Error: Cannot convert name to Unicode
    at repl:1:24
Checklist
  • make -j4 test (UNIX), or vcbuild test (Windows) passes
  • tests and/or benchmarks are included
  • documentation is changed or added
  • commit message follows commit guidelines
Affected core subsystem(s)

@nodejs-github-bot nodejs-github-bot added c++ Issues and PRs that require attention from people who are familiar with C++. i18n-api Issues and PRs related to the i18n implementation. labels Feb 25, 2017
@TimothyGu TimothyGu added the whatwg-url Issues and PRs related to the WHATWG URL implementation. label Feb 25, 2017
@TimothyGu
Copy link
Member Author

@TimothyGu
Copy link
Member Author

TimothyGu commented Feb 25, 2017

Hopefully the issue with legacy url parser is fixed.

/cc @nodejs/intl @nodejs/url

New CI: https://ci.nodejs.org/job/node-test-pull-request/6586/

@TimothyGu TimothyGu added the url Issues and PRs related to the legacy built-in url module. label Feb 25, 2017
@TimothyGu TimothyGu changed the title src: bail on IDNA conversion error src: do not ignore IDNA conversion error Feb 25, 2017
@@ -1007,7 +1008,8 @@ the new `URL` implementation but is not part of the WHATWG URL standard.
* `domain` {String}
* Returns: {String}

Returns the Unicode serialization of the `domain`.
Returns the Unicode serialization of the `domain`. If `domain` is an invalid
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, should this be deserialization, and mention that it is the inverse of domainToASCII?

Copy link
Member Author

@TimothyGu TimothyGu Feb 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is serialization, since the domain is fully parsed and subsequently serialized from the parsed form. It's just that it uses a different algorithm for deserialization.

src/node_i18n.cc Outdated
@@ -489,8 +492,11 @@ static void ToUnicode(const FunctionCallbackInfo<Value>& args) {
CHECK_GE(args.Length(), 1);
CHECK(args[0]->IsString());
Utf8Value val(env->isolate(), args[0]);
// optional arg
bool lenient = args[1].As<Boolean>()->Value();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you update the args.Length() check above to use 2? Also, you probably want to add a CHECK(args[1]->IsBoolean()); or do args[1]->BooleanValue() instead.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't update the check for argument length, since (as the comment is trying to say) it is an optional argument, so that existing usage of toUnicode(str) would still work. V8 automatically returns an Undefined for out-of-range args[] dereference.

Wasn't aware of BooleanValue(). Will use that instead.

src/node_i18n.cc Outdated
@@ -508,8 +514,11 @@ static void ToASCII(const FunctionCallbackInfo<Value>& args) {
CHECK_GE(args.Length(), 1);
CHECK(args[0]->IsString());
Utf8Value val(env->isolate(), args[0]);
// optional arg
bool lenient = args[1].As<Boolean>()->Value();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(ditto)

MaybeStackBuffer<char> buf;
int32_t len = ToASCII(&buf, *val, val.length());
int32_t len = ToASCII(&buf, *val, val.length(), lenient);

if (len < 0) {
return env->ThrowError("Cannot convert name to ASCII");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this error part of any non-experimental API? Could we change it to Cannot encode name to ASCII as Punycode?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes for toASCII

> url.parse(`http://${'é'.repeat(230)}.com/`)
Error: Cannot convert name to ASCII

MaybeStackBuffer<char> buf;
int32_t len = ToUnicode(&buf, *val, val.length());
int32_t len = ToUnicode(&buf, *val, val.length(), lenient);

if (len < 0) {
return env->ThrowError("Cannot convert name to Unicode");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this error part of any non-experimental API? Could we change it to Cannot decode name as Punycode? (basically the same question I also posted below).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No; in fact the toUnicode JS function isn't used in the code base at all. Maybe we should just remove this method?

/cc @jasnell

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's not used, it can be removed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove which function specifically? The `i18n::ToUnicode' function is definitely used.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jasnell, the exposed process.binding('icu').toUnicode() JS function.

src/node_i18n.cc Outdated
@@ -493,7 +493,7 @@ static void ToUnicode(const FunctionCallbackInfo<Value>& args) {
CHECK(args[0]->IsString());
Utf8Value val(env->isolate(), args[0]);
// optional arg
bool lenient = args[1].As<Boolean>()->Value();
bool lenient = args[1]->BooleanValue().FromJust();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this compile? Seems like the env->context() argument is missing

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@addaleax, you are right. Forgot to push fde77b3

MaybeStackBuffer<char> buf;
int32_t len = ToUnicode(&buf, *val, val.length());
int32_t len = ToUnicode(&buf, *val, val.length(), lenient);

if (len < 0) {
return env->ThrowError("Cannot convert name to Unicode");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's not used, it can be removed.

Copy link
Member

@joyeecheung joyeecheung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should also fix the missing errors when parsing percent-encoded disallowed characters in hosts(https://github.com/nodejs/node/blob/master/test/fixtures/url-tests.js#L4499) since we are no longer ignoring UIDNA_ERROR_DISALLOWED, you can turn them on in this PR if you like.

@jasnell
Copy link
Member

jasnell commented Feb 27, 2017

@TimothyGu
Copy link
Member Author

@jasnell, did you see #11549 (comment)?

Old behavior can be restored using a special `lenient` mode.
- Split the tests out to a separate file
- Add invalid cases
- Add tests for url.domainTo*()
- Re-enable previously broken WPT URL parsing tests
@TimothyGu
Copy link
Member Author

Test re-enabled per @joyeecheung. Will land tomorrow.

CI: https://ci.nodejs.org/job/node-test-pull-request/6619/

@TimothyGu
Copy link
Member Author

TimothyGu commented Mar 1, 2017

Landed in a520508...7ceea2a. toUnicode() isn't removed for now, since it provides equivalence to the punycode module, and though unused in the code base it is well-tested.

TimothyGu added a commit that referenced this pull request Mar 1, 2017
Old behavior can be restored using a special `lenient` mode, as used in
the legacy URL parser.

PR-URL: #11549
Reviewed-By: Anna Henningsen <[email protected]>
Reviewed-By: Ben Noordhuis <[email protected]>
Reviewed-By: James M Snell <[email protected]>
Reviewed-By: Joyee Cheung <[email protected]>
TimothyGu added a commit that referenced this pull request Mar 1, 2017
- Split the tests out to a separate file
- Add invalid cases
- Add tests for url.domainTo*()
- Re-enable previously broken WPT URL parsing tests

PR-URL: #11549
Reviewed-By: Anna Henningsen <[email protected]>
Reviewed-By: Ben Noordhuis <[email protected]>
Reviewed-By: James M Snell <[email protected]>
Reviewed-By: Joyee Cheung <[email protected]>
TimothyGu added a commit that referenced this pull request Mar 1, 2017
PR-URL: #11549
Reviewed-By: Anna Henningsen <[email protected]>
Reviewed-By: Ben Noordhuis <[email protected]>
Reviewed-By: James M Snell <[email protected]>
Reviewed-By: Joyee Cheung <[email protected]>
@TimothyGu TimothyGu closed this Mar 1, 2017
@TimothyGu TimothyGu deleted the idna branch March 1, 2017 02:33
addaleax pushed a commit that referenced this pull request Mar 5, 2017
Old behavior can be restored using a special `lenient` mode, as used in
the legacy URL parser.

PR-URL: #11549
Reviewed-By: Anna Henningsen <[email protected]>
Reviewed-By: Ben Noordhuis <[email protected]>
Reviewed-By: James M Snell <[email protected]>
Reviewed-By: Joyee Cheung <[email protected]>
addaleax pushed a commit that referenced this pull request Mar 5, 2017
- Split the tests out to a separate file
- Add invalid cases
- Add tests for url.domainTo*()
- Re-enable previously broken WPT URL parsing tests

PR-URL: #11549
Reviewed-By: Anna Henningsen <[email protected]>
Reviewed-By: Ben Noordhuis <[email protected]>
Reviewed-By: James M Snell <[email protected]>
Reviewed-By: Joyee Cheung <[email protected]>
addaleax pushed a commit that referenced this pull request Mar 5, 2017
PR-URL: #11549
Reviewed-By: Anna Henningsen <[email protected]>
Reviewed-By: Ben Noordhuis <[email protected]>
Reviewed-By: James M Snell <[email protected]>
Reviewed-By: Joyee Cheung <[email protected]>
@evanlucas evanlucas mentioned this pull request Mar 8, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c++ Issues and PRs that require attention from people who are familiar with C++. i18n-api Issues and PRs related to the i18n implementation. url Issues and PRs related to the legacy built-in url module. whatwg-url Issues and PRs related to the WHATWG URL implementation.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants