Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net: use icu's punycode implementation #7355

Closed
wants to merge 2 commits into from

Conversation

jasnell
Copy link
Member

@jasnell jasnell commented Jun 21, 2016

Checklist
  • make -j4 test (UNIX) or vcbuild test nosign (Windows) passes
  • a test and/or benchmark is included
  • the commit message follows commit guidelines
Affected core subsystem(s)

net, intl

Description of change

ICU has a punycode implementation built in. Use it instead of the javascript implementation because it's much faster.

The punycode module is currently only used in one location in core (in the url module). With this change if ICU is present, the ICU implementation would be used with the punycode impl becoming the fallback. Eventually, it would be great to see if we could deprecate the punycode module entirely, with an eye towards eventually removing it.

This includes an update to the small-icu dataset to support the IDNA implementation. The additional data adds < 1 mb of data.

Benchmarks (bigger numbers are better):

net/punycode.js method="punycode" n="1024" val="افغانستا.icom.museum": 27593.19811
net/punycode.js method="punycode" n="1024" val="الجزائر.icom.museum": 30275.51786
net/punycode.js method="punycode" n="1024" val="österreich.icom.museum": 34711.86323
net/punycode.js method="punycode" n="1024" val="বাংলাদেশ.icom.museum": 26096.37590
net/punycode.js method="punycode" n="1024" val="беларусь.icom.museum": 26819.07514
net/punycode.js method="punycode" n="1024" val="belgië.icom.museum": 39787.58249
net/punycode.js method="punycode" n="1024" val="българия.icom.museum": 26728.08628
net/punycode.js method="punycode" n="1024" val="تشادر.icom.museum": 33340.09143
net/punycode.js method="punycode" n="1024" val="中国.icom.museum": 40001.78289
net/punycode.js method="punycode" n="1024" val="القمر.icom.museum": 34623.28045
net/punycode.js method="punycode" n="1024" val="κυπρος.icom.museum": 32423.57551
net/punycode.js method="punycode" n="1024" val="českárepublika.icom.museum": 30496.93111
net/punycode.js method="punycode" n="1024" val="مصر.icom.museum": 38439.50043
net/punycode.js method="punycode" n="1024" val="ελλάδα.icom.museum": 32494.77561
net/punycode.js method="punycode" n="1024" val="magyarország.icom.museum": 35283.22541
net/punycode.js method="punycode" n="1024" val="ísland.icom.museum": 38816.50008
net/punycode.js method="punycode" n="1024" val="भारत.icom.museum": 37098.76096
net/punycode.js method="punycode" n="1024" val="ايران.icom.museum": 31919.15278
net/punycode.js method="punycode" n="1024" val="éire.icom.museum": 36912.57326
net/punycode.js method="punycode" n="1024" val="איקו״ם.ישראל.museum": 15457.49763
net/punycode.js method="punycode" n="1024" val="日本.icom.museum": 40124.44062
net/punycode.js method="punycode" n="1024" val="الأردن.icom.museum": 30370.73983

net/punycode.js method="icu" n="1024" val="افغانستا.icom.museum": 265435.02052
net/punycode.js method="icu" n="1024" val="الجزائر.icom.museum": 327985.94529
net/punycode.js method="icu" n="1024" val="österreich.icom.museum": 300055.76232
net/punycode.js method="icu" n="1024" val="বাংলাদেশ.icom.museum": 235323.31677
net/punycode.js method="icu" n="1024" val="беларусь.icom.museum": 229452.87589
net/punycode.js method="icu" n="1024" val="belgië.icom.museum": 354082.67000
net/punycode.js method="icu" n="1024" val="българия.icom.museum": 256043.71947
net/punycode.js method="icu" n="1024" val="تشادر.icom.museum": 357039.62547
net/punycode.js method="icu" n="1024" val="中国.icom.museum": 385073.29554
net/punycode.js method="icu" n="1024" val="القمر.icom.museum": 351974.70182
net/punycode.js method="icu" n="1024" val="κυπρος.icom.museum": 297908.24193
net/punycode.js method="icu" n="1024" val="českárepublika.icom.museum": 268331.79121
net/punycode.js method="icu" n="1024" val="مصر.icom.museum": 338442.24685
net/punycode.js method="icu" n="1024" val="ελλάδα.icom.museum": 340001.77306
net/punycode.js method="icu" n="1024" val="magyarország.icom.museum": 301679.21012
net/punycode.js method="icu" n="1024" val="ísland.icom.museum": 370667.99875
net/punycode.js method="icu" n="1024" val="भारत.icom.museum": 205594.87089
net/punycode.js method="icu" n="1024" val="ايران.icom.museum": 302680.25733
net/punycode.js method="icu" n="1024" val="éire.icom.museum": 397259.52996
net/punycode.js method="icu" n="1024" val="איקו״ם.ישראל.museum": 202994.92850
net/punycode.js method="icu" n="1024" val="日本.icom.museum": 407493.58416
net/punycode.js method="icu" n="1024" val="الأردن.icom.museum": 326276.22915

cc @srl295 @mscdex @ChALkeR @trevnorris

@jasnell jasnell added wip Issues and PRs that are still a work in progress. net Issues and PRs related to the net subsystem. i18n-api Issues and PRs related to the i18n implementation. performance Issues and PRs related to the performance of Node.js. labels Jun 21, 2016
@nodejs-github-bot nodejs-github-bot added c++ Issues and PRs that require attention from people who are familiar with C++. url Issues and PRs related to the legacy built-in url module. labels Jun 21, 2016
@srl295
Copy link
Member

srl295 commented Jun 21, 2016

does the binary size change much? i don't think it should.

@jasnell
Copy link
Member Author

jasnell commented Jun 21, 2016

With the code, no. Not sure about the data files. I need to track down
exactly which additional data files are needed
On Jun 21, 2016 2:25 PM, "Steven R. Loomis" [email protected]
wrote:

does the binary size change much? i don't think it should.


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#7355 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/AAa2eQuKo2eIYt7PM1-WSnnxx2cc3Skhks5qOFbMgaJpZM4I7LOq
.

@srl295
Copy link
Member

srl295 commented Jun 21, 2016

@jasnell
Copy link
Member Author

jasnell commented Jun 21, 2016

Yeah, likely both. I'll play around with adding those into the small-icu subset and see what happens ;-)

@sindresorhus
Copy link

// @mathiasbynens


if (status == U_BUFFER_OVERFLOW_ERROR) {
status = U_ZERO_ERROR;
dest = static_cast<char*>(malloc(len));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll at least want to do a CHECK_NE(dest, nullptr)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better yet, use MaybeStackBuffer.

@bnoordhuis
Copy link
Member

High-level comment: is punycode prevalent enough to warrant optimizing for? I'm inclined to say it isn't.

@srl295
Copy link
Member

srl295 commented Jun 22, 2016

@bnoordhuis Maybe this is what you meant, but the function is called every time url.parse() is called, isn't it?

> url.parse('http://mañana.com').host
'xn--maana-pta.com'

@bnoordhuis
Copy link
Member

Let me rephrase: does punycode.toASCII() show up in the profile of a non-contrived benchmark? Maybe I don't call url.parse() enough, but I've never seen it.

@jasnell
Copy link
Member Author

jasnell commented Jun 22, 2016

@bnoordhuis ... it's not used extensively by our code, no, but it is used pretty much every time url.parse() is called which is every time someone does something like http.get('http://example.org') (see https://github.com/nodejs/node/blob/master/lib/url.js#L312). This is the only use of the punycode module in core.

Ultimately what I'd really like is for us to deprecate the punycode module and eventually pull it back out of core entirely, relying solely on the faster ICU implementation. What that would mean is not supporting IDNA's in builds that do not include ICU but if those are not including ICU then internationalization is likely not a requirement and it's not likely to be any issue (and if it is, they can rely on the userland module to get it).

@jasnell
Copy link
Member Author

jasnell commented Jun 22, 2016

@bnoordhuis ... to give an idea, below is a comparison of url.parse benchmarks in v6.2.2 against master with this patch applied:

url/url-parse.js type="one" n="250000": /Users/james/.nvm/versions/node/v6.2.2/bin/node: 249020 ./node: 370670 ... -32.82%
url/url-parse.js type="two" n="250000": /Users/james/.nvm/versions/node/v6.2.2/bin/node: 299860 ./node: 493920 ... -39.29%
url/url-parse.js type="three" n="250000": /Users/james/.nvm/versions/node/v6.2.2/bin/node: 238360 ./node: 345490 . -31.01%
url/url-parse.js type="four" n="250000": /Users/james/.nvm/versions/node/v6.2.2/bin/node: 727730 ./node: 730270 ... -0.35%
url/url-parse.js type="five" n="250000": /Users/james/.nvm/versions/node/v6.2.2/bin/node: 716050 ./node: 609270 ... 17.52%
url/url-parse.js type="six" n="250000": /Users/james/.nvm/versions/node/v6.2.2/bin/node: 271480 ./node: 421830 ... -35.64%

The "five" case is an exception and I believe that something else unrelated to this change is going on there.. but the perf boost for the typical use cases is significant.

@jasnell
Copy link
Member Author

jasnell commented Jun 22, 2016

@bnoordhuis @saghul @trevnorris @srl295 .. updated! PTAL

}

if (U_FAILURE(status)) {
env->ThrowError("Cannot convert name to ASCII");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this give a simple yes/no on failure, or is there additional information that may be useful to the error message?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this particular case the possible failures are rare and non-interesting.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like you can test for info.errors & UIDNA_ERROR_DOMAIN_NAME_TOO_LONG, maybe other values. The status itself is likely to be something like out of memory, invalid parameters, etc. You could ThrowError including u_errorName(status) to return the ICU status code - that would be good. I don't know if ThrowError takes a parameter.

@trevnorris
Copy link
Contributor

what a pain. so http.get() uses punycode automatically but not encodeURI()? so a request like http://見.香港/見 would properly convert the domain but not the path. At least in a way that node would like. So to be full proof a user would have to do:

// url.parse() automatically runs toASCII() on the
// domain, but isn't documented to do so.
const o = url.parse('http://見.香港/見');
o.pathname = encodeURI(o.pathname);
const url = url.format(o);

Anyway, not directly anything to do with this PR.

In terms of performance, I think punycode.js could be made noticeably faster for node. Since it's not written for node to begin with (as shown by comments like // Avoid split(regex) for IE8 compatibility.). My main question is the need to use full-icu data for this to work, and so how much actual use it will see.

Strictly from the code perspective, the updates LGTM.

@jasnell
Copy link
Member Author

jasnell commented Jun 22, 2016

This only needs full-icu currently. There are a couple of minor data files currently missing in the small-icu subset that would need to be added but those are minimal. Those can be added to small-icu without too much impact.

@jasnell
Copy link
Member Author

jasnell commented Jun 22, 2016

And I'm working on a better URL impl to deal with the other issues ;-)

@mathiasbynens
Copy link
Contributor

@trevnorris I think punycode.js could be made noticeably faster for node.

How?

@bnoordhuis
Copy link
Member

@mathiasbynens By playing to V8's strengths while avoiding its weaknesses. :-)

For example, .toASCII() calls String#split() twice and is heavy on the map-over-substrings idiom. It would probably be faster to:

  1. Scan for the split character and only split when found; saves allocating an array in the common case.
  2. Iterate rather than map; V8 is not great at lowering/inlining callbacks.
  3. Try to avoid concatenating strings too much; cons strings eventually have to be flattened.

There is probably more but that is what stood out from a quick look.

}
}

const {toASCII} = importPunycode();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we have a spacing rule for destructuring before we start using it in full? // @Trott

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In other places we have spaces around the name.

@jasnell
Copy link
Member Author

jasnell commented Jun 28, 2016

CI had a failure on smartos... green everywhere else. Small tweak and running again: https://ci.nodejs.org/job/node-test-pull-request/3108/

@jasnell
Copy link
Member Author

jasnell commented Jun 28, 2016

@trevnorris @bnoordhuis @srl295 ... CI is good. This now no-longer requires full-icu data. PTAL


if (len < 0) {
env->ThrowError("Cannot convert name to ASCII");
return;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be able to put these two on the same line. git grep "return.*Throw" shows that's how it's done in 260 other places.

@bnoordhuis
Copy link
Member

LGTM with style nits.

@jasnell
Copy link
Member Author

jasnell commented Jun 28, 2016

@bnoordhuis @trevnorris .. updated to address the nits

ICU has a punycode implementation built in. Use it instead of the
javascript implementation because it's much faster.
@jasnell
Copy link
Member Author

jasnell commented Jun 29, 2016

Final CI before landing: https://ci.nodejs.org/job/node-test-pull-request/3134/

@jasnell
Copy link
Member Author

jasnell commented Jun 30, 2016

CI is green.

jasnell added a commit that referenced this pull request Jun 30, 2016
PR-URL: #7355
Reviewed-By: Trevor Norris <[email protected]>
Reviewed-By: Ben Noordhuis <[email protected]>
jasnell added a commit that referenced this pull request Jun 30, 2016
ICU has a punycode implementation built in. Use it instead of the
javascript implementation because it's much faster.

PR-URL: #7355
Reviewed-By: Trevor Norris <[email protected]>
Reviewed-By: Ben Noordhuis <[email protected]>
@jasnell
Copy link
Member Author

jasnell commented Jun 30, 2016

Landed in 7de59ef and 3d6a01e

@jasnell jasnell closed this Jun 30, 2016
@mhdawson mhdawson mentioned this pull request Jun 30, 2016
Fishrock123 pushed a commit that referenced this pull request Jul 5, 2016
PR-URL: #7355
Reviewed-By: Trevor Norris <[email protected]>
Reviewed-By: Ben Noordhuis <[email protected]>
Fishrock123 pushed a commit that referenced this pull request Jul 5, 2016
ICU has a punycode implementation built in. Use it instead of the
javascript implementation because it's much faster.

PR-URL: #7355
Reviewed-By: Trevor Norris <[email protected]>
Reviewed-By: Ben Noordhuis <[email protected]>
@Fishrock123 Fishrock123 mentioned this pull request Jul 5, 2016
@MylesBorins
Copy link
Contributor

@jasnell @srl295 I've set this to don't land as the icu doesn't come packed with v4.x

please feel free to update if I'm incorrect here

@srl295
Copy link
Member

srl295 commented Jul 11, 2016

@thealphanerd Correct and correct.

@srl295 srl295 mentioned this pull request Jul 27, 2016
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c++ Issues and PRs that require attention from people who are familiar with C++. i18n-api Issues and PRs related to the i18n implementation. net Issues and PRs related to the net subsystem. performance Issues and PRs related to the performance of Node.js. url Issues and PRs related to the legacy built-in url module.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants