Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ideas for optimizations #15

Open
walling opened this issue Nov 18, 2013 · 5 comments
Open

Ideas for optimizations #15

walling opened this issue Nov 18, 2013 · 5 comments
Labels

Comments

@walling
Copy link
Owner

walling commented Nov 18, 2013

Just tossing some ideas out there.

The technical report outlines the Quick_Check algorithms for fast verification whether a given string is in a given normalization form: Detecting Normalization Forms.

These are YES / NO / MAYBE answers, so in theory it should be possible to implement this as two regexps:

  1. Whitelist regexp: If this regexp matches, the answer is YES, otherwise continue.
  2. Blacklist regexp: If this regexp matches, the answer is NO, otherwise MAYBE.

The regexps should be automatically generated. It is complicated by the fact that JavaScript regexps only support UCS-2 and not UTF-16, so we have to manually calculate surrogate pairs (and nested matching groups). See punycode.js.

If implemented, we could add test functions, fx. 'foo'.isNormalization('NFC'). Internally they could be used for speeding up any given normlization. Something like this: Use the whitelist regexp to match the longest prefix in this normalization, then cut that out and normalize the rest recursively. We only need to normalize the parts that are not in the normalization already, but we have to be careful about the boundaries between normalized/non-normalized. Some more strategies are outlined in the technical report: Optimization Strategies.

First and foremost, we should make a benchmark test suite, to actually gain some information whether these optimizations gives a boost in speed for long strings. And it would be nice to know how much it means for the size of the library.

@walling
Copy link
Owner Author

walling commented Nov 18, 2013

@phadej, any comments? Do you feel this is necessary, nice-to-have or not important. I'm not sure how many browsers are going to support native ES6 normalization soon.

Status:

@phadej
Copy link
Collaborator

phadej commented Nov 21, 2013

The benchmark suite is definitely something we should start with.

I'm not sure how fast or slow unorm is in the first place. Might be that those kind of optimisations will make sense only on large inputs (>1MB), which might be rare.

@walling
Copy link
Owner Author

walling commented Jan 7, 2014

True, a benchmark suite would be nice.

@termi
Copy link

termi commented Jul 3, 2014

The regexps should be automatically generated. It is complicated by the fact that JavaScript regexps only support UCS-2 and not UTF-16, so we have to manually calculate surrogate pairs (and nested matching groups).

Can this project help you?

@phadej
Copy link
Collaborator

phadej commented Jul 3, 2014

I work from time to time on https://github.com/phadej/tarzan, but it can't go beyond UCS-2 for know. Forgot that this is here (and tarzan could be used to generate regexps). I'll try to find time to implement surrogate pairs in tarzan.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants