Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement a special-case lookup for ascii grapheme categories. #79

Merged
merged 2 commits into from
Feb 14, 2020
Merged

Implement a special-case lookup for ascii grapheme categories. #79

merged 2 commits into from
Feb 14, 2020

Conversation

cessen
Copy link
Contributor

@cessen cessen commented Feb 14, 2020

This speeds up processing even for many non-ascii texts, since
they often still use ascii-range punctuation and whitespace.

This speeds up processing even for many non-ascii texts, since
they often still use ascii-range punctuation and whitespace.
@cessen cessen changed the title Implement a special-case lookup for ascii grapeheme categories. Implement a special-case lookup for ascii grapheme categories. Feb 14, 2020
@cessen
Copy link
Contributor Author

cessen commented Feb 14, 2020

I took a crack at special-casing the ASCII range as discussed in #77, and it definitely improves performance across the board. I tried two different implementations: one based on a 128-element lookup table, and another based on a few if statements. This PR is for the if statement approach.

Here are the bench results, including results for both approaches:

No ascii special-casing:

test graphemes_arabic      ... bench:     893,766 ns/iter (+/- 68,494) = 56 MB/s
test graphemes_english     ... bench:     989,329 ns/iter (+/- 42,635) = 50 MB/s
test graphemes_hindi       ... bench:     933,926 ns/iter (+/- 30,817) = 53 MB/s
test graphemes_japanese    ... bench:     727,820 ns/iter (+/- 37,069) = 69 MB/s
test graphemes_korean      ... bench:   1,083,001 ns/iter (+/- 33,362) = 46 MB/s
test graphemes_mandarin    ... bench:     520,048 ns/iter (+/- 10,724) = 97 MB/s
test graphemes_russian     ... bench:     842,000 ns/iter (+/- 32,343) = 60 MB/s
test graphemes_source_code ... bench:   1,060,247 ns/iter (+/- 42,456) = 47 MB/s

Table:

test graphemes_arabic      ... bench:     692,838 ns/iter (+/- 34,703) = 72 MB/s
test graphemes_english     ... bench:   1,018,363 ns/iter (+/- 48,135) = 48 MB/s
test graphemes_hindi       ... bench:     826,851 ns/iter (+/- 39,035) = 59 MB/s
test graphemes_japanese    ... bench:     739,136 ns/iter (+/- 68,602) = 68 MB/s
test graphemes_korean      ... bench:   1,000,320 ns/iter (+/- 85,934) = 50 MB/s
test graphemes_mandarin    ... bench:     522,754 ns/iter (+/- 24,797) = 96 MB/s
test graphemes_russian     ... bench:     717,681 ns/iter (+/- 25,272) = 71 MB/s
test graphemes_source_code ... bench:   1,008,405 ns/iter (+/- 31,694) = 49 MB/s

If branching:

test graphemes_arabic      ... bench:     663,521 ns/iter (+/- 36,100) = 75 MB/s
test graphemes_english     ... bench:     927,851 ns/iter (+/- 35,652) = 53 MB/s
test graphemes_hindi       ... bench:     801,972 ns/iter (+/- 35,350) = 61 MB/s
test graphemes_japanese    ... bench:     716,374 ns/iter (+/- 40,118) = 70 MB/s
test graphemes_korean      ... bench:     964,766 ns/iter (+/- 47,644) = 51 MB/s
test graphemes_mandarin    ... bench:     503,575 ns/iter (+/- 15,838) = 100 MB/s
test graphemes_russian     ... bench:     665,667 ns/iter (+/- 27,923) = 76 MB/s
test graphemes_source_code ... bench:     959,008 ns/iter (+/- 66,405) = 52 MB/s

Turns out that the branching approach beats the table approach, at least on my machine. The table implementation wasn't completely naive either: I made sure to avoid duplicate bounds checks (although they probably would have been optimized out anyway) by using the slice get() method to roll the bounds check and ascii-range check into one, so the whole thing was just a single branch and an array lookup.

The if-branching approach is also just way easier to maintain and check for correctness, because you don't have to wade through an array of 128 grapheme categories.

@cessen
Copy link
Contributor Author

cessen commented Feb 14, 2020

(Also, sorry for the commit message typo. I can amend the commit if it's important to you.)

Copy link
Member

@Manishearth Manishearth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, right, for graphemes there's not much of a need for tables, good call.

The tables vs if branching might be a bigger deal if we want to do this for word/line/sentence breaking.

src/grapheme.rs Outdated Show resolved Hide resolved
src/grapheme.rs Outdated Show resolved Hide resolved
@Manishearth Manishearth merged commit fbba2a6 into unicode-rs:master Feb 14, 2020
@Manishearth
Copy link
Member

Thanks!

@cessen
Copy link
Contributor Author

cessen commented Feb 15, 2020

Awesome! Thanks for the clean up and merge!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants