Implement a special-case lookup for ascii grapheme categories. #79

cessen · 2020-02-14T12:24:25Z

This speeds up processing even for many non-ascii texts, since
they often still use ascii-range punctuation and whitespace.

This speeds up processing even for many non-ascii texts, since they often still use ascii-range punctuation and whitespace.

cessen · 2020-02-14T12:35:21Z

I took a crack at special-casing the ASCII range as discussed in #77, and it definitely improves performance across the board. I tried two different implementations: one based on a 128-element lookup table, and another based on a few if statements. This PR is for the if statement approach.

Here are the bench results, including results for both approaches:

No ascii special-casing:

test graphemes_arabic      ... bench:     893,766 ns/iter (+/- 68,494) = 56 MB/s
test graphemes_english     ... bench:     989,329 ns/iter (+/- 42,635) = 50 MB/s
test graphemes_hindi       ... bench:     933,926 ns/iter (+/- 30,817) = 53 MB/s
test graphemes_japanese    ... bench:     727,820 ns/iter (+/- 37,069) = 69 MB/s
test graphemes_korean      ... bench:   1,083,001 ns/iter (+/- 33,362) = 46 MB/s
test graphemes_mandarin    ... bench:     520,048 ns/iter (+/- 10,724) = 97 MB/s
test graphemes_russian     ... bench:     842,000 ns/iter (+/- 32,343) = 60 MB/s
test graphemes_source_code ... bench:   1,060,247 ns/iter (+/- 42,456) = 47 MB/s

Table:

test graphemes_arabic      ... bench:     692,838 ns/iter (+/- 34,703) = 72 MB/s
test graphemes_english     ... bench:   1,018,363 ns/iter (+/- 48,135) = 48 MB/s
test graphemes_hindi       ... bench:     826,851 ns/iter (+/- 39,035) = 59 MB/s
test graphemes_japanese    ... bench:     739,136 ns/iter (+/- 68,602) = 68 MB/s
test graphemes_korean      ... bench:   1,000,320 ns/iter (+/- 85,934) = 50 MB/s
test graphemes_mandarin    ... bench:     522,754 ns/iter (+/- 24,797) = 96 MB/s
test graphemes_russian     ... bench:     717,681 ns/iter (+/- 25,272) = 71 MB/s
test graphemes_source_code ... bench:   1,008,405 ns/iter (+/- 31,694) = 49 MB/s

If branching:

test graphemes_arabic      ... bench:     663,521 ns/iter (+/- 36,100) = 75 MB/s
test graphemes_english     ... bench:     927,851 ns/iter (+/- 35,652) = 53 MB/s
test graphemes_hindi       ... bench:     801,972 ns/iter (+/- 35,350) = 61 MB/s
test graphemes_japanese    ... bench:     716,374 ns/iter (+/- 40,118) = 70 MB/s
test graphemes_korean      ... bench:     964,766 ns/iter (+/- 47,644) = 51 MB/s
test graphemes_mandarin    ... bench:     503,575 ns/iter (+/- 15,838) = 100 MB/s
test graphemes_russian     ... bench:     665,667 ns/iter (+/- 27,923) = 76 MB/s
test graphemes_source_code ... bench:     959,008 ns/iter (+/- 66,405) = 52 MB/s

Turns out that the branching approach beats the table approach, at least on my machine. The table implementation wasn't completely naive either: I made sure to avoid duplicate bounds checks (although they probably would have been optimized out anyway) by using the slice get() method to roll the bounds check and ascii-range check into one, so the whole thing was just a single branch and an array lookup.

The if-branching approach is also just way easier to maintain and check for correctness, because you don't have to wade through an array of 128 grapheme categories.

cessen · 2020-02-14T12:39:44Z

(Also, sorry for the commit message typo. I can amend the commit if it's important to you.)

Manishearth

Oh, right, for graphemes there's not much of a need for tables, good call.

The tables vs if branching might be a bigger deal if we want to do this for word/line/sentence breaking.

src/grapheme.rs

Manishearth · 2020-02-14T20:46:23Z

Thanks!

cessen · 2020-02-15T00:53:03Z

Awesome! Thanks for the clean up and merge!

Implement a special-case lookup for ascii grapeheme categories.

c9aa6fa

This speeds up processing even for many non-ascii texts, since they often still use ascii-range punctuation and whitespace.

cessen changed the title ~~Implement a special-case lookup for ascii grapeheme categories.~~ Implement a special-case lookup for ascii grapheme categories. Feb 14, 2020

Manishearth approved these changes Feb 14, 2020

View reviewed changes

src/grapheme.rs Outdated Show resolved Hide resolved

Manishearth reviewed Feb 14, 2020

View reviewed changes

src/grapheme.rs Outdated Show resolved Hide resolved

Apply suggestions from code review

945dbb6

Manishearth merged commit fbba2a6 into unicode-rs:master Feb 14, 2020

Manishearth mentioned this pull request Feb 14, 2020

Try out caching and ASCII special casing for other segmenters #80

Open

cessen deleted the ascii_grapheme_optimization branch February 15, 2020 00:52

zachwolfe mentioned this pull request Jan 14, 2021

Revisit my special case for ASCII in the lexer dusklang/dusk#91

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement a special-case lookup for ascii grapheme categories. #79

Implement a special-case lookup for ascii grapheme categories. #79

cessen commented Feb 14, 2020

cessen commented Feb 14, 2020

cessen commented Feb 14, 2020

Manishearth left a comment

Manishearth commented Feb 14, 2020

cessen commented Feb 15, 2020

Implement a special-case lookup for ascii grapheme categories. #79

Implement a special-case lookup for ascii grapheme categories. #79

Conversation

cessen commented Feb 14, 2020

cessen commented Feb 14, 2020

cessen commented Feb 14, 2020

Manishearth left a comment

Choose a reason for hiding this comment

Manishearth commented Feb 14, 2020

cessen commented Feb 15, 2020