Optimization for grapheme iteration. #77

cessen · 2020-02-09T04:16:10Z

Typical text involves a lot of chars one-after-another that
are in a similar scalar value range and in the same category. This
commit takes advantage of that to speed up grapheme iteration by
caching the range and category of the last char, and always checking
against that cache first before doing a binary search on the whole
grapheme category table.

This results in significant speed ups on many texts, in some cases
up to 2x faster.

Typical text involves a lot of chars one-after-another that are in a similar scalar value range and in the same category. This commit takes advantage of that to speed up grapheme iteration by caching the range and category of the last char, and always checking against that cache first before doing a binary search on the whole grapheme category table. This results in significant speed ups on many texts, in some cases up to 2x faster.

cessen · 2020-02-09T04:40:44Z

Here are some benchmarks I did on a variety of texts in various languages, most of which are sourced from the wikipedia articles about each respective language (e.g. the Mandarin Chinese text is from the Mandarin Chinese wikipedia page about the Mandarin Chinese language). They were all converted to plain utf8 text (no html markup) and copy-pasted to create roughly 200kb text files.

Before:

test iterate_text_arabic      ... bench:   4,200,019 ns/iter (+/- 100,975) = 43 MB/s
test iterate_text_english     ... bench:   7,563,328 ns/iter (+/- 197,099) = 27 MB/s
test iterate_text_hindi       ... bench:   3,408,050 ns/iter (+/- 78,840) = 56 MB/s
test iterate_text_japanese    ... bench:   2,909,737 ns/iter (+/- 137,862) = 67 MB/s
test iterate_text_korean      ... bench:   3,158,322 ns/iter (+/- 77,622) = 54 MB/s
test iterate_text_mandarin    ... bench:   2,784,563 ns/iter (+/- 74,021) = 67 MB/s
test iterate_text_russian     ... bench:   4,437,562 ns/iter (+/- 136,513) = 42 MB/s
test iterate_text_source_code ... bench:   5,531,391 ns/iter (+/- 161,335) = 25 MB/s
test iterate_text_worst_case  ... bench:   7,252,276 ns/iter (+/- 242,166) = 26 MB/s
test iterate_text_zalgo       ... bench:   3,025,250 ns/iter (+/- 77,794) = 62 MB/s

After:

test iterate_text_arabic      ... bench:   3,150,853 ns/iter (+/- 178,068) = 58 MB/s
test iterate_text_english     ... bench:   4,021,696 ns/iter (+/- 141,119) = 50 MB/s
test iterate_text_hindi       ... bench:   3,462,884 ns/iter (+/- 85,012) = 55 MB/s
test iterate_text_japanese    ... bench:   2,571,715 ns/iter (+/- 103,212) = 76 MB/s
test iterate_text_korean      ... bench:   3,530,435 ns/iter (+/- 139,610) = 48 MB/s
test iterate_text_mandarin    ... bench:   1,913,650 ns/iter (+/- 65,902) = 98 MB/s
test iterate_text_russian     ... bench:   3,140,924 ns/iter (+/- 80,945) = 60 MB/s
test iterate_text_source_code ... bench:   3,170,036 ns/iter (+/- 103,675) = 45 MB/s
test iterate_text_worst_case  ... bench:   8,669,671 ns/iter (+/- 335,782) = 22 MB/s
test iterate_text_zalgo       ... bench:   1,234,701 ns/iter (+/- 25,555) = 153 MB/s

With the exception of Korean and an artificial worst-case text I created, everything either improves in performance or breaks even. And the performance improvements are often substantial.

The worst-case text is comprised of alternating grapheme categories on every char, completely defeating the optimization, and leaving only the overhead. This should give a good indication of the worst-case performance drop that might happen.

Korean is a bit of a unique case, and due to the nature of Korean text is pretty close to a real-world equivalent of the artificial worst-case text. Nevertheless, the drop in performance seems (IMO) reasonable given the improvements for other texts.

The zalgo text case includes artificially long combining character sequences (https://en.wikipedia.org/wiki/Combining_character#Zalgo_text), and shouldn't be taken too seriously.

Manishearth

This is great! Could you also check in the benchmarks you wrote? Perhaps make a benches/ folder.

Be sure to exclude that folder from publishing in Cargo.toml.

Also, an additional optimization that can be done is hardcoding the table for ASCII and checking for that first.

Manishearth · 2020-02-12T20:26:44Z

Eh, those things can be followups.

cessen · 2020-02-13T12:08:09Z

@Manishearth
Sure, I'll work on tidying up and committing the benchmarks the next chance I get (possibly this weekend, or maybe the weekend after).

Yeah, I've also been thinking about how to hard-code the ASCII range in some way. I think with some bit-fiddling it could be made extremely fast, because the only pure-ASCII case that needs handling is CRLF (I'm pretty sure?). And it would likely introduce very little overhead for other texts, I'm guessing.

Originally I was planning to do that optimization just in my own application code, bypassing UnicodeSegmentation altogether for those cases, but if you're interested in having it in-tree I'd be more than happy take a crack at it in another PR.

Manishearth · 2020-02-13T18:25:27Z

Oh, I meant that we could hardcode the categories so that ASCII doesn't hit the binary search

cessen · 2020-02-13T22:10:10Z

Ah, okay. I'm sure that would make ASCII text faster, though I think it would more be because it's effectively biasing the search to check ASCII first rather than because of the hard coding itself. We could choose to bias other ranges similarly, if we wanted to. So I guess the question is: is the ASCII range special enough to warrant doing that?

Edit: to clarify, I mean "special enough" in the general case. For e.g. my own project, a code editor, it definitely would be. But I can write my own ascii-specific optimizations in that case, as could other applications for their own domains.

Manishearth · 2020-02-14T01:26:19Z

@cessen it's special enough because a lot of punctuation/etc is ASCII even if text might not necessarily be. It's going to be a single comparison followed by an indexing operation, very cheap if the alternative is a binary search.

It will complement the cache really well since then the cache can continue to deal with the non-ASCII general range of the text whereas spacing and punctuation doesn't invalidate it. Would need to tweak the bsearch method to not return the cache index markers in the ASCII case though.

cessen · 2020-02-14T01:41:35Z

@Manishearth I think that really depends on the text. The CJK languages, for example, have their own punctuation and white space characters, and the only ascii-range characters they ever use are line endings, which is only occasional in typical texts.

But having said that, I think you're right that many other languages still use ascii punctuation and white space. And then there's things like html, too. I'd be happy to take a crack at this and see how it affects the benchmarks and if it seems worth whatever overhead it might introduce.

Manishearth · 2020-02-14T01:43:42Z

Yes, not all languages do this.

This was definitely a win for unicode-xid in unicode-rs/unicode-xid#16 ( which doesn't even go the full way with the ASCII table, just reorders the binary search!), which admittedly works with different sets of characters

Fix to work on Rust 1.24.0 again and remove unused imports.

29ecd77

Manishearth approved these changes Feb 12, 2020

View reviewed changes

Manishearth merged commit b82ed4d into unicode-rs:master Feb 12, 2020

cessen deleted the faster_graphemes branch February 14, 2020 02:43

This was referenced Feb 14, 2020

Add grapheme iteration benchmarks for various languages. #78

Merged

Implement a special-case lookup for ascii grapheme categories. #79

Merged

Manishearth mentioned this pull request Feb 14, 2020

Try out caching and ASCII special casing for other segmenters #80

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimization for grapheme iteration. #77

Optimization for grapheme iteration. #77

cessen commented Feb 9, 2020

cessen commented Feb 9, 2020

Manishearth left a comment

Manishearth commented Feb 12, 2020

cessen commented Feb 13, 2020

Manishearth commented Feb 13, 2020

cessen commented Feb 13, 2020 •

edited

Loading

Manishearth commented Feb 14, 2020

cessen commented Feb 14, 2020

Manishearth commented Feb 14, 2020

Optimization for grapheme iteration. #77

Optimization for grapheme iteration. #77

Conversation

cessen commented Feb 9, 2020

cessen commented Feb 9, 2020

Manishearth left a comment

Choose a reason for hiding this comment

Manishearth commented Feb 12, 2020

cessen commented Feb 13, 2020

Manishearth commented Feb 13, 2020

cessen commented Feb 13, 2020 • edited Loading

Manishearth commented Feb 14, 2020

cessen commented Feb 14, 2020

Manishearth commented Feb 14, 2020

cessen commented Feb 13, 2020 •

edited

Loading