-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimization for grapheme iteration. #77
Conversation
Typical text involves a lot of chars one-after-another that are in a similar scalar value range and in the same category. This commit takes advantage of that to speed up grapheme iteration by caching the range and category of the last char, and always checking against that cache first before doing a binary search on the whole grapheme category table. This results in significant speed ups on many texts, in some cases up to 2x faster.
Here are some benchmarks I did on a variety of texts in various languages, most of which are sourced from the wikipedia articles about each respective language (e.g. the Mandarin Chinese text is from the Mandarin Chinese wikipedia page about the Mandarin Chinese language). They were all converted to plain utf8 text (no html markup) and copy-pasted to create roughly 200kb text files. Before:
After:
With the exception of Korean and an artificial worst-case text I created, everything either improves in performance or breaks even. And the performance improvements are often substantial. The worst-case text is comprised of alternating grapheme categories on every char, completely defeating the optimization, and leaving only the overhead. This should give a good indication of the worst-case performance drop that might happen. Korean is a bit of a unique case, and due to the nature of Korean text is pretty close to a real-world equivalent of the artificial worst-case text. Nevertheless, the drop in performance seems (IMO) reasonable given the improvements for other texts. The zalgo text case includes artificially long combining character sequences (https://en.wikipedia.org/wiki/Combining_character#Zalgo_text), and shouldn't be taken too seriously. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great! Could you also check in the benchmarks you wrote? Perhaps make a benches/
folder.
Be sure to exclude that folder from publishing in Cargo.toml.
Also, an additional optimization that can be done is hardcoding the table for ASCII and checking for that first.
Eh, those things can be followups. |
@Manishearth Yeah, I've also been thinking about how to hard-code the ASCII range in some way. I think with some bit-fiddling it could be made extremely fast, because the only pure-ASCII case that needs handling is CRLF (I'm pretty sure?). And it would likely introduce very little overhead for other texts, I'm guessing. Originally I was planning to do that optimization just in my own application code, bypassing UnicodeSegmentation altogether for those cases, but if you're interested in having it in-tree I'd be more than happy take a crack at it in another PR. |
Oh, I meant that we could hardcode the categories so that ASCII doesn't hit the binary search |
Ah, okay. I'm sure that would make ASCII text faster, though I think it would more be because it's effectively biasing the search to check ASCII first rather than because of the hard coding itself. We could choose to bias other ranges similarly, if we wanted to. So I guess the question is: is the ASCII range special enough to warrant doing that? Edit: to clarify, I mean "special enough" in the general case. For e.g. my own project, a code editor, it definitely would be. But I can write my own ascii-specific optimizations in that case, as could other applications for their own domains. |
@cessen it's special enough because a lot of punctuation/etc is ASCII even if text might not necessarily be. It's going to be a single comparison followed by an indexing operation, very cheap if the alternative is a binary search. It will complement the cache really well since then the cache can continue to deal with the non-ASCII general range of the text whereas spacing and punctuation doesn't invalidate it. Would need to tweak the bsearch method to not return the cache index markers in the ASCII case though. |
@Manishearth I think that really depends on the text. The CJK languages, for example, have their own punctuation and white space characters, and the only ascii-range characters they ever use are line endings, which is only occasional in typical texts. But having said that, I think you're right that many other languages still use ascii punctuation and white space. And then there's things like html, too. I'd be happy to take a crack at this and see how it affects the benchmarks and if it seems worth whatever overhead it might introduce. |
Yes, not all languages do this. This was definitely a win for unicode-xid in unicode-rs/unicode-xid#16 ( which doesn't even go the full way with the ASCII table, just reorders the binary search!), which admittedly works with different sets of characters |
Typical text involves a lot of chars one-after-another that
are in a similar scalar value range and in the same category. This
commit takes advantage of that to speed up grapheme iteration by
caching the range and category of the last char, and always checking
against that cache first before doing a binary search on the whole
grapheme category table.
This results in significant speed ups on many texts, in some cases
up to 2x faster.