-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement a special-case lookup for ascii grapheme categories. #79
Implement a special-case lookup for ascii grapheme categories. #79
Conversation
This speeds up processing even for many non-ascii texts, since they often still use ascii-range punctuation and whitespace.
I took a crack at special-casing the ASCII range as discussed in #77, and it definitely improves performance across the board. I tried two different implementations: one based on a 128-element lookup table, and another based on a few if statements. This PR is for the if statement approach. Here are the bench results, including results for both approaches:
Turns out that the branching approach beats the table approach, at least on my machine. The table implementation wasn't completely naive either: I made sure to avoid duplicate bounds checks (although they probably would have been optimized out anyway) by using the slice The if-branching approach is also just way easier to maintain and check for correctness, because you don't have to wade through an array of 128 grapheme categories. |
(Also, sorry for the commit message typo. I can amend the commit if it's important to you.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, right, for graphemes there's not much of a need for tables, good call.
The tables vs if branching might be a bigger deal if we want to do this for word/line/sentence breaking.
Thanks! |
Awesome! Thanks for the clean up and merge! |
This speeds up processing even for many non-ascii texts, since
they often still use ascii-range punctuation and whitespace.