-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use Unicode line breaking algorithm to find words #313
Conversation
fdfa47f
to
500978b
Compare
I'll need to make a small detour before I can continue with this PR — the |
500978b
to
6c5220b
Compare
The detour has been completed with #331 and the |
6c5220b
to
bbecb5a
Compare
This adds a new optional dependency on the unicode-linebreak crate, which implements the line breaking algorithm from [Unicode Standard Annex #14](https://www.unicode.org/reports/tr14/). We can use this to find words in non-ASCII text. The new dependency is enabled by default since these line breaks are more correct than what you get by splitting on ASCII space. This should help address #220 and #80, though I’m no expert on non-Western languages. More feedback from the community would be needed here.
bbecb5a
to
ecbbde4
Compare
This adds a new optional dependency on the unicode-linebreak crate which implements the line breaking algorithm from Unicode Standard Annex #14.
The new dependency is enabled by default since these line breaks are more correct than what you get by splitting on whitespace.
This should help address #220 and #80, though I’m no expert on non-Western languages. More feedback from the community would be needed here.