Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenization option min_word_length counts length in bytes #5

Open
isaackd opened this issue Apr 9, 2023 · 0 comments
Open

Tokenization option min_word_length counts length in bytes #5

isaackd opened this issue Apr 9, 2023 · 0 comments
Labels
bug Something isn't working

Comments

@isaackd
Copy link
Owner

isaackd commented Apr 9, 2023

This should count by actual "characters"
https://github.com/isaackd/wcloud-dev/blob/e368d53dd4d6fb7fcef084ed98225dc54a054a29/src/tokenizer.rs#L46-L48
From https://doc.rust-lang.org/std/primitive.str.html#method.len:

This length is in bytes, not chars or graphemes. In other words, it might not be what a human considers the length of the string.

@isaackd isaackd added the bug Something isn't working label Apr 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant