-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clarification on the unicode used on the keys #504
Comments
Looking at the java code, the budoux simply does substring. Java internally stores strings utf16 so I guess these are neither codepoints or graphemes but rather utf16 slices. This also explains why there is some garbage keys in the thai models. I think this should be fixed as it may also impact accuracy. |
https://github.com/Cloudef/zig-budoux |
How i would fix this:
Grapheme aware would give the most accurate results as grapheme is what we humans think of as a character. The runtime would have to then normalize the inputs (or keys) to nfkc as well for the best result. Casefolding is something that is usually done before normalization when building indices, but i dont think its required for budoux as i cant think of any language that has upper/lower case and no word boundaries. Related: |
Hi @Cloudef, Thanks for your insightful feedback! The keys are meant to be a list of codepoints, rather than grapheme clusters. NormalizationIt's a good point. Definitely it would be better to normalize the training data before inputting them into the training pipeline. I believe it should be done fairly easily by tweaking our encode_data.py, where we extract keys from the input data. The key extraction is done in Python3 and str indexing + concatenating, so I believe it's working as intended (i.e. codepoint-aware). Thai modelI'm not an expert in Thai, but given that some diacritical marks such as UTF-16For JavaScript and Java ports, I agree that we need to adopt the approaches which make sure that characters are treated codepoint-basis instead of UTF-16 basis. We should add some test cases with characters of multiple 16-bit code units. BTW, your zig-budoux work looks awesome! Congrats on the launch :) @allensu05 Does the ICU port also presumes that the model keys are codepoint-basis? Do you have any insights on this issue? |
Hello, thank you for your response. Regarding graphemes vs codepoints, as long as the intent is clear there is no problem. Normalization is definitely something that should be considered. There might be a bug in my implementation also. I took the hexdump of utf16 encoded string of The test in question is this: var iter = model.iterator("วันนี้อากาศดี");
try std.testing.expectEqualSlices(u8, "วัน", iter.next().?);
try std.testing.expectEqualSlices(u8, "นี้", iter.next().?);
try std.testing.expectEqualSlices(u8, "อากาศ", iter.next().?);
try std.testing.expectEqualSlices(u8, "ดี", iter.next().?); The last condition fails, though the segmenter correctly seems to check for I ran this on both the utf8 and utf16 output and I get the same codepoints, however interestingly this program's output also seems slightly broken, so maybe there's some care that needs to be taken even with python strings.
The chinese and japanese models do not use combining characters at all in their keys AFAIK EDIT: I think I might found the issue in my implementation. |
Unfortunately while I found issue in my implementation regarding the history bookkeeping it was unrelated to this, here are how the maps are accessed from n-2..
|
Ah never mind, I just realized that python / javascript / java implementations always return the last chunk regardless of the score, so that's my dumb mistake 😅 I think the normalization may still be important so I leave this issue open for a reference, you can close it if you want. Thanks for budoux btw. |
Hello, I'm porting budoux to zig (and C).
I wonder what exactly is the unicode format of the keys in the model. Are they normalized?
When constructing a key should the key be constructed from range of codepoints or graphemes?
I'm asking because currently all the tests pass expect for the thai model and this might be the issue I'm hitting.
The text was updated successfully, but these errors were encountered: