Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: fix custom wordbreaker output format for sil.km.gcc #265

Merged
merged 2 commits into from
Aug 26, 2024

Conversation

jahorton
Copy link
Contributor

Fixes keymanapp/keyman#12200.

Recent changes in 18.0-alpha require an update to the custom wordbreaker I originally wrote for this model. We now seek to model whitespaces in the context to ensure we don't accidentally auto-correct away punctuation marks, but those same changes require precise adherence to the wordbreaker output format. Unfortunately, my original version of it played a bit... fast and loose, returning improper values for blank or duplicated tokens.

Comment on lines +63 to +66
let latestIndex = 0;
return tokens.map(function(token) {
const start = str.indexOf(token, latestIndex);
latestIndex = start + token.length;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The key part of the fix: we should track the furthest index reached while iterating over each token and skip past it when determining the original position of the next token to be processed. This fixes handling for any duplicate-string cases.

Comment on lines +19 to +27
if(token.length == 0) {
tokens.splice(i, 1);
i--;
continue;
} else if(token.length == 1 && whitespaceRegex.test(token)) {
tokens.splice(i, 1);
i--;
continue;
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prevents empty tokens from appearing in the middle of context, especially should excessive whitespaces exist between words. Something like the quick brown fox should appear to skip all spaces between tokens at once, not in multiple steps.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the custom wordbreaker for sil.km.cnd.model.ts need a similar fix?

wordBreaker: function(str: string) {
return str.split(/\s|\u200b/).map(function(token) {
return {
left: str.indexOf(token),
start: str.indexOf(token),
right: str.indexOf(token) + token.length,
end: str.indexOf(token) + token.length,
text: token
}
});
},
punctuation: {
insertAfterWord: "\u200B"

@DavidLRowe
Copy link
Collaborator

@darcywong00 Would you be able to review this PR? I don't think I'm qualified.

@jahorton
Copy link
Contributor Author

jahorton commented Aug 21, 2024

For what it's worth, I've included a copy of the new wordbreaker code in keymanapp/keyman#12229, which is then used in automated tests. See https://github.com/keymanapp/keyman/blob/ec32e21bf4bd6f7d225126e34deaec7ff728e302/common/models/templates/test/test-tokenization.js#L529-L575.

The automated test spec before it tests a mitigation for the old version and includes comments talking about the issue we ran into that motivated this. But yeah, there are some predictive-text internal details that might make a review a bit tricky. Just thought I'd provide the most relevant references in case whoever does review might need 'em.

@darcywong00
Copy link
Contributor

Recent changes in 18.0-alpha require an update to the custom wordbreaker I originally wrote for this model.

So should these changes get in before we rebuild all the lexical-models in #246?
e.g. Should we bump the gcc-custom-breaker version (in the .kps) here or #246?

@jahorton
Copy link
Contributor Author

Recent changes in 18.0-alpha require an update to the custom wordbreaker I originally wrote for this model.

So should these changes get in before we rebuild all the lexical-models in #246? e.g. Should we bump the gcc-custom-breaker version (in the .kps) here or #246?

Either order would likely be fine, but yeah, we'd probably need to bump the version again if done after #246, assuming it includes a version bump. So... it wouldn't hurt to push this in before the "rebuild all".

darcywong00 added a commit to darcywong00/lexical-models that referenced this pull request Aug 26, 2024
@DavidLRowe DavidLRowe merged commit 9fac85c into master Aug 26, 2024
2 checks passed
@jahorton jahorton deleted the fix/sil.km.ggc-custom-breaker branch August 27, 2024 01:21
darcywong00 added a commit to darcywong00/lexical-models that referenced this pull request Sep 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bug(web): suggestions not appearing for new words in Khmer until two or three letters in
4 participants