fix: fix custom wordbreaker output format for sil.km.gcc #265

jahorton · 2024-08-19T08:36:44Z

Recent changes in 18.0-alpha require an update to the custom wordbreaker I originally wrote for this model. We now seek to model whitespaces in the context to ensure we don't accidentally auto-correct away punctuation marks, but those same changes require precise adherence to the wordbreaker output format. Unfortunately, my original version of it played a bit... fast and loose, returning improper values for blank or duplicated tokens.

jahorton · 2024-08-19T08:41:12Z

release/sil/sil.km.ggc/source/sil.km.ggc.model.ts

+    let latestIndex = 0;
+    return tokens.map(function(token) {
+      const start = str.indexOf(token, latestIndex);
+      latestIndex = start + token.length;


The key part of the fix: we should track the furthest index reached while iterating over each token and skip past it when determining the original position of the next token to be processed. This fixes handling for any duplicate-string cases.

jahorton · 2024-08-19T08:41:42Z

release/sil/sil.km.ggc/source/sil.km.ggc.model.ts

+      if(token.length == 0) {
+        tokens.splice(i, 1);
+        i--;
+        continue;
+      } else if(token.length == 1 && whitespaceRegex.test(token)) {
+        tokens.splice(i, 1);
+        i--;
+        continue;
+      }


Prevents empty tokens from appearing in the middle of context, especially should excessive whitespaces exist between words. Something like the quick brown fox should appear to skip all spaces between tokens at once, not in multiple steps.

Does the custom wordbreaker for sil.km.cnd.model.ts need a similar fix?

lexical-models/release/sil/sil.km.cnd/source/sil.km.cnd.model.ts

Lines 13 to 25 in a90fec5

wordBreaker: function(str: string) {

return str.split(/\s|\u200b/).map(function(token) {

return {

left: str.indexOf(token),

start: str.indexOf(token),

right: str.indexOf(token) + token.length,

end: str.indexOf(token) + token.length,

text: token

}

});

},

punctuation: {

insertAfterWord: "\u200B"

DavidLRowe · 2024-08-20T20:09:08Z

@darcywong00 Would you be able to review this PR? I don't think I'm qualified.

jahorton · 2024-08-21T01:29:23Z

For what it's worth, I've included a copy of the new wordbreaker code in keymanapp/keyman#12229, which is then used in automated tests. See https://github.com/keymanapp/keyman/blob/ec32e21bf4bd6f7d225126e34deaec7ff728e302/common/models/templates/test/test-tokenization.js#L529-L575.

The automated test spec before it tests a mitigation for the old version and includes comments talking about the issue we ran into that motivated this. But yeah, there are some predictive-text internal details that might make a review a bit tricky. Just thought I'd provide the most relevant references in case whoever does review might need 'em.

darcywong00 · 2024-08-26T01:46:06Z

Recent changes in 18.0-alpha require an update to the custom wordbreaker I originally wrote for this model.

So should these changes get in before we rebuild all the lexical-models in #246?
e.g. Should we bump the gcc-custom-breaker version (in the .kps) here or #246?

jahorton · 2024-08-26T02:32:39Z

Recent changes in 18.0-alpha require an update to the custom wordbreaker I originally wrote for this model.

So should these changes get in before we rebuild all the lexical-models in #246? e.g. Should we bump the gcc-custom-breaker version (in the .kps) here or #246?

Either order would likely be fine, but yeah, we'd probably need to bump the version again if done after #246, assuming it includes a version bump. So... it wouldn't hurt to push this in before the "rebuild all".

jahorton added 2 commits August 19, 2024 15:31

fix: fix custom wordbreaker output format

537f664

change: revert variable rename

6256f59

jahorton commented Aug 19, 2024

View reviewed changes

jahorton marked this pull request as ready for review August 20, 2024 03:45

jahorton mentioned this pull request Aug 20, 2024

fix(web): improve tokenization output when wordbreaker breaks spec for span properties in output keymanapp/keyman#12229

Merged

darcywong00 added a commit to darcywong00/lexical-models that referenced this pull request Aug 26, 2024

[sil.km.ggc] Update HISTORY corresponding to keymanapp#265

6f33d4b

mcdurdin approved these changes Aug 26, 2024

View reviewed changes

DavidLRowe merged commit 9fac85c into master Aug 26, 2024
2 checks passed

darcywong00 added a commit that referenced this pull request Aug 26, 2024

[sil.km.ggc] Update HISTORY corresponding to #265

38e0305

jahorton deleted the fix/sil.km.ggc-custom-breaker branch August 27, 2024 01:21

darcywong00 added a commit to darcywong00/lexical-models that referenced this pull request Sep 9, 2024

[sil.km.ggc] Update HISTORY corresponding to keymanapp#265

c02f415

This was referenced Sep 9, 2024

chore: Rebuild all models with kmc-17.0.329 to address missing low-frequency words #274

Merged

[sil.km.gcc] Fix custom wordbreaker output (version bump to build) #276

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: fix custom wordbreaker output format for sil.km.gcc #265

fix: fix custom wordbreaker output format for sil.km.gcc #265

jahorton commented Aug 19, 2024

jahorton Aug 19, 2024

jahorton Aug 19, 2024

darcywong00 Aug 26, 2024

DavidLRowe commented Aug 20, 2024

jahorton commented Aug 21, 2024 •

edited

Loading

darcywong00 commented Aug 26, 2024

jahorton commented Aug 26, 2024

	wordBreaker: function(str: string) {
	return str.split(/\s\|\u200b/).map(function(token) {
	return {
	left: str.indexOf(token),
	start: str.indexOf(token),
	right: str.indexOf(token) + token.length,
	end: str.indexOf(token) + token.length,
	text: token
	}
	});
	},
	punctuation: {
	insertAfterWord: "\u200B"

fix: fix custom wordbreaker output format for sil.km.gcc #265

fix: fix custom wordbreaker output format for sil.km.gcc #265

Conversation

jahorton commented Aug 19, 2024

jahorton Aug 19, 2024

Choose a reason for hiding this comment

jahorton Aug 19, 2024

Choose a reason for hiding this comment

darcywong00 Aug 26, 2024

Choose a reason for hiding this comment

DavidLRowe commented Aug 20, 2024

jahorton commented Aug 21, 2024 • edited Loading

darcywong00 commented Aug 26, 2024

jahorton commented Aug 26, 2024

jahorton commented Aug 21, 2024 •

edited

Loading