-
-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(common/models): fixes quote-adjacent pred-text suggestions #7205
Conversation
User Test ResultsTest specification and instructions
Results Template
|
... fully posted it, then realized part of the first drafted unit test still fails. Still a lot better than before; just the one last nit to get, I guess. |
I notice both web and lm test builds failed. Not sure if that is new to this PR or longstanding instability. We need to get on top of these failures (even if it means disabling tests because we can't get them stable). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I am not entirely clear on the rationale behind some of the algorithm changes. It seems to make sense but I don't have a deep enough understanding of the whole module to really be 100% on it. Code itself looks okay, apart from the isWhitespace
function which seems weird. Appreciate the refactoring of the three related functions into new TransformUtils class.
static isWhitespace(transform: Transform): boolean { | ||
// Matches prefixed text + any instance of a character with Unicode general property Z* or the following: CR, LF, and Tab. | ||
let whitespaceRemover = /.*[\u0009\u000A\u000D\u0020\u00a0\u1680\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200a\u200b\u2028\u2029\u202f\u205f\u3000]/i; | ||
|
||
// Filter out null-inserts; their high probability can cause issues. | ||
if(transform.insert == '') { // Can actually register as 'whitespace'. | ||
return false; | ||
} | ||
|
||
let insert = transform.insert; | ||
|
||
insert = insert.replace(whitespaceRemover, ''); | ||
|
||
return insert == ''; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems a little convoluted.
-
I'm really not sure exactly what we are testing here. It seems like you are looking for
insert.match(/[\u0009...]$/)
? Is it intended to match having a single whitespace character on the end of the text? If so, then the function name is not very clear. If not, then AFAICT the function is not correct. -
i
modifier should not be necessary on the regex -- there's no case to be insensitive about 🤣. -
Can we use a character class for the regex match rather than a set of characters? (e.g.
insert.match(/[\p{Z}\r\n]$/u)
. Note Chrome 50+ though (so we could have the shorter regex as a comment if we can't use it for back-compat reasons). -
Finally, can we have a javadoc comment rather than the in-function comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keep in mind that I did not edit this function at all. I simply extracted it from its prior home. From master
:
keyman/common/web/lm-worker/src/model-compositor.ts
Lines 19 to 33 in 80d7ce7
protected isWhitespace(transform: Transform): boolean { | |
// Matches prefixed text + any instance of a character with Unicode general property Z* or the following: CR, LF, and Tab. | |
let whitespaceRemover = /.*[\u0009\u000A\u000D\u0020\u00a0\u1680\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200a\u200b\u2028\u2029\u202f\u205f\u3000]/i; | |
// Filter out null-inserts; their high probability can cause issues. | |
if(transform.insert == '') { // Can actually register as 'whitespace'. | |
return false; | |
} | |
let insert = transform.insert; | |
insert = insert.replace(whitespaceRemover, ''); | |
return insert == ''; | |
} |
This all dates back to #1851, and @eddieantonio was the one responsible for that regex: #1851 (comment)
Can't speak to why he didn't use a character class here, but I figure that there was some reason or other. For what it's worth, following that character class link and drilling down to the actual classes: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions/Character_Classes
I don't see a compatibility chart there. Now, there are other ways to search for this...
https://caniuse.com/mdn-javascript_builtins_regexp_property_escapes
Note the minimum Chrome version listed there - if desktop Chrome didn't support these property escapes before version 64, I doubt that mobile had it implemented at that time. This, to me, implies a strong chance that using them would break predictive text outright on non-updated Android 5.0 devices, at the least.
Where do you see a $
in the regex string here? Is it implied with the flags on the pattern? It's certainly not present in plain-text.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
apart from the isWhitespace function which seems weird
So... feel free to make an issue about the function's implementation seeming weird to you, but since there are no edits to this function within this PR, any potential changes should be considered out of scope here. It's been in place since 12.0, I think.
(Using git blame
to double-check...)
(I'm pretty sure that the more-recent 'change' on the empty line was lingering whitespace removal.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, happy to postpone fixing it. Wonder if this function may be causing issues though because it seems weird...
Where do you see a
$
in the regex string here? Is it implied with the flags on the pattern? It's certainly not present in plain-text.
It's implied by return insert == '';
-- so this will only be true if the regex deletes everything in the string, which means it must be the last char in the string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keep in mind that we're dealing with incoming transforms - in essence, this is used on incoming keystrokes, not existing context. (Hence the classname of TransformUtils
. (Emphasis on Transform)
isWhitespace({
insert: 'a ',
deleteLeft: 0
)
will yield false
- after all, that 'a' will remain.
Meanwhile, this one:
isWhitespace({
insert: '\u{0020}\u{0020}\u{0020}\u{0020}', // four spaces in a row; doesn't format well
deleteLeft: 0 // on web without the \u stuff
)
will yield true
. The design is meant to check if any prediction-triggering keystroke is fully whitespace or not.
Its uses do imply a few assumptions, though - we're not doing any hasWhitespace
checks, which might be more valid for niche, likely-unintuitive scenarios.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure? I just ran this (note: stripped down to just pass a string in -- after all, that's the only property of Transform
that is used there):
let foo = [
'a ',
' ',
' ',
' x x ',
'xxx ',
'xxx'
];
for(let f of foo) {
console.log(`"${f}": `, isWhitespace(f));
}
and the result was:
$ node foo.js
"a ": true
" ": true
" ": true
" x x ": true
"xxx ": true
"xxx": false
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it is supposed to check if a given transform is 100% whitespace and not empty, then:
let whitespace = /^[\u0009\u000A\u000D\u0020\u00a0\u1680\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200a\u200b\u2028\u2029\u202f\u205f\u3000]+$/;
return transform.insert.match(whitespace) !== null;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh right... regex matching by default gets the portion, rather than checking against the whole. I tend to forget that sometimes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry on the delay. Just got back to this.
Though, I did go ahead and add unit testing for the function while I was at it.
For the LMLayer, it's the test failure addressed by #7037 again. (It happens on lmlayer worker + hosted-in-browser integration.) |
OK. Think we need to get #7037 merged. I'm not excited about it because it's hard to 100% verify, being timing-based, but the noise is causing trouble in dev processes. And I think we need another round on fixing the web failing tests because again, it's really hard to tell the relationship in failed tests. |
Just swapped the approach for something clearer (and less fiddly, to boot!). The change is quite significant, so... @keymanapp-test-bot retest all |
if(transform.insert == '') { | ||
return false; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shouldn't be needed as the regex match will exclude it.
Co-authored-by: Marc Durdin <[email protected]>
Changes in this pull request will be available for download in Keyman version 16.0.64-alpha |
Fixes #6900. The actual internal issue was technically a bit more general, theoretically capable of happening in more than just "quote-adjacent" scenarios, but the behavior has been most evident there.
The predictive-text engine previously assumed that wordbreaking should only occur after whitespace characters (like a space or a newline). However, here's one common case where this assumption is wrong when using the default Unicode wordbreaker: the text
"Hello"
gets word-broken into['"', 'Hello', '"']
. (That is, quotes on the side of a word like that, rather than inside of it, are treated as a different "word" - without whitespace.)To distill what caused the problem: internally... the predictive-text engine was expecting the leading
"
character to be whitespace whenever this occurred, and so assumed it to be whitespace, ignoring reality, causing a one-character context-desync with the main Web engine, and leading to the reported erroneous behavior. (With a couple of other nits here and there on the side, to boot.)User Testing
GROUP_ANDROID: Test using any Android device (or simulated device) supporting Keyman.
GROUP_WEB_HOST: Test using the "Predictive Text: robust testing" Web test page
TEST_QUOTE_O: Using the default English keyboard for your test environment, type
'
followed immediately byo
.'
mark is replaced, *FAIL_ this test.TEST_QUOTE_OP: Using the default English keyboard for your test environment, type
'
followed immediately byo
, thenp
.'
mark is replaced, FAIL this test.TEST_OP: Using the default English keyboard for your test environment, type
you
, add a space, then typeo
, thenp
.op
, FAIL this test.open
toyou op
should result inyou open
for the full context.