Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(common/models): fixes quote-adjacent pred-text suggestions #7205

Merged
merged 20 commits into from
Sep 15, 2022

Conversation

jahorton
Copy link
Contributor

@jahorton jahorton commented Sep 5, 2022

Fixes #6900. The actual internal issue was technically a bit more general, theoretically capable of happening in more than just "quote-adjacent" scenarios, but the behavior has been most evident there.

The predictive-text engine previously assumed that wordbreaking should only occur after whitespace characters (like a space or a newline). However, here's one common case where this assumption is wrong when using the default Unicode wordbreaker: the text "Hello" gets word-broken into ['"', 'Hello', '"']. (That is, quotes on the side of a word like that, rather than inside of it, are treated as a different "word" - without whitespace.)

To distill what caused the problem: internally... the predictive-text engine was expecting the leading " character to be whitespace whenever this occurred, and so assumed it to be whitespace, ignoring reality, causing a one-character context-desync with the main Web engine, and leading to the reported erroneous behavior. (With a couple of other nits here and there on the side, to boot.)

User Testing

GROUP_ANDROID: Test using any Android device (or simulated device) supporting Keyman.
GROUP_WEB_HOST: Test using the "Predictive Text: robust testing" Web test page

TEST_QUOTE_O: Using the default English keyboard for your test environment, type ' followed immediately by o.

  • If most of the displayed suggestions do not begin with an 'o', FAIL this test.
    • Do not fail if "I" is one of them - it's a super-common word, and 'o' does neighbor 'i'.
  • Apply any of the displayed suggestions. If the ' mark is replaced, *FAIL_ this test.

TEST_QUOTE_OP: Using the default English keyboard for your test environment, type ' followed immediately by o, then p.

  • If most of the displayed suggestions do not begin with an 'op', FAIL this test.
    • Do not fail if "OP" is suggested, despite its capitalization.
  • Apply any of the displayed suggestions. If the ' mark is replaced, FAIL this test.

TEST_OP: Using the default English keyboard for your test environment, type you, add a space, then type o, then p.

  • If most of the displayed suggestions do not begin with an op, FAIL this test.
    • Do not fail if "OP" is suggested, despite its capitalization.
  • Apply any of the displayed suggestions. If unwanted effects occur, FAIL this test.
    • No prior whitespace should be deleted and the suggestion should be cleanly applied.
    • Example: applying open to you op should result in you open for the full context.

@jahorton jahorton added this to the A16S10 milestone Sep 5, 2022
@keymanapp-test-bot keymanapp-test-bot bot added the user-test-missing User tests have not yet been defined for the PR label Sep 5, 2022
@keymanapp-test-bot
Copy link

keymanapp-test-bot bot commented Sep 5, 2022

User Test Results

Test specification and instructions

  • ⬜ GROUP_ANDROID: Test using any Android device (or simulated device) supporting Keyman.

    • TEST_QUOTE_O (OPEN): retest
    • TEST_QUOTE_OP (OPEN): retest
    • TEST_OP (OPEN): retest
  • ⬜ GROUP_WEB_HOST: Test using the "Predictive Text: robust testing" Web test page

    • TEST_QUOTE_O (OPEN): retest
    • TEST_QUOTE_OP (OPEN): retest
    • TEST_OP (OPEN): retest
Results Template
# Test Results

### GROUP_ANDROID: Test using any Android device (or simulated device) supporting Keyman.

* **TEST_QUOTE_O (OPEN):** notes
* **TEST_QUOTE_OP (OPEN):** notes
* **TEST_OP (OPEN):** notes

### GROUP_WEB_HOST: Test using the "Predictive Text: robust testing" Web test page

* **TEST_QUOTE_O (OPEN):** notes
* **TEST_QUOTE_OP (OPEN):** notes
* **TEST_OP (OPEN):** notes

@keymanapp-test-bot keymanapp-test-bot bot added has-user-test user-test-required User tests have not been completed and removed user-test-missing User tests have not yet been defined for the PR labels Sep 5, 2022
@jahorton
Copy link
Contributor Author

jahorton commented Sep 5, 2022

... fully posted it, then realized part of the first drafted unit test still fails. Still a lot better than before; just the one last nit to get, I guess.

@jahorton jahorton marked this pull request as draft September 5, 2022 04:55
@jahorton jahorton marked this pull request as ready for review September 5, 2022 05:35
@bharanidharanj
Copy link

GROUP_ANDROID: Test using any Android device (or simulated device) supporting Keyman.

  • TEST_QUOTE_O (PASSED): Tested this with the attached PR build (keyman 16.0.56-alpha-test-7205)in the Android Mobile device Ver 11.0 and I noticed most of the displayed suggestions begin with an 'o'. Applied suggestions did not replaced by the ' mark.

  • TEST_QUOTE_OP (PASSED): Tested this with the attached PR build in the Android Mobile device Ver 11.0 and I noticed that most of the displayed suggestions begin with an 'op'. Applied suggestions did not replaced by the ' mark.

  • TEST_OP (PASSED): Tested this as per the instructions and most the of the displayed suggestions begin with an op. Applying 'open' to 'you op' results 'you open' for the full context.

@bharanidharanj
Copy link

GROUP_WEB_HOST: Test using the "Predictive Text: robust testing" Web test page

  • TEST_QUOTE_O (PASSED): Tested as per the instructions using the "Predictive Text: robust testing" Web test page and it is working as expected.

  • TEST_QUOTE_OP (PASSED): Tested as per the instructions using the "Predictive Text: robust testing" Web test page and it is working as expected.

  • TEST_OP (PASSED): Tested as per the instructions using the "Predictive Text: robust testing" Web test page and it is working as expected.

@keymanapp-test-bot keymanapp-test-bot bot removed the user-test-required User tests have not been completed label Sep 5, 2022
@mcdurdin
Copy link
Member

mcdurdin commented Sep 7, 2022

I notice both web and lm test builds failed. Not sure if that is new to this PR or longstanding instability. We need to get on top of these failures (even if it means disabling tests because we can't get them stable).

Copy link
Member

@mcdurdin mcdurdin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I am not entirely clear on the rationale behind some of the algorithm changes. It seems to make sense but I don't have a deep enough understanding of the whole module to really be 100% on it. Code itself looks okay, apart from the isWhitespace function which seems weird. Appreciate the refactoring of the three related functions into new TransformUtils class.

common/web/lm-worker/src/correction/context-tracker.ts Outdated Show resolved Hide resolved
common/web/lm-worker/src/correction/context-tracker.ts Outdated Show resolved Hide resolved
common/web/lm-worker/src/model-compositor.ts Outdated Show resolved Hide resolved
Comment on lines 2 to 16
static isWhitespace(transform: Transform): boolean {
// Matches prefixed text + any instance of a character with Unicode general property Z* or the following: CR, LF, and Tab.
let whitespaceRemover = /.*[\u0009\u000A\u000D\u0020\u00a0\u1680\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200a\u200b\u2028\u2029\u202f\u205f\u3000]/i;

// Filter out null-inserts; their high probability can cause issues.
if(transform.insert == '') { // Can actually register as 'whitespace'.
return false;
}

let insert = transform.insert;

insert = insert.replace(whitespaceRemover, '');

return insert == '';
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems a little convoluted.

  • I'm really not sure exactly what we are testing here. It seems like you are looking for insert.match(/[\u0009...]$/)? Is it intended to match having a single whitespace character on the end of the text? If so, then the function name is not very clear. If not, then AFAICT the function is not correct.

  • i modifier should not be necessary on the regex -- there's no case to be insensitive about 🤣.

  • Can we use a character class for the regex match rather than a set of characters? (e.g. insert.match(/[\p{Z}\r\n]$/u). Note Chrome 50+ though (so we could have the shorter regex as a comment if we can't use it for back-compat reasons).

  • Finally, can we have a javadoc comment rather than the in-function comment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep in mind that I did not edit this function at all. I simply extracted it from its prior home. From master:

protected isWhitespace(transform: Transform): boolean {
// Matches prefixed text + any instance of a character with Unicode general property Z* or the following: CR, LF, and Tab.
let whitespaceRemover = /.*[\u0009\u000A\u000D\u0020\u00a0\u1680\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200a\u200b\u2028\u2029\u202f\u205f\u3000]/i;
// Filter out null-inserts; their high probability can cause issues.
if(transform.insert == '') { // Can actually register as 'whitespace'.
return false;
}
let insert = transform.insert;
insert = insert.replace(whitespaceRemover, '');
return insert == '';
}

This all dates back to #1851, and @eddieantonio was the one responsible for that regex: #1851 (comment)

Can't speak to why he didn't use a character class here, but I figure that there was some reason or other. For what it's worth, following that character class link and drilling down to the actual classes: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions/Character_Classes

I don't see a compatibility chart there. Now, there are other ways to search for this...

https://caniuse.com/mdn-javascript_builtins_regexp_property_escapes

Note the minimum Chrome version listed there - if desktop Chrome didn't support these property escapes before version 64, I doubt that mobile had it implemented at that time. This, to me, implies a strong chance that using them would break predictive text outright on non-updated Android 5.0 devices, at the least.


Where do you see a $ in the regex string here? Is it implied with the flags on the pattern? It's certainly not present in plain-text.

Copy link
Contributor Author

@jahorton jahorton Sep 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apart from the isWhitespace function which seems weird

So... feel free to make an issue about the function's implementation seeming weird to you, but since there are no edits to this function within this PR, any potential changes should be considered out of scope here. It's been in place since 12.0, I think.

(Using git blame to double-check...)

image

(I'm pretty sure that the more-recent 'change' on the empty line was lingering whitespace removal.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, happy to postpone fixing it. Wonder if this function may be causing issues though because it seems weird...

Where do you see a $ in the regex string here? Is it implied with the flags on the pattern? It's certainly not present in plain-text.

It's implied by return insert == ''; -- so this will only be true if the regex deletes everything in the string, which means it must be the last char in the string.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep in mind that we're dealing with incoming transforms - in essence, this is used on incoming keystrokes, not existing context. (Hence the classname of TransformUtils. (Emphasis on Transform)

isWhitespace({
  insert: 'a ',
  deleteLeft: 0
)

will yield false - after all, that 'a' will remain.

Meanwhile, this one:

isWhitespace({
  insert: '\u{0020}\u{0020}\u{0020}\u{0020}', // four spaces in a row; doesn't format well
  deleteLeft: 0                               // on web without the \u stuff
)

will yield true. The design is meant to check if any prediction-triggering keystroke is fully whitespace or not.

Its uses do imply a few assumptions, though - we're not doing any hasWhitespace checks, which might be more valid for niche, likely-unintuitive scenarios.

Copy link
Member

@mcdurdin mcdurdin Sep 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure? I just ran this (note: stripped down to just pass a string in -- after all, that's the only property of Transform that is used there):

let foo = [
  'a ',
  ' ',
  '   ',
  ' x x ',
  'xxx   ',
  'xxx'
];

for(let f of foo) {
  console.log(`"${f}": `, isWhitespace(f));
}

and the result was:

$ node foo.js
"a ":  true
" ":  true
"   ":  true
" x x ":  true
"xxx   ":  true
"xxx":  false

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is supposed to check if a given transform is 100% whitespace and not empty, then:

let whitespace = /^[\u0009\u000A\u000D\u0020\u00a0\u1680\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200a\u200b\u2028\u2029\u202f\u205f\u3000]+$/; 

return transform.insert.match(whitespace) !== null;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh right... regex matching by default gets the portion, rather than checking against the whole. I tend to forget that sometimes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry on the delay. Just got back to this.

Though, I did go ahead and add unit testing for the function while I was at it.

common/web/lm-worker/src/transformUtils.ts Show resolved Hide resolved
common/web/lm-worker/src/transformUtils.ts Show resolved Hide resolved
@jahorton
Copy link
Contributor Author

jahorton commented Sep 8, 2022

I notice both web and lm test builds failed. Not sure if that is new to this PR or longstanding instability. We need to get on top of these failures (even if it means disabling tests because we can't get them stable).

For the LMLayer, it's the test failure addressed by #7037 again. (It happens on lmlayer worker + hosted-in-browser integration.)
Web's is similarly unrelated.

@mcdurdin
Copy link
Member

mcdurdin commented Sep 8, 2022

For the LMLayer, it's the test failure addressed by #7037 again. (It happens on lmlayer worker + hosted-in-browser integration.)
Web's is similarly unrelated.

OK. Think we need to get #7037 merged. I'm not excited about it because it's hard to 100% verify, being timing-based, but the noise is causing trouble in dev processes. And I think we need another round on fixing the web failing tests because again, it's really hard to tell the relationship in failed tests.

@jahorton
Copy link
Contributor Author

jahorton commented Sep 8, 2022

Just swapped the approach for something clearer (and less fiddly, to boot!). The change is quite significant, so...

@keymanapp-test-bot retest all

@keymanapp-test-bot keymanapp-test-bot bot added the user-test-required User tests have not been completed label Sep 8, 2022
Comment on lines 7 to 9
if(transform.insert == '') {
return false;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't be needed as the regex match will exclude it.

common/web/lm-worker/src/transformUtils.ts Outdated Show resolved Hide resolved
@jahorton jahorton merged commit ec53662 into master Sep 15, 2022
@jahorton jahorton deleted the fix/web/non-whitespace-wordbreaks branch September 15, 2022 01:23
@keyman-server
Copy link
Collaborator

Changes in this pull request will be available for download in Keyman version 16.0.64-alpha

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bug(common/models): LM replaces left quote and character(s) with the selected option
5 participants