Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong unicode handling? #22

Closed
emilyyyylime opened this issue Dec 8, 2023 · 6 comments · Fixed by #24 or #26
Closed

Wrong unicode handling? #22

emilyyyylime opened this issue Dec 8, 2023 · 6 comments · Fixed by #24 or #26

Comments

@emilyyyylime
Copy link

I'm using Helix, but nothing like this appears to be happening with other LSPs, so I imagine the issue is with Typos.

When a line has multi-byte codepoints, the highlighting appears in the wrong place; seemingly reporting byte offset when a character offset is expected:
Image showing the word Spanish word 'hace' being recognised as a misspelling of 'have', and highlighting being two characters too far.

@tekumara
Copy link
Owner

Thanks for raising this! Could you share the text from the screenshot, and I'll see if I can reproduce this.

@emilyyyylime
Copy link
Author

uh ¿Qué hace él? I guess

tekumara pushed a commit that referenced this issue Dec 10, 2023
🤖 I have created a release *beep* *boop*
---


##
[0.1.9](v0.1.8...v0.1.9)
(2023-12-10)


### Bug Fixes

* typo start position corrected for multiple code point unicode
([e3d2752](e3d2752)),
closes [#22](#22)

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).

Co-authored-by: potatobot-prime[bot] <132267321+potatobot-prime[bot]@users.noreply.github.com>
@emilyyyylime
Copy link
Author

I believe you may have overcompensated with your fix, using combining characters such as U+0303 Combining Tilde (with 'e' displays as ) — the highlight appears too early.
It seems to me, skimming over the LSP specification1, that the offset is not measured in grapheme clusters as implemented in your fix, but rather by code unit offsets, with the client able to negotiate the representation (UTF-8 / UTF-16 / UTF-32) with the server.

I don't have permission to reopen this issue; I'd appreciate if you could.

Footnotes

  1. https://microsoft.github.io/language-server-protocol/specifications/lsp/3.17/specification/#position

@tekumara
Copy link
Owner

Ah good catch .. I can see is rendered with a width of two characters in vscode but typos-vscode is counting it as one (i think).

@tekumara tekumara reopened this Dec 17, 2023
@emilyyyylime
Copy link
Author

@tekumara it should render as a single grapheme. However, what's happening here as I understand is that you used unicode-segmentation in your fix to count grapheme clusters, however the LSP protocol specifies that character offsets should be negotiated between the server and the client and reported using the negotiated representation thereafter. So if UTF-16 were negotiated (which is required to be supported by all servers and would thus be fully compatible on its own), you would return the number of UTF-16 units the characters encode into (1 for each character below U+D800 and 2 for characters after that).1 If UTF-8 were negotiated you would return the number of bytes that would be needed to encode the text as UTF-8 (so 1 for the first 128 codepoints, 2 for the next 1920, and so on).2 For UTF-32, you would simply use the number of code points.3

Footnotes

  1. Rust: str.chars().map(char::len_utf16).sum()

  2. Rust: str.len()

  3. Rust: str.chars().count()

@tekumara
Copy link
Owner

Thanks @emilyyyylime for the really helpful explanation. I've pushed a fix which counts utf-16 code units, since that is the default position encoding. The other position encodings aren't supported yet.

tekumara pushed a commit that referenced this issue Dec 26, 2023
🤖 I have created a release *beep* *boop*
---


##
[0.1.10](v0.1.9...v0.1.10)
(2023-12-26)


### Bug Fixes

* count positions as utf-16 code units
([de52345](de52345)),
closes [#22](#22)

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).

Co-authored-by: potatobot-prime[bot] <132267321+potatobot-prime[bot]@users.noreply.github.com>
This was referenced Mar 26, 2024
This was referenced Sep 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment