Skip to content

Rework TextEdit arrow navigation to handle Unicode graphemes#5812

Merged
lucasmerlin merged 8 commits intoemilk:masterfrom
MStarha:unicode-grapheme-navigation
Apr 22, 2025
Merged

Rework TextEdit arrow navigation to handle Unicode graphemes#5812
lucasmerlin merged 8 commits intoemilk:masterfrom
MStarha:unicode-grapheme-navigation

Conversation

@MStarha
Copy link
Contributor

@MStarha MStarha commented Mar 16, 2025

  • I have followed the instructions in the PR template

Previously, navigating text in TextEdit with Ctrl + left/right arrow would jump inside words that contained combining characters (i.e. diacritics). This PR introduces new dependency of unicode-segmentation to handle grapheme encoding. The new implementation ignores whitespace and other separators such as - (dash) between words, but respects _ (underscore).

@github-actions
Copy link

Preview available at https://egui-pr-preview.github.io/pr/5812-unicode-grapheme-navigation
Note that it might take a couple seconds for the update to show up after the preview_build workflow has completed.

@emilk
Copy link
Owner

emilk commented Mar 20, 2025

I did a quick check, and this increases the .wasm size by ~50 kB, which I think is acceptable (it's because of the tables here: https://github.com/unicode-rs/unicode-segmentation/blob/master/src/tables.rs)

Copy link
Owner

@emilk emilk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Than you for working on this!

Does this fully close #62 ?

Please add a some unit tests of this feature so that we know it works, and that it won't break again 🙏

@MStarha
Copy link
Contributor Author

MStarha commented Mar 20, 2025

I think it does indeed solve #62, I just did not find it (I searched 'unicode' or 'utf', the term 'grapheme' did not occur to me). I ceratainly does nothing for #2432, I doubt it has much effect on #246, and is only a part of #56.

@MStarha
Copy link
Contributor Author

MStarha commented Mar 20, 2025

I just reworked the word splitting because I found out it complete fell apart around emojis.

Then I saw the is_word_char() function and got an idea: use the previous implementation, but instead of char::is_ascii_alphanumeric() use `char::is_alphanumeric(). Which behaves around 'normal words' the same way as the new implementation, but slightly different around emojis. Emojis are a completely different category, whose handling is not thouroughly consisten across editors and browsers, so I would not stress much about them.

The new unicode implementation may be useful, if used at a larger scale in the future (not just for word splitting in text edit). But currently the local-only effect of the dependency may not be worth what it brings compared to allowing non-ASCII characters in the existing implementation.

@valadaptive
Copy link
Contributor

This may end up covering the same ground as #5784.

@MStarha
Copy link
Contributor Author

MStarha commented Mar 21, 2025

That's true, though I think this can be finalized and merged and later replaced by #5784.

@emilk
Copy link
Owner

emilk commented Apr 1, 2025

@valadaptive do you think merging this PR will help or hinder your parley work?

@valadaptive
Copy link
Contributor

I think I'm going to need to redo it from scratch anyway, so go ahead and merge this.

@lucasmerlin lucasmerlin self-assigned this Apr 22, 2025
# Conflicts:
#	crates/egui/src/text_selection/text_cursor_state.rs
#	crates/egui/src/widgets/text_edit/text_buffer.rs
@lucasmerlin lucasmerlin added feature New feature or request egui labels Apr 22, 2025
@lucasmerlin lucasmerlin merged commit 69b9f0e into emilk:master Apr 22, 2025
47 of 48 checks passed
@lucasmerlin lucasmerlin removed their assignment Apr 22, 2025
darkwater pushed a commit to darkwater/egui that referenced this pull request Aug 24, 2025
…#5812)

* [x] I have followed the instructions in the PR template

Previously, navigating text in `TextEdit` with Ctrl + left/right arrow
would jump inside words that contained combining characters (i.e.
diacritics). This PR introduces new dependency of `unicode-segmentation`
to handle grapheme encoding. The new implementation ignores whitespace
and other separators such as `-` (dash) between words, but respects `_`
(underscore).

---------

Co-authored-by: lucasmerlin <hi@lucasmerlin.me>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

egui feature New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants