From b5f0679fdb269b3e653ca5c758770d60e62da694 Mon Sep 17 00:00:00 2001 From: Adam Krantz Date: Fri, 6 Mar 2026 06:06:34 -0800 Subject: [PATCH] Add Unicode grapheme handling guidance Add guidance to CONTRIBUTING.md and .claude/rules/ about correct grapheme-aware text handling: use string.GetColumns() for width measurement, iterate by grapheme with GraphemeHelper.GetGraphemes() for rendering, and use AddStr instead of AddRune for grapheme clusters. This documents the patterns that prevent the bugs fixed in #4798 and #4800 from recurring. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- .claude/rules/unicode-graphemes.md | 48 ++++++++++++++++++++++++++++++ CONTRIBUTING.md | 32 ++++++++++++++++++++ 2 files changed, 80 insertions(+) create mode 100644 .claude/rules/unicode-graphemes.md diff --git a/.claude/rules/unicode-graphemes.md b/.claude/rules/unicode-graphemes.md new file mode 100644 index 0000000000..c1b7ec5b11 --- /dev/null +++ b/.claude/rules/unicode-graphemes.md @@ -0,0 +1,48 @@ +# Unicode and Grapheme Handling + +**Source:** [CONTRIBUTING.md - Unicode and Grapheme Handling](../../CONTRIBUTING.md#unicode-and-grapheme-handling) + +## Core Principle + +**Think in graphemes, not runes.** A grapheme cluster is what the user perceives as a single character, but it may consist of multiple `Rune` values (e.g., base character + combining marks, or ZWJ emoji sequences). + +## Width Measurement + +**Always use `string.GetColumns()`** to measure display width. + +```csharp +// ✅ CORRECT — grapheme-aware, handles ZWJ emoji, combining marks, CJK +int width = text.GetColumns (); + +// ❌ WRONG — sums individual rune widths, inflates multi-rune grapheme clusters +int width = text.EnumerateRunes ().Sum (r => r.GetColumns ()); + +// ❌ WRONG — counts chars, not terminal cells (wrong for CJK, emoji, surrogates) +int width = text.Length; +``` + +**Why:** A ZWJ family emoji like 👨‍👩‍👦‍👦 occupies 2 terminal cells, but `EnumerateRunes().Sum(GetColumns)` returns 8 and `string.Length` returns 11. `string.GetColumns()` correctly returns 2 by iterating grapheme clusters and clamping each to max 2 columns. + +## Text Iteration and Rendering + +**Iterate by grapheme cluster** using `GraphemeHelper.GetGraphemes()` and render with `AddStr`. + +```csharp +// ✅ CORRECT — preserves grapheme cluster integrity +foreach (string grapheme in GraphemeHelper.GetGraphemes (text)) +{ + AddStr (grapheme); +} + +// ❌ WRONG — breaks combining marks, ZWJ sequences +foreach (Rune rune in text.EnumerateRunes ()) +{ + AddRune (rune); +} +``` + +**Why:** Combining marks (e.g., `e` + U+0301 combining acute = `é`) must be sent together. `AddRune` sends them separately, preventing composition. `AddStr` with a complete grapheme string preserves the cluster. + +## When Rune Iteration Is Appropriate + +Rune-level iteration is correct when you need to inspect individual Unicode scalar values rather than render text — for example, counting zero-width runes for vertical text layout, or checking character properties. The key distinction: **rendering and measurement should use graphemes; analysis may use runes.** diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 0b1abf6c57..f7bfc635b6 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -169,6 +169,36 @@ Welcome! This guide provides everything you need to know to contribute effective **⚠️ CRITICAL - These conventions apply to ALL code - production code, test code, examples, documentation, and samples.** +### Unicode and Grapheme Handling + +**Think in graphemes, not runes.** A grapheme cluster is what the user perceives as a single character, but it may consist of multiple `Rune` values (e.g., base character + combining marks, or ZWJ emoji sequences). + +- **Always use `string.GetColumns()`** to measure display width — never `EnumerateRunes().Sum(r => r.GetColumns())` (inflates multi-rune clusters) or `string.Length` (counts chars, not terminal cells) +- **Iterate by grapheme** using `GraphemeHelper.GetGraphemes()` when rendering text — never iterate by `Rune` and call `AddRune` for each (breaks combining marks and ZWJ sequences) +- **Render with `AddStr`** passing complete grapheme strings — `AddRune` with individual runes from a cluster will not compose correctly + +```csharp +// ✅ CORRECT — grapheme-aware width measurement +int width = text.GetColumns (); + +// ❌ WRONG — inflates width for ZWJ emoji (e.g., 👨‍👩‍👦‍👦 → 8 instead of 2) +int width = text.EnumerateRunes ().Sum (r => r.GetColumns ()); + +// ✅ CORRECT — grapheme-aware rendering +foreach (string grapheme in GraphemeHelper.GetGraphemes (text)) +{ + AddStr (grapheme); +} + +// ❌ WRONG — breaks combining marks (é rendered as e + ́ separately) +foreach (Rune rune in text.EnumerateRunes ()) +{ + AddRune (rune); +} +``` + +**Exception:** Rune-level iteration is appropriate when inspecting individual Unicode scalar values (e.g., counting zero-width runes for vertical text layout), not for rendering or measurement. + ## Testing Requirements ### Code Coverage @@ -324,5 +354,7 @@ The workflow will: - ❌ **Don't use `var` for anything but built-in simple types** (use explicit types) - ❌ **Don't use redundant type names with `new`** (**ALWAYS PREFER** target-typed `new ()`) - ❌ **Don't introduce new warnings** (fix warnings in files you modify; exception: `[Obsolete]` warnings) +- ❌ **Don't use `EnumerateRunes().Sum(GetColumns)`** for display width — use `string.GetColumns()` +- ❌ **Don't use `AddRune` in a rune loop** to render text — iterate by grapheme with `GraphemeHelper.GetGraphemes()` and use `AddStr` **Thank you for contributing to Terminal.Gui!** 🎉