Fixed concurrency issue in ReorderingBuffer that corrupted Normalizer2 output#122
Merged
NightOwl888 merged 2 commits intoApr 20, 2026
Merged
Conversation
…2 output under contention ReorderingBuffer had an internal constructor that took a `ref ValueStringBuilder` and assigned it by value into the `str` field. Because ValueStringBuilder is a ref struct that rents a char[] from ArrayPool<char>.Shared (stored in _arrayToReturnToPool), the struct copy aliased the rental between the outer caller and the ReorderingBuffer. When Grow() fired on `str`, the old array was returned to the pool while the outer caller's ValueStringBuilder still held a reference to it and would return it again on Dispose(). Under concurrency two threads could rent the same array simultaneously and corrupt each other's normalization output — reproducible at >10% mismatch rates with 8 threads running NormalizeSecondAndAppend on NFKC_CF input that forces Grow() via ligature expansion. Removed the unsafe ReorderingBuffer(Normalizer2Impl, ref ValueStringBuilder, int) ctor and rewrote the three call sites to use the existing safe constructors that take an initial ReadOnlySpan<char> value plus an owned buffer: - Normalizer2Impl.Decompose(scoped ReadOnlySpan<char>, scoped ref ValueStringBuilder, int) - Norm2AllModes.NormalizeSecondAndAppend(StringBuilder, ReadOnlySpan<char>, bool) - Norm2AllModes.NormalizeSecondAndAppend(scoped ref ValueStringBuilder, ReadOnlySpan<char>, bool) - RuleBasedCollator.MakeFCD call site in the NFD identical-level-run branch Also removes a now-unsound "HACK" line in Decompose that manually transferred the length from the ReorderingBuffer's aliased ValueStringBuilder back to `dest`; the fixed version owns its own buffer and copies the result into `dest` explicitly. Added a regression test (TestNormalizeSecondAndAppendConcurrency) that runs 8 threads x 2000 iterations of NormalizeSecondAndAppend against NFKC_CF with an input that forces Grow() via FB03 "ffi" ligature expansion, and fails loudly if any thread's output diverges from the single-threaded expected result. Before this fix: ~2200/16000 mismatches. After: 0. The test embeds nfkc_cf.nrm as a resource in the test assembly so it runs without requiring the ICU4N.resources satellite assembly (which depends on tooling that isn't available on every dev machine). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 task
NightOwl888
requested changes
Apr 20, 2026
b85c7c1 to
fcefc19
Compare
ReorderingBuffer: - Rewrote <remarks> so the pool-backing mention reflects that ValueStringBuilder is only conditionally backed by an ArrayPool rental (either via the initialCapacity ctor or by outgrowing a stackalloc'd initialBuffer), not unconditionally. - Removed the stale "ICU4N TODO: Evaluate whether this approach makes sense and if not, remove" comment above the (ReadOnlySpan<char>, int) ctor — this PR's fix relies on that ctor shape, so the design question is settled. - Added internal ReorderingBuffer(Normalizer2Impl, StringBuilder?, Span<char>) and (..., int) ctors that Append the StringBuilder directly into the inner ValueStringBuilder, eliminating the outer stackalloc/pool dance that the previous fix used to materialize `first` into a span before handing it to ReorderingBuffer. Normalizer2Impl.Decompose(ReadOnlySpan<char>, StringBuilder, int): - Restored the "ICU4N TODO: Make public TryDecompose()..." reminder above the method (it previously lived above the removed (ref ValueStringBuilder, int) overload; the reviewer wants it kept as a standing note). Norm2AllModes.NormalizeSecondAndAppend(StringBuilder, ...): - Collapsed to the single-allocation form enabled by the new StringBuilder? ctors. No more intermediate char[]/stackalloc to copy `first` into before building the ReorderingBuffer. FCDUTF16CollationIterator.Normalize: - Reworked per reviewer's suggestion: hoist estimatedLength/value above the OpenStringBuilder allocation, size OpenStringBuilder with capacity up front when creating it fresh (avoids internal growths), and restore the "ICU4N: Corrected 2nd parameter" translation marker on the slice expression. Restored ICU4N translation-marker comments on call sites I rewrote: - FCDUTF16CollationIterator.cs:496 — "Corrected 2nd parameter" - RuleBasedCollator.cs:1208 — "value is sliced prior to passing to this method, nfdQCYesLimit is based on this slice" - RuleBasedCollator.cs:1226 — "Corrected 2nd parameter" - RuleBasedCollator.cs:1485 — "Corrected 2nd parameter" (was "3rd", now matches the revised slice) - RuleBasedCollator.cs:1502 — new "Corrected 2nd parameter" on the initial-value slice - RuleBasedCollator.cs:1508 — "Corrected 2nd parameter" (was "3rd") Regression test TestNormalizeSecondAndAppendConcurrency: - Dropped the nfkc_cf.nrm embedded resource from ICU4N.Tests.csproj (it was duplicating production data; will break once NightOwl888#85 removes the source file from the repo). - Switched to Normalizer2.GetInstance(null, "nfkc_cf", Compose), the same null-stream pattern every other Normalizer2 test uses. The test now inherits the same runtime dependency on the ICU4N.resources satellite as the rest of the suite. - Tried the reviewer's suggestion of basing the test on testnorm.nrm but confirmed via TDD it cannot reproduce the bug: testnorm's maximum expansion is ~17% (one case goes 6→7 UTF-16 units), well under the pool-bucket rounding (~2×) that determines whether Grow fires. Without Grow the aliasing hazard never manifests. Documented this rationale inline in the test. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
fcefc19 to
1c4684e
Compare
NightOwl888
approved these changes
Apr 20, 2026
ReorderingBuffer that corrupted Normalizer2 output
4 tasks
4 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes a thread-safety bug in
ReorderingBufferthat corruptedNormalizer2output under contention.ReorderingBufferhad an internal constructor that took aref ValueStringBuilderand assigned it by value into thestrfield. BecauseValueStringBuilderis aref structthat rents achar[]fromArrayPool<char>.Shared(stored in_arrayToReturnToPool), the struct copy duplicated ownership of the rental between the outer caller'sValueStringBuilderandReorderingBuffer.str. WhenGrow()fired onstr, the old array was returned to the pool while the outer caller's copy still pointed at it and would return it again onDispose(). Under concurrency two threads could rent the same array simultaneously and corrupt each other's normalization output — reproducible at >10% mismatch rates with 8 threads runningNormalizeSecondAndAppendon NFKC_CF input that forcesGrow()via ligature expansion.Fix
Removed the unsafe
ReorderingBuffer(Normalizer2Impl, ref ValueStringBuilder, int)ctor entirely. The only way to safely avoid the copy would be areffield, but C# forbidsreffields that referenceref structtypes (CS9050) — so a ref-field approach is not viable forValueStringBuilder.ReorderingBuffernow always owns its innerValueStringBuilder. To match that invariant without double-buffering,Normalizer2Impl.Decompose(..., ref ValueStringBuilder, ...)is gone; call sites that previously decomposed through an intermediateValueStringBuildernow allocate aReorderingBufferdirectly and call the low-levelDecompose(ReadOnlySpan<char>, scoped ref ReorderingBuffer)overload:RuleBasedCollator.WriteIdenticalLevelRuleBasedCollator.MakeFCD(NFD identical-level-run branch)FCDUTF16CollationIterator.NormalizeFCDIterCollationIterator.NormalizeNorm2AllModes.NormalizeSecondAndAppend(StringBuilder, ReadOnlySpan<char>, bool)Norm2AllModes.NormalizeSecondAndAppend(scoped ref ValueStringBuilder, ReadOnlySpan<char>, bool)Normalizer2Impl.Decompose(ReadOnlySpan<char>, StringBuilder, int)The hazard rationale lives in a
<remarks>block on theReorderingBufferXML doc, documenting why the type owns its ownValueStringBuilderand never aliases a caller's.Regression test
Added
TestNormalizeSecondAndAppendConcurrencyinNormalizerRegressionTests. It runs 8 threads × 2000 iterations ofNormalizeSecondAndAppendagainst NFKC_CF with an input that forcesGrow()via FB03 "ffi" ligature expansion, and fails loudly if any thread's output diverges from the single-threaded expected result.The test embeds
nfkc_cf.nrmas a resource in the test assembly so it runs without requiring theICU4N.resourcessatellite assembly (which depends on tooling that isn't available on every dev machine — e.g. thealassembly linker on non-Windows without Mono).Test plan
TestNormalizeSecondAndAppendConcurrencypasses on net9.0.mainon net9.0 (all pre-existing failures are theICU4N.resourcessatellite missing; delta is exactly +1 passing test — the new regression).This PR was prepared with assistance from Claude Code (Opus 4.7).