Skip to content

preserve UTF-16 surrogate pairs in MaskXmlInvalidCharacters - fixes #290#291

Merged
FreeAndNil merged 1 commit intomasterfrom
Feature/291-MaskXmlInvalidCharacters-surrogate-pairs
Apr 16, 2026
Merged

preserve UTF-16 surrogate pairs in MaskXmlInvalidCharacters - fixes #290#291
FreeAndNil merged 1 commit intomasterfrom
Feature/291-MaskXmlInvalidCharacters-surrogate-pairs

Conversation

@FreeAndNil
Copy link
Copy Markdown
Contributor

fixes #290

The regex [^\x09\x0A\x0D\x20-\uD7FF\uE000-\uFFFD] operated on individual UTF-16 char units, causing both halves of a valid surrogate pair to be replaced, silently corrupting supplementary characters (U+10000–U+10FFFF) such as emoji in XML log output.

Fix by prepending a surrogate-pair alternative to the regex so valid pairs are matched and preserved as a unit; only lone surrogates and other XML-illegal code units are replaced with the mask string.

Also optimise CountSubstrings: use a char loop for single-character substrings (all current callers) and StringComparison.Ordinal for the multi-character CDATA token path.

Add unit tests covering surrogate pair preservation, lone surrogates, and CountSubstrings edge cases.



The regex [^\x09\x0A\x0D\x20-\uD7FF\uE000-\uFFFD] operated on individual
UTF-16 char units, causing both halves of a valid surrogate pair to be
replaced, silently corrupting supplementary characters (U+10000–U+10FFFF)
such as emoji in XML log output.

Fix by prepending a surrogate-pair alternative to the regex so valid pairs
are matched and preserved as a unit; only lone surrogates and other
XML-illegal code units are replaced with the mask string.

Also optimise CountSubstrings: use a char loop for single-character
substrings (all current callers) and StringComparison.Ordinal for the
multi-character CDATA token path.

Add unit tests covering surrogate pair preservation, lone surrogates,
and CountSubstrings edge cases.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

MaskXmlInvalidCharacters() corrupts supplementary Unicode characters in XmlLayoutSchemaLog4J

2 participants