- 
                Notifications
    You must be signed in to change notification settings 
- Fork 5.2k
Update CharUnicodeInfo to Unicode 12.1; update StringInfo to UAX29 #328
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update CharUnicodeInfo to Unicode 12.1; update StringInfo to UAX29 #328
Conversation
| As discussed on Teams, please move the testdata files to https://github.com/dotnet/runtime-assets. I renamed the repository today. cc @akoeplinger | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @GrabYourPitchforks for getting this ready!
| Latest iteration reverts changes to the  | 
| Is the draft PR or something else confusing GitHub because the file changes view shows  I assume the change from 12:4:4 to 11:5:4 for the numeric tables gives better compression now and cutoff isn't useful any-more because of grapheme break data? | 
| 
 I recently implemented bidi-reordering algorithm (mainly for Arabic support) for Unity game engine. I had to do so because Unity UI doesn't support it. It were very helpful if a bidi class API were public. | 
| CI failure is due to a known Helix issue, unrelated to this PR. Marking ready for review. I'm going to spend the day benchmarking. | 
| Modest gains in APIs like  
 | 
| <SystemIOCompressionTestDataVersion>5.0.0-beta.19608.5</SystemIOCompressionTestDataVersion> | ||
| <SystemIOPackagingTestDataVersion>5.0.0-beta.19608.5</SystemIOPackagingTestDataVersion> | ||
| <SystemNetTestDataVersion>5.0.0-beta.19608.5</SystemNetTestDataVersion> | ||
| <SystemPrivateRuntimeUnicodeDataVersion>5.0.0-beta.19610.1</SystemPrivateRuntimeUnicodeDataVersion> | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you need to add a dependency to eng/Version.Details.xml so that the version gets auto-updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#1731 is out. Thanks for catching this! :)
Background
This PR involves four general poles.
CharUnicodeInfobacking tables from Unicode 11.0 to Unicode 12.1. This also fixes some bidi bugs in the existing logic.CharUnicodeInfobacking tables to carry more information, including UAX29 text segmentation info and optional case folding information. RewriteGenUnicodePropto carry this information.CharandCharUnicodeInfosince we were already touching these code paths.StringInfoandTextElementEnumeratorto use UAX29 extended grapheme cluster segmentation (fixes https://github.com/dotnet/corefx/issues/41324 and https://github.com/dotnet/corefx/issues/28416).Updating the backing data from Unicode 11.0 to Unicode 12.1 is fairly straightforward. We do this every so often, with dotnet/coreclr#20529 being one recent example.
Prior to this PR, the
CharUnicodeInfodata contained the following data for all code pointsU+0000..U+10FFFF:GetUnicodeCategory)IdnMapping)With this PR, these tables are updated to include for every code point:
The CategoriesValues table is an 11:5:4 table that contains offsets into an array of bytes where each element has the following layout:
bit 7 6 5 4 3 2 1 0 +-------------------------------+ | w | b | b | c | c | c | c | c | +-------------------------------+This represents a packed structure containing the
UnicodeCategory(bits 0 - 4), restricted bidi class (bits 5 - 6), and whitespace property (bit 7) of each code point.The NumericGrapheme table is an 11:5:4 table that contains offsets into one of several different arrays. The NumericValues array consists of 8-byte elements interpreted as little-endian
doubles. The DigitValues array consists of 1-byte elements where the high nibble of each element is the digit value and the low nibble of each element is the decimal digit value. The GraphemeValues array consists of 1-byte elements where the value of each element maps to an enum value from UAX#29, Table 2.A quick note on bidi information: it turns out that our previous methodology of pulling bidi information straight from
UnicodeData.txtwas incomplete. Certain unassigned/reserved code points (such asU+061D) do not have entries inUnicodeData.txtbut do have entries inDerivedBidiClass.txt, so we should still carry their bidi data in theCharUnicodeInfobacking tables. The update to theGenUnicodePropgeneration utility takes this scenario into account.tools/GenUnicodePropThis is the logic that reads all of the Unicode data files and spits out the
CharUnicodeInfoData.csbacking file. Much of this application was rewritten to remove unneeded logic and to add support for smuggling complex data through the 11:5:4 tables.If we do want to include simple case mapping / case folding data in the future (see, e.g., https://github.com/dotnet/corefx/issues/17233), this is supported by the tool. Pass the
-IncludeCasingDataswitch to the tool to generate this information. This switch is not enabled by default, which means that theCharUnicodeDatabacking tables do not carry this information.CharUnicodeInfo.csThe methods in this file were rewritten to be more optimized, and methods were moved around a bit so that logical method groups (such as
GetUnicodeCategory) stay together. Of note is that there are many places where a single table or a single array holds multiple pieces of information.For example, the Categories 11:5:4 table is used to index into the
CategoriesValuesarray, where eachbyteelement of that array is essentially a glorified(isWhiteSpace, bidiCategory, unicodeCategory)tuple. The NumericGrapheme 11:5:4 table is used to index into theDigitValues,NumericValues, andGraphemeSegmentationValuesarrays.StrongBidiCategory.csandIdnMapping.csThe previous incarnation of
CharUnicodeInfoheld full bidi class data. However, it turns out that we don't have a public API to get at this information, and the only consumer of this information in the entire framework isIdnMapping. Furthermore, it doesn't care about most bidi classes. It only cares about classes that are strongly left-to-right ("L") or strongly right-to-left ("R", "AL"). So I changed the data stored inCharUnicodeInfoto reflect only this very limited set of information rather than the full bidi class. See https://www.unicode.org/reports/tr44/#BC_Values_Table for more info on these values.When generating the backing table (see the
GenUnicodeProptool), all code points marked with "L" are given a "strong left-to-right" marker in the backing data, and all code points marked with "AL" or "R" are given a "strong right-to-left" marker in the backing data. All other bidi information is thrown away when generatingCharUnicodeInfoData.cs. This is referred to in source as "restricted" bidi data.Of note is that
IdnMappingunder the invariant globalization mode follows IDNA2003 semantics when performing bidi processing, with the modification that the data is seeded from recent Unicode data instead of being locked to Unicode 3.2 as required by IDNA2003. We could updateIdnMappingto follow strict IDNA2003 semantics if we desired, but that's out of scope of this PR. Since most runtimes won't be running under the invariant globalization addressing this would probably be low-priority.Rune.csandChar.csSlight modifications to these files to react to refactorings in
CharUnicodeInfo, plus some minor optimizations in those same code paths.TextSegmentationUtility.csThis is the workhorse class that computes grapheme cluster segmentation boundaries ("display characters", essentially). We use the definition of "extended grapheme cluster" per https://www.unicode.org/reports/tr29/. See Sec. 3.1 and 3.1.1 of that document for the specific algorithm we follow.
The code is generic so that it can work with either UTF-16 or UTF-8. Currently only the UTF-16 API is public, implemented by passing the delegate
Rune.DecodeFromUtf16down to the workhorse routine. If we wanted to add UTF-8 support in the future, it would be trivial to do so by instead calling the same workhorse with theRune.DecodeFromUtf8delegate.StringInfo.csandTextElementEnumerator.csRewritten to use the new "extended grapheme cluster" logic in
TextSegmentationUtility.Aside from changing the implementation to be UAX29-compliant, there's one additional behavioral change: the
TextElementEnumerator.ElementIndexproperty getter now throws once the enumeration has completed. This brings its behavior in line with theTextElementEnumerator.Currentproperty getter, which also throws once enumeration has completed; and it matchesElementIndex's documentation, which says that the getter throws after completion.CoreFx.Private.TestUtilities.Unicode/*Contains all the logic to process the UCD files. This isn't itself shipping code, but it's pulled in by the unit tests so that we can compare the data coming from APIs like
CharUnicodeInfo.GetUnicodeCategoryagainst the official data coming from the Unicode group.The
Datafolder is unchanged by this PR, as we're now referencing static assets from the runtime-assets repo (see dotnet/runtime-assets#44). I'll send a separate PR in the future to remove the existing Unicode text files from this folder.CharUnicodeInfoTests.Generated.csThis file contains tests that iterate through all code points in the UCD files, hitting our APIs. It's intended to verify the contents of the 11:5:4 map we use in
CharUnicodeInfoData.csand of the logic which reads that data. It's not intended as a substitute for other manually-specified test data, which already exists inCharUnicodeInfoTets.cs.GraphemeBreakTest.csIterates through the test data pulled from https://www.unicode.org/Public/UCD/latest/ucd/auxiliary/GraphemeBreakTest.txt, which defines approx. 600 test cases for grapheme boundary calculation.
Other miscellaneous tests
Some tests had to be modified a bit, especially tests which used
"\u0300\u0300"as a sequence break. (Per the most recent iteration of UAX29, those two code points should be treated as a single cluster.) I also added some more complex emoji sequences to the test case data.VB defines a string reversal function
StrReversewhich had to be rewritten since it made now-incorrect assumptions about howStringInfoworked. Specifically, they had copied some of the existingStringInfosegmentation logic, which was causing problems with their other usage ofTextElementEnumerator. So I removed the custom logic and have them useTextElementEnumeratoralways.Benchmarks
In the earlier drafts, APIs like
char.GetUnicodeCategory(char)see an approx. 20% perf improvement with this PR. Once the build stabilizes I'll rerun the tests and get more representative numbers.I suspect this will also improve the performance of certain Regex constructs, such as
"\p{L}". Will need to measure that separately to confirm.