Unicode German sharp S violates an assumption of some string methods #13648

dmk42 · 2019-08-02T23:15:08Z

Chapel strings are currently immutable except for appending, and upper/lower case conversion. In case conversion, the methods assume that the upper and lower case versions of a letter encode to the same number of UTF-8 bytes. While documenting some properties of strings, I came across a letter that violates that assumption.

Unicode's German lower-case sharp S (ß) was encoded in 1985 (U+00DF) in an earlier character set and eventually absorbed into Unicode.

Unicode's German upper-case sharp S (ẞ) was encoded in 2008 (U+1E9E).

Consequently, the lower case version translates to two bytes in UTF-8, and the upper case version translates to three bytes.

Darwin's towlower() and towupper() routines compensate by refusing to change the case of a sharp S. Cygwin's and Linux's towupper() will not change the case of a lower case sharp S (preventing buffer overflows) but their towlower() will change the case of an upper case sharp S, generating a character that takes one less byte than before.

Eventually it would be a good idea to have the String module change the byte length of strings as necessary to accommodate any differences in the length of the upper vs. lower case characters. Alternatively, the String module could enforce the Darwin approach everywhere, and refuse to change the case of any character where that would change the number of bytes. Another alternative is for the String module to enforce the Linux approach everywhere, and refuse to change the case of any character where that would take more bytes than before.

The affected string methods are the following.

toLower()
toUpper()
toTitle()
capitalize()

#13649 adds a future to track the bug.

The text was updated successfully, but these errors were encountered:

Add a future for the German sharp S upper/lower case anomaly (#13649) [new future, not reviewed] As discussed in #13648 , the Unicode German sharp S takes up a different number of bytes for its upper vs. lower case versions. This change adds a future to track the string methods' assumption that the upper and lower case versions would encode to the same number of bytes. Tested on Cygwin, Darwin, and Linux (including Cray XC systems).

dmk42 · 2019-08-02T23:35:52Z

@e-kayrakli - FYI.

@e-kayrakli

Fix the Unicode sharp S bug (#13662) [reviewed by @e-kayrakli ] Upper/lower case-changing string methods assume that the upper case and lower case version of a character encode to the same number of UTF-8 bytes. The German sharp S character violates that assumption, as detailed in #13648 . The MacOS towlower() and towupper() functions resolve the issue by refusing to change the case of the character in question. This change adopts that solution for Chapel on all platforms. When a case change is attempted and the new character would encode to a different number of bytes than the old one, the case change is not performed. Passes full local and GASNet testing on linux64. Closes #13648

dmk42 mentioned this issue Aug 2, 2019

Add a future for the German sharp S upper/lower case anomaly #13649

Merged

dmk42 added area: Libraries / Modules type: Bug labels Aug 6, 2019

dmk42 mentioned this issue Aug 6, 2019

Fix the Unicode German sharp S bug #13662

Merged

dmk42 closed this as completed in #13662 Aug 6, 2019

e-kayrakli mentioned this issue Apr 12, 2022

Add .toLower() and .toUpper() methods to Strings Bears-R-Us/arkouda#1265

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unicode German sharp S violates an assumption of some string methods #13648

Unicode German sharp S violates an assumption of some string methods #13648

dmk42 commented Aug 2, 2019 •

edited

Loading

dmk42 commented Aug 2, 2019

Unicode German sharp S violates an assumption of some string methods #13648

Unicode German sharp S violates an assumption of some string methods #13648

Comments

dmk42 commented Aug 2, 2019 • edited Loading

dmk42 commented Aug 2, 2019

dmk42 commented Aug 2, 2019 •

edited

Loading