Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode German sharp S violates an assumption of some string methods #13648

Closed
dmk42 opened this issue Aug 2, 2019 · 1 comment · Fixed by #13662
Closed

Unicode German sharp S violates an assumption of some string methods #13648

dmk42 opened this issue Aug 2, 2019 · 1 comment · Fixed by #13662

Comments

@dmk42
Copy link
Contributor

dmk42 commented Aug 2, 2019

Chapel strings are currently immutable except for appending, and upper/lower case conversion. In case conversion, the methods assume that the upper and lower case versions of a letter encode to the same number of UTF-8 bytes. While documenting some properties of strings, I came across a letter that violates that assumption.

Unicode's German lower-case sharp S (ß) was encoded in 1985 (U+00DF) in an earlier character set and eventually absorbed into Unicode.

Unicode's German upper-case sharp S (ẞ) was encoded in 2008 (U+1E9E).

Consequently, the lower case version translates to two bytes in UTF-8, and the upper case version translates to three bytes.

Darwin's towlower() and towupper() routines compensate by refusing to change the case of a sharp S. Cygwin's and Linux's towupper() will not change the case of a lower case sharp S (preventing buffer overflows) but their towlower() will change the case of an upper case sharp S, generating a character that takes one less byte than before.

Eventually it would be a good idea to have the String module change the byte length of strings as necessary to accommodate any differences in the length of the upper vs. lower case characters. Alternatively, the String module could enforce the Darwin approach everywhere, and refuse to change the case of any character where that would change the number of bytes. Another alternative is for the String module to enforce the Linux approach everywhere, and refuse to change the case of any character where that would take more bytes than before.

The affected string methods are the following.

toLower()
toUpper()
toTitle()
capitalize()

#13649 adds a future to track the bug.

dmk42 added a commit that referenced this issue Aug 2, 2019
Add a future for the German sharp S upper/lower case anomaly (#13649)

[new future, not reviewed]

As discussed in #13648 , the Unicode German sharp S takes up a different number of bytes for its upper vs. lower case versions. This change adds a future to track the string methods' assumption that the upper and lower case versions would encode to the same number of bytes.

Tested on Cygwin, Darwin, and Linux (including Cray XC systems).
@dmk42
Copy link
Contributor Author

dmk42 commented Aug 2, 2019

@e-kayrakli - FYI.

dmk42 added a commit that referenced this issue Aug 6, 2019
Fix the Unicode sharp S bug (#13662)

[reviewed by @e-kayrakli ]

Upper/lower case-changing string methods assume that the upper case and lower case version of a character encode to the same number of UTF-8 bytes. The German sharp S character violates that assumption, as detailed in #13648 .

The MacOS towlower() and towupper() functions resolve the issue by refusing to change the case of the character in question. This change adopts that solution for Chapel on all platforms. When a case change is attempted and the new character would encode to a different number of bytes than the old one, the case change is not performed.

Passes full local and GASNet testing on linux64.

Closes #13648
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant