You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Chapel strings are currently immutable except for appending, and upper/lower case conversion. In case conversion, the methods assume that the upper and lower case versions of a letter encode to the same number of UTF-8 bytes. While documenting some properties of strings, I came across a letter that violates that assumption.
Unicode's German lower-case sharp S (ß) was encoded in 1985 (U+00DF) in an earlier character set and eventually absorbed into Unicode.
Unicode's German upper-case sharp S (ẞ) was encoded in 2008 (U+1E9E).
Consequently, the lower case version translates to two bytes in UTF-8, and the upper case version translates to three bytes.
Darwin's towlower() and towupper() routines compensate by refusing to change the case of a sharp S. Cygwin's and Linux's towupper() will not change the case of a lower case sharp S (preventing buffer overflows) but their towlower() will change the case of an upper case sharp S, generating a character that takes one less byte than before.
Eventually it would be a good idea to have the String module change the byte length of strings as necessary to accommodate any differences in the length of the upper vs. lower case characters. Alternatively, the String module could enforce the Darwin approach everywhere, and refuse to change the case of any character where that would change the number of bytes. Another alternative is for the String module to enforce the Linux approach everywhere, and refuse to change the case of any character where that would take more bytes than before.
Add a future for the German sharp S upper/lower case anomaly (#13649)
[new future, not reviewed]
As discussed in #13648 , the Unicode German sharp S takes up a different number of bytes for its upper vs. lower case versions. This change adds a future to track the string methods' assumption that the upper and lower case versions would encode to the same number of bytes.
Tested on Cygwin, Darwin, and Linux (including Cray XC systems).
Fix the Unicode sharp S bug (#13662)
[reviewed by @e-kayrakli ]
Upper/lower case-changing string methods assume that the upper case and lower case version of a character encode to the same number of UTF-8 bytes. The German sharp S character violates that assumption, as detailed in #13648 .
The MacOS towlower() and towupper() functions resolve the issue by refusing to change the case of the character in question. This change adopts that solution for Chapel on all platforms. When a case change is attempted and the new character would encode to a different number of bytes than the old one, the case change is not performed.
Passes full local and GASNet testing on linux64.
Closes#13648
Chapel strings are currently immutable except for appending, and upper/lower case conversion. In case conversion, the methods assume that the upper and lower case versions of a letter encode to the same number of UTF-8 bytes. While documenting some properties of strings, I came across a letter that violates that assumption.
Unicode's German lower-case sharp S (ß) was encoded in 1985 (U+00DF) in an earlier character set and eventually absorbed into Unicode.
Unicode's German upper-case sharp S (ẞ) was encoded in 2008 (U+1E9E).
Consequently, the lower case version translates to two bytes in UTF-8, and the upper case version translates to three bytes.
Darwin's
towlower()
andtowupper()
routines compensate by refusing to change the case of a sharp S. Cygwin's and Linux'stowupper()
will not change the case of a lower case sharp S (preventing buffer overflows) but theirtowlower()
will change the case of an upper case sharp S, generating a character that takes one less byte than before.Eventually it would be a good idea to have the String module change the byte length of strings as necessary to accommodate any differences in the length of the upper vs. lower case characters. Alternatively, the String module could enforce the Darwin approach everywhere, and refuse to change the case of any character where that would change the number of bytes. Another alternative is for the String module to enforce the Linux approach everywhere, and refuse to change the case of any character where that would take more bytes than before.
The affected string methods are the following.
#13649 adds a future to track the bug.
The text was updated successfully, but these errors were encountered: