-
Notifications
You must be signed in to change notification settings - Fork 77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
0xA3 0xA0 in GB 18030 #338
Comments
Some preliminary analysis: Usage statistics of GB 18030, GBK, and GB 2312 for websitesSee:
Bing search engine resultsSearching [] (U+E5E5) in Bing China, most of the web pages were in 200x. Some websites may quote text from other websites, resulting in some recent results:
Exampleshttps://www.chinanews.com.cn/n/2004-01-13/26/391173.html (GB 2312) This website uses both U+3000 and U+E5E5. https://blog.sina.com.cn/s/blog_44c67f2c0102v4at.html (UTF-8) This website uses both U+3000 and U+E5E5. https://news.sina.com.cn/c/2004-01-21/09272687351.shtml (GB 2312) This website only uses U+3000. https://edu.sina.com.cn/focus/wq3/index.html (GB2312, date 2008) Although it's a U+E5E5 result, the source code only contains U+3000. |
We've had the current behavior for years and years. What practical problem does the current state of things cause? Changing it would affect the relationship of GBK and GB18030 in the Encoding Standard, right? |
Because GB 18030 is a compulsory standard, according to Article 14 of CHAPTER III of the Standardization Law of the People's Republic of China:
IANAL, but non-conformance to GB 18030 could be seen as a risk. |
We already diverged from GB18030-2022 spec because of the non-round trip mapping proposed by UTC. There is no point in changing 0xA3 0xA0 mapping from the spec conformance perspective. |
What supports your claim about ICU? Did https://unicode-org.atlassian.net/browse/ICU-22420 get reverted? I created web-platform-tests/wpt#49137 to ensure we test this code point. I haven't seen any credible argument in this thread to change this mapping so I'm inclined to close this. |
Sorry, I'm not using the latest version of ICU, you are right. I updated the description above. I think before closing this issue, at least we should analyze the impact of updating and not updating the mapping. |
What is the issue with the Encoding Standard?
https://encoding.spec.whatwg.org/#gb18030-encoder
We didn't update this in #336 , so I filed this issue to track it.
https://bugzilla.mozilla.org/show_bug.cgi?id=131837 , a bug filed in 2002 mentioned this. The reason behind this mapping was that some websites use 0xA3 0xA0 as space characters, which causes display abnormalities, so Mozilla changed the mapping to
U+3000 IDEOGRAPHIC SPACE
.In the Hong Kong Supplementary Character Set, U+E5E5 was used to encode
𨪜
(U+2A89C
in Unicode CJK Unified Ideographs Extension I).We need to analyze how many websites using GB 18030 are still using 0xA3 0xA0 to represent U+3000.
Currently, iconv
and ICUseems to map 0xA3 0xA0 to U+E5E5. ICU 74.1+ maps it to U+3000.The following is some information about this misuse (mostly translated from a Chinese website).
The 0xA3A1 ~ 0xA3FE part of GB18030-2022 is inherited from row 3 of GB 2312, and contains the G0 set of GB/T 1988-80 (ISO 646-CN). GB 2312 does not specify the width of these characters, but subsequent standards (such as GB 5007.1-85) made it clear that characters in row 3 are full-width, which are mapped to the Halfwidth and Fullwidth Forms Unicode block.
However, the G0 set of GB/T 1988-80 does not include spaces, but influenced by ASCII, people often consider spaces together with the remaining 94 characters. Now let's assume that someone thinks that 0xA3A1 ~ 0xA3FE are full-width ASCII characters (although "$" has been replaced by "¥"), then this person is likely to think that 0xA3 0xA0 should be a full-width space (although the actual full-width space is at 0xA1A1). Because some fonts display .notdef as a 1 em wide space, even when the corresponding Unicode code point of the two are different, the rendering is the same (undefined PUA code points in GB encoding will be displayed as .notdef).
The text was updated successfully, but these errors were encountered: