-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Normalize all unicode identifiers to NFC #5462
Conversation
Thanks for taking the lead on this! |
Yes, thanks for moving on this – much needed after all the talking :-) The NFC normalization is completely uncontroversial and clearly a good idea. On the other hand, I think this approach to NFKC collision avoidance is pretty broken. How about separating the two so that we can get the uncontroversial part merged and figure out how to do the harder collision avoidance bit separately? |
This is good, but we might need to do the normalization even earlier to handle the case where different forms of an identifier appear in the same scope (the front end does some identifier matching). I agree with Stefan about NFKC collisions --- they will not necessarily manifest as undefined variables. |
The NFKC warning I added certainly does not cover all possible collisions, or even close, but it still seemed helpful to me. But I'll remove it if you want. @JeffBezanson, where should the normalization be hooked in? In femtolisp somewhere? |
@JeffBezanson, maybe |
Of course, that would have the side effect of normalizing flisp symbols too, but that's ok. |
Okay, I've updated the patch to do the NFC normalization in flisp, and have removed the NFKC warning. |
Looks good to me. I tried it out and it seems to work as intended. |
Whoops, I accidentally deleted my comment about Currently, I'm not performing normalization on the argument of this function on the principle that we currently allow the programmer to call The counter-argument is that it is hard to imagine a circumstance in which a non-NFC symbol is actually desired, and that this may lead to unexpected results if the user calls |
Yeah, that's an interesting case. I think leaving it un-normalized is probably right. |
I would also say to leave it unnormalized. One use case that has come up |
Normalize all unicode identifiers to NFC
This addresses issue #5434. As per the apparent consensus in that issue, all identifiers are normalized to NFC, which canonicalizes composite characters but does not canonicalize easily-confused characters ("compatibility equivalents") such as µ (micro) and μ (mu).
However, when the interpreter throws an "identifier is not defined" exception, it checks whether the identifier differs from its NFKC normalization, and if so it warns that the identifier may contain easily confusable unicode characters. (Even better would be to do an NFKC-normalized lookup of the identifier to see if there is a similar-looking identifier that is defined, but that was more work to implement and this seemed like a good start for now.)This patch adds the utf8proc library to
deps
and tolibjulia
. This is a small (< 600 lines) MIT-licensed C library that implements various unicode normalizations (about 500kB when compiled, since it includes a database of unicode codepoints). As a separate pull request, we should probably add functions toBase
to expose some of this functionality within Julia (e.g. for normalization, Unicode-aware case-folding, diacritical-stripping, etcetera).