Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Normalize all unicode identifiers to NFC #5462

Merged
merged 2 commits into from
Jan 22, 2014

Conversation

stevengj
Copy link
Member

This addresses issue #5434. As per the apparent consensus in that issue, all identifiers are normalized to NFC, which canonicalizes composite characters but does not canonicalize easily-confused characters ("compatibility equivalents") such as µ (micro) and μ (mu).

However, when the interpreter throws an "identifier is not defined" exception, it checks whether the identifier differs from its NFKC normalization, and if so it warns that the identifier may contain easily confusable unicode characters. (Even better would be to do an NFKC-normalized lookup of the identifier to see if there is a similar-looking identifier that is defined, but that was more work to implement and this seemed like a good start for now.)

This patch adds the utf8proc library to deps and to libjulia. This is a small (< 600 lines) MIT-licensed C library that implements various unicode normalizations (about 500kB when compiled, since it includes a database of unicode codepoints). As a separate pull request, we should probably add functions to Base to expose some of this functionality within Julia (e.g. for normalization, Unicode-aware case-folding, diacritical-stripping, etcetera).

@jiahao
Copy link
Member

jiahao commented Jan 21, 2014

Thanks for taking the lead on this!

@StefanKarpinski
Copy link
Sponsor Member

Yes, thanks for moving on this – much needed after all the talking :-)

The NFC normalization is completely uncontroversial and clearly a good idea. On the other hand, I think this approach to NFKC collision avoidance is pretty broken. How about separating the two so that we can get the uncontroversial part merged and figure out how to do the harder collision avoidance bit separately?

@JeffBezanson
Copy link
Sponsor Member

This is good, but we might need to do the normalization even earlier to handle the case where different forms of an identifier appear in the same scope (the front end does some identifier matching).

I agree with Stefan about NFKC collisions --- they will not necessarily manifest as undefined variables.

@stevengj
Copy link
Member Author

The NFKC warning I added certainly does not cover all possible collisions, or even close, but it still seemed helpful to me. But I'll remove it if you want.

@JeffBezanson, where should the normalization be hooked in? In femtolisp somewhere?

@stevengj
Copy link
Member Author

@JeffBezanson, maybe flisp.c:mk_symbol is the right place to normalize from?

@StefanKarpinski
Copy link
Sponsor Member

Of course, that would have the side effect of normalizing flisp symbols too, but that's ok.

@stevengj
Copy link
Member Author

Okay, I've updated the patch to do the NFC normalization in flisp, and have removed the NFKC warning.

@StefanKarpinski
Copy link
Sponsor Member

Looks good to me. I tried it out and it seems to work as intended.

@stevengj
Copy link
Member Author

Whoops, I accidentally deleted my comment about jl_symbol_n, which is called by symbol(s::String).

Currently, I'm not performing normalization on the argument of this function on the principle that we currently allow the programmer to call symbol on any string, even strings that cannot be parsed.

The counter-argument is that it is hard to imagine a circumstance in which a non-NFC symbol is actually desired, and that this may lead to unexpected results if the user calls symbol("....") on copy-and-pasted code.

@StefanKarpinski
Copy link
Sponsor Member

Yeah, that's an interesting case. I think leaving it un-normalized is probably right.

@toivoh
Copy link
Contributor

toivoh commented Jan 21, 2014

I would also say to leave it unnormalized. One use case that has come up
for Symbol is to hold truly immutable strings.

StefanKarpinski added a commit that referenced this pull request Jan 22, 2014
Normalize all unicode identifiers to NFC
@StefanKarpinski StefanKarpinski merged commit c8c547f into JuliaLang:master Jan 22, 2014
stevengj added a commit that referenced this pull request Jan 23, 2014
stevengj added a commit to stevengj/julia that referenced this pull request Feb 1, 2014
@jiahao jiahao added the unicode label Feb 22, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:unicode Related to unicode characters and encodings
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants