Normalize all unicode identifiers to NFC #5462

stevengj · 2014-01-21T15:57:21Z

This addresses issue #5434. As per the apparent consensus in that issue, all identifiers are normalized to NFC, which canonicalizes composite characters but does not canonicalize easily-confused characters ("compatibility equivalents") such as µ (micro) and μ (mu).

However, when the interpreter throws an "identifier is not defined" exception, it checks whether the identifier differs from its NFKC normalization, and if so it warns that the identifier may contain easily confusable unicode characters. (Even better would be to do an NFKC-normalized lookup of the identifier to see if there is a similar-looking identifier that is defined, but that was more work to implement and this seemed like a good start for now.)

This patch adds the utf8proc library to deps and to libjulia. This is a small (< 600 lines) MIT-licensed C library that implements various unicode normalizations (about 500kB when compiled, since it includes a database of unicode codepoints). As a separate pull request, we should probably add functions to Base to expose some of this functionality within Julia (e.g. for normalization, Unicode-aware case-folding, diacritical-stripping, etcetera).

jiahao · 2014-01-21T16:22:12Z

Thanks for taking the lead on this!

StefanKarpinski · 2014-01-21T16:39:44Z

Yes, thanks for moving on this – much needed after all the talking :-)

The NFC normalization is completely uncontroversial and clearly a good idea. On the other hand, I think this approach to NFKC collision avoidance is pretty broken. How about separating the two so that we can get the uncontroversial part merged and figure out how to do the harder collision avoidance bit separately?

JeffBezanson · 2014-01-21T16:43:36Z

This is good, but we might need to do the normalization even earlier to handle the case where different forms of an identifier appear in the same scope (the front end does some identifier matching).

I agree with Stefan about NFKC collisions --- they will not necessarily manifest as undefined variables.

stevengj · 2014-01-21T16:53:39Z

The NFKC warning I added certainly does not cover all possible collisions, or even close, but it still seemed helpful to me. But I'll remove it if you want.

@JeffBezanson, where should the normalization be hooked in? In femtolisp somewhere?

stevengj · 2014-01-21T17:44:40Z

@JeffBezanson, maybe flisp.c:mk_symbol is the right place to normalize from?

StefanKarpinski · 2014-01-21T18:18:32Z

Of course, that would have the side effect of normalizing flisp symbols too, but that's ok.

stevengj · 2014-01-21T18:51:12Z

Okay, I've updated the patch to do the NFC normalization in flisp, and have removed the NFKC warning.

StefanKarpinski · 2014-01-21T19:13:49Z

Looks good to me. I tried it out and it seems to work as intended.

stevengj · 2014-01-21T19:15:43Z

Whoops, I accidentally deleted my comment about jl_symbol_n, which is called by symbol(s::String).

Currently, I'm not performing normalization on the argument of this function on the principle that we currently allow the programmer to call symbol on any string, even strings that cannot be parsed.

The counter-argument is that it is hard to imagine a circumstance in which a non-NFC symbol is actually desired, and that this may lead to unexpected results if the user calls symbol("....") on copy-and-pasted code.

StefanKarpinski · 2014-01-21T19:20:48Z

Yeah, that's an interesting case. I think leaving it un-normalized is probably right.

toivoh · 2014-01-21T20:11:18Z

I would also say to leave it unnormalized. One use case that has come up
for Symbol is to hold truly immutable strings.

Normalize all unicode identifiers to NFC

…JuliaLang#5434)

added utf8proc to deps for JuliaLang#5434

b1585ac

normalize all flisp symbols to NFC (fix JuliaLang#5434)

7f8dd12

StefanKarpinski added a commit that referenced this pull request Jan 22, 2014

Merge pull request #5462 from stevengj/utf8proc

c8c547f

Normalize all unicode identifiers to NFC

StefanKarpinski merged commit c8c547f into JuliaLang:master Jan 22, 2014

stevengj added a commit that referenced this pull request Jan 23, 2014

NEWS for unicode normalization (#5462)

192a3e8

stevengj added a commit to stevengj/julia that referenced this pull request Jan 27, 2014

export utf8proc functionality in Julia (followup to JuliaLang#5462 and …

5799e46

…JuliaLang#5434)

stevengj added a commit to stevengj/julia that referenced this pull request Jan 27, 2014

export utf8proc functionality in Julia (followup to JuliaLang#5462 and …

bc7cf20

…JuliaLang#5434)

stevengj mentioned this pull request Jan 27, 2014

RFC: export utf8proc Unicode transformation functionality in Julia #5576

Merged

stevengj added a commit to stevengj/julia that referenced this pull request Jan 27, 2014

export utf8proc functionality in Julia (followup to JuliaLang#5462 and …

59b0f18

…JuliaLang#5434)

stevengj added a commit to stevengj/julia that referenced this pull request Jan 29, 2014

export utf8proc functionality in Julia (followup to JuliaLang#5462 and …

bb70a9a

…JuliaLang#5434)

stevengj added a commit that referenced this pull request Feb 1, 2014

export utf8proc functionality in Julia (followup to #5462 and #5434)

9e5ce63

stevengj added a commit to stevengj/julia that referenced this pull request Feb 1, 2014

export utf8proc functionality in Julia (followup to JuliaLang#5462 and …

6039a46

…JuliaLang#5434)

stevengj mentioned this pull request Feb 7, 2014

Random error using variables with unicode characters #5712

Closed

jiahao mentioned this pull request Feb 12, 2014

Profile.print() throws exception if function name has Greek characters #5769

Closed

jiahao added the unicode label Feb 22, 2014

stevengj mentioned this pull request Jul 2, 2015

Add "unusual Julia features" section in the manual noteworthy diffs. #11966

Closed

stevengj mentioned this pull request Jul 18, 2018

don't allow U+0387 (·) in identifiers #28167

Merged

stevengj mentioned this pull request Dec 27, 2023

Base.isidentifier(::Symbol) should check normalization #52641

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Normalize all unicode identifiers to NFC #5462

Normalize all unicode identifiers to NFC #5462

stevengj commented Jan 21, 2014

jiahao commented Jan 21, 2014

StefanKarpinski commented Jan 21, 2014

JeffBezanson commented Jan 21, 2014

stevengj commented Jan 21, 2014

stevengj commented Jan 21, 2014

StefanKarpinski commented Jan 21, 2014

stevengj commented Jan 21, 2014

StefanKarpinski commented Jan 21, 2014

stevengj commented Jan 21, 2014

StefanKarpinski commented Jan 21, 2014

toivoh commented Jan 21, 2014

Normalize all unicode identifiers to NFC #5462

Normalize all unicode identifiers to NFC #5462

Conversation

stevengj commented Jan 21, 2014

jiahao commented Jan 21, 2014

StefanKarpinski commented Jan 21, 2014

JeffBezanson commented Jan 21, 2014

stevengj commented Jan 21, 2014

stevengj commented Jan 21, 2014

StefanKarpinski commented Jan 21, 2014

stevengj commented Jan 21, 2014

StefanKarpinski commented Jan 21, 2014

stevengj commented Jan 21, 2014

StefanKarpinski commented Jan 21, 2014

toivoh commented Jan 21, 2014