WIP: custom Unicode normalization for Julia identifiers #19464

stevengj · 2016-11-30T20:57:54Z

As discussed in #14751, #5903, and JuliaStrings/utf8proc#11, this PR canonicalizes Unicode identifiers/symbols in Julia via a custom normalization, in addition to NFC normalization. That is, we treat certain codepoints as equivalent in identifiers. (Requires a new feature in the soon-to-be-released utf8proc 2.1.)

I don't think we want to go too crazy on "confusable" characters here—we made a conscious choice in #5434 to use NFC normalization, not NFKC normalization (ala Python 3), in order to preserve useful mathematical distinctions. I propose the following criteria for inclusion in our custom normalization, to be applied conservatively when real problems arise in the wild:

Both characters are easy to type by common input methods.
The two characters are so close in appearance that it makes no sense to use them for distinct identifiers in the same scope, and people might reasonably type one when the other is intended.

Currently, this PR canonicalizes ɛ (U+025B latin small letter open e) to ε (U+03B5 greek small letter epsilon) as discussed in #14751, and canonicalizes µ (U+00B5 micro) to μ (U+03BC greek small letter mu) as discussed in #5434. I think it would also be good to canonicalize fullwidth to halfwidth identifiers as discussed in #5903 (one of the primary reasons Python 3 went with NFKC).

cc: @StefanKarpinski, @jiahao

iamed2 · 2016-11-30T22:48:49Z

Could the exact additions to NFC be documented in the manual? And maybe another method of normalize_string?

stevengj · 2016-11-30T23:21:50Z

Yes, we should definitely document it. You automatically get the "Julia" normalization just by converting a string to a symbol, so I'm not 100% sure we need another API.

iamed2 · 2016-12-01T01:35:26Z

Oh, I didn't know that!

stevengj · 2016-12-01T03:57:48Z

Added documentation.

I also added normalization of fullwidth to halfwidth characters (using a subset of NFKC), and I hooked this in at parse time so that things like b＝3:5 now work using the fullwidth ＝ (#5903).

@jiahao, in particular, I applied the NFKC normalization to codepoints in [0xff01:0xff5e; 0x3000; 0xffe0:0xffee], which seems to be what Wikipedia says are the fullwidth forms of ASCII and other symbols. I would appreciate your expertise here regarding anything else that might be needed.

stevengj · 2016-12-01T04:11:49Z

@iamed2, I take that back: Symbol(s::String) does not normalize s. However, parse(s::String) does if s is valid identifier.

pabloferz · 2016-12-01T04:46:13Z

Another pair of characters that I believe follow the proposed criteria are ĸ (U+0138 latin small letter kra) and κ (U+03BA greek small letter kappa). The first one is easy to type in some keyboards.

stevengj · 2016-12-01T04:57:18Z

Hmm, not quite working yet. For one thing, I was too aggressive in normalizing every character read by the parser. We definitely don't want to normalize characters that appear in string and character literals.

stevengj · 2016-12-01T04:58:45Z

@pabloferz, in what circumstances are you likely to type a "kra" when kappa is intended?

stevengj · 2016-12-01T05:01:12Z

Another question for @jiahao: should we only normalize fullwidth identifiers and operators, or should we also normalize fullwidth quotation marks and @ and similar syntactically important symbols? I would think that the answer is yes, as otherwise when typing Julia code in Pinyin mode you'll continually have to switch back to English.

pabloferz · 2016-12-01T05:08:35Z

In a bunch of Linux keyboard layouts typing Right Alt + m gives 'micro' while Right Alt + k gives 'kra'.

stevengj · 2016-12-01T05:11:32Z

Ah, then kra -> kappa indeed makes sense.

stevengj · 2016-12-01T20:36:52Z

Kra seems like a weird case, because apparently it is more like a type of "q" than a type of "k". It seems like it would be like typing ß (german ss) for β (beta). I'm hesitant to do this normalization unless we have more evidence that it is something people are really likely to want.

stevengj · 2016-12-01T21:29:03Z

Okay, I fixed the earlier problems.

Now it does fullwidth->ASCII normalization on operators and identifiers, but also on numeric literals and punctuation (parens, brackets, @, commas, #comments)... basically everything except string and character literals. (Quotation marks still need to be ASCII, too.) So basically you can type all your Julia code except for string/character literals in e.g. Pinyin input mode, without switching back and forth if I understand correctly.

nalimilan · 2016-12-02T10:22:23Z

It's kind of weird that Linux layouts offer Kra rather than Kappa. I guess that's because it's considered as latin? Yet Right Alt + m gives Mu, which is a greek letter. Maybe we should file a bug.

tkelman · 2016-12-02T12:23:35Z

deps/utf8proc.version

@@ -1,2 +1,2 @@
-UTF8PROC_BRANCH=v2.0.2
-UTF8PROC_SHA1=e3a5ed7b8bb5d0c6bb313d3e1f4d072c04113c4b
+UTF8PROC_BRANCH=master


this will need to be a release tag, and update checksums when ready

tkelman · 2016-12-02T12:24:26Z

doc/manual/variables.rst

+some Asian languages.  The Unicode characters ``ɛ`` (U+025B: Latin small letter open e)
+and ``µ`` (U+00B5: micro sign) are treated as equivalent to the corresponding
+Greek letters, because the former are easily accessible via some input methods.
+Different ways of entering Unicode combining are treated as equivalent


combining... characters?

stevengj · 2016-12-07T20:48:25Z

It would be nice to get some feedback on the main decision point here: whether it is desirable to be able to write Julia code (except for literal strings) as either/both ASCII and fullwidth and have them be parsed as equivalent.

I talked briefly with @jiahao about it. On the one hand, it would make it much more convenient to use East Asian-language input modes when writing Julia code. Most Asian-language programmers have learned not to do this because no other programming language allows it. On the other hand, fullwidth code looks a little weird, especially to programmers coming from other languages, and we have to decide whether that diversity in appearance is something we want to encourage.

stevengj · 2016-12-26T21:15:06Z

Rebased. Any chance of a decision on this, or is this postponed to the next release?

tkelman · 2016-12-26T21:23:43Z

deps/utf8proc.version

-UTF8PROC_BRANCH=v2.0.2
-UTF8PROC_SHA1=e3a5ed7b8bb5d0c6bb313d3e1f4d072c04113c4b
+UTF8PROC_BRANCH=v2.1
+UTF8PROC_SHA1=40e605959eb5cb90b2587fa88e3b661558fbc55a


update the checksums

thanks, fixed

tkelman · 2016-12-26T21:26:02Z

test/core.jl

@@ -3364,6 +3364,25 @@ typealias PossiblyInvalidUnion{T} Union{T,Int}
 @test Symbol("x") === Symbol("x")
 @test split(string(gensym("abc")),'#')[3] == "abc"

+# normalization of Unicode symbols (#19464)


these should probably be in test/parse, not core

tkelman · 2016-12-27T01:20:32Z

The fullwidth part of this seems a bit questionable to me.

stevengj · 2016-12-27T02:00:16Z

I can always put the fullwidth part (which is mostly relevant to speakers Chinese and Japanese languages, it seems) in a separate PR if we want to keep the ε and μ parts. But the motivations seem rather similar to me: common input methods allow you to type the "same" character as different codepoints.

…not to normalize string literals" This reverts commit 81033fa.

…g#5903)" This reverts commit cf61972.

StefanKarpinski · 2017-01-04T18:24:40Z

In favor. Needs squashing, but otherwise seems good to go? (Failure is the usual on 32-bit.)

stevengj · 2017-01-04T18:47:12Z

Yup, should be good to squash and merge.

src/flisp/julia_extensions.c

Keno · 2017-07-24T18:39:34Z

It would be nice to have this as an option to normalize_string.

stevengj · 2017-07-24T18:41:02Z

@Keno, you can always just call parse on an identifier string...

Keno · 2017-07-24T18:43:11Z

Well a) not if you're trying to replace the parser and b) that seems like a pretty roundabout way of doing it.

stevengj added needs decision A decision on this change is needed unicode Related to unicode characters and encodings labels Nov 30, 2016

stevengj force-pushed the id_norm branch from 93de092 to c477dcf Compare November 30, 2016 21:11

stevengj added needs docs Documentation for this change is required needs tests Unit tests are required for this change labels Nov 30, 2016

stevengj removed the needs docs Documentation for this change is required label Dec 1, 2016

stevengj removed the needs tests Unit tests are required for this change label Dec 1, 2016

tkelman reviewed Dec 2, 2016

View reviewed changes

stevengj force-pushed the id_norm branch from f455db4 to 1843d97 Compare December 26, 2016 21:14

tkelman reviewed Dec 26, 2016

View reviewed changes

stevengj and others added 9 commits January 3, 2017 21:01

test fullwidth numeric literals and parens

595ec59

typo/clarification

b6f5217

update to utf8proc-2.1

61d2b50

checksum for utf8proc 2.1

d23a23a

moved symbol-normalization test from test/core to test/parse

01dfaa4

Revert "be more cautious about normalizing chars when parsing, so as …

2fdee35

…not to normalize string literals" This reverts commit 81033fa.

Revert "normalize fullwidth characters during parsing (fixes JuliaLan…

a28c90f

…g#5903)" This reverts commit cf61972.

remove more references to fullwidth normalization

aedc4c2

rm fullwidth identifier normalization

b2ee6b6

stevengj force-pushed the id_norm branch from 5172cc6 to b2ee6b6 Compare January 4, 2017 02:02

Merge branch 'master' into id_norm

ab229d7

tkelman merged commit 62c423b into JuliaLang:master Jan 6, 2017

stevengj deleted the id_norm branch January 6, 2017 13:35

tkelman reviewed Jan 7, 2017

View reviewed changes

src/flisp/julia_extensions.c Show resolved Hide resolved

stevengj added a commit that referenced this pull request Jan 7, 2017

document that #19464 requires utf8proc 2.1 or later

2f223b7

randy3k mentioned this pull request Apr 24, 2017

Update unicode related code JuliaEditorSupport/Julia-sublime#48

Merged

stevengj mentioned this pull request Sep 22, 2017

Bowtie symbol isn't a valid character. #23820

Open

fredrikekre mentioned this pull request Dec 18, 2017

Make \cdot and \cdotp operators equivalent #25157

Merged

stevengj mentioned this pull request Feb 26, 2018

Unicode: Supporting middle dot & real minus #26193

Closed

chengchingwen mentioned this pull request Dec 5, 2018

WIP: Julia interface for custom Unicode normalization #30275

Closed

stevengj mentioned this pull request Jan 29, 2021

Should U+22A5 and U+27C2 be equivalent? JuliaStrings/utf8proc#218

Closed

epithet mentioned this pull request Jun 8, 2021

apply unicode normalization in help mode #41086

Open

stevengj mentioned this pull request Oct 8, 2021

add Unicode.julia_chartransform Julia-parser normalization #42561

Merged

sostock mentioned this pull request Jan 3, 2022

Remove unnecessary code related to encoding of μ PainterQubits/Unitful.jl#511

Merged

mrluc mentioned this pull request Apr 10, 2022

Compiler regression on 1.14 when using unicode as variables names elixir-lang/elixir#11750

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: custom Unicode normalization for Julia identifiers #19464

WIP: custom Unicode normalization for Julia identifiers #19464

stevengj commented Nov 30, 2016 •

edited

Loading

iamed2 commented Nov 30, 2016

stevengj commented Nov 30, 2016

iamed2 commented Dec 1, 2016

stevengj commented Dec 1, 2016 •

edited

Loading

stevengj commented Dec 1, 2016

pabloferz commented Dec 1, 2016

stevengj commented Dec 1, 2016

stevengj commented Dec 1, 2016

stevengj commented Dec 1, 2016

pabloferz commented Dec 1, 2016

stevengj commented Dec 1, 2016

stevengj commented Dec 1, 2016

stevengj commented Dec 1, 2016 •

edited

Loading

nalimilan commented Dec 2, 2016

tkelman Dec 2, 2016

tkelman Dec 2, 2016

stevengj commented Dec 7, 2016

stevengj commented Dec 26, 2016

tkelman Dec 26, 2016

stevengj Dec 26, 2016

tkelman Dec 26, 2016

stevengj Dec 26, 2016

tkelman commented Dec 27, 2016

stevengj commented Dec 27, 2016

StefanKarpinski commented Jan 4, 2017

stevengj commented Jan 4, 2017

Keno commented Jul 24, 2017

stevengj commented Jul 24, 2017

Keno commented Jul 24, 2017

WIP: custom Unicode normalization for Julia identifiers #19464

WIP: custom Unicode normalization for Julia identifiers #19464

Conversation

stevengj commented Nov 30, 2016 • edited Loading

iamed2 commented Nov 30, 2016

stevengj commented Nov 30, 2016

iamed2 commented Dec 1, 2016

stevengj commented Dec 1, 2016 • edited Loading

stevengj commented Dec 1, 2016

pabloferz commented Dec 1, 2016

stevengj commented Dec 1, 2016

stevengj commented Dec 1, 2016

stevengj commented Dec 1, 2016

pabloferz commented Dec 1, 2016

stevengj commented Dec 1, 2016

stevengj commented Dec 1, 2016

stevengj commented Dec 1, 2016 • edited Loading

nalimilan commented Dec 2, 2016

tkelman Dec 2, 2016

Choose a reason for hiding this comment

tkelman Dec 2, 2016

Choose a reason for hiding this comment

stevengj commented Dec 7, 2016

stevengj commented Dec 26, 2016

tkelman Dec 26, 2016

Choose a reason for hiding this comment

stevengj Dec 26, 2016

Choose a reason for hiding this comment

tkelman Dec 26, 2016

Choose a reason for hiding this comment

stevengj Dec 26, 2016

Choose a reason for hiding this comment

tkelman commented Dec 27, 2016

stevengj commented Dec 27, 2016

StefanKarpinski commented Jan 4, 2017

stevengj commented Jan 4, 2017

Keno commented Jul 24, 2017

stevengj commented Jul 24, 2017

Keno commented Jul 24, 2017

stevengj commented Nov 30, 2016 •

edited

Loading

stevengj commented Dec 1, 2016 •

edited

Loading

stevengj commented Dec 1, 2016 •

edited

Loading