Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: custom Unicode normalization for Julia identifiers #19464

Merged
merged 20 commits into from
Jan 6, 2017

Conversation

stevengj
Copy link
Member

@stevengj stevengj commented Nov 30, 2016

As discussed in #14751, #5903, and JuliaStrings/utf8proc#11, this PR canonicalizes Unicode identifiers/symbols in Julia via a custom normalization, in addition to NFC normalization. That is, we treat certain codepoints as equivalent in identifiers. (Requires a new feature in the soon-to-be-released utf8proc 2.1.)

I don't think we want to go too crazy on "confusable" characters here—we made a conscious choice in #5434 to use NFC normalization, not NFKC normalization (ala Python 3), in order to preserve useful mathematical distinctions. I propose the following criteria for inclusion in our custom normalization, to be applied conservatively when real problems arise in the wild:

  • Both characters are easy to type by common input methods.
  • The two characters are so close in appearance that it makes no sense to use them for distinct identifiers in the same scope, and people might reasonably type one when the other is intended.

Currently, this PR canonicalizes ɛ (U+025B latin small letter open e) to ε (U+03B5 greek small letter epsilon) as discussed in #14751, and canonicalizes µ (U+00B5 micro) to μ (U+03BC greek small letter mu) as discussed in #5434. I think it would also be good to canonicalize fullwidth to halfwidth identifiers as discussed in #5903 (one of the primary reasons Python 3 went with NFKC).

cc: @StefanKarpinski, @jiahao

@stevengj stevengj added needs decision A decision on this change is needed unicode Related to unicode characters and encodings labels Nov 30, 2016
@iamed2
Copy link
Contributor

iamed2 commented Nov 30, 2016

Could the exact additions to NFC be documented in the manual? And maybe another method of normalize_string?

@stevengj
Copy link
Member Author

Yes, we should definitely document it. You automatically get the "Julia" normalization just by converting a string to a symbol, so I'm not 100% sure we need another API.

@stevengj stevengj added needs docs Documentation for this change is required needs tests Unit tests are required for this change labels Nov 30, 2016
@iamed2
Copy link
Contributor

iamed2 commented Dec 1, 2016

Oh, I didn't know that!

@stevengj stevengj removed the needs docs Documentation for this change is required label Dec 1, 2016
@stevengj
Copy link
Member Author

stevengj commented Dec 1, 2016

Added documentation.

I also added normalization of fullwidth to halfwidth characters (using a subset of NFKC), and I hooked this in at parse time so that things like b=3:5 now work using the fullwidth (#5903).

@jiahao, in particular, I applied the NFKC normalization to codepoints in [0xff01:0xff5e; 0x3000; 0xffe0:0xffee], which seems to be what Wikipedia says are the fullwidth forms of ASCII and other symbols. I would appreciate your expertise here regarding anything else that might be needed.

@stevengj
Copy link
Member Author

stevengj commented Dec 1, 2016

@iamed2, I take that back: Symbol(s::String) does not normalize s. However, parse(s::String) does if s is valid identifier.

@pabloferz
Copy link
Contributor

Another pair of characters that I believe follow the proposed criteria are ĸ (U+0138 latin small letter kra) and κ (U+03BA greek small letter kappa). The first one is easy to type in some keyboards.

@stevengj stevengj removed the needs tests Unit tests are required for this change label Dec 1, 2016
@stevengj
Copy link
Member Author

stevengj commented Dec 1, 2016

Hmm, not quite working yet. For one thing, I was too aggressive in normalizing every character read by the parser. We definitely don't want to normalize characters that appear in string and character literals.

@stevengj
Copy link
Member Author

stevengj commented Dec 1, 2016

@pabloferz, in what circumstances are you likely to type a "kra" when kappa is intended?

@stevengj
Copy link
Member Author

stevengj commented Dec 1, 2016

Another question for @jiahao: should we only normalize fullwidth identifiers and operators, or should we also normalize fullwidth quotation marks and @ and similar syntactically important symbols? I would think that the answer is yes, as otherwise when typing Julia code in Pinyin mode you'll continually have to switch back to English.

@pabloferz
Copy link
Contributor

In a bunch of Linux keyboard layouts typing Right Alt + m gives 'micro' while Right Alt + k gives 'kra'.

@stevengj
Copy link
Member Author

stevengj commented Dec 1, 2016

Ah, then kra -> kappa indeed makes sense.

@stevengj
Copy link
Member Author

stevengj commented Dec 1, 2016

Kra seems like a weird case, because apparently it is more like a type of "q" than a type of "k". It seems like it would be like typing ß (german ss) for β (beta). I'm hesitant to do this normalization unless we have more evidence that it is something people are really likely to want.

@stevengj
Copy link
Member Author

stevengj commented Dec 1, 2016

Okay, I fixed the earlier problems.

Now it does fullwidth->ASCII normalization on operators and identifiers, but also on numeric literals and punctuation (parens, brackets, @, commas, #comments)... basically everything except string and character literals. (Quotation marks still need to be ASCII, too.) So basically you can type all your Julia code except for string/character literals in e.g. Pinyin input mode, without switching back and forth if I understand correctly.

@nalimilan
Copy link
Member

It's kind of weird that Linux layouts offer Kra rather than Kappa. I guess that's because it's considered as latin? Yet Right Alt + m gives Mu, which is a greek letter. Maybe we should file a bug.

@@ -1,2 +1,2 @@
UTF8PROC_BRANCH=v2.0.2
UTF8PROC_SHA1=e3a5ed7b8bb5d0c6bb313d3e1f4d072c04113c4b
UTF8PROC_BRANCH=master
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will need to be a release tag, and update checksums when ready

some Asian languages. The Unicode characters ``ɛ`` (U+025B: Latin small letter open e)
and ``µ`` (U+00B5: micro sign) are treated as equivalent to the corresponding
Greek letters, because the former are easily accessible via some input methods.
Different ways of entering Unicode combining are treated as equivalent
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

combining... characters?

@stevengj
Copy link
Member Author

stevengj commented Dec 7, 2016

It would be nice to get some feedback on the main decision point here: whether it is desirable to be able to write Julia code (except for literal strings) as either/both ASCII and fullwidth and have them be parsed as equivalent.

I talked briefly with @jiahao about it. On the one hand, it would make it much more convenient to use East Asian-language input modes when writing Julia code. Most Asian-language programmers have learned not to do this because no other programming language allows it. On the other hand, fullwidth code looks a little weird, especially to programmers coming from other languages, and we have to decide whether that diversity in appearance is something we want to encourage.

@stevengj
Copy link
Member Author

Rebased. Any chance of a decision on this, or is this postponed to the next release?

UTF8PROC_BRANCH=v2.0.2
UTF8PROC_SHA1=e3a5ed7b8bb5d0c6bb313d3e1f4d072c04113c4b
UTF8PROC_BRANCH=v2.1
UTF8PROC_SHA1=40e605959eb5cb90b2587fa88e3b661558fbc55a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update the checksums

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, fixed

@@ -3364,6 +3364,25 @@ typealias PossiblyInvalidUnion{T} Union{T,Int}
@test Symbol("x") === Symbol("x")
@test split(string(gensym("abc")),'#')[3] == "abc"

# normalization of Unicode symbols (#19464)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these should probably be in test/parse, not core

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@tkelman
Copy link
Contributor

tkelman commented Dec 27, 2016

The fullwidth part of this seems a bit questionable to me.

@stevengj
Copy link
Member Author

I can always put the fullwidth part (which is mostly relevant to speakers Chinese and Japanese languages, it seems) in a separate PR if we want to keep the ε and μ parts. But the motivations seem rather similar to me: common input methods allow you to type the "same" character as different codepoints.

@StefanKarpinski
Copy link
Member

In favor. Needs squashing, but otherwise seems good to go? (Failure is the usual on 32-bit.)

@stevengj
Copy link
Member Author

stevengj commented Jan 4, 2017

Yup, should be good to squash and merge.

@tkelman tkelman merged commit 62c423b into JuliaLang:master Jan 6, 2017
@stevengj stevengj deleted the id_norm branch January 6, 2017 13:35
@Keno
Copy link
Member

Keno commented Jul 24, 2017

It would be nice to have this as an option to normalize_string.

@stevengj
Copy link
Member Author

@Keno, you can always just call parse on an identifier string...

@Keno
Copy link
Member

Keno commented Jul 24, 2017

Well a) not if you're trying to replace the parser and b) that seems like a pretty roundabout way of doing it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs decision A decision on this change is needed unicode Related to unicode characters and encodings
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants