-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: custom Unicode normalization for Julia identifiers #19464
Conversation
Could the exact additions to NFC be documented in the manual? And maybe another method of |
Yes, we should definitely document it. You automatically get the "Julia" normalization just by converting a string to a symbol, so I'm not 100% sure we need another API. |
Oh, I didn't know that! |
Added documentation. I also added normalization of fullwidth to halfwidth characters (using a subset of NFKC), and I hooked this in at parse time so that things like @jiahao, in particular, I applied the NFKC normalization to codepoints in |
@iamed2, I take that back: |
Another pair of characters that I believe follow the proposed criteria are |
Hmm, not quite working yet. For one thing, I was too aggressive in normalizing every character read by the parser. We definitely don't want to normalize characters that appear in string and character literals. |
@pabloferz, in what circumstances are you likely to type a "kra" when kappa is intended? |
Another question for @jiahao: should we only normalize fullwidth identifiers and operators, or should we also normalize fullwidth quotation marks and @ and similar syntactically important symbols? I would think that the answer is yes, as otherwise when typing Julia code in Pinyin mode you'll continually have to switch back to English. |
In a bunch of Linux keyboard layouts typing Right Alt + m gives 'micro' while Right Alt + k gives 'kra'. |
Ah, then kra -> kappa indeed makes sense. |
Kra seems like a weird case, because apparently it is more like a type of "q" than a type of "k". It seems like it would be like typing ß (german ss) for β (beta). I'm hesitant to do this normalization unless we have more evidence that it is something people are really likely to want. |
Okay, I fixed the earlier problems. Now it does fullwidth->ASCII normalization on operators and identifiers, but also on numeric literals and punctuation (parens, brackets, @, commas, #comments)... basically everything except string and character literals. (Quotation marks still need to be ASCII, too.) So basically you can type all your Julia code except for string/character literals in e.g. Pinyin input mode, without switching back and forth if I understand correctly. |
It's kind of weird that Linux layouts offer Kra rather than Kappa. I guess that's because it's considered as latin? Yet |
@@ -1,2 +1,2 @@ | |||
UTF8PROC_BRANCH=v2.0.2 | |||
UTF8PROC_SHA1=e3a5ed7b8bb5d0c6bb313d3e1f4d072c04113c4b | |||
UTF8PROC_BRANCH=master |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this will need to be a release tag, and update checksums when ready
some Asian languages. The Unicode characters ``ɛ`` (U+025B: Latin small letter open e) | ||
and ``µ`` (U+00B5: micro sign) are treated as equivalent to the corresponding | ||
Greek letters, because the former are easily accessible via some input methods. | ||
Different ways of entering Unicode combining are treated as equivalent |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
combining... characters?
It would be nice to get some feedback on the main decision point here: whether it is desirable to be able to write Julia code (except for literal strings) as either/both ASCII and fullwidth and have them be parsed as equivalent. I talked briefly with @jiahao about it. On the one hand, it would make it much more convenient to use East Asian-language input modes when writing Julia code. Most Asian-language programmers have learned not to do this because no other programming language allows it. On the other hand, fullwidth code looks a little weird, especially to programmers coming from other languages, and we have to decide whether that diversity in appearance is something we want to encourage. |
Rebased. Any chance of a decision on this, or is this postponed to the next release? |
UTF8PROC_BRANCH=v2.0.2 | ||
UTF8PROC_SHA1=e3a5ed7b8bb5d0c6bb313d3e1f4d072c04113c4b | ||
UTF8PROC_BRANCH=v2.1 | ||
UTF8PROC_SHA1=40e605959eb5cb90b2587fa88e3b661558fbc55a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update the checksums
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, fixed
@@ -3364,6 +3364,25 @@ typealias PossiblyInvalidUnion{T} Union{T,Int} | |||
@test Symbol("x") === Symbol("x") | |||
@test split(string(gensym("abc")),'#')[3] == "abc" | |||
|
|||
# normalization of Unicode symbols (#19464) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these should probably be in test/parse, not core
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
The fullwidth part of this seems a bit questionable to me. |
I can always put the fullwidth part (which is mostly relevant to speakers Chinese and Japanese languages, it seems) in a separate PR if we want to keep the ε and μ parts. But the motivations seem rather similar to me: common input methods allow you to type the "same" character as different codepoints. |
…not to normalize string literals" This reverts commit 81033fa.
In favor. Needs squashing, but otherwise seems good to go? (Failure is the usual on 32-bit.) |
Yup, should be good to squash and merge. |
It would be nice to have this as an option to |
@Keno, you can always just call |
Well a) not if you're trying to replace the parser and b) that seems like a pretty roundabout way of doing it. |
As discussed in #14751, #5903, and JuliaStrings/utf8proc#11, this PR canonicalizes Unicode identifiers/symbols in Julia via a custom normalization, in addition to NFC normalization. That is, we treat certain codepoints as equivalent in identifiers. (Requires a new feature in the soon-to-be-released utf8proc 2.1.)
I don't think we want to go too crazy on "confusable" characters here—we made a conscious choice in #5434 to use NFC normalization, not NFKC normalization (ala Python 3), in order to preserve useful mathematical distinctions. I propose the following criteria for inclusion in our custom normalization, to be applied conservatively when real problems arise in the wild:
Currently, this PR canonicalizes ɛ (U+025B latin small letter open e) to ε (U+03B5 greek small letter epsilon) as discussed in #14751, and canonicalizes µ (U+00B5 micro) to μ (U+03BC greek small letter mu) as discussed in #5434. I think it would also be good to canonicalize fullwidth to halfwidth identifiers as discussed in #5903 (one of the primary reasons Python 3 went with NFKC).
cc: @StefanKarpinski, @jiahao