-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow four more characters to start identifiers. #11267
Conversation
- Mathematical bold 0, 1 (U+1D7CE, U+1D7CF) - Mathematical double-struck 0, 1 (U+1D7D8, U+1D7D9) which are sometimes used to represent certain representations of additive and multiplicative identities. Closes #10762
da26e3c
to
db9625e
Compare
I'm ambivalent about this. Should we just allow all the double-struck digits? |
This does seem to clash with the idea of normalizing characters that could be mistaken for each other. |
Well, people do use the fancy 0 and 1 to mean additive and multiplicative identities. I'm not aware of uses for 2-9. The fancy digits are not canonicalizable to the ordinary digits. |
My main concern is that the rules for "what is an allowed identifier" are getting pretty complicated. |
I disagree strongly with this proposed change. It will be too easy to create identifiers that look like numbers, and can cause great confusion depending on what fonts are in use by the user. I like the general policy of "numberlike characters cannot be used to start identifier names" - it's simple, intuitive, and minimally subjective. |
@dpsanders it's up to you to defend this one. |
@sbromberger "It will be too easy to create identifiers that look like numbers" - I know it's probably not Julia's place, but if this would appear bigger would it solve the issue? Usually, programming editors have used monospaced fonts. I'm not sure that is outdated with Unicode.. Unicode has halfwidth and fullwidth at least.. Just thinking, at least in editors/IDEs like Juno (that most will probably use anyway - with time), could there be a special case that makes "numberlikes" at least taller? @JeffBezanson: "I don't think it's the place of a programming language to try to ban characters." - I think I agree. Security is usually for the data coming in. Is the program itself not the programmer's responsibility? Some lint program could give a warning? Anyway I will never use these letters and to not care either way, just found the issue interesting.. |
@PallHaraldsson so now we're going to be forced to use a specific editor (I don't use Juno/LT, btw) with a specific set of fonts just to avoid confusion with identifiers looking like numbers? I really don't think that's a reasonable suggestion. |
No, not "forced" to use a different editor, the code would still work.. and be secure against outside attacks, just not be readable.. - unless you either avoid the letter (in your own code) or use say Juno - or a linter that warns you. Anyway.. just a thought.. and anyone know of if halfwidth/fullwidth and proportional in generl is used in editor's..? |
I agree with @sbromberger on this one... julia programs are not identical to equations in a math textbook, as much as people seem to try to make it so (I'm not against that, don't get me wrong), but making something that is inconsistent with all programming languages I'm aware of (using a numberlike symbol by itself as an identifier) seems like it will just lead to confusion for many people... What's wrong with having to prefix these 4 characters with i (for identity), a (for array), or m (for matrix), when making a variable name? |
Scott and others: I just noticed, even just in the REPL: 𝟙 = 1 #yes, easier to see with: a𝟙 = 1 Just like: 丙=3 Yes, here in my Thunderbird editor (and I guess vi) you see no size |
on fullwidth #5903 |
I don't understand why a programming language, which can be composed and edited in any one of a number of methods (including pen!), would want to tie its hands with glyphs that can be confused with other symbols and that must rely on the user to avoid specific fonts / methods of editing in order to eliminate confusion and ambiguity. Just from an accessibility perspective, this becomes a nightmare. |
Allowing this is very different from packages actually using it; it's not like these identifiers are going to become widely popular... or used at all outside of this small niche (they're never end up in base for example). Unicode in julia has been incredibly useful... yet it's still often panned for the same reasons you cite (possibly ambiguity, accessibility, needs a modern font). It's easy to write terrible code with similar looking identifiers, even in ascii. Who are we protecting here? I don't think 𝟙 should be aliased to 1. |
To me, this is horribly inconsistent... why these four, and not the double strike 2..9, if the argument is that basically any Unicode should be allowed at any position in an identifier... |
@ScottPJones this PR is not merged. There is nothing to revert. Everyone, let's not complicate things here.
|
My gut feeling is that unicode character categories provide a good objective basis for decisions like this. I don't think it can be about fonts or appearances one way or the other. After all there are tons of pairs of similar-looking characters in unicode. However I wouldn't want to normalize 𝟙 and 1 to the same character. The only reason 𝟙 exists is to have a different symbol, not to write digits in a nifty-looking font. The standard arguably got this one wrong, and 𝟙 should have category Sm (math symbol). |
@jiahao I thought it was merged because of the comment by @sbromberger in #10762, i.e.:
I also don't approve of Hungarian notation (one of the many evils foisted upon the world by M$, IMO 😀) My point about using i, m, or a, as prefixes was simply that those could retain most of the terseness of using the 𝟙 character, while still being a valid identifier using the current identifier start rules... I never meant to imply that one should use "System" Hungarian notation... (there are still some valid arguments in favor of "Apps" Hungarian notation, not that I use it anyway). There is a huge visual distinguishability issue about using this, for the many people who use iOS, There is a Unicode standard (annex) about this issue... see http://unicode.org/reports/tr31/ Finally: this is _way_ too complicated already... who can remember these rules (except maybe Dr. @JeffBezanson)? (BTW, why all the special casing of the Sm category? Which ones aren't allowed?) return (cat == UTF8PROC_CATEGORY_LU || cat == UTF8PROC_CATEGORY_LL ||
cat == UTF8PROC_CATEGORY_LT || cat == UTF8PROC_CATEGORY_LM ||
cat == UTF8PROC_CATEGORY_LO || cat == UTF8PROC_CATEGORY_NL ||
cat == UTF8PROC_CATEGORY_SC || // allow currency symbols
cat == UTF8PROC_CATEGORY_SO || // other symbols
// math symbol (category Sm) whitelist
(wc >= 0x2140 && wc <= 0x2a1c &&
((wc >= 0x2140 && wc <= 0x2144) || // ⅀, ⅁, ⅂, ⅃, ⅄
wc == 0x223f || wc == 0x22be || wc == 0x22bf || // ∿, ⊾, ⊿
wc == 0x22a4 || wc == 0x22a5 || // ⊤ ⊥
(wc >= 0x22ee && wc <= 0x22f1) || // ⋮, ⋯, ⋰, ⋱
(wc >= 0x2202 && wc <= 0x2233 &&
(wc == 0x2202 || wc == 0x2205 || wc == 0x2206 || // ∂, ∅, ∆
wc == 0x2207 || wc == 0x220e || wc == 0x220f || // ∇, ∎, ∏
wc == 0x2210 || wc == 0x2211 || // ∐, ∑
wc == 0x221e || wc == 0x221f || // ∞, ∟
wc >= 0x222b)) || // ∫, ∬, ∭, ∮, ∯, ∰, ∱, ∲, ∳
(wc >= 0x22c0 && wc <= 0x22c3) || // N-ary big ops: ⋀, ⋁, ⋂, ⋃
(wc >= 0x25F8 && wc <= 0x25ff) || // ◸, ◹, ◺, ◻, ◼, ◽, ◾, ◿
(wc >= 0x266f &&
(wc == 0x266f || wc == 0x27d8 || wc == 0x27d9 || // ♯, ⟘, ⟙
(wc >= 0x27c0 && wc <= 0x27c2) || // ⟀, ⟁, ⟂
(wc >= 0x29b0 && wc <= 0x29b4) || // ⦰, ⦱, ⦲, ⦳, ⦴
(wc >= 0x2a00 && wc <= 0x2a06) || // ⨀, ⨁, ⨂, ⨃, ⨄, ⨅, ⨆
(wc >= 0x2a09 && wc <= 0x2a16) || // ⨉, ⨊, ⨋, ⨌, ⨍, ⨎, ⨏, ⨐, ⨑, ⨒, ⨓, ⨔, ⨕, ⨖
wc == 0x2a1b || wc == 0x2a1c)))) || // ⨛, ⨜
(wc >= 0x1d6c1 && // variants of \nabla and \partial
(wc == 0x1d6c1 || wc == 0x1d6db ||
wc == 0x1d6fb || wc == 0x1d715 ||
wc == 0x1d735 || wc == 0x1d74f ||
wc == 0x1d76f || wc == 0x1d789 ||
wc == 0x1d7a9 || wc == 0x1d7c3)) ||
// super- and subscript +-=()
(wc >= 0x207a && wc <= 0x207e) ||
(wc >= 0x208a && wc <= 0x208e) ||
// angle symbols
(wc >= 0x2220 && wc <= 0x2222) || // ∠, ∡, ∢
(wc >= 0x299b && wc <= 0x29af) || // ⦛, ⦜, ⦝, ⦞, ⦟, ⦠, ⦡, ⦢, ⦣, ⦤, ⦥, ⦦, ⦧, ⦨, ⦩, ⦪, ⦫, ⦬, ⦭, ⦮, ⦯
// Other_ID_Start
wc == 0x2118 || wc == 0x212E || // ℘, ℮
(wc >= 0x309B && wc <= 0x309C)); // katakana-hiragana sound marks |
@ScottPJones, the reason for the special-casing of category Sm is that this category is something of an intractable mess where parsing is concerned:
In practice, having things like You're right that, given all the special-casing that we already do in the name of mathematical conventions, it is not completely crazy to special-case bold 0 and 1. |
OK, thanks! |
@ScottPJones, note that the parser also needs Unicode normalization, not just categories etc. Also, this code is not performance-critical last I checked, so that shouldn't be a major design criterion. And since we now maintain utf8proc, updating to new Unicode versions (which doesn't happen very often anyway), is easy already. |
(You can work on improving whatever you like, of course, and cleaner code is always welcome. But I just want to suggest that rewriting working, non-performance-critical, code should probably not be your priority.) |
@stevengj This particular pet project would accomplish a few things:
|
@stevengj About maintaining |
No, Julia does not need to depend on it if there is a nicer alternative. However, we probably need a C library for normalization, since it has to be executed in the flisp parser. (Updating to new unicode versions should take only a few minutes of work, since the data import is fully automated: update the URLs, run I just hope that, along with all the emojis, they finally add a superscript "q". |
Well, from what I saw, all that is needed in C is two id character lookup functions, and the normalization function... everything else can be done in Julia, accessing generated tables. (but tables structured so that doing normalization, checking an id, or lower/upper casing of a string doesn't wipe out your L1/L2 cache!) |
Closing as too contentious. |
which are sometimes used to represent additive and multiplicative identities.
Closes #10762