Allow four more characters to start identifiers. #11267

jiahao · 2015-05-14T18:21:41Z

Mathematical bold 0, 1 (U+1D7CE, U+1D7CF)
Mathematical double-struck 0, 1 (U+1D7D8, U+1D7D9)

which are sometimes used to represent additive and multiplicative identities.

- Mathematical bold 0, 1 (U+1D7CE, U+1D7CF) - Mathematical double-struck 0, 1 (U+1D7D8, U+1D7D9) which are sometimes used to represent certain representations of additive and multiplicative identities. Closes #10762

stevengj · 2015-05-15T16:16:06Z

I'm ambivalent about this. Should we just allow all the double-struck digits?

StefanKarpinski · 2015-05-15T17:00:43Z

This does seem to clash with the idea of normalizing characters that could be mistaken for each other.

jiahao · 2015-05-15T17:53:47Z

Well, people do use the fancy 0 and 1 to mean additive and multiplicative identities. I'm not aware of uses for 2-9.

The fancy digits are not canonicalizable to the ordinary digits.

stevengj · 2015-05-15T18:55:41Z

My main concern is that the rules for "what is an allowed identifier" are getting pretty complicated.

sbromberger · 2015-05-21T22:06:04Z

I disagree strongly with this proposed change. It will be too easy to create identifiers that look like numbers, and can cause great confusion depending on what fonts are in use by the user.

I like the general policy of "numberlike characters cannot be used to start identifier names" - it's simple, intuitive, and minimally subjective.

jiahao · 2015-05-22T00:50:44Z

@dpsanders it's up to you to defend this one.

PallHaraldsson · 2015-05-29T14:07:49Z

@sbromberger "It will be too easy to create identifiers that look like numbers" - I know it's probably not Julia's place, but if this would appear bigger would it solve the issue?

Usually, programming editors have used monospaced fonts. I'm not sure that is outdated with Unicode.. Unicode has halfwidth and fullwidth at least.. Just thinking, at least in editors/IDEs like Juno (that most will probably use anyway - with time), could there be a special case that makes "numberlikes" at least taller?

@JeffBezanson: "I don't think it's the place of a programming language to try to ban characters." - I think I agree. Security is usually for the data coming in. Is the program itself not the programmer's responsibility? Some lint program could give a warning? Anyway I will never use these letters and to not care either way, just found the issue interesting..

sbromberger · 2015-05-29T14:13:23Z

@PallHaraldsson so now we're going to be forced to use a specific editor (I don't use Juno/LT, btw) with a specific set of fonts just to avoid confusion with identifiers looking like numbers? I really don't think that's a reasonable suggestion.

PallHaraldsson · 2015-05-29T14:16:45Z

No, not "forced" to use a different editor, the code would still work.. and be secure against outside attacks, just not be readable.. - unless you either avoid the letter (in your own code) or use say Juno - or a linter that warns you. Anyway.. just a thought.. and anyone know of if halfwidth/fullwidth and proportional in generl is used in editor's..?

ScottPJones · 2015-05-29T14:21:56Z

I agree with @sbromberger on this one... julia programs are not identical to equations in a math textbook, as much as people seem to try to make it so (I'm not against that, don't get me wrong), but making something that is inconsistent with all programming languages I'm aware of (using a numberlike symbol by itself as an identifier) seems like it will just lead to confusion for many people... What's wrong with having to prefix these 4 characters with i (for identity), a (for array), or m (for matrix), when making a variable name?
I think this should be reverted... (sorry, @jiahao!)

PallHaraldsson · 2015-05-29T14:35:39Z

Scott and others:

I just noticed, even just in the REPL:

𝟙 = 1 #yes, easier to see with: a𝟙 = 1

Just like:

丙＝3

Yes, here in my Thunderbird editor (and I guess vi) you see no size
difference. [And I do not use Juno.. yet.. I guess I should, postponing
until there is the debugger - a real good reason..). The REPL is
amazingly just useful enough for me, and vi/less when I do edit(function)..

PallHaraldsson · 2015-05-29T14:37:23Z

on fullwidth #5903

sbromberger · 2015-05-29T14:39:48Z

I don't understand why a programming language, which can be composed and edited in any one of a number of methods (including pen!), would want to tie its hands with glyphs that can be confused with other symbols and that must rely on the user to avoid specific fonts / methods of editing in order to eliminate confusion and ambiguity.

Just from an accessibility perspective, this becomes a nightmare.

hayd · 2015-05-29T14:50:06Z

Allowing this is very different from packages actually using it; it's not like these identifiers are going to become widely popular... or used at all outside of this small niche (they're never end up in base for example). Unicode in julia has been incredibly useful... yet it's still often panned for the same reasons you cite (possibly ambiguity, accessibility, needs a modern font).

It's easy to write terrible code with similar looking identifiers, even in ascii. Who are we protecting here?

I don't think 𝟙 should be aliased to 1.

ScottPJones · 2015-05-29T15:05:25Z

To me, this is horribly inconsistent... why these four, and not the double strike 2..9, if the argument is that basically any Unicode should be allowed at any position in an identifier...
I don't have any problem with allowing Unicode characters as operators, identifiers, etc., but I think they do need to follow certain fairly standard rules... (i.e. letters, letterlike, plus a few other things for initial character of an identifier... followed by those + numbers, numberlike, etc.).
I also don't think this really is the case that @JeffBezanson brought up... i.e. "I don't think it's the place of a programming language to try to ban characters."
I agree with him, in general, but this isn't banning the character, it is simply saying that it's classification means it shouldn't be allowed as an identifier start character, just like 0..9, or :, or +, ...

jiahao · 2015-05-30T02:46:44Z

@ScottPJones this PR is not merged. There is nothing to revert.

Everyone, let's not complicate things here.

Hungarian notation is not idiomatic Julia, so saying that you can always use Hungarian notation is irrelevant.
The visual distinguishability issue is in general one we should be cognizant of. However, these specific characters, even written by hand, are designed to be distinguishable from ordinary letters and numerals. In fact, these characters derive from how mathematicians write them on paper and on blackboards. My own rendering of this looks like:

Any properly designed font should respect the reason why these characters exist. Therefore I also consider this point irrelevant.
The main issue is whether the rules for valid identifiers are already too complicated as they stand.

JeffBezanson · 2015-05-30T03:31:39Z

My gut feeling is that unicode character categories provide a good objective basis for decisions like this. I don't think it can be about fonts or appearances one way or the other. After all there are tons of pairs of similar-looking characters in unicode.

However I wouldn't want to normalize 𝟙 and 1 to the same character. The only reason 𝟙 exists is to have a different symbol, not to write digits in a nifty-looking font. The standard arguably got this one wrong, and 𝟙 should have category Sm (math symbol).

ScottPJones · 2015-05-30T08:04:36Z

@jiahao I thought it was merged because of the comment by @sbromberger in #10762, i.e.:

... but I see that a commit has already been made to allow this. I'll just go once more on the record that I think it's a bad idea, and will move on.

I also don't approve of Hungarian notation (one of the many evils foisted upon the world by M$, IMO 😀) My point about using i, m, or a, as prefixes was simply that those could retain most of the terseness of using the 𝟙 character, while still being a valid identifier using the current identifier start rules... I never meant to imply that one should use "System" Hungarian notation... (there are still some valid arguments in favor of "Apps" Hungarian notation, not that I use it anyway).

There is a huge visual distinguishability issue about using this, for the many people who use iOS,
and read julia-users, julia-dev, and GitHub on their iPhone or iPad... there's no way (unless you jailbreak your device, something most people won't do) to change the font... and these characters just come out as boxes... without @jiahao's nice photo of a hand drawing, I wouldn't have seen the character at all until I got to my Mac this morning... at least with i𝟙 or a𝟙, you can see that it's probably an identifier... Is typing one extra character so much of a burden?

There is a Unicode standard (annex) about this issue... see http://unicode.org/reports/tr31/

Finally: this is _way_ too complicated already... who can remember these rules (except maybe Dr. @JeffBezanson)? (BTW, why all the special casing of the Sm category? Which ones aren't allowed?)

    return (cat == UTF8PROC_CATEGORY_LU || cat == UTF8PROC_CATEGORY_LL ||
            cat == UTF8PROC_CATEGORY_LT || cat == UTF8PROC_CATEGORY_LM ||
            cat == UTF8PROC_CATEGORY_LO || cat == UTF8PROC_CATEGORY_NL ||
            cat == UTF8PROC_CATEGORY_SC ||  // allow currency symbols
            cat == UTF8PROC_CATEGORY_SO ||  // other symbols

            // math symbol (category Sm) whitelist
            (wc >= 0x2140 && wc <= 0x2a1c &&
             ((wc >= 0x2140 && wc <= 0x2144) || // ⅀, ⅁, ⅂, ⅃, ⅄
              wc == 0x223f || wc == 0x22be || wc == 0x22bf || // ∿, ⊾, ⊿
              wc == 0x22a4 || wc == 0x22a5 ||   // ⊤ ⊥
              (wc >= 0x22ee && wc <= 0x22f1) || // ⋮, ⋯, ⋰, ⋱

              (wc >= 0x2202 && wc <= 0x2233 &&
               (wc == 0x2202 || wc == 0x2205 || wc == 0x2206 || // ∂, ∅, ∆
                wc == 0x2207 || wc == 0x220e || wc == 0x220f || // ∇, ∎, ∏
                wc == 0x2210 || wc == 0x2211 || // ∐, ∑
                wc == 0x221e || wc == 0x221f || // ∞, ∟
                wc >= 0x222b)) || // ∫, ∬, ∭, ∮, ∯, ∰, ∱, ∲, ∳

              (wc >= 0x22c0 && wc <= 0x22c3) ||  // N-ary big ops: ⋀, ⋁, ⋂, ⋃
              (wc >= 0x25F8 && wc <= 0x25ff) ||  // ◸, ◹, ◺, ◻, ◼, ◽, ◾, ◿

              (wc >= 0x266f &&
               (wc == 0x266f || wc == 0x27d8 || wc == 0x27d9 || // ♯, ⟘, ⟙
                (wc >= 0x27c0 && wc <= 0x27c2) ||  // ⟀, ⟁, ⟂
                (wc >= 0x29b0 && wc <= 0x29b4) ||  // ⦰, ⦱, ⦲, ⦳, ⦴
                (wc >= 0x2a00 && wc <= 0x2a06) ||  // ⨀, ⨁, ⨂, ⨃, ⨄, ⨅, ⨆
                (wc >= 0x2a09 && wc <= 0x2a16) ||  // ⨉, ⨊, ⨋, ⨌, ⨍, ⨎, ⨏, ⨐, ⨑, ⨒, ⨓, ⨔, ⨕, ⨖
                wc == 0x2a1b || wc == 0x2a1c)))) || // ⨛, ⨜

            (wc >= 0x1d6c1 && // variants of \nabla and \partial
             (wc == 0x1d6c1 || wc == 0x1d6db ||
              wc == 0x1d6fb || wc == 0x1d715 ||
              wc == 0x1d735 || wc == 0x1d74f ||
              wc == 0x1d76f || wc == 0x1d789 ||
              wc == 0x1d7a9 || wc == 0x1d7c3)) ||

            // super- and subscript +-=()
            (wc >= 0x207a && wc <= 0x207e) ||
            (wc >= 0x208a && wc <= 0x208e) ||

            // angle symbols
            (wc >= 0x2220 && wc <= 0x2222) || // ∠, ∡, ∢
            (wc >= 0x299b && wc <= 0x29af) || // ⦛, ⦜, ⦝, ⦞, ⦟, ⦠, ⦡, ⦢, ⦣, ⦤, ⦥, ⦦, ⦧, ⦨, ⦩, ⦪, ⦫, ⦬, ⦭, ⦮, ⦯

            // Other_ID_Start
            wc == 0x2118 || wc == 0x212E || // ℘, ℮
            (wc >= 0x309B && wc <= 0x309C)); // katakana-hiragana sound marks

stevengj · 2015-05-30T12:47:00Z

@ScottPJones, the reason for the special-casing of category Sm is that this category is something of an intractable mess where parsing is concerned:

It contains symbols that we definitely want to allow in identifiers, like \nabla.
It contains symbols that we want to parse as infix operators, like \oplus
It contains punctuation-like characters such as the U+23a4 braket fragment that we probably do not want to allow in identifiers or operators.

In practice, having things like ⊕ available as infix operators, and things like x⁽⁺⁾ available as identifiers, is just way too useful to abandon, and in practice they don't seem to be confusing because they follow standard mathematical conventions, but they require special-casing.

You're right that, given all the special-casing that we already do in the name of mathematical conventions, it is not completely crazy to special-case bold 0 and 1.

ScottPJones · 2015-05-30T12:59:19Z

OK, thanks!
Somewhat OT: I am planning on making this table driven, unless people object... will be a lot faster, save space, etc., and adding new identifiers can be done by just running a julia program to regenerate the table (as a C file, like utf8proc_data.c) (could even be loadable at startup instead).] That way it can directly pick up changes to the Unicode standard (instead of having to do it indirectly via remaking utf8proc, linking to a new version of utf8proc, etc.) Thoughts? 🍅 s?

stevengj · 2015-05-30T13:17:26Z

@ScottPJones, note that the parser also needs Unicode normalization, not just categories etc. Also, this code is not performance-critical last I checked, so that shouldn't be a major design criterion. And since we now maintain utf8proc, updating to new Unicode versions (which doesn't happen very often anyway), is easy already.

stevengj · 2015-05-30T13:29:28Z

(You can work on improving whatever you like, of course, and cleaner code is always welcome. But I just want to suggest that rewriting working, non-performance-critical, code should probably not be your priority.)

ScottPJones · 2015-05-30T13:35:25Z

@stevengj This particular pet project would accomplish a few things:

Improve my knowledge of Julia data structures, which I definitely know I will need for our project
Reduce the size of memory for Julia (no 1MB chunk of very spread out data... it's not laid out well for cache performance)
Reduce the dependence of core Julia on an external C library
Develop the techniques in Julia to handle the ones that will probably be more performance critical for me... (for example, normalization)
Have the nice side benefit of speeding up some operations (whether or not it's performance critical) and simplifying the code (and getting it better documented)
Does that make sense?

ScottPJones · 2015-05-30T13:38:07Z

@stevengj About maintaining utf8proc... it can still be maintained by various julians, but does Julia really need to depend on it in the future, if there is a better alternative?
(and I think Unicode standards are going to be coming much quicker... mostly because of adding Emoji's... that's become a big issue for the Unicode organization)

stevengj · 2015-05-30T17:28:16Z

No, Julia does not need to depend on it if there is a nicer alternative. However, we probably need a C library for normalization, since it has to be executed in the flisp parser. (Updating to new unicode versions should take only a few minutes of work, since the data import is fully automated: update the URLs, run make update, commit, and update the commit number in Julia.)

I just hope that, along with all the emojis, they finally add a superscript "q".

ScottPJones · 2015-05-30T18:01:12Z

Well, from what I saw, all that is needed in C is two id character lookup functions, and the normalization function... everything else can be done in Julia, accessing generated tables. (but tables structured so that doing normalization, checking an id, or lower/upper casing of a string doesn't wipe out your L1/L2 cache!)

jiahao · 2015-08-24T20:50:55Z

Closing as too contentious.

Allow four more characters to start identifiers.

db9625e

- Mathematical bold 0, 1 (U+1D7CE, U+1D7CF) - Mathematical double-struck 0, 1 (U+1D7D8, U+1D7D9) which are sometimes used to represent certain representations of additive and multiplicative identities. Closes #10762

jiahao force-pushed the cjh/doublestruck01 branch from da26e3c to db9625e Compare May 14, 2015 19:23

jiahao closed this Aug 24, 2015

jiahao deleted the cjh/doublestruck01 branch October 22, 2015 02:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow four more characters to start identifiers. #11267

Allow four more characters to start identifiers. #11267

jiahao commented May 14, 2015

stevengj commented May 15, 2015

StefanKarpinski commented May 15, 2015

jiahao commented May 15, 2015

stevengj commented May 15, 2015

sbromberger commented May 21, 2015

jiahao commented May 22, 2015

PallHaraldsson commented May 29, 2015

sbromberger commented May 29, 2015

PallHaraldsson commented May 29, 2015

ScottPJones commented May 29, 2015

PallHaraldsson commented May 29, 2015

PallHaraldsson commented May 29, 2015

sbromberger commented May 29, 2015

hayd commented May 29, 2015

ScottPJones commented May 29, 2015

jiahao commented May 30, 2015

JeffBezanson commented May 30, 2015

ScottPJones commented May 30, 2015

stevengj commented May 30, 2015

ScottPJones commented May 30, 2015

stevengj commented May 30, 2015

stevengj commented May 30, 2015

ScottPJones commented May 30, 2015

ScottPJones commented May 30, 2015

stevengj commented May 30, 2015

ScottPJones commented May 30, 2015

jiahao commented Aug 24, 2015

Allow four more characters to start identifiers. #11267

Allow four more characters to start identifiers. #11267

Conversation

jiahao commented May 14, 2015

stevengj commented May 15, 2015

StefanKarpinski commented May 15, 2015

jiahao commented May 15, 2015

stevengj commented May 15, 2015

sbromberger commented May 21, 2015

jiahao commented May 22, 2015

PallHaraldsson commented May 29, 2015

sbromberger commented May 29, 2015

PallHaraldsson commented May 29, 2015

ScottPJones commented May 29, 2015

PallHaraldsson commented May 29, 2015

PallHaraldsson commented May 29, 2015

sbromberger commented May 29, 2015

hayd commented May 29, 2015

ScottPJones commented May 29, 2015

jiahao commented May 30, 2015

JeffBezanson commented May 30, 2015

ScottPJones commented May 30, 2015

stevengj commented May 30, 2015

ScottPJones commented May 30, 2015

stevengj commented May 30, 2015

stevengj commented May 30, 2015

ScottPJones commented May 30, 2015

ScottPJones commented May 30, 2015

stevengj commented May 30, 2015

ScottPJones commented May 30, 2015

jiahao commented Aug 24, 2015