Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restrict codepoints of valid identifiers #5936

Closed
stevengj opened this issue Feb 24, 2014 · 7 comments · Fixed by #6805
Closed

Restrict codepoints of valid identifiers #5936

stevengj opened this issue Feb 24, 2014 · 7 comments · Fixed by #6805
Labels
breaking This change will break code needs decision A decision on this change is needed unicode Related to unicode characters and encodings

Comments

@stevengj
Copy link
Member

As mentioned in #5434, separate from the question of what unicode normalization we should use for identifiers, it would probably be a good idea to restrict the codepoints of valid identifiers. Currently, you can do crazy things like:

julia> ² = 1
1
julia> 2²
2
julia> 2= 1
1
julia> 1 + 2
2
julia> –3 = 3
3
julia> -3 + –3
0

Python 3's valid identifiers provide one possible model.

@stevengj
Copy link
Member Author

Another possible model would be the Fortress language specification, which is fairly detailed (see chapter 5, although it doesn't discuss normalization) and was unburdened by backwards compatibility (unlike Python).

@stevengj
Copy link
Member Author

cc: @malmaud, @jiahao

@JeffBezanson
Copy link
Member

It's starting to look like we need more and more of the ICU library. It would be great to rely on libc and call iswalpha, but (1) some libc implementations are quite far behind the unicode standard, and (2) for some reason this function is locale-dependent. I don't really see how whether a character is a letter should depend on locale...

@stevengj
Copy link
Member Author

The utf8proc library will tell us the unicode category of a codepoint, in a locale-independent way.

@JeffBezanson
Copy link
Member

Excellent.

@stevengj
Copy link
Member Author

What character categories do we want to allow in identifiers?

Certainly we want Sm (symbol, math) to be allowed, unlike Python.

As another example, Python does not allow Po (punctuation, other) characters in identifiers. Currently, Julia does, so you can have e.g. x′ as an identifier using the prime character. Do we want to allow this?

@StefanKarpinski
Copy link
Member

I really like using prime in variable names. However, we probably want to use other mathematical operators as, well, operators. So, I suspect we'll have to go through the math pages and decide on a case-by-case basis whether they should be allowed in identifiers or become operators (and how they should parse).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking This change will break code needs decision A decision on this change is needed unicode Related to unicode characters and encodings
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants