-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
use unicode character categories to classify identifier characters
- Loading branch information
1 parent
f4d1e94
commit 82e34b6
Showing
2 changed files
with
63 additions
and
13 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
82e34b6
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JeffBezanson Ever since this commit, I can no longer
using AWS
successfully.On 82e34b6, we seem to have a memory leak, as calling
using AWS
will freeze, then eat up all available memory until I forcibly kill it.On f5b5b63 (which is
master
at the time of writing) I get an error message:82e34b6
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
82e34b6 itself is no good; it was fixed by subsequent commits.
In s3_types.jl there is a zero-width space on line 652. I suppose that could be considered an identifier-continue character?
82e34b6
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amitmurthy
82e34b6
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another possibility is that we should strip "default-ignorable" characters entirely. Any opinions @stevengj @jiahao ?
82e34b6
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about printing a clear error message? Having zero-width spaces in code seems like asking for trouble.
82e34b6
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm in favor of the error message option. Some of the 'default-ignorable' characters like soft hyphen,
\u00ad
, have rather complex behaviors that may be semantically meaningful in context. (Others, like zero-width joiner,\u200d
, seem like pure trouble.)82e34b6
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why use the zero-width joiner when you can use the zero-width non-joiner? Or my favorite, the zero-width non-breaking space.
82e34b6
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This discussion lacks breadth, period.
82e34b6
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, I don't know how those characters got in in the first place. I use Kate as an editor. And that particular file may have had quite a bit of Copy-Paste of variable names from either Amazon's documentation on the web or from a pdf.
Thus, these characters may creep in again in regular use in other similar circumstances. So, I would prefer "stripping 'default-ignorable' characters entirely"
82e34b6
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jiahao you deserve some kind of award. Also, "breadth."
82e34b6
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIL that it is possible to deprecate the meaning of characters in the Unicode standard. Somehow I feel that this would make for excellent Existential Comics fodder.
82e34b6
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MichaelHatherly reported that ∑ (summation) is no longer accepted as a valid symbol name. Σ (sigma) is still valid. Is this intended?
82e34b6
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mlubin, I think we didn't want to include the entire category Sm as valid identifier characters, because many of these should be infix operators instead.
However, we should probably whitelist more of the category Sm characters (currently just ℘ is allowed from Sm) as allowed identifiers. I would suggest at least those in https://gist.github.com/stevengj/0dd4927f019bf504df47
(I can't paste them in here as Github doesn't allow some of those unicode characters in comments, grrr.)
82e34b6
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One possibility would be to make everything in category Sm an infix operator except a small whitelist of identifier chars (plus a small list of prefix operators).
(Alternatively, make everything in Sm a valid identifier except a list of infix operators. However, it seems like there are a lot more infix operators than identifier chars in Sm. See my comment in #6582.)
82e34b6
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In JuMP we recently supported
∑{}
andΣ{}
forsum{}
and∏{}
forprod{}
. I didn't realize the product character was no longer valid either.@IainNZ @joehuchette
82e34b6
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I think we should whitelist those.
82e34b6
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also JuliaQuant/MarketTechnicals.jl#32 it seems
82e34b6
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, because of the precedence issue (we need to manually pick what precedence each infix operator has), it might actually be better to:
This would also be easier to implement for 0.3 (adding Sm is a one-liner) and less traumatic in terms of backwards compatibility with JuMP etc.
82e34b6
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May I also request the super and subscript +-= 207A-208C as valid identifiers?
82e34b6
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's please just allow anything in Sm as an identifier unless it is is specifically listed as an infix/prefix operator. @loladiro, super/subscript +/– are in Sm, so this would fix that.
82e34b6
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that seems like the correct solution to me.
82e34b6
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The safest thing is to leave most characters invalid, and then gradually classify them as either identifier characters or correctly-parsed operators. We already have an identifier character whitelist and an operator list, and we will keep growing both of those. I think it is equally difficult to enumerate either the operators in Sm or the non-operators in Sm.
82e34b6
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JeffBezanson, I've come to the conclusion that you're probably right in this, since switching a character from an identifier-char (if Sm were allowed by default) to an operator-char would cause unexpected code breakage.