-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lexical syntax simplification #90
Conversation
Another benefit of this is that the output of the lexer can be only spans and their associated token type, rather than having to do any work. |
|
||
LIT_STR_RAW | ||
: 'r' LIT_STR_RAW_INNER | ||
| 'r' '"' .*? '"' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can't just be 'r' LIT_STR_RAW_INNER2
? (and the inner tokens should probably be swapped).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed it can.
This needs to take into account rust-lang/rust#14400 still. |
; | ||
|
||
LIT_FLOAT | ||
: [0-9][0-9_]* ('.' [0-9][0-9_]*)? ([eE] [-+]? [0-9][0-9]*)? FLOAT_SUFFIX? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The exponent [0-9]*
part—should it be [0-9_]*
?
Also I think this will be tightening what is accepted; at present, for example, 1.
is acceptable (but not 1.f32
for clear reasons), but this change will break that. Is that deliberate? Desirable? &c.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. That was not deliberate, I didn't mean to change the float literal syntax at all.
+1, sounds like an improvement for |
@kballard is the CRLF stuff correct? I extended the places that accept newline to also accept '\r\n', but not '\r', and I've removed '\r' from the whitespace skipping. |
@cmr My patch actually allows bare |
; | ||
|
||
LIT_CHAR | ||
: '\'' ( '\\' CHAR_ESCAPE | [^'\n\t\r] ) '\'' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't [^'\n\t\r]
be ~['\n\t\r]
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also the character set needs to include \\
or else invalid character escapes will end up matching anyway.
This grammar is wildly ambiguous. Identifiers, numbers and operators can be tokenized in multiple ways. |
@zwarich do you have an example of an ambiguous sequence of tokens? |
"12.12" could be INTEGER(12) DOT INTEGER(12) |
etc. will be relatively easy to fix. |
Doesn't antlr4 pick the longest matching token? |
(Yeah, isn't the maximal munch principal the standard way to resolve "ambiguities" like this?) |
I like the idea of keeping comments after lexing so pretty-printers / refactoring tools can use the same lexer as the compiler, but how about we just make comment dropping a micropass between the lexer and parser instead of adding to the parser workload? |
Sure, whatever. |
Fixed most things, and verified that it works as I expect. |
cc @nikomatsakis @pcwalton @brson I've updated this. It behaves as I expect for the code I've run it against, and accepts/rejects everything it should in the compiler/libs/testsuite/servo. |
Fix typo in tutorial
RFC for caching results of `treeFor` hook
Rendered