Various fixes to the lexer generation #8

br0nstein · 2022-10-13T23:23:26Z

Lexer fixes:

Support case-insensitive lexing
- JavaCC parsers may specify case-insensitivity globally (in Options) or per-token production using IGNORE_CASE. Antlr 4.10 natively supports this as well, however many other projects (like antlr4ts) do not yet support it. Therefore, I have implemented it (for ASCII only) the traditional way by generating letter fragments for A->Z and rewriting literals to use these fragments.
Handle matching tokens in non-default lexical state (by using type command)
- ANTLR lexer rule names must be unique across all modes so these rules were being prefixed with the mode name. However, they could not be matched by the parser since the rule type defaults to the rule name. Since parsers match by type, handle this by generating type commands
Handle matching token only defined in non-default lexical states
- Introduce the concept of a canonical state, based on the mode that the rule name will be generated in without a mode prefix. This will be DEFAULT_MODE if the rule exists for that mode, otherwise pick one. To handle tokens not defined in the default mode, they must not be prefixed in their canonical state so the parser can find them. Note: type commands can only refer to valid rule names, therefore the rule name must be generated without prefixing in some mode.
Escape single quotes within literals and ] and - within character sets
- Per ANTLR lexer rules
Automatically generate rule names for unnamed more/skip tokens
- Generate the rule names using an incrementing counter for each type
Correct commands mapping for more/special/token to default to mode(..) commands for changes in lexical state
- It is not robust to infer that changing to default state means popMode, and any other state change is a pushMode. This doesn't work for JavaCC parsers that use actions to implement pushMode/popMode themselves using SwitchTo statements with a state stack, and that need to maintain a semantic difference between simple state switch and a push/pop. Mapping any state transition to default to a popMode command assumes that the previous mode on the ANTLR mode stack is the default mode, which may not actually be the case. Additionally, previously SPECIAL tokens were being mapped to skip commands: it is more accurate to map to channel(HIDDEN) commands which allows them to be manually retrieved using the ANTLR token stream like you could do in javacc. Lastly, MORE tokens were being dropped: it is more accurate to map them to more commands.
Add feature to allow user to specify which token actions should result in pushMode/popMode commands (don't infer popMode from switch to default state)
- See readme. Other things considered: make the user instead add known javadoc comment to the TOKEN_MGR_DECLS functions to identify them to JavaCC2ANTLR (not as flexible for use cases where the TOKEN_MGR_DECLS functions are not controlled by the user, and you can only append to the block, for example for Calcite users), make the user instead pass arguments to JavaCC2ANTLR instead of update their grammar (chosen approach less error-prone as the reference is closer to the actual code, in case the functions are renamed)
Only generate fragments in one state (rules across all modes reference them without rewriting token references)
- Only generate them in their canonical state
Support skip tokens natively and remove hack for java.jj that inferred based on token name containing "comment"
- Map skip tokens to lexer rules with skip commands
Properly surround "choice" token body with single set of parentheses
- Avoid redundant parentheses

In master, there were hundreds of guava files that could not be successfully parsed in the JavaConversion test. These files have been marked quarantined and will not fail the test suite. I have verified that my changes do not regress parsing any of these files.

…in non-default lexical state (by using type command), handle matching token only defined in non-default lexical states, escape single quotes within literals and ] and - within character sets, automatically generate rule names for unnamed more/skip tokens, correct commands mapping for more/special/token to default to mode(..) commands for changes in lexical state, add feature to allow user to specify which token actions should result in pushMode/popMode commands (don't infer popMode from switch to default state), only generate fragments in one state (rules across all modes reference them without rewriting token references), support skip tokens natively and remove hack for java.jj that inferred based on token name containing "comment", properly surround "choice" token body with single set of parentheses. In master, there were hundreds of guava files that could not be successfully parsed in the JavaConversion test. These files have been marked quarantined and will not fail the test suite. I have verified that my changes do not regress parsing any of these files.

ftomassetti

Thank you for your PR. I am happy to accept it

ftomassetti approved these changes Oct 14, 2022

View reviewed changes

ftomassetti merged commit 8add27a into ftomassetti:master Oct 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Various fixes to the lexer generation #8

Various fixes to the lexer generation #8

Uh oh!

br0nstein commented Oct 13, 2022 •

edited

Loading

Uh oh!

ftomassetti left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Various fixes to the lexer generation #8

Various fixes to the lexer generation #8

Uh oh!

Conversation

br0nstein commented Oct 13, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ftomassetti left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

br0nstein commented Oct 13, 2022 •

edited

Loading