Skip to content

Conversation

@br0nstein
Copy link
Contributor

@br0nstein br0nstein commented Oct 13, 2022

Lexer fixes:

  • Support case-insensitive lexing
    • JavaCC parsers may specify case-insensitivity globally (in Options) or per-token production using IGNORE_CASE. Antlr 4.10 natively supports this as well, however many other projects (like antlr4ts) do not yet support it. Therefore, I have implemented it (for ASCII only) the traditional way by generating letter fragments for A->Z and rewriting literals to use these fragments.
  • Handle matching tokens in non-default lexical state (by using type command)
    • ANTLR lexer rule names must be unique across all modes so these rules were being prefixed with the mode name. However, they could not be matched by the parser since the rule type defaults to the rule name. Since parsers match by type, handle this by generating type commands
  • Handle matching token only defined in non-default lexical states
    • Introduce the concept of a canonical state, based on the mode that the rule name will be generated in without a mode prefix. This will be DEFAULT_MODE if the rule exists for that mode, otherwise pick one. To handle tokens not defined in the default mode, they must not be prefixed in their canonical state so the parser can find them. Note: type commands can only refer to valid rule names, therefore the rule name must be generated without prefixing in some mode.
  • Escape single quotes within literals and ] and - within character sets
  • Automatically generate rule names for unnamed more/skip tokens
    • Generate the rule names using an incrementing counter for each type
  • Correct commands mapping for more/special/token to default to mode(..) commands for changes in lexical state
    • It is not robust to infer that changing to default state means popMode, and any other state change is a pushMode. This doesn't work for JavaCC parsers that use actions to implement pushMode/popMode themselves using SwitchTo statements with a state stack, and that need to maintain a semantic difference between simple state switch and a push/pop. Mapping any state transition to default to a popMode command assumes that the previous mode on the ANTLR mode stack is the default mode, which may not actually be the case. Additionally, previously SPECIAL tokens were being mapped to skip commands: it is more accurate to map to channel(HIDDEN) commands which allows them to be manually retrieved using the ANTLR token stream like you could do in javacc. Lastly, MORE tokens were being dropped: it is more accurate to map them to more commands.
  • Add feature to allow user to specify which token actions should result in pushMode/popMode commands (don't infer popMode from switch to default state)
    • See readme. Other things considered: make the user instead add known javadoc comment to the TOKEN_MGR_DECLS functions to identify them to JavaCC2ANTLR (not as flexible for use cases where the TOKEN_MGR_DECLS functions are not controlled by the user, and you can only append to the block, for example for Calcite users), make the user instead pass arguments to JavaCC2ANTLR instead of update their grammar (chosen approach less error-prone as the reference is closer to the actual code, in case the functions are renamed)
  • Only generate fragments in one state (rules across all modes reference them without rewriting token references)
    • Only generate them in their canonical state
  • Support skip tokens natively and remove hack for java.jj that inferred based on token name containing "comment"
    • Map skip tokens to lexer rules with skip commands
  • Properly surround "choice" token body with single set of parentheses
    • Avoid redundant parentheses

In master, there were hundreds of guava files that could not be successfully parsed in the JavaConversion test. These files have been marked quarantined and will not fail the test suite. I have verified that my changes do not regress parsing any of these files.

…in non-default lexical state (by using type command),

handle matching token only defined in non-default lexical states, escape single quotes within literals and ] and -
within character sets, automatically generate rule names for unnamed more/skip tokens, correct commands mapping for
more/special/token to default to mode(..) commands for changes in lexical state, add feature to allow user to specify
which token actions should result in pushMode/popMode commands (don't infer popMode from switch to default state),
only generate fragments in one state (rules across all modes reference them without rewriting token references),
support skip tokens natively and remove hack for java.jj that inferred based on token name containing "comment",
properly surround "choice" token body with single set of parentheses.

In master, there were hundreds of guava files that could not be successfully parsed in the JavaConversion test.
These files have been marked quarantined and will not fail the test suite. I have verified that my changes do not
regress parsing any of these files.
Copy link
Owner

@ftomassetti ftomassetti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your PR. I am happy to accept it

@ftomassetti ftomassetti merged commit 8add27a into ftomassetti:master Oct 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants