Skip to content

Conversation

@martint
Copy link
Member

@martint martint commented Nov 17, 2022

When the parsing failure occurs in the middle of a rule (as opposed to at the start of a rule), do not suppress the candidate tokens even if the rule is marked as suppressed.

For instance, if the parsing fails immediately when recursing into primaryExpression, the candidates from that rule will be suppressed to avoid polluting the error message with all the possible prefixes of that rule (EXISTS, CASE, TRY_CAST, CURRENT_XXX, etc).

However, if the parsing fails after is matched within the alternative, include in set of expected candidates.

Fixes #15044

Release notes

(x) This is not user-visible or docs only and no release notes are required.

@cla-bot cla-bot bot added the cla-signed label Nov 17, 2022
@martint martint requested review from findepi and kasiafi November 17, 2022 18:26
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem right. The error position is reported as the end of UNBOUNDED:

SELECT X() OVER (ROWS UNBOUNDED) FROM T
                               ^

The only valid tokens at that point are FOLLOWING and PRECEDING, so we shouldn't be reporting all of these extra tokens.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I know. I’m trying to figure out why it’s doing that. There’s some other bug lurking in there.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It turns out these are all valid tokens per the grammar. It's because UNBOUNDED is a non-reserved word, so it can take the place of an identifier. In particular, it's considering alternatives that include the following productions:

over
    : OVER ([...] | '(' windowSpecification ')')
    ;

windowSpecification
    : [...]
      windowFrame?
    ;

windowFrame
    : [...]
      frameExtent
    ;

frameExtent
    : [...]
    | frameType=ROWS start=frameBound
    | [...]
    ;

frameBound
    : [...]
    | expression boundType=(PRECEDING | FOLLOWING)  #boundedFrame
    ;

And <expression> can be an <identifier> followed by a number of things, including ., [, OVER, etc.

For instance, this goes through the parser successfully even though it's clearly wrong if we consider UNBOUNDED a special keyword:

SELECT X() OVER (ROWS UNBOUNDED -> 1 PRECEDING) FROM t

It's treating UNBOUNDED -> 1 as a lambda expression, which is a valid (syntax-wise) expression in the "#boundedFrame" frame bound above.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I mark UNBOUNDED as a reserved word, then it produces the expected message:

line 1:32: mismatched input ')'. Expecting: 'FOLLOWING', 'PRECEDING'

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to make the error handler treat UNBOUNDED as a reserved word without actually making it reserved? In my opinion, printing all alternatives for the case of UNBOUNDED being an identifier is unnecessary and confusing, even if those are all correct per grammar.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another issue I can spot here is that the error handler suggests GROUPS, RANGE, and ROWS at position 23:

Arguments.of("SELECT x() over (ROWS select) FROM t",
                        "line 1:23: mismatched input 'select'. Expecting: 'BETWEEN', 'CURRENT', 'UNBOUNDED', <expression>"),
                        "line 1:23: mismatched input 'select'. Expecting: ')', 'BETWEEN', 'CURRENT', 'GROUPS', 'MEASURES', 'ORDER', 'PARTITION', 'RANGE', 'ROWS', 'UNBOUNDED', <expression>"),

It is not correct, as position 23 is after the word ROWS. It seems that the reported error position is the innermost one, but the suggestions come from an outer rule.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is correct (grammar-wise) because ROWS can be interpreted as an identifier representing a named window. After that, any of the terms PARTITION, ORDER, or any of the prefix tokens in windowFrame are valid.

At the end of the day, this fixes a bug in the parser error handler that should've never existed in the initial implementation, so we should get it in. I can't think of any way to avoid the spurious tokens due to the non-reserved keyword handling without adding semantic knowledge to the grammar about the behavior of each rule, but I have some ideas that are out of the scope of this PR.

@martint martint force-pushed the parser-error-handling branch from 5e1e180 to 0dffe76 Compare November 19, 2022 19:27
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we can remove ignoredRule from ErrorHandler now?
AFAIK, nonReserved was the only ignored rule.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can remove that.

@martint martint force-pushed the parser-error-handling branch from 0dffe76 to 2cc1690 Compare December 1, 2022 22:02
@martint martint force-pushed the parser-error-handling branch from 2cc1690 to 335dfbe Compare January 17, 2023 02:34
@martint martint force-pushed the parser-error-handling branch from 335dfbe to 016af5b Compare January 27, 2023 05:49
When the parsing failure occurs in the middle of a rule (as opposed
to at the start of a rule), do not suppress the candidate tokens
even if the rule is marked as suppressed.

For instance, if the parsing fails immediately when recursing into
primaryExpression, the candidates from that rule will be suppressed
to avoid polluting the error message with all the possible prefixes
of that rule (EXISTS, CASE, TRY_CAST, CURRENT_XXX, etc).

However, if the parsing fails after <identifier> is matched within
the <identifier> <string> alternative, include <string> in set of
expected candidates.
@martint martint force-pushed the parser-error-handling branch from 016af5b to 0e6e498 Compare January 27, 2023 22:43
@martint martint merged commit 72cac61 into trinodb:master Jan 28, 2023
@martint martint deleted the parser-error-handling branch January 28, 2023 03:04
@github-actions github-actions bot added this to the 407 milestone Jan 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

Syntax error throws incorrect error message

4 participants