-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Do not suppress candidate tokens in the middle of a rule #15085
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't seem right. The error position is reported as the end of UNBOUNDED:
SELECT X() OVER (ROWS UNBOUNDED) FROM T
^
The only valid tokens at that point are FOLLOWING and PRECEDING, so we shouldn't be reporting all of these extra tokens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I know. I’m trying to figure out why it’s doing that. There’s some other bug lurking in there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It turns out these are all valid tokens per the grammar. It's because UNBOUNDED is a non-reserved word, so it can take the place of an identifier. In particular, it's considering alternatives that include the following productions:
over
: OVER ([...] | '(' windowSpecification ')')
;
windowSpecification
: [...]
windowFrame?
;
windowFrame
: [...]
frameExtent
;
frameExtent
: [...]
| frameType=ROWS start=frameBound
| [...]
;
frameBound
: [...]
| expression boundType=(PRECEDING | FOLLOWING) #boundedFrame
;
And <expression> can be an <identifier> followed by a number of things, including ., [, OVER, etc.
For instance, this goes through the parser successfully even though it's clearly wrong if we consider UNBOUNDED a special keyword:
SELECT X() OVER (ROWS UNBOUNDED -> 1 PRECEDING) FROM t
It's treating UNBOUNDED -> 1 as a lambda expression, which is a valid (syntax-wise) expression in the "#boundedFrame" frame bound above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I mark UNBOUNDED as a reserved word, then it produces the expected message:
line 1:32: mismatched input ')'. Expecting: 'FOLLOWING', 'PRECEDING'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to make the error handler treat UNBOUNDED as a reserved word without actually making it reserved? In my opinion, printing all alternatives for the case of UNBOUNDED being an identifier is unnecessary and confusing, even if those are all correct per grammar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another issue I can spot here is that the error handler suggests GROUPS, RANGE, and ROWS at position 23:
Arguments.of("SELECT x() over (ROWS select) FROM t",
"line 1:23: mismatched input 'select'. Expecting: 'BETWEEN', 'CURRENT', 'UNBOUNDED', <expression>"),
"line 1:23: mismatched input 'select'. Expecting: ')', 'BETWEEN', 'CURRENT', 'GROUPS', 'MEASURES', 'ORDER', 'PARTITION', 'RANGE', 'ROWS', 'UNBOUNDED', <expression>"),
It is not correct, as position 23 is after the word ROWS. It seems that the reported error position is the innermost one, but the suggestions come from an outer rule.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is correct (grammar-wise) because ROWS can be interpreted as an identifier representing a named window. After that, any of the terms PARTITION, ORDER, or any of the prefix tokens in windowFrame are valid.
At the end of the day, this fixes a bug in the parser error handler that should've never existed in the initial implementation, so we should get it in. I can't think of any way to avoid the spurious tokens due to the non-reserved keyword handling without adding semantic knowledge to the grammar about the behavior of each rule, but I have some ideas that are out of the scope of this PR.
core/trino-parser/src/test/java/io/trino/sql/parser/TestSqlParserErrorHandling.java
Outdated
Show resolved
Hide resolved
5e1e180 to
0dffe76
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think we can remove ignoredRule from ErrorHandler now?
AFAIK, nonReserved was the only ignored rule.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we can remove that.
core/trino-parser/src/main/java/io/trino/sql/parser/ErrorHandler.java
Outdated
Show resolved
Hide resolved
0dffe76 to
2cc1690
Compare
core/trino-parser/src/main/java/io/trino/sql/parser/SqlParser.java
Outdated
Show resolved
Hide resolved
core/trino-parser/src/test/java/io/trino/sql/parser/TestSqlParserErrorHandling.java
Outdated
Show resolved
Hide resolved
2cc1690 to
335dfbe
Compare
335dfbe to
016af5b
Compare
When the parsing failure occurs in the middle of a rule (as opposed to at the start of a rule), do not suppress the candidate tokens even if the rule is marked as suppressed. For instance, if the parsing fails immediately when recursing into primaryExpression, the candidates from that rule will be suppressed to avoid polluting the error message with all the possible prefixes of that rule (EXISTS, CASE, TRY_CAST, CURRENT_XXX, etc). However, if the parsing fails after <identifier> is matched within the <identifier> <string> alternative, include <string> in set of expected candidates.
016af5b to
0e6e498
Compare
When the parsing failure occurs in the middle of a rule (as opposed to at the start of a rule), do not suppress the candidate tokens even if the rule is marked as suppressed.
For instance, if the parsing fails immediately when recursing into primaryExpression, the candidates from that rule will be suppressed to avoid polluting the error message with all the possible prefixes of that rule (EXISTS, CASE, TRY_CAST, CURRENT_XXX, etc).
However, if the parsing fails after is matched within the alternative, include in set of expected candidates.
Fixes #15044
Release notes
(x) This is not user-visible or docs only and no release notes are required.