-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: schema error when parsing order-by expressions #10234
Conversation
Thank you for this PR @jonahgao -- it is on my review list for tomorrow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @jonahgao for addressing this problem. Im thinking if we can avoid the additional schemas in the signature, this is usually super confusing, we recently removed the similar from windows. Perhaps we can construct or amend the schema before calling the order by?
Here we need to distinguish these two schemas. When the order by expression is a literal, such as |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TLDR thank you @jonahgao -- while I also have some concerns about this PR as described below, given it fixes a bug I think we could merge it as is. However, I think it might be good to have a broader discussion about this
Currently, when building order-by expressions, only the input plan's schema (derived from the select list) is used.
For the following query, this will cause the column reference a in ORDER BY SUM(a) to fail to normalize.
I feel like I am missing something in this explanation. The following query works without this PR (and shows that ORDER BY can reference columns from the FROM clause, not just what is in the SELECT list)
> create table foo(x int, y int);
0 row(s) fetched.
Elapsed 0.002 seconds.
> select x from foo order by y;
+---+
| x |
+---+
+---+
0 row(s) fetched.
Elapsed 0.019 seconds.
The only type of query that this PR seems to solve involves the HAVING
clause -- maybe the issue is that the schema used to to resolve the HAVING clause needs to be treated like the ORDER BY clause?
Update I see the example with sum()
now
query I | ||
SELECT | ||
SUM(column1) | ||
FROM foo |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we also add an example that doesn't have a HAVING as well as one with GROUP BY? something like this perhaps
SELECT SUM(column1) FROM foo ORDER BY SUM(column1)
SELECT column2 FROM foo ORDER BY SUM(column1)
SELECT SUM(column1) FROM foo ORDER BY SUM(column1) GROUP BY column2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added. Thank you @alamb !
Co-authored-by: Andrew Lamb <[email protected]>
@alamb DataFusion CLI v37.1.0
> create table t(a int, b int);
> select a from t union select 1 order by b;
Error during planning: For SELECT DISTINCT, ORDER BY expressions b must appear in select list
> select a from t union all select 1 order by b;
Schema error: No field named t.a. Valid fields are a, t.b. Doing it for UNION makes the error messages hard to understand.
NOTE: When used in conjunction with set operators, the ORDER BY clause applies to the result set of the entire query; it doesn't apply only to the closest SELECT statement. |
@alamb I think we should make the column qualified, not the other way around. The problem of the I think that we should handle |
Yes, this makes sense -- so in this case perhaps it would mean removing add_missing_columns and unifying with this code path. I'll file a ticket to discuss |
I filed #10326 to discuss unifying the resolution logic |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which issue does this PR close?
Closes #10013.
Rationale for this change
In the following query syntax,
SELECT select_list FROM table_expression ORDER BY expressions
, the order-by expressions can not only reference the column names in the select list, but also reference the column names in theFROM
clause.Currently, when building order-by expressions, only the input plan's schema (derived from the select list) is used.
For the following query, this will cause the column reference
a
inORDER BY SUM(a)
to fail to normalize.The solution is, when constructing order-by expressions, we can use both the schema of the select list and the schema of the
FROM
clause.To achieve this, we need to handle the
ORDER BY
clause within the planning of select, that is,select_to_plan
.This approach may also have other benefits:
DISTINCT
andORDER BY
, because distinct is defined within the select list.ORDER BY
, such as supporting order by unprojected aggregate expressions and unprojected window functions. The term 'unprojected' means that they do not appear in the select list.These can be implemented in subsequent PRs.
What changes are included in this PR?
Use both the schema of the select list and the FROM clause to construct order-by expressions.
Are these changes tested?
Yes.
By existing tests and new tests.
Are there any user-facing changes?
No