Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: schema error when parsing order-by expressions #10234

Merged
merged 8 commits into from
May 1, 2024

Conversation

jonahgao
Copy link
Member

@jonahgao jonahgao commented Apr 25, 2024

Which issue does this PR close?

Closes #10013.

Rationale for this change

In the following query syntax, SELECT select_list FROM table_expression ORDER BY expressions, the order-by expressions can not only reference the column names in the select list, but also reference the column names in the FROM clause.

Currently, when building order-by expressions, only the input plan's schema (derived from the select list) is used.
For the following query, this will cause the column reference a in ORDER BY SUM(a) to fail to normalize.

DataFusion CLI v37.0.0
> create table t(a int);
0 row(s) fetched.
Elapsed 0.005 seconds.

> SELECT
   SUM(a)
FROM t
having SUM(a) > 0
order by SUM(a);
Schema error: No field named a. Valid fields are "SUM(t.a)", "SUM(t.a)".

The solution is, when constructing order-by expressions, we can use both the schema of the select list and the schema of the FROM clause.

To achieve this, we need to handle the ORDER BY clause within the planning of select, that is, select_to_plan.
This approach may also have other benefits:

  • Refine the current add_missing_columns logic by directly adding the missing columns to the select list, which should be more efficient and less susceptible to errors.
  • Easier to coordinate DISTINCT and ORDER BY, because distinct is defined within the select list.
  • Easier to expand the functionality of ORDER BY, such as supporting order by unprojected aggregate expressions and unprojected window functions. The term 'unprojected' means that they do not appear in the select list.

These can be implemented in subsequent PRs.

What changes are included in this PR?

Use both the schema of the select list and the FROM clause to construct order-by expressions.

Are these changes tested?

Yes.
By existing tests and new tests.

Are there any user-facing changes?

No

@jonahgao jonahgao marked this pull request as draft April 25, 2024 16:42
@github-actions github-actions bot added sql SQL Planner sqllogictest SQL Logic Tests (.slt) labels Apr 25, 2024
@jonahgao jonahgao marked this pull request as ready for review April 27, 2024 15:47
@alamb
Copy link
Contributor

alamb commented Apr 29, 2024

Thank you for this PR @jonahgao -- it is on my review list for tomorrow

Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @jonahgao for addressing this problem. Im thinking if we can avoid the additional schemas in the signature, this is usually super confusing, we recently removed the similar from windows. Perhaps we can construct or amend the schema before calling the order by?

@jonahgao
Copy link
Member Author

thanks @jonahgao for addressing this problem. Im thinking if we can avoid the additional schemas in the signature, this is usually super confusing, we recently removed the similar from windows. Perhaps we can construct or amend the schema before calling the order by?

Here we need to distinguish these two schemas. When the order by expression is a literal, such as ORDER BY 1, it can only refer to the input schema, that is, the literal is the index of the select list.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TLDR thank you @jonahgao -- while I also have some concerns about this PR as described below, given it fixes a bug I think we could merge it as is. However, I think it might be good to have a broader discussion about this

Currently, when building order-by expressions, only the input plan's schema (derived from the select list) is used.
For the following query, this will cause the column reference a in ORDER BY SUM(a) to fail to normalize.

I feel like I am missing something in this explanation. The following query works without this PR (and shows that ORDER BY can reference columns from the FROM clause, not just what is in the SELECT list)

> create table foo(x int, y int);
0 row(s) fetched.
Elapsed 0.002 seconds.

> select x from foo order by y;
+---+
| x |
+---+
+---+
0 row(s) fetched.
Elapsed 0.019 seconds.

The only type of query that this PR seems to solve involves the HAVING clause -- maybe the issue is that the schema used to to resolve the HAVING clause needs to be treated like the ORDER BY clause?

Update I see the example with sum() now

datafusion/sqllogictest/test_files/order.slt Outdated Show resolved Hide resolved
query I
SELECT
SUM(column1)
FROM foo
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we also add an example that doesn't have a HAVING as well as one with GROUP BY? something like this perhaps

SELECT SUM(column1) FROM foo ORDER BY SUM(column1)
SELECT column2 FROM foo ORDER BY SUM(column1)
SELECT SUM(column1) FROM foo ORDER BY SUM(column1) GROUP BY column2

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added. Thank you @alamb !

@jonahgao
Copy link
Member Author

jonahgao commented May 1, 2024

The following query works without this PR (and shows that ORDER BY can reference columns from the FROM clause, not just what is in the SELECT list)

@alamb select x from foo order by y can be covered by add_missing_columns, by blindly adding columns into the descendant projection node. Another issue is that we should not run add_missing_columns for other SetExprs except SELECT.

DataFusion CLI v37.1.0
> create table t(a int, b int);

> select a from t union select 1 order by b;
Error during planning: For SELECT DISTINCT, ORDER BY expressions b must appear in select list
> select a from t union all select 1 order by b;
Schema error: No field named t.a. Valid fields are a, t.b.

Doing it for UNION makes the error messages hard to understand.
PostgreSQL refuses to do this for UNION, order by can only references output columns.

ORDER BY can be applied to the result of a UNION, INTERSECT, or EXCEPT combination, but in this case it is only permitted to sort by output column names or numbers.

NOTE: When used in conjunction with set operators, the ORDER BY clause applies to the result set of the entire query; it doesn't apply only to the closest SELECT statement.

@jonahgao
Copy link
Member Author

jonahgao commented May 1, 2024

The only type of query that this PR seems to solve involves the HAVING clause -- maybe the issue is that the schema used to to resolve the HAVING clause needs to be treated like the ORDER BY clause?

@alamb I think we should make the column qualified, not the other way around. The problem of the SUM() example is that it can't resolve the qualifier of a in ORDER BY SUM(a), to match the select list item which has been normalized into SUM(t.a).

I think that we should handle ORDER BY similarly to HAVING, use the merged schema, add the missing columns directly in the select list, instead of traversing the plan looking for projection node. Their processing logic may be reusable. I agree it might be good to have a broader discussion about this.

@alamb
Copy link
Contributor

alamb commented May 1, 2024

I think that we should handle ORDER BY similarly to HAVING, use the merged schema, add the missing columns directly in the select list, instead of traversing the plan looking for projection node. Their processing logic may be reusable. I agree it might be good to have a broader discussion about this.

Yes, this makes sense -- so in this case perhaps it would mean removing add_missing_columns and unifying with this code path.

I'll file a ticket to discuss

@alamb
Copy link
Contributor

alamb commented May 1, 2024

I filed #10326 to discuss unifying the resolution logic

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again @jonahgao and @comphead - I think this PR is an improvement in that it fixes a bug, even though there is likely more room for code improvement (tracked in #10326)

@alamb alamb merged commit d3237b2 into apache:main May 1, 2024
23 checks passed
@jonahgao
Copy link
Member Author

jonahgao commented May 1, 2024

Thanks for the review @comphead @alamb ❤️

@jonahgao jonahgao deleted the fix_order_by branch May 1, 2024 15:22
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sql SQL Planner sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

HAVING doesn't work with ORDER BY
3 participants