fix: schema error when parsing order-by expressions #10234

jonahgao · 2024-04-25T16:41:55Z

Which issue does this PR close?

Rationale for this change

In the following query syntax, SELECT select_list FROM table_expression ORDER BY expressions, the order-by expressions can not only reference the column names in the select list, but also reference the column names in the FROM clause.

Currently, when building order-by expressions, only the input plan's schema (derived from the select list) is used.
For the following query, this will cause the column reference a in ORDER BY SUM(a) to fail to normalize.

DataFusion CLI v37.0.0
> create table t(a int);
0 row(s) fetched.
Elapsed 0.005 seconds.

> SELECT
   SUM(a)
FROM t
having SUM(a) > 0
order by SUM(a);
Schema error: No field named a. Valid fields are "SUM(t.a)", "SUM(t.a)".

The solution is, when constructing order-by expressions, we can use both the schema of the select list and the schema of the FROM clause.

To achieve this, we need to handle the ORDER BY clause within the planning of select, that is, select_to_plan.
This approach may also have other benefits:

Refine the current add_missing_columns logic by directly adding the missing columns to the select list, which should be more efficient and less susceptible to errors.
Easier to coordinate DISTINCT and ORDER BY, because distinct is defined within the select list.
Easier to expand the functionality of ORDER BY, such as supporting order by unprojected aggregate expressions and unprojected window functions. The term 'unprojected' means that they do not appear in the select list.

These can be implemented in subsequent PRs.

What changes are included in this PR?

Use both the schema of the select list and the FROM clause to construct order-by expressions.

Are these changes tested?

Yes.
By existing tests and new tests.

Are there any user-facing changes?

No

alamb · 2024-04-29T19:49:09Z

Thank you for this PR @jonahgao -- it is on my review list for tomorrow

comphead

thanks @jonahgao for addressing this problem. Im thinking if we can avoid the additional schemas in the signature, this is usually super confusing, we recently removed the similar from windows. Perhaps we can construct or amend the schema before calling the order by?

jonahgao · 2024-04-30T15:47:42Z

thanks @jonahgao for addressing this problem. Im thinking if we can avoid the additional schemas in the signature, this is usually super confusing, we recently removed the similar from windows. Perhaps we can construct or amend the schema before calling the order by?

Here we need to distinguish these two schemas. When the order by expression is a literal, such as ORDER BY 1, it can only refer to the input schema, that is, the literal is the index of the select list.

alamb

TLDR thank you @jonahgao -- while I also have some concerns about this PR as described below, given it fixes a bug I think we could merge it as is. However, I think it might be good to have a broader discussion about this

Currently, when building order-by expressions, only the input plan's schema (derived from the select list) is used.
For the following query, this will cause the column reference a in ORDER BY SUM(a) to fail to normalize.

I feel like I am missing something in this explanation. The following query works without this PR (and shows that ORDER BY can reference columns from the FROM clause, not just what is in the SELECT list)

> create table foo(x int, y int);
0 row(s) fetched.
Elapsed 0.002 seconds.

> select x from foo order by y;
+---+
| x |
+---+
+---+
0 row(s) fetched.
Elapsed 0.019 seconds.

The only type of query that this PR seems to solve involves the HAVING clause -- maybe the issue is that the schema used to to resolve the HAVING clause needs to be treated like the ORDER BY clause?

Update I see the example with sum() now

datafusion/sqllogictest/test_files/order.slt

alamb · 2024-04-30T22:06:23Z

datafusion/sqllogictest/test_files/order.slt

+query I
+SELECT
+   SUM(column1)
+FROM foo


Could we also add an example that doesn't have a HAVING as well as one with GROUP BY? something like this perhaps

SELECT SUM(column1) FROM foo ORDER BY SUM(column1)

SELECT column2 FROM foo ORDER BY SUM(column1)

SELECT SUM(column1) FROM foo ORDER BY SUM(column1) GROUP BY column2

Added. Thank you @alamb !

Co-authored-by: Andrew Lamb <[email protected]>

jonahgao · 2024-05-01T00:14:50Z

The following query works without this PR (and shows that ORDER BY can reference columns from the FROM clause, not just what is in the SELECT list)

@alamb select x from foo order by y can be covered by add_missing_columns, by blindly adding columns into the descendant projection node. Another issue is that we should not run add_missing_columns for other SetExprs except SELECT.

DataFusion CLI v37.1.0
> create table t(a int, b int);

> select a from t union select 1 order by b;
Error during planning: For SELECT DISTINCT, ORDER BY expressions b must appear in select list
> select a from t union all select 1 order by b;
Schema error: No field named t.a. Valid fields are a, t.b.

Doing it for UNION makes the error messages hard to understand.
PostgreSQL refuses to do this for UNION, order by can only references output columns.

ORDER BY can be applied to the result of a UNION, INTERSECT, or EXCEPT combination, but in this case it is only permitted to sort by output column names or numbers.

NOTE: When used in conjunction with set operators, the ORDER BY clause applies to the result set of the entire query; it doesn't apply only to the closest SELECT statement.

jonahgao · 2024-05-01T00:21:42Z

The only type of query that this PR seems to solve involves the HAVING clause -- maybe the issue is that the schema used to to resolve the HAVING clause needs to be treated like the ORDER BY clause?

@alamb I think we should make the column qualified, not the other way around. The problem of the SUM() example is that it can't resolve the qualifier of a in ORDER BY SUM(a), to match the select list item which has been normalized into SUM(t.a).

I think that we should handle ORDER BY similarly to HAVING, use the merged schema, add the missing columns directly in the select list, instead of traversing the plan looking for projection node. Their processing logic may be reusable. I agree it might be good to have a broader discussion about this.

alamb · 2024-05-01T11:08:29Z

I think that we should handle ORDER BY similarly to HAVING, use the merged schema, add the missing columns directly in the select list, instead of traversing the plan looking for projection node. Their processing logic may be reusable. I agree it might be good to have a broader discussion about this.

Yes, this makes sense -- so in this case perhaps it would mean removing add_missing_columns and unifying with this code path.

I'll file a ticket to discuss

alamb · 2024-05-01T11:13:42Z

I filed #10326 to discuss unifying the resolution logic

alamb

Thanks again @jonahgao and @comphead - I think this PR is an improvement in that it fixes a bug, even though there is likely more room for code improvement (tracked in #10326)

jonahgao · 2024-05-01T15:22:11Z

Thanks for the review @comphead @alamb ❤️

fix: schema error when parsing order-by expressions

9d5d83e

jonahgao marked this pull request as draft April 25, 2024 16:42

github-actions bot added sql SQL Planner sqllogictest SQL Logic Tests (.slt) labels Apr 25, 2024

jonahgao added 4 commits April 26, 2024 00:47

add test from issue

e53a15e

Merge branch 'main' into fix_order_by

f385213

improve order_by_to_sort_expr

462f70b

add test

a7c5f33

jonahgao marked this pull request as ready for review April 27, 2024 15:47

alamb mentioned this pull request Apr 29, 2024

DataFusion weekly project plan (Andrew Lamb) - April 29, 2024 #10283

Closed

8 tasks

comphead reviewed Apr 30, 2024

View reviewed changes

alamb reviewed Apr 30, 2024

View reviewed changes

datafusion/sqllogictest/test_files/order.slt Outdated Show resolved Hide resolved

alamb approved these changes Apr 30, 2024

View reviewed changes

jonahgao and others added 3 commits May 1, 2024 07:31

Update datafusion/sqllogictest/test_files/order.slt

7744076

Co-authored-by: Andrew Lamb <[email protected]>

fix tests

1bc4b71

add test

e984a3e

alamb mentioned this pull request May 1, 2024

Unify SQL planning for ORDER BY, HAVING, DISTINCT, etc #10326

Open

alamb approved these changes May 1, 2024

View reviewed changes

alamb merged commit d3237b2 into apache:main May 1, 2024
23 checks passed

jonahgao deleted the fix_order_by branch May 1, 2024 15:22

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: schema error when parsing order-by expressions #10234

fix: schema error when parsing order-by expressions #10234

jonahgao commented Apr 25, 2024 •

edited

Loading

alamb commented Apr 29, 2024

comphead left a comment •

edited

Loading

jonahgao commented Apr 30, 2024

alamb left a comment •

edited

Loading

alamb Apr 30, 2024

jonahgao Apr 30, 2024

jonahgao commented May 1, 2024 •

edited

Loading

jonahgao commented May 1, 2024 •

edited

Loading

alamb commented May 1, 2024

alamb commented May 1, 2024

alamb left a comment

jonahgao commented May 1, 2024

fix: schema error when parsing order-by expressions #10234

fix: schema error when parsing order-by expressions #10234

Conversation

jonahgao commented Apr 25, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

alamb commented Apr 29, 2024

comphead left a comment • edited Loading

Choose a reason for hiding this comment

jonahgao commented Apr 30, 2024

alamb left a comment • edited Loading

Choose a reason for hiding this comment

alamb Apr 30, 2024

Choose a reason for hiding this comment

jonahgao Apr 30, 2024

Choose a reason for hiding this comment

jonahgao commented May 1, 2024 • edited Loading

jonahgao commented May 1, 2024 • edited Loading

alamb commented May 1, 2024

alamb commented May 1, 2024

alamb left a comment

Choose a reason for hiding this comment

jonahgao commented May 1, 2024

jonahgao commented Apr 25, 2024 •

edited

Loading

comphead left a comment •

edited

Loading

alamb left a comment •

edited

Loading

jonahgao commented May 1, 2024 •

edited

Loading

jonahgao commented May 1, 2024 •

edited

Loading