Add support_ignore_nulls and support_ordering in Aggregate expression #9991

huaxingao · 2024-04-08T06:24:40Z

Which issue does this PR close?

Closes #9924.

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

huaxingao · 2024-04-08T06:32:50Z

datafusion/sqllogictest/test_files/group_by.slt

----StreamingTableExec: partition_sizes=1, projection=[a, c, d], infinite_source=true, output_ordering=[a@0 ASC NULLS LAST]
-
-query III
+statement error DataFusion error: This feature is not implemented: ORDER BY is not implemented for SUM


I am not sure if I understand the requirement of #9924 correctly:

In my PR, I have support_ordering to true for first, last, nth_value and array_agg_ordered. For all the other aggregate functions, support_ordering is false and ORDER BY returns not implemented error.

Is this what we want? This seems to be a breaking change for me.

I think we should rewrite the test that excluded ordering for SUM in this case 🤔

@mustafasrepo / @ozankabak / @metesynnada can you help us understand if the ordering for SUM(c ORDER BY a DESC) has meaning?

DataFusion currently ignores cases

We propose making it an error to include clauses that make no sense

However, I could see an argument for permitting the user to write ORDER BY even if the aggregate didn't care about ordering (and simply remove the ORDER BY as a optimization)

I can also see the rationale for failing fast and erroring as the user probably didn't meant to specify an ordering on SUM 🤔

A first principles viewpoint suggests this should depend on the data type. For integral types, ORDER BY wouldn't matter for SUM, but it shouldn't be an error to still specify it -- the optimizer should just remove it. For floating-point types, the summation order actually makes a difference in the result. Any data type for which addition doesn't commute, ORDER BY will have an impact on the result.

However, I'm not sure if SQL standard says anything definitive in this matter. If not, it would be prudent to follow the results of this first-principles analysis.

I hadn't thought about the implications for SUM(ORDER BY ...) for floats 🤔 It appears that this is consistent with what postgres does:

postgres=# create table foo (x float); CREATE TABLE postgres=# insert into foo values (1.0); INSERT 0 1 postgres=# insert into foo values (2.0); INSERT 0 1 postgres=# insert into foo values (-1.0); INSERT 0 1 postgres=# select sum(x ORDER BY x) from foo; sum ----- 2 (1 row) postgres=# select sum(x IGNORE NULLS) from foo; ERROR: syntax error at or near "IGNORE" LINE 1: select sum(x IGNORE NULLS) from foo; ^ postgres=#

I also verified that postgres actually does sort the input:

postgres=# explain select sum(x ORDER BY x) from foo; QUERY PLAN ------------------------------------------------------------------- Aggregate (cost=169.81..169.82 rows=1 width=8) -> Sort (cost=158.51..164.16 rows=2260 width=8) Sort Key: x -> Seq Scan on foo (cost=0.00..32.60 rows=2260 width=8) (4 rows) postgres=#

jayzhan211 · 2024-04-09T13:15:57Z

datafusion/physical-expr/src/aggregate/build_in.rs

+    };
+
+    let agg_name = aggregate_expr.name();
+    if ignore_nulls && !aggregate_expr.support_ignore_nulls() {


I would prefer we check these at the beginning of the function, so we can avoid unnecessary computing for invalid cases

comphead · 2024-04-09T15:45:32Z

datafusion/sqllogictest/test_files/aggregate.slt

+
+# Test for IGNORE NULLS / ORDER BY not implemented
+statement error DataFusion error: This feature is not implemented: IGNORE NULLS is not implemented for COUNT
+SELECT COUNT(*) IGNORE NULLS FROM (values (1), (null), (2));


good, however this should be early exited on parser... I'll create a follow up for the parser

If its doable in parser we probably may want to revert those checks/tests from DF

Filed sqlparser-rs/sqlparser-rs#1206

I agree it is probably better to check on parser. Spark throws Exception on parser if IGNORE NULLS is not supported.

I assume that we should still have the checks in the physical plan, for use cases such as Comet where we are not using the DataFusion SQL parsing?

I think we need to check in the physical plan layer as well for the reason @andygrove says.

Another approach could be an analyzer rule (following the model of how certain subqueries are handled) like this:

https://github.com/apache/arrow-datafusion/blob/582050728914650c6d4340ca803a0e9af087d8ec/datafusion/optimizer/src/analyzer/subquery.rs#L32-L39

Queries like that should fail by the invalid syntax and this is parsers work to stop it IMHO. And spark doesn't allow query like that, it fails on query compile stage.

scala> spark.sql("SELECT COUNT(*) IGNORE NULLS over () FROM (values (1), (null), (2));").show(false) org.apache.spark.sql.AnalysisException: Function count does not support IGNORE NULLS.; line 1 pos 7 at org.apache.spark.sql.errors.QueryCompilationErrors$.functionWithUnsupportedSyntaxError(QueryCompilationErrors.scala:602)

alamb

Thanks @huaxingao -- I agree this does what the ticket requests; However given what you have found about SUM(ORDER BY ...) it may make sense to permit that case 🤔

alamb · 2024-04-10T17:23:38Z

datafusion/sqllogictest/test_files/aggregate.slt

+
+# Test for IGNORE NULLS / ORDER BY not implemented
+statement error DataFusion error: This feature is not implemented: IGNORE NULLS is not implemented for COUNT
+SELECT COUNT(*) IGNORE NULLS FROM (values (1), (null), (2));


I think we need to check in the physical plan layer as well for the reason @andygrove says.

Another approach could be an analyzer rule (following the model of how certain subqueries are handled) like this:

https://github.com/apache/arrow-datafusion/blob/582050728914650c6d4340ca803a0e9af087d8ec/datafusion/optimizer/src/analyzer/subquery.rs#L32-L39

alamb · 2024-04-10T17:26:41Z

datafusion/sqllogictest/test_files/group_by.slt

----StreamingTableExec: partition_sizes=1, projection=[a, c, d], infinite_source=true, output_ordering=[a@0 ASC NULLS LAST]
-
-query III
+statement error DataFusion error: This feature is not implemented: ORDER BY is not implemented for SUM


@mustafasrepo / @ozankabak / @metesynnada can you help us understand if the ordering for SUM(c ORDER BY a DESC) has meaning?

DataFusion currently ignores cases

We propose making it an error to include clauses that make no sense

However, I could see an argument for permitting the user to write ORDER BY even if the aggregate didn't care about ordering (and simply remove the ORDER BY as a optimization)

I can also see the rationale for failing fast and erroring as the user probably didn't meant to specify an ordering on SUM 🤔

comphead · 2024-04-10T18:01:27Z

Thanks @huaxingao -- I agree this does what the ticket requests; However given what you have found about SUM(ORDER BY ...) it may make sense to permit that case 🤔

this query is invalid, our parser allows it but it should fail. IGNORE NULLS have a limited functions to be working with. all other query engine parsers does.

alamb · 2024-04-10T18:21:12Z

this query is invalid, our parser allows it but it should fail. IGNORE NULLS have a limited functions to be working with. all other query engine parsers does.

IGNORE NULLS definitely makes sense to error if the aggregate doesn't handle it (as it would result in incorrect answers)

ORDER BY seems more gray to me -- as in the answers won't be incorrect but it would potentially be doing more work than necessary

alamb · 2024-04-13T13:37:37Z

Marking as draft as I think this PR is no longer waiting on feedback. Please mark it as ready for review when it is ready for another look

huaxingao · 2024-04-13T16:43:13Z

@alamb

I am not so sure how to proceed with this PR. Do we still need this PR?
For IGNORE NULLS, I think we have consensus to error in the parser if aggregates don't support IGNORE NULLS.

For ORDER BY, there are two options:

Keep the current behavior, that is, allow SUM(ORDER BY ...). The rationale is:

The DataFusion optimizer would remove the ORDER BY if the order doesn't matter.
For float type, the order matters, so we need the ORDER BY
Postgres allows the SUM(ORDER BY ...).

Error in the parser if ordering doesn't matter (That's what Spark does).

So, we will either change the parser or leave the current implementation as is. It seems to me we don't need this PR anymore?

ozankabak · 2024-04-13T17:33:59Z

For ORDER BY, there are two options:

Keep the current behavior, that is, allow SUM(ORDER BY ...). The rationale is:

The DataFusion optimizer would remove the ORDER BY if the order doesn't matter.

For float type, the order matters, so we need the ORDER BY

Postgres allows the SUM(ORDER BY ...).

Yes, let's follow this approach.

alamb · 2024-04-13T20:10:07Z

For IGNORE NULLS, I think we have consensus to error in the parser if aggregates don't support IGNORE NULLS.

Sounds good to me -- we can also move the error later if we want to enforce it for the expr_fn API as well

Thanks @huaxingao and @ozankabak and @comphead

github-actions · 2024-06-13T01:48:26Z

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

Add support_ignore_nulls and support_ordering in Aggregate expression

f27d7de

github-actions bot added logical-expr Logical plan and expressions physical-expr Physical Expressions sqllogictest SQL Logic Tests (.slt) labels Apr 8, 2024

huaxingao commented Apr 8, 2024

View reviewed changes

alamb mentioned this pull request Apr 8, 2024

DataFusion weekly project plan (Andrew Lamb) - April 8, 2024 #10002

Closed

9 tasks

jayzhan211 reviewed Apr 9, 2024

View reviewed changes

comphead reviewed Apr 9, 2024

View reviewed changes

alamb reviewed Apr 10, 2024

View reviewed changes

alamb marked this pull request as draft April 13, 2024 13:37

github-actions bot added the Stale PR has not had any activity for some time label Jun 13, 2024

github-actions bot closed this Jun 21, 2024

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support_ignore_nulls and support_ordering in Aggregate expression #9991

Add support_ignore_nulls and support_ordering in Aggregate expression #9991

huaxingao commented Apr 8, 2024

huaxingao Apr 8, 2024

jayzhan211 Apr 9, 2024

alamb Apr 10, 2024

ozankabak Apr 10, 2024 •

edited

Loading

alamb Apr 10, 2024

jayzhan211 Apr 9, 2024 •

edited

Loading

comphead Apr 9, 2024

comphead Apr 9, 2024

comphead Apr 9, 2024

huaxingao Apr 9, 2024

andygrove Apr 10, 2024

alamb Apr 10, 2024

comphead Apr 10, 2024

alamb left a comment

alamb Apr 10, 2024

alamb Apr 10, 2024

comphead commented Apr 10, 2024 •

edited

Loading

alamb commented Apr 10, 2024

alamb commented Apr 13, 2024

huaxingao commented Apr 13, 2024

ozankabak commented Apr 13, 2024

alamb commented Apr 13, 2024 •

edited

Loading

github-actions bot commented Jun 13, 2024

Add support_ignore_nulls and support_ordering in Aggregate expression #9991

Add support_ignore_nulls and support_ordering in Aggregate expression #9991

Conversation

huaxingao commented Apr 8, 2024

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ozankabak Apr 10, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jayzhan211 Apr 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

comphead commented Apr 10, 2024 • edited Loading

alamb commented Apr 10, 2024

alamb commented Apr 13, 2024

huaxingao commented Apr 13, 2024

ozankabak commented Apr 13, 2024

alamb commented Apr 13, 2024 • edited Loading

github-actions bot commented Jun 13, 2024

ozankabak Apr 10, 2024 •

edited

Loading

jayzhan211 Apr 9, 2024 •

edited

Loading

comphead commented Apr 10, 2024 •

edited

Loading

alamb commented Apr 13, 2024 •

edited

Loading