Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support_ignore_nulls and support_ordering in Aggregate expression #9991

Closed
wants to merge 1 commit into from

Conversation

huaxingao
Copy link
Contributor

Which issue does this PR close?

Closes #9924.

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added logical-expr Logical plan and expressions physical-expr Physical Expressions sqllogictest SQL Logic Tests (.slt) labels Apr 8, 2024
----StreamingTableExec: partition_sizes=1, projection=[a, c, d], infinite_source=true, output_ordering=[a@0 ASC NULLS LAST]

query III
statement error DataFusion error: This feature is not implemented: ORDER BY is not implemented for SUM
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if I understand the requirement of #9924 correctly:

In my PR, I have support_ordering to true for first, last, nth_value and array_agg_ordered. For all the other aggregate functions, support_ordering is false and ORDER BY returns not implemented error.

Is this what we want? This seems to be a breaking change for me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should rewrite the test that excluded ordering for SUM in this case 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mustafasrepo / @ozankabak / @metesynnada can you help us understand if the ordering for SUM(c ORDER BY a DESC) has meaning?

DataFusion currently ignores cases

We propose making it an error to include clauses that make no sense

However, I could see an argument for permitting the user to write ORDER BY even if the aggregate didn't care about ordering (and simply remove the ORDER BY as a optimization)

I can also see the rationale for failing fast and erroring as the user probably didn't meant to specify an ordering on SUM 🤔

Copy link
Contributor

@ozankabak ozankabak Apr 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A first principles viewpoint suggests this should depend on the data type. For integral types, ORDER BY wouldn't matter for SUM, but it shouldn't be an error to still specify it -- the optimizer should just remove it. For floating-point types, the summation order actually makes a difference in the result. Any data type for which addition doesn't commute, ORDER BY will have an impact on the result.

However, I'm not sure if SQL standard says anything definitive in this matter. If not, it would be prudent to follow the results of this first-principles analysis.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hadn't thought about the implications for SUM(ORDER BY ...) for floats 🤔 It appears that this is consistent with what postgres does:

postgres=# create table foo (x float);
CREATE TABLE
postgres=# insert into foo values (1.0);
INSERT 0 1
postgres=# insert into foo values (2.0);
INSERT 0 1
postgres=# insert into foo values (-1.0);
INSERT 0 1
postgres=# select sum(x ORDER BY x) from foo;
 sum
-----
   2
(1 row)

postgres=# select sum(x IGNORE NULLS) from foo;
ERROR:  syntax error at or near "IGNORE"
LINE 1: select sum(x IGNORE NULLS) from foo;
                     ^
postgres=#

I also verified that postgres actually does sort the input:

postgres=# explain select sum(x ORDER BY x) from foo;
                            QUERY PLAN
-------------------------------------------------------------------
 Aggregate  (cost=169.81..169.82 rows=1 width=8)
   ->  Sort  (cost=158.51..164.16 rows=2260 width=8)
         Sort Key: x
         ->  Seq Scan on foo  (cost=0.00..32.60 rows=2260 width=8)
(4 rows)

postgres=#

};

let agg_name = aggregate_expr.name();
if ignore_nulls && !aggregate_expr.support_ignore_nulls() {
Copy link
Contributor

@jayzhan211 jayzhan211 Apr 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer we check these at the beginning of the function, so we can avoid unnecessary computing for invalid cases


# Test for IGNORE NULLS / ORDER BY not implemented
statement error DataFusion error: This feature is not implemented: IGNORE NULLS is not implemented for COUNT
SELECT COUNT(*) IGNORE NULLS FROM (values (1), (null), (2));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good, however this should be early exited on parser... I'll create a follow up for the parser

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If its doable in parser we probably may want to revert those checks/tests from DF

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it is probably better to check on parser. Spark throws Exception on parser if IGNORE NULLS is not supported.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume that we should still have the checks in the physical plan, for use cases such as Comet where we are not using the DataFusion SQL parsing?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to check in the physical plan layer as well for the reason @andygrove says.

Another approach could be an analyzer rule (following the model of how certain subqueries are handled) like this:

https://github.com/apache/arrow-datafusion/blob/582050728914650c6d4340ca803a0e9af087d8ec/datafusion/optimizer/src/analyzer/subquery.rs#L32-L39

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Queries like that should fail by the invalid syntax and this is parsers work to stop it IMHO. And spark doesn't allow query like that, it fails on query compile stage.

scala> spark.sql("SELECT COUNT(*) IGNORE NULLS over () FROM (values (1), (null), (2));").show(false)
org.apache.spark.sql.AnalysisException: Function count does not support IGNORE NULLS.; line 1 pos 7
  at org.apache.spark.sql.errors.QueryCompilationErrors$.functionWithUnsupportedSyntaxError(QueryCompilationErrors.scala:602)

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @huaxingao -- I agree this does what the ticket requests; However given what you have found about SUM(ORDER BY ...) it may make sense to permit that case 🤔


# Test for IGNORE NULLS / ORDER BY not implemented
statement error DataFusion error: This feature is not implemented: IGNORE NULLS is not implemented for COUNT
SELECT COUNT(*) IGNORE NULLS FROM (values (1), (null), (2));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to check in the physical plan layer as well for the reason @andygrove says.

Another approach could be an analyzer rule (following the model of how certain subqueries are handled) like this:

https://github.com/apache/arrow-datafusion/blob/582050728914650c6d4340ca803a0e9af087d8ec/datafusion/optimizer/src/analyzer/subquery.rs#L32-L39

----StreamingTableExec: partition_sizes=1, projection=[a, c, d], infinite_source=true, output_ordering=[a@0 ASC NULLS LAST]

query III
statement error DataFusion error: This feature is not implemented: ORDER BY is not implemented for SUM
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mustafasrepo / @ozankabak / @metesynnada can you help us understand if the ordering for SUM(c ORDER BY a DESC) has meaning?

DataFusion currently ignores cases

We propose making it an error to include clauses that make no sense

However, I could see an argument for permitting the user to write ORDER BY even if the aggregate didn't care about ordering (and simply remove the ORDER BY as a optimization)

I can also see the rationale for failing fast and erroring as the user probably didn't meant to specify an ordering on SUM 🤔

@comphead
Copy link
Contributor

comphead commented Apr 10, 2024

Thanks @huaxingao -- I agree this does what the ticket requests; However given what you have found about SUM(ORDER BY ...) it may make sense to permit that case 🤔

this query is invalid, our parser allows it but it should fail. IGNORE NULLS have a limited functions to be working with. all other query engine parsers does.

@alamb
Copy link
Contributor

alamb commented Apr 10, 2024

this query is invalid, our parser allows it but it should fail. IGNORE NULLS have a limited functions to be working with. all other query engine parsers does.

IGNORE NULLS definitely makes sense to error if the aggregate doesn't handle it (as it would result in incorrect answers)

ORDER BY seems more gray to me -- as in the answers won't be incorrect but it would potentially be doing more work than necessary

@alamb
Copy link
Contributor

alamb commented Apr 13, 2024

Marking as draft as I think this PR is no longer waiting on feedback. Please mark it as ready for review when it is ready for another look

@alamb alamb marked this pull request as draft April 13, 2024 13:37
@huaxingao
Copy link
Contributor Author

@alamb

I am not so sure how to proceed with this PR. Do we still need this PR?
For IGNORE NULLS, I think we have consensus to error in the parser if aggregates don't support IGNORE NULLS.

For ORDER BY, there are two options:

  1. Keep the current behavior, that is, allow SUM(ORDER BY ...). The rationale is:
  • The DataFusion optimizer would remove the ORDER BY if the order doesn't matter.
  • For float type, the order matters, so we need the ORDER BY
  • Postgres allows the SUM(ORDER BY ...).
  1. Error in the parser if ordering doesn't matter (That's what Spark does).

So, we will either change the parser or leave the current implementation as is. It seems to me we don't need this PR anymore?

@ozankabak
Copy link
Contributor

For ORDER BY, there are two options:

  1. Keep the current behavior, that is, allow SUM(ORDER BY ...). The rationale is:
  • The DataFusion optimizer would remove the ORDER BY if the order doesn't matter.
  • For float type, the order matters, so we need the ORDER BY
  • Postgres allows the SUM(ORDER BY ...).

Yes, let's follow this approach.

@alamb
Copy link
Contributor

alamb commented Apr 13, 2024

For IGNORE NULLS, I think we have consensus to error in the parser if aggregates don't support IGNORE NULLS.

Sounds good to me -- we can also move the error later if we want to enforce it for the expr_fn API as well

Thanks @huaxingao and @ozankabak and @comphead

Copy link

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the Stale PR has not had any activity for some time label Jun 13, 2024
@github-actions github-actions bot closed this Jun 21, 2024
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
logical-expr Logical plan and expressions physical-expr Physical Expressions sqllogictest SQL Logic Tests (.slt) Stale PR has not had any activity for some time
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Some aggregates silently ignore IGNORE NULLS and ORDER BY on arguments
6 participants