-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support_ignore_nulls and support_ordering in Aggregate expression #9991
Conversation
----StreamingTableExec: partition_sizes=1, projection=[a, c, d], infinite_source=true, output_ordering=[a@0 ASC NULLS LAST] | ||
|
||
query III | ||
statement error DataFusion error: This feature is not implemented: ORDER BY is not implemented for SUM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if I understand the requirement of #9924 correctly:
In my PR, I have support_ordering
to true for first
, last
, nth_value
and array_agg_ordered
. For all the other aggregate functions, support_ordering
is false and ORDER BY
returns not implemented
error.
Is this what we want? This seems to be a breaking change for me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should rewrite the test that excluded ordering for SUM in this case 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mustafasrepo / @ozankabak / @metesynnada can you help us understand if the ordering for SUM(c ORDER BY a DESC)
has meaning?
DataFusion currently ignores cases
We propose making it an error to include clauses that make no sense
However, I could see an argument for permitting the user to write ORDER BY
even if the aggregate didn't care about ordering (and simply remove the ORDER BY as a optimization)
I can also see the rationale for failing fast and erroring as the user probably didn't meant to specify an ordering on SUM 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A first principles viewpoint suggests this should depend on the data type. For integral types, ORDER BY
wouldn't matter for SUM
, but it shouldn't be an error to still specify it -- the optimizer should just remove it. For floating-point types, the summation order actually makes a difference in the result. Any data type for which addition doesn't commute, ORDER BY
will have an impact on the result.
However, I'm not sure if SQL standard says anything definitive in this matter. If not, it would be prudent to follow the results of this first-principles analysis.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hadn't thought about the implications for SUM(ORDER BY ...)
for floats 🤔 It appears that this is consistent with what postgres does:
postgres=# create table foo (x float);
CREATE TABLE
postgres=# insert into foo values (1.0);
INSERT 0 1
postgres=# insert into foo values (2.0);
INSERT 0 1
postgres=# insert into foo values (-1.0);
INSERT 0 1
postgres=# select sum(x ORDER BY x) from foo;
sum
-----
2
(1 row)
postgres=# select sum(x IGNORE NULLS) from foo;
ERROR: syntax error at or near "IGNORE"
LINE 1: select sum(x IGNORE NULLS) from foo;
^
postgres=#
I also verified that postgres actually does sort the input:
postgres=# explain select sum(x ORDER BY x) from foo;
QUERY PLAN
-------------------------------------------------------------------
Aggregate (cost=169.81..169.82 rows=1 width=8)
-> Sort (cost=158.51..164.16 rows=2260 width=8)
Sort Key: x
-> Seq Scan on foo (cost=0.00..32.60 rows=2260 width=8)
(4 rows)
postgres=#
}; | ||
|
||
let agg_name = aggregate_expr.name(); | ||
if ignore_nulls && !aggregate_expr.support_ignore_nulls() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer we check these at the beginning of the function, so we can avoid unnecessary computing for invalid cases
|
||
# Test for IGNORE NULLS / ORDER BY not implemented | ||
statement error DataFusion error: This feature is not implemented: IGNORE NULLS is not implemented for COUNT | ||
SELECT COUNT(*) IGNORE NULLS FROM (values (1), (null), (2)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good, however this should be early exited on parser... I'll create a follow up for the parser
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If its doable in parser we probably may want to revert those checks/tests from DF
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree it is probably better to check on parser. Spark throws Exception on parser if IGNORE NULLS
is not supported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume that we should still have the checks in the physical plan, for use cases such as Comet where we are not using the DataFusion SQL parsing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to check in the physical plan layer as well for the reason @andygrove says.
Another approach could be an analyzer rule (following the model of how certain subqueries are handled) like this:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Queries like that should fail by the invalid syntax and this is parsers work to stop it IMHO. And spark doesn't allow query like that, it fails on query compile stage.
scala> spark.sql("SELECT COUNT(*) IGNORE NULLS over () FROM (values (1), (null), (2));").show(false)
org.apache.spark.sql.AnalysisException: Function count does not support IGNORE NULLS.; line 1 pos 7
at org.apache.spark.sql.errors.QueryCompilationErrors$.functionWithUnsupportedSyntaxError(QueryCompilationErrors.scala:602)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @huaxingao -- I agree this does what the ticket requests; However given what you have found about SUM(ORDER BY ...)
it may make sense to permit that case 🤔
|
||
# Test for IGNORE NULLS / ORDER BY not implemented | ||
statement error DataFusion error: This feature is not implemented: IGNORE NULLS is not implemented for COUNT | ||
SELECT COUNT(*) IGNORE NULLS FROM (values (1), (null), (2)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to check in the physical plan layer as well for the reason @andygrove says.
Another approach could be an analyzer rule (following the model of how certain subqueries are handled) like this:
----StreamingTableExec: partition_sizes=1, projection=[a, c, d], infinite_source=true, output_ordering=[a@0 ASC NULLS LAST] | ||
|
||
query III | ||
statement error DataFusion error: This feature is not implemented: ORDER BY is not implemented for SUM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mustafasrepo / @ozankabak / @metesynnada can you help us understand if the ordering for SUM(c ORDER BY a DESC)
has meaning?
DataFusion currently ignores cases
We propose making it an error to include clauses that make no sense
However, I could see an argument for permitting the user to write ORDER BY
even if the aggregate didn't care about ordering (and simply remove the ORDER BY as a optimization)
I can also see the rationale for failing fast and erroring as the user probably didn't meant to specify an ordering on SUM 🤔
this query is invalid, our parser allows it but it should fail. IGNORE NULLS have a limited functions to be working with. all other query engine parsers does. |
|
Marking as draft as I think this PR is no longer waiting on feedback. Please mark it as ready for review when it is ready for another look |
I am not so sure how to proceed with this PR. Do we still need this PR? For
So, we will either change the parser or leave the current implementation as is. It seems to me we don't need this PR anymore? |
Yes, let's follow this approach. |
Sounds good to me -- we can also move the error later if we want to enforce it for the Thanks @huaxingao and @ozankabak and @comphead |
Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days. |
Which issue does this PR close?
Closes #9924.
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?