-
Notifications
You must be signed in to change notification settings - Fork 156
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add "skip nulls" property to aggregate function invocation #401
Comments
What should the default behavior be? I think I'd prefer skipping nulls by default but will every function "skip". I don't have any good counter-example at the moment (postgres' Default - No special behavior. Nulls are skipped unless otherwise specified by the function. Also, I'm not super familiar with the rationale behind the null handling logic in Acero. Do you know why this "emit null if any input is null" behavior was desired in the first place? I think Acero might be a bit of an oddball here. |
I agree that the default should be to skip them, and not skipping them should be the exceptional case. In R, there are many scalar aggregate functions which have an If it's exceptionally niche and not something we'd want to support here, there are workarounds I can implement in the R Substrait producer (i.e. wrap the bindings to the scalar aggregate functions in further calls to other functions which first check if any results are NULL, and return either NULL or the calculated value depending on the outcome). |
That would work. We could also consider this a "physical optimization" in Acero and, if we recognize this pattern (collapse-to-null followed by aggregate function) we could collapse it into a single aggregate operator with skip_nulls=false. That being said, I'd consider R/dplyr to be a separate "engine" and so now I suppose there are two engines that support this feature. I'd be curious to hear what others think. |
The default should definitely be to skip nulls. That's what most engines do. IMO it does seem worthwhile to expose an option for aggregates to emit null if the input contains any nulls. Besides R working this way by default (which is peculiar), the aggregate functions in pandas also offer this as an option ( |
I was thinking about this some more this weekend and realized we already have the capability to express this logically. The message
So Now, I agree that skipping nulls is something that can be done more cheaply than applying an arbitrary filter. So, in a physical operator, it might make sense to have a dedicated |
I just want to comment that a very common case of opting to not skip nulls is for a COUNT type of aggregation. |
@wmalpica the Substrait equivalent of |
Some backends (e.g. DuckDB) remove NULL values by default in computations involving scalar aggregate functions, others (e.g. Acero) allow specifying this option to return an NULL value if any of the input values are NULL.
In #388 (closed, not merged), I proposed adding this as an option to each of the scalar aggregate functions, but in the review, there was discussion recommending instead adding this as a property to the aggregate function invocation.
The text was updated successfully, but these errors were encountered: