Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add "skip nulls" property to aggregate function invocation #401

Open
thisisnic opened this issue Dec 5, 2022 · 7 comments
Open

Add "skip nulls" property to aggregate function invocation #401

thisisnic opened this issue Dec 5, 2022 · 7 comments
Labels
enhancement New feature or request

Comments

@thisisnic
Copy link
Contributor

Some backends (e.g. DuckDB) remove NULL values by default in computations involving scalar aggregate functions, others (e.g. Acero) allow specifying this option to return an NULL value if any of the input values are NULL.

In #388 (closed, not merged), I proposed adding this as an option to each of the scalar aggregate functions, but in the review, there was discussion recommending instead adding this as a property to the aggregate function invocation.

@westonpace
Copy link
Member

What should the default behavior be? I think I'd prefer skipping nulls by default but will every function "skip". I don't have any good counter-example at the moment (postgres' array_agg does not skip nulls but it's a bit of an odd aggregate function and doesn't exist in Substrait).

Default - No special behavior. Nulls are skipped unless otherwise specified by the function.
Alternate - If any input value is null then the output value is null.

Also, I'm not super familiar with the rationale behind the null handling logic in Acero. Do you know why this "emit null if any input is null" behavior was desired in the first place? I think Acero might be a bit of an oddball here.

@thisisnic
Copy link
Contributor Author

I agree that the default should be to skip them, and not skipping them should be the exceptional case.

In R, there are many scalar aggregate functions which have an na.rm argument which equates to Acero's skip_nulls; this may have been why it was added there (and is why I'm requesting this feature now).

If it's exceptionally niche and not something we'd want to support here, there are workarounds I can implement in the R Substrait producer (i.e. wrap the bindings to the scalar aggregate functions in further calls to other functions which first check if any results are NULL, and return either NULL or the calculated value depending on the outcome).

@westonpace
Copy link
Member

If it's exceptionally niche and not something we'd want to support here, there are workarounds I can implement in the R Substrait producer (i.e. wrap the bindings to the scalar aggregate functions in further calls to other functions which first check if any results are NULL, and return either NULL or the calculated value depending on the outcome).

That would work. We could also consider this a "physical optimization" in Acero and, if we recognize this pattern (collapse-to-null followed by aggregate function) we could collapse it into a single aggregate operator with skip_nulls=false.

That being said, I'd consider R/dplyr to be a separate "engine" and so now I suppose there are two engines that support this feature. I'd be curious to hear what others think.

@ianmcook
Copy link
Contributor

ianmcook commented Dec 5, 2022

The default should definitely be to skip nulls. That's what most engines do.

IMO it does seem worthwhile to expose an option for aggregates to emit null if the input contains any nulls. Besides R working this way by default (which is peculiar), the aggregate functions in pandas also offer this as an option (skipna=False). So it seems worthwhile to allow producers to easily represent this intended behavior in Substrait plans. It seems like a win in terms of expressiveness and explicitness, even if support for it on the consumers side is spotty.

@westonpace
Copy link
Member

I was thinking about this some more this weekend and realized we already have the capability to express this logically. The message AggregateRel::Measure is defined as:

  message Measure {
    AggregateFunction measure = 1;

    // An optional boolean expression that acts to filter which records are
    // included in the measure. True means include this record for calculation
    // within the measure.
    // Helps to support SUM(<c>) FILTER(WHERE...) syntax without masking opportunities for optimization
    Expression filter = 2;
  }

So skip_nulls should be equivalent to defining a measure where filter is x is not null.

Now, I agree that skipping nulls is something that can be done more cheaply than applying an arbitrary filter. So, in a physical operator, it might make sense to have a dedicated skip_nulls. I don't think it belongs in the logical AggregateRel that we have today.

@wmalpica
Copy link

I just want to comment that a very common case of opting to not skip nulls is for a COUNT type of aggregation.
If you think of a SQL statement SELECT COUNT(*) as c_star, COUNT(colA) as c_A FROM my_table GROUP BY colB then COUNT(*) would not ignore nulls while COUNT(colA) would ignore nulls

@westonpace
Copy link
Member

@wmalpica the Substrait equivalent of COUNT(*) is the 0-arg version e.g. count(). Defined in more detail here: https://github.com/substrait-io/substrait/blob/main/extensions/functions_aggregate_generic.yaml

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants