Add "skip nulls" property to aggregate function invocation #401

thisisnic · 2022-12-05T11:42:59Z

Some backends (e.g. DuckDB) remove NULL values by default in computations involving scalar aggregate functions, others (e.g. Acero) allow specifying this option to return an NULL value if any of the input values are NULL.

In #388 (closed, not merged), I proposed adding this as an option to each of the scalar aggregate functions, but in the review, there was discussion recommending instead adding this as a property to the aggregate function invocation.

westonpace · 2022-12-05T14:20:16Z

What should the default behavior be? I think I'd prefer skipping nulls by default but will every function "skip". I don't have any good counter-example at the moment (postgres' array_agg does not skip nulls but it's a bit of an odd aggregate function and doesn't exist in Substrait).

Default - No special behavior. Nulls are skipped unless otherwise specified by the function.
Alternate - If any input value is null then the output value is null.

Also, I'm not super familiar with the rationale behind the null handling logic in Acero. Do you know why this "emit null if any input is null" behavior was desired in the first place? I think Acero might be a bit of an oddball here.

thisisnic · 2022-12-05T14:42:04Z

I agree that the default should be to skip them, and not skipping them should be the exceptional case.

In R, there are many scalar aggregate functions which have an na.rm argument which equates to Acero's skip_nulls; this may have been why it was added there (and is why I'm requesting this feature now).

If it's exceptionally niche and not something we'd want to support here, there are workarounds I can implement in the R Substrait producer (i.e. wrap the bindings to the scalar aggregate functions in further calls to other functions which first check if any results are NULL, and return either NULL or the calculated value depending on the outcome).

westonpace · 2022-12-05T16:46:30Z

If it's exceptionally niche and not something we'd want to support here, there are workarounds I can implement in the R Substrait producer (i.e. wrap the bindings to the scalar aggregate functions in further calls to other functions which first check if any results are NULL, and return either NULL or the calculated value depending on the outcome).

That would work. We could also consider this a "physical optimization" in Acero and, if we recognize this pattern (collapse-to-null followed by aggregate function) we could collapse it into a single aggregate operator with skip_nulls=false.

That being said, I'd consider R/dplyr to be a separate "engine" and so now I suppose there are two engines that support this feature. I'd be curious to hear what others think.

ianmcook · 2022-12-05T18:20:33Z

The default should definitely be to skip nulls. That's what most engines do.

IMO it does seem worthwhile to expose an option for aggregates to emit null if the input contains any nulls. Besides R working this way by default (which is peculiar), the aggregate functions in pandas also offer this as an option (skipna=False). So it seems worthwhile to allow producers to easily represent this intended behavior in Substrait plans. It seems like a win in terms of expressiveness and explicitness, even if support for it on the consumers side is spotty.

westonpace · 2023-03-05T13:15:20Z

I was thinking about this some more this weekend and realized we already have the capability to express this logically. The message AggregateRel::Measure is defined as:

  message Measure {
    AggregateFunction measure = 1;

    // An optional boolean expression that acts to filter which records are
    // included in the measure. True means include this record for calculation
    // within the measure.
    // Helps to support SUM(<c>) FILTER(WHERE...) syntax without masking opportunities for optimization
    Expression filter = 2;
  }

So skip_nulls should be equivalent to defining a measure where filter is x is not null.

Now, I agree that skipping nulls is something that can be done more cheaply than applying an arbitrary filter. So, in a physical operator, it might make sense to have a dedicated skip_nulls. I don't think it belongs in the logical AggregateRel that we have today.

wmalpica · 2023-03-22T19:27:25Z

I just want to comment that a very common case of opting to not skip nulls is for a COUNT type of aggregation.
If you think of a SQL statement SELECT COUNT(*) as c_star, COUNT(colA) as c_A FROM my_table GROUP BY colB then COUNT(*) would not ignore nulls while COUNT(colA) would ignore nulls

westonpace · 2023-03-23T13:23:28Z

@wmalpica the Substrait equivalent of COUNT(*) is the 0-arg version e.g. count(). Defined in more detail here: https://github.com/substrait-io/substrait/blob/main/extensions/functions_aggregate_generic.yaml

thisisnic added the enhancement New feature or request label Dec 5, 2022

thisisnic mentioned this issue Dec 5, 2022

feat: add skip_nulls option to scalar aggregate functions #388

Closed

westonpace mentioned this issue Mar 7, 2023

feat: adding SUM0 definition for aggregate functions #465

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add "skip nulls" property to aggregate function invocation #401

Add "skip nulls" property to aggregate function invocation #401

thisisnic commented Dec 5, 2022

westonpace commented Dec 5, 2022

thisisnic commented Dec 5, 2022

westonpace commented Dec 5, 2022

ianmcook commented Dec 5, 2022 •

edited

Loading

westonpace commented Mar 5, 2023

wmalpica commented Mar 22, 2023

westonpace commented Mar 23, 2023

Add "skip nulls" property to aggregate function invocation #401

Add "skip nulls" property to aggregate function invocation #401

Comments

thisisnic commented Dec 5, 2022

westonpace commented Dec 5, 2022

thisisnic commented Dec 5, 2022

westonpace commented Dec 5, 2022

ianmcook commented Dec 5, 2022 • edited Loading

westonpace commented Mar 5, 2023

wmalpica commented Mar 22, 2023

westonpace commented Mar 23, 2023

ianmcook commented Dec 5, 2022 •

edited

Loading