Removes min/max/count comparison based on name in aggregate statistics #12296

edmondop · 2024-09-02T18:55:57Z

Which issue does this PR close?

Closes #11151 .

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

jayzhan211 · 2024-09-02T23:52:50Z

datafusion/expr/src/udaf.rs

@@ -262,6 +262,19 @@ impl AggregateUDF {
        self.inner.is_descending()
    }

+    /// Returns true if the function is min. Used by the optimizer
+    pub fn is_min(&self) -> bool {


This is not we should do...
We need to understand the context and have a general function instead of a specialize name matching function.

default_value is a good example, it is only used in count for now, but it could extend to any function if needed.

@jayzhan211 looking to this comment from @alamb #11151 (comment) it might be that those "general functions" might not exists, will do some homework

If there is really no "general context", I think it is fine to just leave them as it is. Having specific function name matching in Impl Trait doesn't make sense to me 🤔

I agree with @jayzhan211 that it is important that these methods describe some property about the function rather than "what the function is"

What if we added a function like this:

/// Return the value of this aggregate function from statistics /// /// If the value of this aggregate, can be determined using only the /// statistics, return `Some(value)`, otherwise return `None` (the default) /// /// # Arguments /// * `statistics`: the statistics describing the input to this aggregate functions /// * `args`: the arguments passed to the aggregate function /// /// The value of some aggregate functions such as `COUNT`, `MIN` and `MAX` /// can be determined using statistics, if known /// fn value_from_stats(&self, statistics: &Statistics, arguments: &[Arc<PhysicalExpr>]) -> Option<ScalarValue> { None }

I think you could then implement this function for min / max and count (moving logic out of the aggregate statistics optimizer). It might need some other information (like schema for types, for example) but I think it would be pretty straight forward

alamb

Thanks @edmondop and @jayzhan211 - I have a suggestion about how to proceed. Let me know if that makes sense

Sorry (again) for the delay

alamb · 2024-09-06T14:53:42Z

datafusion/expr/src/udaf.rs

@@ -262,6 +262,19 @@ impl AggregateUDF {
        self.inner.is_descending()
    }

+    /// Returns true if the function is min. Used by the optimizer
+    pub fn is_min(&self) -> bool {


I agree with @jayzhan211 that it is important that these methods describe some property about the function rather than "what the function is"

What if we added a function like this:

/// Return the value of this aggregate function from statistics /// /// If the value of this aggregate, can be determined using only the /// statistics, return `Some(value)`, otherwise return `None` (the default) /// /// # Arguments /// * `statistics`: the statistics describing the input to this aggregate functions /// * `args`: the arguments passed to the aggregate function /// /// The value of some aggregate functions such as `COUNT`, `MIN` and `MAX` /// can be determined using statistics, if known /// fn value_from_stats(&self, statistics: &Statistics, arguments: &[Arc<PhysicalExpr>]) -> Option<ScalarValue> { None }

I think you could then implement this function for min / max and count (moving logic out of the aggregate statistics optimizer). It might need some other information (like schema for types, for example) but I think it would be pretty straight forward

alamb · 2024-09-11T14:56:07Z

Marking as draft as I think this PR is no longer waiting on feedback. Please mark it as ready for review when it is ready for another look

edmondop · 2024-09-13T23:34:05Z

Thanks @edmondop and @jayzhan211 - I have a suggestion about how to proceed. Let me know if that makes sense

Sorry (again) for the delay

I tried to implement that solution, but the problem is now here

https://github.com/apache/datafusion/pull/12296/files#diff-bca16ed42e39a82d942b706ad36b0d49502c12c9bc1c50e33ed443ce8c4d0437

that there is no real distinction between count and min_max. So the non distinct count doesn't get taken at line 61 but only at line https://github.com/apache/datafusion/pull/12296/files#diff-bca16ed42e39a82d942b706ad36b0d49502c12c9bc1c50e33ed443ce8c4d0437R65

We have two options imho:

passing distinct to get_value from stats
have a separate method for count?

jayzhan211 · 2024-09-14T00:34:14Z

datafusion/physical-optimizer/src/aggregate_statistics.rs

-        }
-    }
-    None
+    let value = agg_expr.fun().value_from_stats(


@edmondop
We have agg_expr.is_distinct() here, so you can differentiate distinct count and non-distinct case.

Maybe we can have StatisticsArgs similar to AccumulatorArgs.

pub struct StatisticsArgs<'a> { statistics: &'a Statistics, return_type: &'a DataType, /// Whether the aggregate function is distinct. /// /// ```sql /// SELECT COUNT(DISTINCT column1) FROM t; /// ``` pub is_distinct: bool, /// The physical expression of arguments the aggregate function takes. pub exprs: &'a [Arc<dyn PhysicalExpr>], }

So the function for count would return None if the aggregate is distinct? It felt like leaking in the UDF logic required by physical optimiser

It felt like leaking in the UDF logic required by physical optimiser

I think it makes sense, by default it has None, but we can also tune the function for optimizer 🤔

Maybe we need to find a better name than value_from_stats? I am not very clear why the distinct for count is important if you are simply "getting the value" from stats, if the precision is Exact

It seems that distinct has specifically to do with the optimizer and not with "getting a value for the specific UDF from statistics"

The optimization for count is mainly for count(*), since it doesn't make sense for distinct case, so we only care about non-distinct count for value_from_stats, but not distinct count.

It seems that distinct has specifically to do with the optimizer and not with "getting a value for the specific UDF from statistics"

distinct is for value_from_stats to know what kind of function we have, either count or distinct count.

github-actions bot added logical-expr Logical plan and expressions optimizer Optimizer rules functions labels Sep 2, 2024

Removes min/max/count comparison based on name in aggregate statistics

f478344

edmondop force-pushed the issue-11151 branch from b9262ec to f478344 Compare September 2, 2024 18:57

jayzhan211 reviewed Sep 2, 2024

View reviewed changes

alamb reviewed Sep 6, 2024

View reviewed changes

This was referenced Sep 6, 2024

Remove special casting of Min / Max built in AggregateFunctions #11151

Open

DataFusion weekly project plan (Andrew Lamb) - Sep 2, 2024 #12336

Closed

alamb marked this pull request as draft September 11, 2024 14:56

edmondop added 2 commits September 13, 2024 23:17

Abstracting away value from statistics

751e35a

Removing imports

016decf

jayzhan211 reviewed Sep 14, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Removes min/max/count comparison based on name in aggregate statistics #12296

Removes min/max/count comparison based on name in aggregate statistics #12296

edmondop commented Sep 2, 2024

jayzhan211 Sep 2, 2024

edmondop Sep 3, 2024

jayzhan211 Sep 3, 2024 •

edited

Loading

alamb Sep 6, 2024

alamb left a comment

alamb Sep 6, 2024

alamb commented Sep 11, 2024

edmondop commented Sep 13, 2024

jayzhan211 Sep 14, 2024 •

edited

Loading

edmondop Sep 14, 2024 •

edited

Loading

jayzhan211 Sep 14, 2024

edmondop Sep 14, 2024

jayzhan211 Sep 14, 2024

Removes min/max/count comparison based on name in aggregate statistics #12296

Are you sure you want to change the base?

Removes min/max/count comparison based on name in aggregate statistics #12296

Conversation

edmondop commented Sep 2, 2024

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

jayzhan211 Sep 2, 2024

Choose a reason for hiding this comment

edmondop Sep 3, 2024

Choose a reason for hiding this comment

jayzhan211 Sep 3, 2024 • edited Loading

Choose a reason for hiding this comment

alamb Sep 6, 2024

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

alamb Sep 6, 2024

Choose a reason for hiding this comment

alamb commented Sep 11, 2024

edmondop commented Sep 13, 2024

jayzhan211 Sep 14, 2024 • edited Loading

Choose a reason for hiding this comment

edmondop Sep 14, 2024 • edited Loading

Choose a reason for hiding this comment

jayzhan211 Sep 14, 2024

Choose a reason for hiding this comment

edmondop Sep 14, 2024

Choose a reason for hiding this comment

jayzhan211 Sep 14, 2024

Choose a reason for hiding this comment

jayzhan211 Sep 3, 2024 •

edited

Loading

jayzhan211 Sep 14, 2024 •

edited

Loading

edmondop Sep 14, 2024 •

edited

Loading