Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Removes min/max/count comparison based on name in aggregate statistics #12296

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

edmondop
Copy link
Contributor

@edmondop edmondop commented Sep 2, 2024

Which issue does this PR close?

Closes #11151 .

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added logical-expr Logical plan and expressions optimizer Optimizer rules functions labels Sep 2, 2024
@@ -262,6 +262,19 @@ impl AggregateUDF {
self.inner.is_descending()
}

/// Returns true if the function is min. Used by the optimizer
pub fn is_min(&self) -> bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not we should do...
We need to understand the context and have a general function instead of a specialize name matching function.

default_value is a good example, it is only used in count for now, but it could extend to any function if needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jayzhan211 looking to this comment from @alamb #11151 (comment) it might be that those "general functions" might not exists, will do some homework

Copy link
Contributor

@jayzhan211 jayzhan211 Sep 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is really no "general context", I think it is fine to just leave them as it is. Having specific function name matching in Impl Trait doesn't make sense to me 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @jayzhan211 that it is important that these methods describe some property about the function rather than "what the function is"

What if we added a function like this:

/// Return the value of this aggregate function from statistics
///
/// If the value of this aggregate, can be determined using only the 
/// statistics, return `Some(value)`, otherwise return `None` (the default)
///
/// # Arguments
/// * `statistics`: the statistics describing the input to this aggregate functions
/// * `args`: the arguments passed to the aggregate function 
///
/// The value of some aggregate functions such as `COUNT`, `MIN` and `MAX` 
/// can be determined using statistics, if known
///
fn value_from_stats(&self, statistics: &Statistics, arguments: &[Arc<PhysicalExpr>]) -> Option<ScalarValue> { None }

I think you could then implement this function for min / max and count (moving logic out of the aggregate statistics optimizer). It might need some other information (like schema for types, for example) but I think it would be pretty straight forward

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @edmondop and @jayzhan211 - I have a suggestion about how to proceed. Let me know if that makes sense

Sorry (again) for the delay

@@ -262,6 +262,19 @@ impl AggregateUDF {
self.inner.is_descending()
}

/// Returns true if the function is min. Used by the optimizer
pub fn is_min(&self) -> bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @jayzhan211 that it is important that these methods describe some property about the function rather than "what the function is"

What if we added a function like this:

/// Return the value of this aggregate function from statistics
///
/// If the value of this aggregate, can be determined using only the 
/// statistics, return `Some(value)`, otherwise return `None` (the default)
///
/// # Arguments
/// * `statistics`: the statistics describing the input to this aggregate functions
/// * `args`: the arguments passed to the aggregate function 
///
/// The value of some aggregate functions such as `COUNT`, `MIN` and `MAX` 
/// can be determined using statistics, if known
///
fn value_from_stats(&self, statistics: &Statistics, arguments: &[Arc<PhysicalExpr>]) -> Option<ScalarValue> { None }

I think you could then implement this function for min / max and count (moving logic out of the aggregate statistics optimizer). It might need some other information (like schema for types, for example) but I think it would be pretty straight forward

@alamb
Copy link
Contributor

alamb commented Sep 11, 2024

Marking as draft as I think this PR is no longer waiting on feedback. Please mark it as ready for review when it is ready for another look

@edmondop
Copy link
Contributor Author

Thanks @edmondop and @jayzhan211 - I have a suggestion about how to proceed. Let me know if that makes sense

Sorry (again) for the delay

I tried to implement that solution, but the problem is now here

https://github.com/apache/datafusion/pull/12296/files#diff-bca16ed42e39a82d942b706ad36b0d49502c12c9bc1c50e33ed443ce8c4d0437

that there is no real distinction between count and min_max. So the non distinct count doesn't get taken at line 61 but only at line https://github.com/apache/datafusion/pull/12296/files#diff-bca16ed42e39a82d942b706ad36b0d49502c12c9bc1c50e33ed443ce8c4d0437R65

We have two options imho:

  • passing distinct to get_value from stats
  • have a separate method for count?

}
}
None
let value = agg_expr.fun().value_from_stats(
Copy link
Contributor

@jayzhan211 jayzhan211 Sep 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@edmondop
We have agg_expr.is_distinct() here, so you can differentiate distinct count and non-distinct case.

Maybe we can have StatisticsArgs similar to AccumulatorArgs.

pub struct StatisticsArgs<'a> {
    statistics: &'a Statistics,
    return_type: &'a DataType,
    /// Whether the aggregate function is distinct.
    ///
    /// ```sql
    /// SELECT COUNT(DISTINCT column1) FROM t;
    /// ```
    pub is_distinct: bool,
    /// The physical expression of arguments the aggregate function takes.
    pub exprs: &'a [Arc<dyn PhysicalExpr>],
}

Copy link
Contributor Author

@edmondop edmondop Sep 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the function for count would return None if the aggregate is distinct? It felt like leaking in the UDF logic required by physical optimiser

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It felt like leaking in the UDF logic required by physical optimiser

I think it makes sense, by default it has None, but we can also tune the function for optimizer 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we need to find a better name than value_from_stats? I am not very clear why the distinct for count is important if you are simply "getting the value" from stats, if the precision is Exact

It seems that distinct has specifically to do with the optimizer and not with "getting a value for the specific UDF from statistics"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The optimization for count is mainly for count(*), since it doesn't make sense for distinct case, so we only care about non-distinct count for value_from_stats, but not distinct count.

It seems that distinct has specifically to do with the optimizer and not with "getting a value for the specific UDF from statistics"

distinct is for value_from_stats to know what kind of function we have, either count or distinct count.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
functions logical-expr Logical plan and expressions optimizer Optimizer rules
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Remove special casting of Min / Max built in AggregateFunctions
3 participants