Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move Covariance (Sample) covar / covar_samp to be a User Defined Aggregate Function #10372

Merged
merged 7 commits into from
May 6, 2024

Conversation

jayzhan211
Copy link
Contributor

@jayzhan211 jayzhan211 commented May 4, 2024

Which issue does this PR close?

Part of #8708

Rationale for this change

We are moving aggregate functions out of the core to ensure the core APIs are sufficient to implement all aggregates and make datafusion more configurable

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Signed-off-by: jayzhan211 <[email protected]>
Signed-off-by: jayzhan211 <[email protected]>
Signed-off-by: jayzhan211 <[email protected]>
Signed-off-by: jayzhan211 <[email protected]>
Signed-off-by: jayzhan211 <[email protected]>
Signed-off-by: jayzhan211 <[email protected]>
@github-actions github-actions bot added logical-expr Logical plan and expressions physical-expr Physical Expressions core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels May 4, 2024
@jayzhan211 jayzhan211 marked this pull request as ready for review May 4, 2024 03:26
@alamb alamb changed the title UDAF: Covariance Sample Move Covariance (Sample) covariance_samp to be a User Defined Aggregate Function May 5, 2024
@@ -63,8 +63,6 @@ pub enum AggregateFunction {
Stddev,
/// Standard Deviation (Population)
StddevPop,
/// Covariance (Sample)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point of this PR is to remove this variant and make it a user defined aggregate

@alamb alamb changed the title Move Covariance (Sample) covariance_samp to be a User Defined Aggregate Function Move Covariance (Sample) covar / covar_samp to be a User Defined Aggregate Function May 5, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @jayzhan211

This looks great and I think it is a nice verification that we can extract aggregates from the code. I left some small suggestions but I don't think they are necessary -- I also updated the title and description of this PR to be more description

I believe this function (and related ones) are part of what @yyy1000 was working on when we started down the path to extract this type of thing from the core.

// specific language governing permissions and limitations
// under the License.

//! Defines the covariance aggregations.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
//! Defines the covariance aggregations.
//! [`CovarianceSample`]: covariance aggregations.

f.debug_struct("CovarianceSample")
.field("name", &self.name())
.field("signature", &self.signature)
.field("accumulator", &"<FUNC>")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably don't need the accumulator field in the debug printoug as it doesn't exist in the structure

}

/// An accumulator to compute covariance
/// The algrithm used is an online implementation and numerically stable. It is derived from the following paper
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// The algrithm used is an online implementation and numerically stable. It is derived from the following paper
/// The algorithm used is an online implementation and numerically stable. It is derived from the following paper

}

fn name(&self) -> &str {
"covar"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A minor nitpick here is that the name of the struct is CovarianceSample but the name is covar (with alias covar_samp)

It would be better in my opinion of name() and the struct name were consistent -- so Covariance or name() to return "covariance_pop"

Copy link
Contributor Author

@jayzhan211 jayzhan211 May 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rename it to covar_samp like Postgres, covar is now alias

@@ -0,0 +1,25 @@
// Licensed to the Apache Software Foundation (ASF) under one
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the StatsType is only used for functions in defined in datafusion-functions-aggregate so this module could go in datafusion-functions-aggregate

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let me take a to-do note. Before all other functions are moved to functions-aggregate we need it to keep in common, since we don't import physical-expr into functions-aggregate

@alamb alamb added the api change Changes the API exposed to users of the crate label May 5, 2024
Signed-off-by: jayzhan211 <[email protected]>
@jayzhan211
Copy link
Contributor Author

jayzhan211 commented May 5, 2024

Thank you @jayzhan211

This looks great and I think it is a nice verification that we can extract aggregates from the code. I left some small suggestions but I don't think they are necessary -- I also updated the title and description of this PR to be more description

I believe this function (and related ones) are part of what @yyy1000 was working on when we started down the path to extract this type of thing from the core.

Thanks for your review.
I plan to work on Sum or others that have additional features, but not too complex ones. covar_pop can be a good first issue.

@alamb alamb merged commit a0fccbf into apache:main May 6, 2024
23 checks passed
@alamb
Copy link
Contributor

alamb commented May 6, 2024

Thanks again @jayzhan211

Michael-J-Ward added a commit to Michael-J-Ward/datafusion-python that referenced this pull request May 13, 2024
Upstream is continuing it's migration to UDFs.

Ref apache/datafusion#10098
Ref apache/datafusion#10372
andygrove pushed a commit to apache/datafusion-python that referenced this pull request May 14, 2024
* chore: upgrade datafusion Deps

Ref #690

* update concat and concat_ws to use datafusion_functions

Moved in apache/datafusion#10089

* feat: upgrade functions.rs

Upstream is continuing it's migration to UDFs.

Ref apache/datafusion#10098
Ref apache/datafusion#10372

* fix ScalarUDF import

* feat: remove deprecated suppors_filter_pushdown and impl supports_filters_pushdown

Deprecated function removed in apache/datafusion#9923

* use `unnest_columns_with_options` instead of deprecated `unnest_column_with_option`

* remove ScalarFunction wrappers

These relied on upstream BuiltinScalarFunction, which are now removed.

Ref apache/datafusion#10098

* update dataframe `test_describe`

`null_count` was fixed upstream.

Ref apache/datafusion#10260

* remove PyDFField and related methods

DFField was removed upstream.

Ref: apache/datafusion#9595

* bump `datafusion-python` package version to 38.0.0

* re-implement `PyExpr::column_name`

The previous implementation relied on `DFField` which was removed upstream.

Ref: apache/datafusion#9595
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api change Changes the API exposed to users of the crate core Core DataFusion crate logical-expr Logical plan and expressions physical-expr Physical Expressions sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants