Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert ApproxPercentileCont and ApproxPercentileContWithWeight to UDAF #10917

Merged

Conversation

goldmedal
Copy link
Contributor

Which issue does this PR close?

Closes #10870 and convert ApproxPercentileContWithWeight to udaf.

Rationale for this change

What changes are included in this PR?

  • Pass original logical expression args to AccumulatorArgs.
  • Remove args_num from AccumulatorArgs, use args.len() instead.
  • convert ApproxPercentileCont and ApproxPercentileContWithWeight to udaf

Are these changes tested?

yes

Are there any user-facing changes?

no

@github-actions github-actions bot added logical-expr Logical plan and expressions physical-expr Physical Expressions core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels Jun 14, 2024
@goldmedal goldmedal marked this pull request as draft June 14, 2024 16:16
@github-actions github-actions bot added the optimizer Optimizer rules label Jun 14, 2024
Comment on lines +499 to +501
// TODO: 'logical_exprs' is not supported for UDAF yet.
// approx_percentile_cont and approx_percentile_cont_weight are not supported for UDAF from protobuf yet.
let logical_exprs = &[];
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't find an easy way to get the logical plan from protobuf. I think maybe we need to convert the physical plan back to the logical plan. Otherwise, approx_percentile_cont and approx_percentile_cont_weighted can't be supported for protobuf.

@jayzhan211, what do you think? Are there any existing functions or tools that can do this?
Thanks.

Copy link
Contributor

@jayzhan211 jayzhan211 Jun 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the error if we leave it empty? I run the test in logical plan that includes approx_percentile_cont and it seems correct 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it will be correct for the logical plan. However, I think it will cause some issues when processing the physical plan. Curiously, should we expect the physical roundtrip to work? I tried to add a roundtrip test for it indatafusion/proto/tests/cases/roundtrip_physical_plan.rs like:

#[test]
fn roundtrip_approx_percentile_cont_udaf() -> Result<()> {
    let field_a = Field::new("a", DataType::Int64, false);
    let field_b = Field::new("b", DataType::Float64, false);
    let schema = Arc::new(Schema::new(vec![field_a, field_b]));
    let udaf = (*approx_percentile_cont_udaf()).clone();

    let ctx = SessionContext::new();
    ctx.register_udaf(udaf.clone());

    let groups: Vec<(Arc<dyn PhysicalExpr>, String)> =
        vec![(col("a", &schema)?, "unused".to_string())];

    let aggregates: Vec<Arc<dyn AggregateExpr>> = vec![udaf::create_aggregate_expr(
        &udaf,
        &[col("b", &schema)?, lit(ScalarValue::Float64(Some(0.5)))],
        &[datafusion_expr::col("b"), datafusion_expr::lit(ScalarValue::Float64(Some(0.5)))],
        &[],
        &[],
        &schema,
        "approx_percentile_cont_agg",
        false,
        false,
    )?];

    roundtrip_test_with_context(
        Arc::new(AggregateExec::try_new(
            AggregateMode::Final,
            PhysicalGroupBy::new_single(groups.clone()),
            aggregates.clone(),
            vec![None],
            Arc::new(EmptyExec::new(schema.clone())),
            schema,
        )?),
        &ctx,
    )
}

It will lose the logical_plan when converting back to the physical plan from protobuf.

Is this test necessary? I only see an physical-plan roudtrip testing of udaf called roundtrip_aggregate_udaf.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't need to care about logical plan conversion in physical plan roundtrip, but I'm not pretty sure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. I thought Flight SQL might be a use case for from_proto. However, I tried to query approx_percentile_cont through Flight SQL, and it worked well. Maybe we can just ignore it and leave this TODO comment until we find any from_proto use case.

@goldmedal goldmedal marked this pull request as ready for review June 14, 2024 17:28
@goldmedal goldmedal marked this pull request as draft June 14, 2024 17:52
@goldmedal
Copy link
Contributor Author

It's really weird. I can't reproduce the CI failed in my local environment.
https://github.com/apache/datafusion/actions/runs/9529829557/job/26268998329?pr=10917

External error: statement is expected to fail with error:
	(regex) DataFusion error: Error during planning: .* No function matches the given name and argument types 'approx_percentile_cont_with_weight\(Utf8, Int8, Float64\)'\. You might need to add explicit type casts\.(.|\n)*
but got error:
	DataFusion error: Error during planning: Error during planning: Coercion from [Utf8, Int8, Float64] to the signature OneOf([Exact([Int8, Int8, Float64]), Exact([Int16, Int16, Float64]), Exact([Int32, Int32, Float64]), Exact([Int64, Int64, Float64]), Exact([UInt8, UInt8, Float64]), Exact([UInt16, UInt16, Float64]), Exact([UInt32, UInt32, Float64]), Exact([UInt64, UInt64, Float64]), Exact([Float32, Float32, Float64]), Exact([Float64, Float64, Float64])]) failed.
[SQL] SELECT approx_percentile_cont_with_weight(c1, c2, 0.95) FROM aggregate_test_100
at test_files/aggregate.slt:79

The error message thrown in my local is difference from the CI environment... :(

@goldmedal goldmedal marked this pull request as ready for review June 15, 2024 18:25
@jayzhan211
Copy link
Contributor

It's really weird. I can't reproduce the CI failed in my local environment. https://github.com/apache/datafusion/actions/runs/9529829557/job/26268998329?pr=10917

External error: statement is expected to fail with error:
	(regex) DataFusion error: Error during planning: .* No function matches the given name and argument types 'approx_percentile_cont_with_weight\(Utf8, Int8, Float64\)'\. You might need to add explicit type casts\.(.|\n)*
but got error:
	DataFusion error: Error during planning: Error during planning: Coercion from [Utf8, Int8, Float64] to the signature OneOf([Exact([Int8, Int8, Float64]), Exact([Int16, Int16, Float64]), Exact([Int32, Int32, Float64]), Exact([Int64, Int64, Float64]), Exact([UInt8, UInt8, Float64]), Exact([UInt16, UInt16, Float64]), Exact([UInt32, UInt32, Float64]), Exact([UInt64, UInt64, Float64]), Exact([Float32, Float32, Float64]), Exact([Float64, Float64, Float64])]) failed.
[SQL] SELECT approx_percentile_cont_with_weight(c1, c2, 0.95) FROM aggregate_test_100
at test_files/aggregate.slt:79

The error message thrown in my local is difference from the CI environment... :(

I usually just remove the error message, although it is not an idea solution, but I don't know why the CI message is different from local too.

}
}
Self {
signature: Signature::one_of(variants, Volatility::Immutable),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it better to have Signature::one_of(vec![Numeric::(2), Numeric::(3)])?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found we can't use Signature::numeric. I tried to use it, but I discovered that all the input arguments will have the same type as the first argument. For example, in a SQL query like:

select approx_percentile_cont(float64_col, 0.5, 100) from t

the input arguments will be vec![Float64, Float64, Float64]. However, the last argument should be int64. This will cause a failure.

I'm not sure if it's related, but I found something weird. Why is AccumulatorArgs#input_type not a vec but a single datatype? If the number of arguments is greater than 1, how do we handle the other arguments?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is single type because we have not yet meet any function that takes multiple input to build accumulator. If there is, then we might need to extend it to Vec

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the input arguments will be vec![Float64, Float64, Float64]

Oh, yes. It will coerce types to the unify one.

&self,
args: AccumulatorArgs,
) -> datafusion_common::Result<ApproxPercentileAccumulator> {
let percentile = validate_input_percentile_expr(&args.args[1])?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we rename args.args to args.input_exprs? I think it would be more clear and consistent with the name of args.input_type

&self,
acc_args: AccumulatorArgs,
) -> datafusion_common::Result<Box<dyn Accumulator>> {
Ok(Box::new(self.create_accumulator(acc_args)?))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can inline the code of create_accumulator here

})
pub fn new() -> Self {
Self {
signature: Signature::one_of(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe use Signature::Numeric(3)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"approx_percentile_cont_with_weight requires numeric weight input types"
);
}
if !arg_types[2].is_floating() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with signature::numeric(3), we can just check if the third arg is f64

&& self.column_expr.eq(&x.column_expr)
&& self.weight_expr.eq(&x.weight_expr)
&& self.percentile_expr.eq(&x.percentile_expr)
self.signature == x.signature
Copy link
Contributor

@jayzhan211 jayzhan211 Jun 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need this, we can compare them with the function name

Comment on lines +499 to +501
// TODO: 'logical_exprs' is not supported for UDAF yet.
// approx_percentile_cont and approx_percentile_cont_weight are not supported for UDAF from protobuf yet.
let logical_exprs = &[];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't need to care about logical plan conversion in physical plan roundtrip, but I'm not pretty sure.

@goldmedal
Copy link
Contributor Author

External error: statement is expected to fail with error:
(regex) DataFusion error: Error during planning: .* No function matches the given name and argument types
I usually just remove the error message, although it is not an idea solution, but I don't know why the CI message is different from local too.

I modified the expected pattern of error message to match the message thrown in CI flow. The full message is

        DataFusion error: Error during planning: Error during planning: Coercion from [Utf8, Int8, Float64] to the signature OneOf([Exact([Int8, Int8, Float64]), Exact([Int16, Int16, Float64]), Exact([Int32, Int32, Float64]), Exact([Int64, Int64, Float64]), Exact([UInt8, UInt8, Float64]), Exact([UInt16, UInt16, Float64]), Exact([UInt32, UInt32, Float64]), Exact([UInt64, UInt64, Float64]), Exact([Float32, Float32, Float64]), Exact([Float64, Float64, Float64])]) failed. No function matches the given name and argument types 'approx_percentile_cont_with_weight(Utf8, Int8, Float64)'. You might need to add explicit type casts.
        Candidate functions:
        approx_percentile_cont_with_weight(Int8, Int8, Float64)
        approx_percentile_cont_with_weight(Int16, Int16, Float64)
        approx_percentile_cont_with_weight(Int32, Int32, Float64)
        approx_percentile_cont_with_weight(Int64, Int64, Float64)
        approx_percentile_cont_with_weight(UInt8, UInt8, Float64)
        approx_percentile_cont_with_weight(UInt16, UInt16, Float64)
        approx_percentile_cont_with_weight(UInt32, UInt32, Float64)
        approx_percentile_cont_with_weight(UInt64, UInt64, Float64)
        approx_percentile_cont_with_weight(Float32, Float32, Float64)
        approx_percentile_cont_with_weight(Float64, Float64, Float64)

I noticed that the CI flow is missing the tail part of the error message: You might need to add explicit type casts..... There might be a length limitation for the error message, but I'm not sure. Just making a note of it here.

Copy link
Contributor

@jayzhan211 jayzhan211 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@goldmedal
Copy link
Contributor Author

Thanks @jayzhan211 approved it. : )
If no more comments, how about merge this PR first? I guess it may cause some conflicts for other UDAF-related PRs.

@jayzhan211
Copy link
Contributor

Thanks @jayzhan211 approved it. : ) If no more comments, how about merge this PR first? I guess it may cause some conflicts for other UDAF-related PRs.

Sure!

@jayzhan211 jayzhan211 merged commit c884bdb into apache:main Jun 16, 2024
23 checks passed
@goldmedal goldmedal deleted the feature/conver-approx-percentile-udaf branch June 16, 2024 14:46
@goldmedal
Copy link
Contributor Author

Thanks again @jayzhan211 :)

findepi pushed a commit to findepi/datafusion that referenced this pull request Jul 16, 2024
…AF (apache#10917)

* pass logical expr of arguments for udaf

* implement approx_percentile_cont udaf

* register udaf

* remove ApproxPercentileCont

* convert with_wegiht to udaf and remove original

* fix conflict

* fix compile check

* fix doc and testing

* evaluate args through physical plan

* public use Literal

* fix tests

* rollback the experimental tests

* remove unused import

* rename args and inline code

* remove unnecessary partial eq trait

* fix error message
Michael-J-Ward added a commit to Michael-J-Ward/datafusion-python that referenced this pull request Jul 24, 2024
… UDAF

Ref: approx_distinct apache/datafusion#10851
Ref: approx_median apache/datafusion#10840
Ref: approx_percentile_cont and _with_weight apache/datafusion#10917
Michael-J-Ward added a commit to Michael-J-Ward/datafusion-python that referenced this pull request Jul 24, 2024
… UDAF

Ref: approx_distinct apache/datafusion#10851
Ref: approx_median apache/datafusion#10840
Ref: approx_percentile_cont and _with_weight apache/datafusion#10917
Michael-J-Ward added a commit to Michael-J-Ward/datafusion-python that referenced this pull request Jul 25, 2024
… UDAF

Ref: approx_distinct apache/datafusion#10851
Ref: approx_median apache/datafusion#10840
Ref: approx_percentile_cont and _with_weight apache/datafusion#10917
andygrove pushed a commit to apache/datafusion-python that referenced this pull request Jul 31, 2024
* chore: update datafusion deps

* feat: impl ExecutionPlan::static_name() for DatasetExec

This required trait method was added upstream [0] and recommends to simply forward to `static_name`.

[0]: apache/datafusion#10266

* feat: update first_value and last_value wrappers.

Upstream signatures were changed for the new new `AggregateBuilder` api [0].

This simply gets the code to work. We should better incorporate that API into `datafusion-python`.

[0] apache/datafusion#10560

* migrate count to UDAF

Builtin Count was removed upstream.

TBD whether we want to re-implement `count_star` with new API.

Ref: apache/datafusion#10893

* migrate approx_percentile_cont, approx_distinct, and approx_median to UDAF

Ref: approx_distinct apache/datafusion#10851
Ref: approx_median apache/datafusion#10840
Ref: approx_percentile_cont and _with_weight apache/datafusion#10917

* migrate avg to UDAF

Ref: apache/datafusion#10964

* migrage corr to UDAF

Ref: apache/datafusion#10884

* migrate grouping to UDAF

Ref: apache/datafusion#10906

* add alias `mean` for UDAF `avg`

* migrate stddev to UDAF

Ref: apache/datafusion#10827

* remove rust alias for stddev

The python wrapper now provides stddev_samp alias.

* migrage var_pop to UDAF

Ref: apache/datafusion#10836

* migrate regr_* functions to UDAF

Ref: apache/datafusion#10898

* migrate bitwise functions to UDAF

The functions now take a single expression instead of a Vec<_>.

Ref: apache/datafusion#10930

* add missing variants for ScalarValue with todo

* fix typo in approx_percentile_cont

* add distinct arg to count

* comment out failing test

`approx_percentile_cont` is now returning a DoubleArray instead of an IntArray.

This may be a bug upstream; it requires further investigation.

* update tests to expect lowercase `sum` in query plans

This was changed upstream.

Ref: apache/datafusion#10831

* update ScalarType data_type map

* add docs dependency pickleshare

* re-implement count_star

* lint: ruff python lint

* lint: rust cargo fmt

* include name of window function in error for find_window_fn

* refactor `find_window_fn` for debug clarity

* search default aggregate functions by both name and aliases

The alias list no longer includes the name of the function.

Ref: apache/datafusion#10658

* fix markdown in find_window_fn docs

* parameterize test_window_functions

`first_value` and `last_value` are currently failing and marked as xfail.

* add test ids to test_simple_select tests marked xfail

* update find_window_fn to search built-ins first

The behavior of `first_value` and `last_value` UDAFs currently does not match the built-in behavior.
This allowed me to remove `marks=pytest.xfail` from the window tests.

* improve first_call and last_call use of the builder API

* remove trailing todos

* fix examples/substrait.py

* chore: remove explicit aliases from functions.rs

Ref: #779

* remove `array_fn!` aliases

* remove alias rules for `expr_fn_vec!`

* remove alias rules from `expr_fn!` macro

* remove unnecessary pyo3 var-arg signatures in functions.rs

* remove pyo3 signatures that provided defaults for first_value and last_value

* parametrize test_string_functions

* test regr_ function wrappers

Closes #778
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate logical-expr Logical plan and expressions optimizer Optimizer rules physical-expr Physical Expressions sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Convert ApproxPercentileCont and `ApproxPercentileContWithWeight to UDAF
2 participants