Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert approx_median to UDAF #10838

Closed
jayzhan211 opened this issue Jun 9, 2024 · 7 comments · Fixed by #10840
Closed

Convert approx_median to UDAF #10838

jayzhan211 opened this issue Jun 9, 2024 · 7 comments · Fixed by #10840
Assignees
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@jayzhan211
Copy link
Contributor

Is your feature request related to a problem or challenge?

Similar to #10713 #10836 #10834 and others in #8708

Describe the solution you'd like

No response

Describe alternatives you've considered

No response

Additional context

No response

@jayzhan211 jayzhan211 added enhancement New feature or request good first issue Good for newcomers labels Jun 9, 2024
@goldmedal
Copy link
Contributor

take

@goldmedal
Copy link
Contributor

Hi @jayzhan211,

After some research, I found that approx_median heavily relies on ApproxPercentileCont for accumulation.

https://github.com/goldmedal/datafusion/blob/1600a307f272f9c716131dcca7a55ba1761774a6/datafusion/physical-expr/src/aggregate/approx_median.rs#L46

I think we should convert ApproxPercentileCont to a UDAF first. We can move ApproxPercentileCont and TDigest to function-aggregate, so that ApproxMedian can use them.

WDYT?

@jayzhan211
Copy link
Contributor Author

approx_median

I agree, we need to convert ApproxPercentileCont first. For TDigest, we should move it to datafusion/physical-expr-common/src/aggregate/mod.rs where it can used not only for functions-aggregate but also other users that import physical-expr-common

@goldmedal
Copy link
Contributor

goldmedal commented Jun 9, 2024

I agree, we need to convert ApproxPercentileCont first. For TDigest, we should move it to datafusion/physical-expr-common/src/aggregate/mod.rs where it can used not only for functions-aggregate but also other users that import physical-expr-common

Thanks. I will have a PR for it first.

@goldmedal
Copy link
Contributor

Hi @jayzhan211
I tried to convert ApproxPercentileCont, but I encountered some challenges. ApproxPercentileCont creates the accumulator based on the percentile parameter, but I couldn't find a way to access this parameter when defining the UDAF.
https://github.com/goldmedal/datafusion/blob/24a08465e12bc07275cafe5310a7ac44898e39de/datafusion/physical-expr/src/aggregate/approx_percentile_cont.rs#L104

The AccumulatorArgs doesn't carry the original argument expression, making it difficult to create the accumulator for ApproxPercentileCont.
https://github.com/goldmedal/datafusion/blob/24a08465e12bc07275cafe5310a7ac44898e39de/datafusion/expr/src/function.rs#L45

I think we need another PR to address this issue. For now, I just moved TDigest and ApproxPercentileAccumulator in #10840 so that I can convert approx_median to a UDAF.

@jayzhan211
Copy link
Contributor Author

jayzhan211 commented Jun 9, 2024

Hi @jayzhan211 I tried to convert ApproxPercentileCont, but I encountered some challenges. ApproxPercentileCont creates the accumulator based on the percentile parameter, but I couldn't find a way to access this parameter when defining the UDAF. https://github.com/goldmedal/datafusion/blob/24a08465e12bc07275cafe5310a7ac44898e39de/datafusion/physical-expr/src/aggregate/approx_percentile_cont.rs#L104

The AccumulatorArgs doesn't carry the original argument expression, making it difficult to create the accumulator for ApproxPercentileCont. https://github.com/goldmedal/datafusion/blob/24a08465e12bc07275cafe5310a7ac44898e39de/datafusion/expr/src/function.rs#L45

I think we need another PR to address this issue. For now, I just moved TDigest and ApproxPercentileAccumulator in #10840 so that I can convert approx_median to a UDAF.

We can add self.args to AccumulatorArgs and compute percentile and tdigest max size in accumulator.

fn create_accumulator(&self) -> Result<Box<dyn Accumulator>> {
let acc_args = AccumulatorArgs {
data_type: &self.data_type,
schema: &self.schema,
ignore_nulls: self.ignore_nulls,
sort_exprs: &self.sort_exprs,
is_distinct: self.is_distinct,
input_type: &self.input_type,
args_num: self.args.len(),
name: &self.name,
};
self.fun.accumulator(acc_args)
}

@goldmedal
Copy link
Contributor

We can add self.args to AccumulatorArgs and compute percentile and tdigest max size in accumulator.

Thanks for the suggestion. I can make a PR after #10840.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants