Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate OrderSensitiveArrayAgg to be a user defined aggregate #11564

Merged
merged 8 commits into from
Jul 23, 2024

Conversation

jayzhan211
Copy link
Contributor

@jayzhan211 jayzhan211 commented Jul 20, 2024

Which issue does this PR close?

Part of #8708

Rationale for this change

Migrating DataFusion so there is no distinction between built in and user defined functions

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added logical-expr Logical plan and expressions physical-expr Physical Expressions core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels Jul 20, 2024
Signed-off-by: jayzhan211 <[email protected]>
@github-actions github-actions bot removed the sqllogictest SQL Logic Tests (.slt) label Jul 20, 2024
Signed-off-by: jayzhan211 <[email protected]>
limited_convert_logical_expr_to_physical_expr_with_dfschema(expr, dfschema)?,
),
Expr::Column(col) => {
let idx = dfschema.index_of_column(col)?;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found that I need dfschema in order to get the correct field, so I change all the UDAF to create_aggregate_expr_with_dfschema which is used internally for UDAF only.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After removing limited_convert_logical_expr_to_physical_expr_with_dfschema, I think we don't need DFSchema anymore

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So does that mean limited_convert_logical_expr_to_physical_expr_with_dfschema will be temporary?

Signed-off-by: jayzhan211 <[email protected]>
Signed-off-by: jayzhan211 <[email protected]>
Signed-off-by: jayzhan211 <[email protected]>

#[allow(clippy::too_many_arguments)]
// This is not for external usage, consider creating with `create_aggregate_expr` instead.
pub fn create_aggregate_expr_with_dfschema(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

first/last and ordered_array_agg are switched to this function. After #11359 , I think we can have a single create_aggregate_expr again.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make we can also make a builder to make using this function easier (e.g. it might be easy to swap one of the last three bools by accident)

Copy link
Contributor Author

@jayzhan211 jayzhan211 Jul 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! I already filed the ticket in #11543.

@jayzhan211 jayzhan211 marked this pull request as ready for review July 22, 2024 10:03
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @jayzhan211 -- we are getting very close 🤞

I do think some of the APIs are getting a little clumsy but perhaps we can refine them over time as follow on PRs (e.g. with builders, etc)

@@ -57,6 +57,9 @@ pub struct AccumulatorArgs<'a> {
/// The schema of the input arguments
pub schema: &'a Schema,

/// The schema of the input arguments
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I view this addition as basically making the UDAF framework as full featured as the built in one is

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope we will have either dfschema or schema at the end 🤔

///
/// `is_reversed` is used to indicate whether the aggregation is running in reverse order,
/// it could be used to hint Accumulator to accumulate in the reversed order,
/// you can just set to false if you are not reversing expression
#[allow(clippy::too_many_arguments)]
pub fn create_aggregate_expr(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the number of arguments it is getting, maybe it would be good to define a builder like the following:

let builder = AggregateExprBuilder::new(fun),
  .with_input_phyiscal_exprs(input_phy_exprs)
  ...
  .with_name(name)
  .build()?

🤔


#[allow(clippy::too_many_arguments)]
// This is not for external usage, consider creating with `create_aggregate_expr` instead.
pub fn create_aggregate_expr_with_dfschema(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make we can also make a builder to make using this function easier (e.g. it might be easy to swap one of the last three bools by accident)

@@ -495,18 +573,23 @@ impl AggregateExpr for AggregateFunctionExpr {
})
.collect::<Vec<_>>();
let mut name = self.name().to_string();
replace_order_by_clause(&mut name);
// TODO: Generalize order-by clause rewrite
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we track this in a follow on ticket?

limited_convert_logical_expr_to_physical_expr_with_dfschema(expr, dfschema)?,
),
Expr::Column(col) => {
let idx = dfschema.index_of_column(col)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So does that mean limited_convert_logical_expr_to_physical_expr_with_dfschema will be temporary?

@alamb alamb changed the title Move OrderSensitiveArrayAgg to UDAF Migrate OrderSensitiveArrayAgg to be a user defined aggregate Jul 22, 2024
@jayzhan211
Copy link
Contributor Author

jayzhan211 commented Jul 22, 2024

limited_convert_logical_expr_to_physical_expr_with_dfschema and limited_convert_logical_expr_to_physical_expr are both temporary until we refactor #11359

@jayzhan211
Copy link
Contributor Author

jayzhan211 commented Jul 23, 2024

TODO:

  1. Remove ArrayAgg Builtin definition and lowercase the name
  2. Generalize order-by clause rewrite
  3. Expr Builder (not sure before the refactor or after 🤔 )

@jayzhan211
Copy link
Contributor Author

Thanks @alamb !

@jayzhan211 jayzhan211 merged commit d941dc3 into apache:main Jul 23, 2024
26 checks passed
@jayzhan211 jayzhan211 deleted the array-udf-order branch July 23, 2024 00:02
@alamb
Copy link
Contributor

alamb commented Jul 23, 2024

Thanks @alamb !

Thank you -- it is so close I can feel it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate logical-expr Logical plan and expressions physical-expr Physical Expressions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants