Move min and max to user defined aggregate function, remove `AggregateFunction` / `AggregateFunctionDefinition::BuiltIn` #11013

edmondop · 2024-06-19T18:52:33Z

Which issue does this PR close?

Closes #10943 .

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

datafusion/physical-expr/src/aggregate/build_in.rs

edmondop · 2024-06-20T20:04:35Z

I do have something that's starting to look reasonable, but some tests on the optimizer now are failing for some reasons I can't understand

running 1 test
test custom_sources_cases::optimizers_catch_all_statistics ... FAILED

successes:

successes:

failures:

---- custom_sources_cases::optimizers_catch_all_statistics stdout ----
thread 'custom_sources_cases::optimizers_catch_all_statistics' panicked at datafusion/core/tests/custom_sources_cases/mod.rs:274:5:
Expected aggregate_statistics optimizations missing: AggregateExec { mode: Single, group_by: PhysicalGroupBy { expr: [], null_expr: [], groups: [[]] }, aggr_expr: [AggregateFunctionExpr { fun: AggregateUDF { inner: Count { name: "COUNT", signature: Signature { type_signature: VariadicAny, volatility: Immutable } } }, args: [Literal { value: Int64(1) }], logical_args: [Literal(Int64(1))], data_type: Int64, name: "COUNT(*)", schema: Schema { fields: [Field { name: "c1", data_type: Int32, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {} }, sort_exprs: [], ordering_req: [], ignore_nulls: false, ordering_fields: [], is_distinct: false, input_type: Int64 }, AggregateFunctionExpr { fun: AggregateUDF { inner: Min { signature: Signature { type_signature: Numeric(1), volatility: Immutable }, aliases: ["min"] } }, args: [Column { name: "c1", index: 0 }], logical_args: [Column(Column { relation: Some(Bare { table: "test" }), name: "c1" })], data_type: Int32, name: "MIN(test.c1)", schema: Schema { fields: [Field { name: "c1", data_type: Int32, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {} }, sort_exprs: [], ordering_req: [], ignore_nulls: false, ordering_fields: [], is_distinct: false, input_type: Int32 }, AggregateFunctionExpr { fun: AggregateUDF { inner: Max { signature: Signature { type_signature: Numeric(1), volatility: Immutable }, aliases: ["max"] } }, args: [Column { name: "c1", index: 0 }], logical_args: [Column(Column { relation: Some(Bare { table: "test" }), name: "c1" })], data_type: Int32, name: "MAX(test.c1)", schema: Schema { fields: [Field { name: "c1", data_type: Int32, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {} }, sort_exprs: [], ordering_req: [], ignore_nulls: false, ordering_fields: [], is_distinct: false, input_type: Int32 }], filter_expr: [None, None, None], limit: None, input: CustomExecutionPlan { projection: Some([0]), cache: PlanProperties { eq_properties: EquivalenceProperties { eq_group: EquivalenceGroup { classes: [] }, oeq_class: OrderingEquivalenceClass { orderings: [] }, constants: [], schema: Schema { fields: [Field { name: "c1", data_type: Int32, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {} } }, partitioning: UnknownPartitioning(1), execution_mode: Bounded, output_ordering: None } }, schema: Schema { fields: [Field { name: "COUNT(*)", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "MIN(test.c1)", data_type: Int32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "MAX(test.c1)", data_type: Int32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {} }, input_schema: Schema { fields: [Field { name: "c1", data_type: Int32, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {} }, metrics: ExecutionPlanMetricsSet { inner: Mutex { data: MetricsSet { metrics: [] } } }, required_input_ordering: None, input_order_mode: Linear, cache: PlanProperties { eq_properties: EquivalenceProperties { eq_group: EquivalenceGroup { classes: [] }, oeq_class: OrderingEquivalenceClass { orderings: [] }, constants: [], schema: Schema { fields: [Field { name: "COUNT(*)", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "MIN(test.c1)", data_type: Int32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "MAX(test.c1)", data_type: Int32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {} } }, partitioning: UnknownPartitioning(1), execution_mode: Bounded, output_ordering: None } }

jayzhan211 · 2024-06-21T02:29:48Z

I do have something that's starting to look reasonable, but some tests on the optimizer now are failing for some reasons I can't understand

running 1 test
test custom_sources_cases::optimizers_catch_all_statistics ... FAILED

successes:

successes:

failures:

---- custom_sources_cases::optimizers_catch_all_statistics stdout ----
thread 'custom_sources_cases::optimizers_catch_all_statistics' panicked at datafusion/core/tests/custom_sources_cases/mod.rs:274:5:
Expected aggregate_statistics optimizations missing: AggregateExec { mode: Single, group_by: PhysicalGroupBy { expr: [], null_expr: [], groups: [[]] }, aggr_expr: [AggregateFunctionExpr { fun: AggregateUDF { inner: Count { name: "COUNT", signature: Signature { type_signature: VariadicAny, volatility: Immutable } } }, args: [Literal { value: Int64(1) }], logical_args: [Literal(Int64(1))], data_type: Int64, name: "COUNT(*)", schema: Schema { fields: [Field { name: "c1", data_type: Int32, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {} }, sort_exprs: [], ordering_req: [], ignore_nulls: false, ordering_fields: [], is_distinct: false, input_type: Int64 }, AggregateFunctionExpr { fun: AggregateUDF { inner: Min { signature: Signature { type_signature: Numeric(1), volatility: Immutable }, aliases: ["min"] } }, args: [Column { name: "c1", index: 0 }], logical_args: [Column(Column { relation: Some(Bare { table: "test" }), name: "c1" })], data_type: Int32, name: "MIN(test.c1)", schema: Schema { fields: [Field { name: "c1", data_type: Int32, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {} }, sort_exprs: [], ordering_req: [], ignore_nulls: false, ordering_fields: [], is_distinct: false, input_type: Int32 }, AggregateFunctionExpr { fun: AggregateUDF { inner: Max { signature: Signature { type_signature: Numeric(1), volatility: Immutable }, aliases: ["max"] } }, args: [Column { name: "c1", index: 0 }], logical_args: [Column(Column { relation: Some(Bare { table: "test" }), name: "c1" })], data_type: Int32, name: "MAX(test.c1)", schema: Schema { fields: [Field { name: "c1", data_type: Int32, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {} }, sort_exprs: [], ordering_req: [], ignore_nulls: false, ordering_fields: [], is_distinct: false, input_type: Int32 }], filter_expr: [None, None, None], limit: None, input: CustomExecutionPlan { projection: Some([0]), cache: PlanProperties { eq_properties: EquivalenceProperties { eq_group: EquivalenceGroup { classes: [] }, oeq_class: OrderingEquivalenceClass { orderings: [] }, constants: [], schema: Schema { fields: [Field { name: "c1", data_type: Int32, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {} } }, partitioning: UnknownPartitioning(1), execution_mode: Bounded, output_ordering: None } }, schema: Schema { fields: [Field { name: "COUNT(*)", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "MIN(test.c1)", data_type: Int32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "MAX(test.c1)", data_type: Int32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {} }, input_schema: Schema { fields: [Field { name: "c1", data_type: Int32, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {} }, metrics: ExecutionPlanMetricsSet { inner: Mutex { data: MetricsSet { metrics: [] } } }, required_input_ordering: None, input_order_mode: Linear, cache: PlanProperties { eq_properties: EquivalenceProperties { eq_group: EquivalenceGroup { classes: [] }, oeq_class: OrderingEquivalenceClass { orderings: [] }, constants: [], schema: Schema { fields: [Field { name: "COUNT(*)", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "MIN(test.c1)", data_type: Int32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "MAX(test.c1)", data_type: Int32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {} } }, partitioning: UnknownPartitioning(1), execution_mode: Bounded, output_ordering: None } }

I guess you skip the aggregate statistic optimization for min/max

datafusion/datafusion/core/src/physical_optimizer/aggregate_statistics.rs

Lines 177 to 224 in 18042fd

    
           fn take_optimizable_min( 
        
               agg_expr: &dyn AggregateExpr, 
        
               stats: &Statistics, 
        
           ) -> Option<(ScalarValue, String)> { 
        
               if let Precision::Exact(num_rows) = &stats.num_rows { 
        
                   match *num_rows { 
        
                       0 => { 
        
                           // MIN/MAX with 0 rows is always null 
        
                           if let Some(casted_expr) = 
        
                               agg_expr.as_any().downcast_ref::<expressions::Min>() 
        
                           { 
        
                               if let Ok(min_data_type) = 
        
                                   ScalarValue::try_from(casted_expr.field().unwrap().data_type()) 
        
                               { 
        
                                   return Some((min_data_type, casted_expr.name().to_string())); 
        
                               } 
        
                           } 
        
                       } 
        
                       value if value > 0 => { 
        
                           let col_stats = &stats.column_statistics; 
        
                           if let Some(casted_expr) = 
        
                               agg_expr.as_any().downcast_ref::<expressions::Min>() 
        
                           { 
        
                               if casted_expr.expressions().len() == 1 { 
        
                                   // TODO optimize with exprs other than Column 
        
                                   if let Some(col_expr) = casted_expr.expressions()[0] 
        
                                       .as_any() 
        
                                       .downcast_ref::<expressions::Column>() 
        
                                   { 
        
                                       if let Precision::Exact(val) = 
        
                                           &col_stats[col_expr.index()].min_value 
        
                                       { 
        
                                           if !val.is_null() { 
        
                                               return Some(( 
        
                                                   val.clone(), 
        
                                                   casted_expr.name().to_string(), 
        
                                               )); 
        
                                           } 
        
                                       } 
        
                                   } 
        
                               } 
        
                           } 
        
                       } 
        
                       _ => {} 
        
                   } 
        
               } 
        
               None 
        
           }

You might need to check if the AggregateExpr is min/max in take_optimizable_min and take_optimizable_max

edmondop · 2024-06-21T16:02:24Z

I fixed this but now I have a test that doesn't pass on the optimizer (there are two actually)

---- single_distinct_to_groupby::tests::two_distinct_and_one_common stdout ----
thread 'single_distinct_to_groupby::tests::two_distinct_and_one_common' panicked at datafusion/optimizer/src/test/mod.rs:200:5:
assertion `left == right` failed
  left: "Projection: test.a, sum(alias2) AS sum(test.c), COUNT(alias1) AS COUNT(DISTINCT test.b), MAX(alias3) AS MAX(test.b) [a:UInt32, sum(test.c):UInt64;N, COUNT(DISTINCT test.b):Int64;N, MAX(test.b):UInt32;N]\n  Aggregate: groupBy=[[test.a]], aggr=[[sum(alias2), COUNT(alias1), MAX(alias3)]] [a:UInt32, sum(alias2):UInt64;N, COUNT(alias1):Int64;N, MAX(alias3):UInt32;N]\n    Aggregate: groupBy=[[test.a, test.b AS alias1]], aggr=[[sum(test.c) AS alias2, MAX(test.b) AS alias3]] [a:UInt32, alias1:UInt32, alias2:UInt64;N, alias3:UInt32;N]\n      TableScan: test [a:UInt32, b:UInt32, c:UInt32]"
 right: "Projection: test.a, sum(alias2) AS sum(test.c), COUNT(alias1) AS COUNT(DISTINCT test.b), MAX(alias1) AS MAX(DISTINCT test.b) [a:UInt32, sum(test.c):UInt64;N, COUNT(DISTINCT test.b):Int64;N, MAX(DISTINCT test.b):UInt32;N]\n  Aggregate: groupBy=[[test.a]], aggr=[[sum(alias2), COUNT(alias1), MAX(alias1)]] [a:UInt32, sum(alias2):UInt64;N, COUNT(alias1):Int64;N, MAX(alias1):UInt32;N]\n    Aggregate: groupBy=[[test.a, test.b AS alias1]], aggr=[[sum(test.c) AS alias2]] [a:UInt32, alias1:UInt32, alias2:UInt64;N]\n      TableScan: test [a:UInt32, b:UInt32, c:UInt32]"

That suggests that the optimizer cannot use the existing aliases / doesn't understand the existing aliases that provide DISTINCT test.b . Looking, any tip would be highly appreciated

jayzhan211 · 2024-06-22T00:11:02Z

I fixed this but now I have a test that doesn't pass on the optimizer (there are two actually)

---- single_distinct_to_groupby::tests::two_distinct_and_one_common stdout ----
thread 'single_distinct_to_groupby::tests::two_distinct_and_one_common' panicked at datafusion/optimizer/src/test/mod.rs:200:5:
assertion `left == right` failed
  left: "Projection: test.a, sum(alias2) AS sum(test.c), COUNT(alias1) AS COUNT(DISTINCT test.b), MAX(alias3) AS MAX(test.b) [a:UInt32, sum(test.c):UInt64;N, COUNT(DISTINCT test.b):Int64;N, MAX(test.b):UInt32;N]\n  Aggregate: groupBy=[[test.a]], aggr=[[sum(alias2), COUNT(alias1), MAX(alias3)]] [a:UInt32, sum(alias2):UInt64;N, COUNT(alias1):Int64;N, MAX(alias3):UInt32;N]\n    Aggregate: groupBy=[[test.a, test.b AS alias1]], aggr=[[sum(test.c) AS alias2, MAX(test.b) AS alias3]] [a:UInt32, alias1:UInt32, alias2:UInt64;N, alias3:UInt32;N]\n      TableScan: test [a:UInt32, b:UInt32, c:UInt32]"
 right: "Projection: test.a, sum(alias2) AS sum(test.c), COUNT(alias1) AS COUNT(DISTINCT test.b), MAX(alias1) AS MAX(DISTINCT test.b) [a:UInt32, sum(test.c):UInt64;N, COUNT(DISTINCT test.b):Int64;N, MAX(DISTINCT test.b):UInt32;N]\n  Aggregate: groupBy=[[test.a]], aggr=[[sum(alias2), COUNT(alias1), MAX(alias1)]] [a:UInt32, sum(alias2):UInt64;N, COUNT(alias1):Int64;N, MAX(alias1):UInt32;N]\n    Aggregate: groupBy=[[test.a, test.b AS alias1]], aggr=[[sum(test.c) AS alias2]] [a:UInt32, alias1:UInt32, alias2:UInt64;N]\n      TableScan: test [a:UInt32, b:UInt32, c:UInt32]"

That suggests that the optimizer cannot use the existing aliases / doesn't understand the existing aliases that provide DISTINCT test.b . Looking, any tip would be highly appreciated

I think we should add distinct for MIN/MAX so we can get the distinct after group by is converted to distinct function

But I think there is no difference between MIN and Distinct Min, maybe we could remove distinct for MIN/MAX beforehand? Introduce EliminateDistinct optimize rule for MIN/MAX.

edmondop · 2024-06-22T01:04:52Z

I fixed this but now I have a test that doesn't pass on the optimizer (there are two actually)

---- single_distinct_to_groupby::tests::two_distinct_and_one_common stdout ----
thread 'single_distinct_to_groupby::tests::two_distinct_and_one_common' panicked at datafusion/optimizer/src/test/mod.rs:200:5:
assertion `left == right` failed
  left: "Projection: test.a, sum(alias2) AS sum(test.c), COUNT(alias1) AS COUNT(DISTINCT test.b), MAX(alias3) AS MAX(test.b) [a:UInt32, sum(test.c):UInt64;N, COUNT(DISTINCT test.b):Int64;N, MAX(test.b):UInt32;N]\n  Aggregate: groupBy=[[test.a]], aggr=[[sum(alias2), COUNT(alias1), MAX(alias3)]] [a:UInt32, sum(alias2):UInt64;N, COUNT(alias1):Int64;N, MAX(alias3):UInt32;N]\n    Aggregate: groupBy=[[test.a, test.b AS alias1]], aggr=[[sum(test.c) AS alias2, MAX(test.b) AS alias3]] [a:UInt32, alias1:UInt32, alias2:UInt64;N, alias3:UInt32;N]\n      TableScan: test [a:UInt32, b:UInt32, c:UInt32]"
 right: "Projection: test.a, sum(alias2) AS sum(test.c), COUNT(alias1) AS COUNT(DISTINCT test.b), MAX(alias1) AS MAX(DISTINCT test.b) [a:UInt32, sum(test.c):UInt64;N, COUNT(DISTINCT test.b):Int64;N, MAX(DISTINCT test.b):UInt32;N]\n  Aggregate: groupBy=[[test.a]], aggr=[[sum(alias2), COUNT(alias1), MAX(alias1)]] [a:UInt32, sum(alias2):UInt64;N, COUNT(alias1):Int64;N, MAX(alias1):UInt32;N]\n    Aggregate: groupBy=[[test.a, test.b AS alias1]], aggr=[[sum(test.c) AS alias2]] [a:UInt32, alias1:UInt32, alias2:UInt64;N]\n      TableScan: test [a:UInt32, b:UInt32, c:UInt32]"

That suggests that the optimizer cannot use the existing aliases / doesn't understand the existing aliases that provide DISTINCT test.b . Looking, any tip would be highly appreciated

I think we should add distinct for MIN/MAX so we can get the distinct after group by is converted to distinct function

But I think there is no difference between MIN and Distinct Min, maybe we could remove distinct for MIN/MAX beforehand? Introduce EliminateDistinct optimize rule for MIN/MAX.

Is this a part of the optimizer i.e. https://github.com/edmondop/arrow-datafusion/blob/main/datafusion/optimizer/src/replace_distinct_aggregate.rs ? Thank your for your help btw

jayzhan211 · 2024-06-22T01:43:00Z

I fixed this but now I have a test that doesn't pass on the optimizer (there are two actually)
---- single_distinct_to_groupby::tests::two_distinct_and_one_common stdout ----
thread 'single_distinct_to_groupby::tests::two_distinct_and_one_common' panicked at datafusion/optimizer/src/test/mod.rs:200:5:
assertion `left == right` failed
  left: "Projection: test.a, sum(alias2) AS sum(test.c), COUNT(alias1) AS COUNT(DISTINCT test.b), MAX(alias3) AS MAX(test.b) [a:UInt32, sum(test.c):UInt64;N, COUNT(DISTINCT test.b):Int64;N, MAX(test.b):UInt32;N]\n  Aggregate: groupBy=[[test.a]], aggr=[[sum(alias2), COUNT(alias1), MAX(alias3)]] [a:UInt32, sum(alias2):UInt64;N, COUNT(alias1):Int64;N, MAX(alias3):UInt32;N]\n    Aggregate: groupBy=[[test.a, test.b AS alias1]], aggr=[[sum(test.c) AS alias2, MAX(test.b) AS alias3]] [a:UInt32, alias1:UInt32, alias2:UInt64;N, alias3:UInt32;N]\n      TableScan: test [a:UInt32, b:UInt32, c:UInt32]"
 right: "Projection: test.a, sum(alias2) AS sum(test.c), COUNT(alias1) AS COUNT(DISTINCT test.b), MAX(alias1) AS MAX(DISTINCT test.b) [a:UInt32, sum(test.c):UInt64;N, COUNT(DISTINCT test.b):Int64;N, MAX(DISTINCT test.b):UInt32;N]\n  Aggregate: groupBy=[[test.a]], aggr=[[sum(alias2), COUNT(alias1), MAX(alias1)]] [a:UInt32, sum(alias2):UInt64;N, COUNT(alias1):Int64;N, MAX(alias1):UInt32;N]\n    Aggregate: groupBy=[[test.a, test.b AS alias1]], aggr=[[sum(test.c) AS alias2]] [a:UInt32, alias1:UInt32, alias2:UInt64;N]\n      TableScan: test [a:UInt32, b:UInt32, c:UInt32]"
That suggests that the optimizer cannot use the existing aliases / doesn't understand the existing aliases that provide DISTINCT test.b . Looking, any tip would be highly appreciated
I think we should add distinct for MIN/MAX so we can get the distinct after group by is converted to distinct function
But I think there is no difference between MIN and Distinct Min, maybe we could remove distinct for MIN/MAX beforehand? Introduce EliminateDistinct optimize rule for MIN/MAX.
Is this a part of the optimizer i.e. https://github.com/edmondop/arrow-datafusion/blob/main/datafusion/optimizer/src/replace_distinct_aggregate.rs ? Thank your for your help btw

I don't think so, Distinct/Distinct On is different from distinct in the function.

edmondop · 2024-06-23T20:27:08Z

@jayzhan211 I have started experimenting with an optimizer rule, but removing the distinct result in such an error:

running 2 tests
test eliminate_distinct::tests::eliminate_distinct_from_min_expr ... FAILED
test eliminate_nested_union::tests::eliminate_distinct_nothing ... ok

failures:

---- eliminate_distinct::tests::eliminate_distinct_from_min_expr stdout ----
Transformed yes true
Error: Context("Optimizer rule 'eliminate_distinct' failed", Context("eliminate_distinct", Internal("Failed due to a difference in schemas, original schema: DFSchema { inner: Schema { fields: [Field { name: \"a\", data_type: UInt32, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: \"MIN(DISTINCT test.b)\", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {} }, field_qualifiers: [Some(Bare { table: \"test\" }), None], functional_dependencies: FunctionalDependencies { deps: [FunctionalDependence { source_indices: [0], target_indices: [0, 1], nullable: false, mode: Single }] } }, new schema: DFSchema { inner: Schema { fields: [Field { name: \"a\", data_type: UInt32, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: \"MIN(test.b)\", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {} }, field_qualifiers: [Some(Bare { table: \"test\" }), None], functional_dependencies: FunctionalDependencies { deps: [FunctionalDependence { source_indices: [0], target_indices: [0, 1], nullable: false, mode: Single }] } }")))

Do I need to change also the equivalence rules?

jayzhan211 · 2024-06-23T23:59:37Z

eliminate_distinct_from_min_expr

You can take single_distinct_groupby as reference, there is alias to remain schema equivalence.
Also, I suggest we introduce this rule in another PR, not mixing this with MIN/MAX UDAF.

edmondop · 2024-06-24T00:12:59Z

Thanks. I guess I wasn't clear in my comment here #11013 (comment) . How should that test failure be addressed? It seems that min/max udaf uses other aliases and is not reusing the intermediate results already available

jayzhan211 · 2024-06-24T06:56:15Z

Thanks. I guess I wasn't clear in my comment here #11013 (comment) . How should that test failure be addressed? It seems that min/max udaf uses other aliases and is not reusing the intermediate results already available

If we eliminate distinct of min/max prior to single_distinct_to_group_by, we don't expect to get distinct min/max at this point, we should rewrite the test to other function like sum.

edmondop · 2024-06-24T11:39:08Z

---- single_distinct_to_groupby::tests::two_distinct_and_one_common

Wouldn't eliminating it require the optimizer rule? Or do you suggest I update the test case? Or the expected value?

jayzhan211 · 2024-06-24T11:56:53Z

---- single_distinct_to_groupby::tests::two_distinct_and_one_common

Wouldn't eliminating it require the optimizer rule? Or do you suggest I update the test case? Or the expected value?

Yes, I suggest we update the test like

    #[test]
    fn one_distinct_and_two_common() -> Result<()> {
        let table_scan = test_table_scan()?;

        let plan = LogicalPlanBuilder::from(table_scan)
            .aggregate(
                vec![col("a")],
                vec![sum(col("c")), count_distinct(col("b")), max(col("b"))],
            )?
            .build()?;
        // Should work
        let expected = "Projection: test.a, sum(alias2) AS sum(test.c), COUNT(alias1) AS COUNT(DISTINCT test.b), MAX(alias3) AS MAX(test.b) [a:UInt32, sum(test.c):UInt64;N, COUNT(DISTINCT test.b):Int64;N, MAX(test.b):UInt32;N]\n  Aggregate: groupBy=[[test.a]], aggr=[[sum(alias2), COUNT(alias1), MAX(alias3)]] [a:UInt32, sum(alias2):UInt64;N, COUNT(alias1):Int64;N, MAX(alias3):UInt32;N]\n    Aggregate: groupBy=[[test.a, test.b AS alias1]], aggr=[[sum(test.c) AS alias2, MAX(test.b) AS alias3]] [a:UInt32, alias1:UInt32, alias2:UInt64;N, alias3:UInt32;N]\n      TableScan: test [a:UInt32, b:UInt32, c:UInt32]";

        assert_optimized_plan_equal(plan, expected)
    }

edmondop · 2024-06-24T12:56:30Z

There seems to be a column added to the Aggregate node in the logical plan, can that affect performance and/or memory footprint? This was the reason why I didn't update the test in the first place

This is a subset of the new plan

aggr=[[sum(test.c) AS alias2, MAX(test.b) AS alias3]] [a:UInt32, alias1:UInt32, alias2:UInt64;N, alias3:UInt32;N]\n      TableScan: test [a:UInt32, b:UInt32, c:UInt32]"

while this is the subset from the previous plan

Aggregate: groupBy=[[test.a, test.b AS alias1]], aggr=[[sum(test.c) AS alias2]] [a:UInt32, alias1:UInt32, alias2:UInt64;N]\n      TableScan: test [a:UInt32, b:UInt32, c:UInt32]"

there is an alias3:UInt64 that gets added

jayzhan211 · 2024-06-24T14:14:41Z

There seems to be a column added to the Aggregate node in the logical plan, can that affect performance and/or memory footprint? This was the reason why I didn't update the test in the first place

This is a subset of the new plan
aggr=[[sum(test.c) AS alias2, MAX(test.b) AS alias3]] [a:UInt32, alias1:UInt32, alias2:UInt64;N, alias3:UInt32;N]\n      TableScan: test [a:UInt32, b:UInt32, c:UInt32]"
while this is the subset from the previous plan
Aggregate: groupBy=[[test.a, test.b AS alias1]], aggr=[[sum(test.c) AS alias2]] [a:UInt32, alias1:UInt32, alias2:UInt64;N]\n      TableScan: test [a:UInt32, b:UInt32, c:UInt32]"
there is an alias3:UInt64 that gets added

Remove the Min/Max matching in is_single_distinct_agg and the alias is removed

datafusion/optimizer/src/eliminate_distinct.rs

alamb

Thank you so much @edmondop -- I took a look at this PR and I think in general it is quite close.

It needs:

to remove the old min/max implementation in https://github.com/apache/datafusion/blob/5bb6b356277ea1c6f1d7af64e2d66f005d7e1ed4/datafusion/physical-expr/src/aggregate/min_max.rs
resolve some merge conflicts

There is also a follow on issue / PR I would like to make regarding the optimizer check

Given this PR has hung out for a while and has some merge conflicts now I am going to try and help polish it up

datafusion/expr/src/test/function_stub.rs

datafusion/core/src/physical_optimizer/aggregate_statistics.rs

edmondop · 2024-06-27T20:52:08Z

I think as long as you can explain me how to resolve the current test failure I should be fine. Agree using names for min and max unwrapping is not very robust

Working out the duffs

edmondop · 2024-08-01T14:16:12Z

@jayzhan211 fyi all the tests in scalar subquery failed with the stubs, I restored import that used the real implementation

jayzhan211 · 2024-08-01T14:27:00Z

@jayzhan211 fyi all the tests in scalar subquery failed with the stubs, I restored import that used the real implementation

wait, we should use stubs instead of the real function, otherwise we import the functions-aggregate dependency

jayzhan211 · 2024-08-01T14:40:13Z

You use the name "min" in Max, and "max" in Min for stubs

edmondop · 2024-08-01T14:44:37Z

@jayzhan211 fyi all the tests in scalar subquery failed with the stubs, I restored import that used the real implementation

wait, we should use stubs instead of the real function, otherwise we import the functions-aggregate dependency

It is available, I hadn't have to do anything / update the Cargo

jayzhan211 · 2024-08-01T15:07:04Z

@jayzhan211 fyi all the tests in scalar subquery failed with the stubs, I restored import that used the real implementation

wait, we should use stubs instead of the real function, otherwise we import the functions-aggregate dependency

It is available, I hadn't have to do anything / update the Cargo

I see. We don't need stubs.

edmondop · 2024-08-01T15:28:21Z

Do you expect any additional changes ? I fixed the stub function name anyways

jayzhan211

👍

jayzhan211 · 2024-08-02T00:18:36Z

datafusion/functions-aggregate/src/min_max.rs

-            self.nullable,
-        ))
+    fn name(&self) -> &str {
+        "MIN"


Follow up PR is to lowercase the name

I logged this #11779 I am not sure we should lowercase this or make the other UDF uppercase

jayzhan211 · 2024-08-02T02:34:58Z

datafusion/expr/src/test/function_stub.rs

+
+impl std::fmt::Debug for Max {
+    fn fmt(&self, f: &mut std::fmt::Formatter) -> std::fmt::Result {
+        f.debug_struct("Min")


this should be "Max"

alamb

Thank you @edmondop and @jayzhan211 -- this is epic work and I think the codebase is that much better for it 🙏

After this PR is merged i think we can also remove the AggregateFunctionDefition enum that now only has a single variant as a follow on cleanup

https://github.com/edmondop/arrow-datafusion/blob/cbdc71485fe1be5fa8b120653db3286f04362a4f/datafusion/expr/src/expr.rs#L632-L631

alamb · 2024-08-03T10:48:23Z

datafusion/proto/gen/src/main.rs

+    println!(
+        "Copying {} to {}",
+        prost.clone().display(),
+        proto_dir.join("src/generated/prost.rs").display()
+    );


I think it is fine to print out some status reporting while regenerating protos 👍

alamb · 2024-08-03T10:49:17Z

datafusion/proto/proto/datafusion.proto

@@ -466,51 +464,6 @@ message InListNode {
  bool negated = 3;
 }

-enum AggregateFunction {


alamb · 2024-08-03T10:53:14Z

datafusion/proto/src/physical_plan/mod.rs

@@ -477,30 +477,10 @@ impl AsExecutionPlan for protobuf::PhysicalPlanNode {
                            ExprType::AggregateExpr(agg_node) => {
                                let input_phy_expr: Vec<Arc<dyn PhysicalExpr>> = agg_node.expr.iter()
                                    .map(|e| parse_physical_expr(e, registry, &physical_schema, extension_codec)).collect::<Result<Vec<_>>>()?;
-                                let ordering_req: Vec<PhysicalSortExpr> = agg_node.ordering_req.iter()
+                                let _ordering_req: Vec<PhysicalSortExpr> = agg_node.ordering_req.iter()


this is an interesting change -- does it mean ordering is not carried into the udf?

Or maybe it is redundant and now is entirely determined by the udf

Ordering are not supported for udaf yet. I had leave a TODO comment below

Thanks

Filed #11804 / #11805 to track

jayzhan211 · 2024-08-03T11:17:41Z

TODO:

MIN/MAX lowercase name
Remove AggregateFunctionDefition enum
Support ordering for UDAF in proto

jayzhan211 · 2024-08-03T11:18:06Z

Thanks @edmondop and @alamb

alamb · 2024-08-03T11:25:42Z

TODO:

@jayzhan211 should we file tickets to track this work (I think @edmondop has a PR for the lower case name). The AggregateFunctionDefition removal would be a good "Good First Issue` ticket

I am happy to file the tickets if you would like

jayzhan211 · 2024-08-04T00:18:56Z

TODO:

@jayzhan211 should we file tickets to track this work (I think @edmondop has a PR for the lower case name). The AggregateFunctionDefition removal would be a good "Good First Issue` ticket

I am happy to file the tickets if you would like

Sure, thanks

alamb · 2024-08-04T11:34:57Z

MIN/MAX lowercase name

I think this is tracked by #11779 (and @edmondop has a PR for it #11795)

Remove AggregateFunctionDefition enum

Turnsout @lewiszlw already has a PR for this one: #11803 (review)

Support ordering for UDAF in proto

File #11804 and will add a ticket link to the code

alamb · 2024-08-04T11:39:26Z

#11805 <-- comment PR

github-actions bot added sql SQL Planner logical-expr Logical plan and expressions physical-expr Physical Expressions optimizer Optimizer rules core Core DataFusion crate labels Jun 19, 2024

edmondop force-pushed the issue-10943 branch from d6cf206 to 552f52b Compare June 19, 2024 18:54

jayzhan211 reviewed Jun 19, 2024

View reviewed changes

datafusion/physical-expr/src/aggregate/build_in.rs Show resolved Hide resolved

github-actions bot removed the sql SQL Planner label Jun 20, 2024

edmondop requested a review from jayzhan211 June 23, 2024 20:29

edmondop marked this pull request as ready for review June 23, 2024 20:29

jayzhan211 reviewed Jun 24, 2024

View reviewed changes

datafusion/optimizer/src/eliminate_distinct.rs Outdated Show resolved Hide resolved

alamb previously approved these changes Jun 27, 2024

View reviewed changes

datafusion/expr/src/test/function_stub.rs Outdated Show resolved Hide resolved

datafusion/expr/src/test/function_stub.rs Outdated Show resolved Hide resolved

datafusion/core/src/physical_optimizer/aggregate_statistics.rs Outdated Show resolved Hide resolved

alamb changed the title ~~Moving min and max to new API and removing from protobuf~~ Moving min and max to user defined aggregate function Jun 27, 2024

alamb mentioned this pull request Jun 27, 2024

Alamb/merge resolve edmondop/arrow-datafusion#1

Closed

edmondop requested a review from jayzhan211 August 1, 2024 14:15

jayzhan211 approved these changes Aug 2, 2024

View reviewed changes

jayzhan211 reviewed Aug 2, 2024

View reviewed changes

edmondop force-pushed the issue-10943 branch from 9794e89 to 706d6e8 Compare August 2, 2024 12:50

edmondop added 2 commits August 2, 2024 08:54

Fixed wrong name

3884faa

Fixing name

7ec7170

edmondop force-pushed the issue-10943 branch from b075bbb to 7ec7170 Compare August 2, 2024 12:54

edmondop mentioned this pull request Aug 2, 2024

Align UDF names to lowercase or uppercase #11779

Closed

alamb mentioned this pull request Aug 2, 2024

DataFusion weekly project plan (Andrew Lamb) - July 29, 2024 #11710

Closed

8 tasks

alamb changed the title ~~Move min and max to user defined aggregate function~~ Move min and max to user defined aggregate function, remove AggregateFunction Aug 3, 2024

alamb changed the title ~~Move min and max to user defined aggregate function, remove AggregateFunction~~ Move min and max to user defined aggregate function, remove AggregateFunction / AggregateFunctionDefinition::BuiltIn Aug 3, 2024

alamb approved these changes Aug 3, 2024

View reviewed changes

jayzhan211 merged commit f4e519f into apache:main Aug 3, 2024
29 checks passed

alamb mentioned this pull request Aug 4, 2024

Implement ordering serialization for AggregateUdf in datafusion-proto #11804

Closed

alamb mentioned this pull request Aug 5, 2024

DataFusion weekly project plan (Andrew Lamb) - Aug 5, 2024 #11826

Closed

6 tasks

edmondop mentioned this pull request Aug 28, 2024

Confirming UDF aliases are serialized correctly #12219

Merged

jcsherin mentioned this pull request Sep 1, 2024

NthValue UDAF removed in v41 #12278

Closed

This pull request was closed.

Move min and max to user defined aggregate function, remove AggregateFunction / AggregateFunctionDefinition::BuiltIn #11013

Move min and max to user defined aggregate function, remove AggregateFunction / AggregateFunctionDefinition::BuiltIn #11013

Conversation

edmondop commented Jun 19, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

edmondop commented Jun 20, 2024

jayzhan211 commented Jun 21, 2024

edmondop commented Jun 21, 2024

jayzhan211 commented Jun 22, 2024 • edited Loading

edmondop commented Jun 22, 2024

jayzhan211 commented Jun 22, 2024

edmondop commented Jun 23, 2024

jayzhan211 commented Jun 23, 2024

edmondop commented Jun 24, 2024

jayzhan211 commented Jun 24, 2024

edmondop commented Jun 24, 2024

jayzhan211 commented Jun 24, 2024

edmondop commented Jun 24, 2024

jayzhan211 commented Jun 24, 2024

alamb left a comment

Choose a reason for hiding this comment

edmondop commented Jun 27, 2024

edmondop commented Aug 1, 2024

jayzhan211 commented Aug 1, 2024 • edited Loading

jayzhan211 commented Aug 1, 2024

edmondop commented Aug 1, 2024 • edited Loading

jayzhan211 commented Aug 1, 2024

edmondop commented Aug 1, 2024

jayzhan211 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jayzhan211 commented Aug 3, 2024

jayzhan211 commented Aug 3, 2024

alamb commented Aug 3, 2024

jayzhan211 commented Aug 4, 2024

alamb commented Aug 4, 2024

alamb commented Aug 4, 2024

Move min and max to user defined aggregate function, remove `AggregateFunction` / `AggregateFunctionDefinition::BuiltIn` #11013

Move min and max to user defined aggregate function, remove `AggregateFunction` / `AggregateFunctionDefinition::BuiltIn` #11013

edmondop commented Jun 19, 2024 •

edited

Loading

jayzhan211 commented Jun 22, 2024 •

edited

Loading

jayzhan211 commented Aug 1, 2024 •

edited

Loading

edmondop commented Aug 1, 2024 •

edited

Loading