Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce reverse_expr for UDAF #10214

Closed
wants to merge 2 commits into from

Conversation

jayzhan211
Copy link
Contributor

Which issue does this PR close?

Closes #.
Part of #10091

Rationale for this change

reverse_expr has a different meaning between returning clone and None. Together with reversing, there are three cases.
I think the proposed Enum makes the user understand what they can do in reverse_expr.

What changes are included in this PR?

Are these changes tested?

I added a simple UDAF for testing. I want to avoid adding too many other things just for test only because most of them can be covered after moving UDAF from built-in.

Are there any user-facing changes?

Signed-off-by: jayzhan211 <[email protected]>
@github-actions github-actions bot added logical-expr Logical plan and expressions core Core DataFusion crate labels Apr 24, 2024
Signed-off-by: jayzhan211 <[email protected]>
/// Typically the "reverse" expression is itself (e.g. SUM, COUNT).
/// For aggregates that do not support calculation in reverse,
/// returns None (which is the default value).
fn reverse_expr(&self) -> ReversedExpr {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to take the owned type, but

function arguments must have a statically known size, borrowed types always have a known size: `&`

)?;

// TODO: We don't have a nice way to test the change without introducing many other things
// We check with the output string. `ignore nulls` is expeceted to be false.
Copy link
Contributor Author

@jayzhan211 jayzhan211 Apr 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can even remove fn test_reverse_udaf if first/last are moved to UDAF based.

@jayzhan211 jayzhan211 marked this pull request as ready for review April 24, 2024 14:11
@alamb
Copy link
Contributor

alamb commented Apr 25, 2024

Thanks @jayzhan211 -- I'll check this out tomorrow morning

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thnaks @jayzhan211 -- looking neat. I had some questions about the actual API though

/// The expression is the same as the original expression, like SUM, COUNT
Identical,
/// The expression does not support reverse calculation, like ArrayAgg
NotSupported,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spent some time trying to understand this -- is the reason that ArrayAgg doesn't support reversed calculations that the array elements would be in a different order?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, because ArrayAgg has no ordering info.

You can also take a look at this comment #9972 (comment)

Although I found that the counter-example might not be correct after a few days, ARRAY_AGG(b ORDER BY c DESC) is OrderSensitiveArrayAgg, so it is possible to produce reverse expr, but it still makes sense to me that order insensitive ArrayAgg (which is ArrayAgg) does not support reverse_expr. cc @mustafasrepo

/// The expression does not support reverse calculation, like ArrayAgg
NotSupported,
/// The expression is different from the original expression
Reversed(AggregateFunction),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure about returning an AggregateFunction here -- this method is part of AggregateUDFImpl trait, but AggregateFunction has arguments and other fields that I don't think will be accessable to an instance of AggregateUDFImpl.

#[derive(Clone, PartialEq, Eq, Hash, Debug)]
pub struct AggregateFunction {
/// Name of the function
pub func_def: AggregateFunctionDefinition,
/// List of expressions to feed to the functions as arguments
pub args: Vec<Expr>,
/// Whether this is a DISTINCT aggregation or not
pub distinct: bool,
/// Optional filter
pub filter: Option<Box<Expr>>,
/// Optional ordering
pub order_by: Option<Vec<Expr>>,
pub null_treatment: Option<NullTreatment>,
}

Would it make more sense to return an AggregateUDF here:

/// Logical representation of a user-defined [aggregate function] (UDAF).
///
/// An aggregate function combines the values from multiple input rows
/// into a single output "aggregate" (summary) row. It is different
/// from a scalar function because it is stateful across batches. User
/// defined aggregate functions can be used as normal SQL aggregate
/// functions (`GROUP BY` clause) as well as window functions (`OVER`
/// clause).
///
/// `AggregateUDF` provides DataFusion the information needed to plan and call
/// aggregate functions, including name, type information, and a factory
/// function to create an [`Accumulator`] instance, to perform the actual
/// aggregation.
///
/// For more information, please see [the examples]:
///
/// 1. For simple use cases, use [`create_udaf`] (examples in [`simple_udaf.rs`]).
///
/// 2. For advanced use cases, use [`AggregateUDFImpl`] which provides full API
/// access (examples in [`advanced_udaf.rs`]).
///
/// # API Note
/// This is a separate struct from `AggregateUDFImpl` to maintain backwards
/// compatibility with the older API.
///
/// [the examples]: https://github.com/apache/datafusion/tree/main/datafusion-examples#single-process
/// [aggregate function]: https://en.wikipedia.org/wiki/Aggregate_function
/// [`Accumulator`]: crate::Accumulator
/// [`create_udaf`]: crate::expr_fn::create_udaf
/// [`simple_udaf.rs`]: https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/simple_udaf.rs
/// [`advanced_udaf.rs`]: https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/advanced_udaf.rs
#[derive(Debug, Clone)]
pub struct AggregateUDF {
inner: Arc<dyn AggregateUDFImpl>,
}
here instead?

Maybe it would help guide the API design to implement this API for first_value/last_value udafs 🤔

Copy link
Contributor Author

@jayzhan211 jayzhan211 Apr 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can see the reverse expr in OrderSensitiveArrayAgg

fn reverse_expr(&self) -> Option<Arc<dyn AggregateExpr>> {
        Some(Arc::new(Self {
            name: self.name.to_string(),
            input_data_type: self.input_data_type.clone(),
            expr: self.expr.clone(),
            nullable: self.nullable,
            order_by_data_types: self.order_by_data_types.clone(),
            // Reverse requirement:
            ordering_req: reverse_order_bys(&self.ordering_req),
            reverse: !self.reverse,
        }))
    }

We need the ordering info, it is not contained in either AggregateUDF or AggregateUDFImpl function. Therefore, I return AggregateFunction instead. Then, we can create AggregateExec with create_physical_expr to get the physical-expr from these logical exprs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I wonder why does the AggregateUDFImpl need to actually reverse the ordering? Couldn't the code that calls reverse_expr() do the actual call to reverse_order_by?

As in maybe AggregateFunctionExpr::reverse_expr could call reverse_order_by and AggregateUdfImpl::reverse_expr would only return an AggregateUDF

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me take a look at the code that uses reverse_expr, I guess some kind of rewrite may help.

@@ -39,3 +39,4 @@ path = "src/lib.rs"
arrow = { workspace = true }
datafusion-common = { workspace = true, default-features = true }
datafusion-expr = { workspace = true }
sqlparser = { workspace = true }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think ideally the physical expr shouldn't be depending on sql parser (for people who are not using SQL)

Though now I see that perhaps it is because AggregateFunction has a null treatment flag on it 🤔 That might be a nice dependency to avoid

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is easy to avoid this dependency, we can check this token in parser, then pass boolean all the way down.

Copy link
Contributor Author

@jayzhan211 jayzhan211 Apr 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let me file an issue for this

This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate logical-expr Logical plan and expressions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants