Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consistent API to set parameters of aggregate and window functions (AggregateExt --> ExprFunctionExt) #11550

Merged
merged 22 commits into from
Jul 24, 2024

Conversation

timsaucer
Copy link
Contributor

Which issue does this PR close?

Closes #6747.

Rationale for this change

There was excellent work done on #10560 that makes a much more easy to use interface for developers to set parameters on aggregate functions. This PR is designed to bring the same interface to window functions.

Currently it can be very clunky to write a window function for a relatively simple query. For example to create a lag function you would do:

let lag_expr = Expr::WindowFunction(WindowFunction::new(
        WindowFunctionDefinition::BuiltInWindowFunction(
            BuiltInWindowFunction::Lag,
        ),
        vec![col("array_col")],
        vec![],
        vec![],
        WindowFrame::new(None),
        None,
    ));

With this PR it becomes a simpler

let lag_expr = lag(col("array_col"));

Even more powerful is that you can now convert window function expressions into their builder to set additional properties

let lag_expr = lag(col("array_col")).null_treatment(NullTreatment::IgnoreNulls).build()?;

What changes are included in this PR?

This PR refactors AggregateExt to be ExprFunctionExt and adds in additional functions for partition_by and window_frame. It also adds in a variety of useful helper functions for things like lead, lag, row_number etc.

Are these changes tested?

All existing CI tests pass. Some were updated to use the new interface, but all of the functionality remains the same.

Are there any user-facing changes?

WindowFunction::new now only requires two parameters, the function definition and arguments. Other parameters will need to be set using the builder methods provided. For example, if the user was previously calling

let max1 = Expr::WindowFunction(expr::WindowFunction::new(
             WindowFunctionDefinition::AggregateFunction(AggregateFunction::Max),
             vec![col("name")],
            vec![],
            vec![age_asc.clone(), name_desc.clone()],
            WindowFrame::new(Some(false)),
            None,
        ));

They would change this to

let max1 = Expr::WindowFunction(expr::WindowFunction::new(
             WindowFunctionDefinition::AggregateFunction(AggregateFunction::Max),
             vec![col("name")]))
        .order_by(vec![age_asc.clone(), name_desc.clone()])
        .build()
        .unwrap();

@github-actions github-actions bot added documentation Improvements or additions to documentation sql SQL Planner logical-expr Logical plan and expressions optimizer Optimizer rules core Core DataFusion crate labels Jul 19, 2024
@timsaucer
Copy link
Contributor Author

I don't think I have permission on this repo to add labels to this PR so I wasn't able to add one for having user facing changes

@alamb alamb added the api change Changes the API exposed to users of the crate label Jul 20, 2024
@alamb
Copy link
Contributor

alamb commented Jul 20, 2024

I don't think I have permission on this repo to add labels to this PR so I wasn't able to add one for having user facing changes

Thanks @timsaucer -- I added the api-change label. This PR looks amazing. I plan to review it soon (I have a bunch of PRs to review tomorrow). Looking forward to it ❤️

@timsaucer
Copy link
Contributor Author

Actually maybe hold off on reviewing. I have an idea for the morning that might make it simpler

@alamb alamb marked this pull request as draft July 20, 2024 11:06
@alamb
Copy link
Contributor

alamb commented Jul 20, 2024

marking as draft as it sounds like @timsaucer has some additional ideas

@timsaucer
Copy link
Contributor Author

Copying @jayzhan211 's comment from

I had tried to directly modify Expr before, but ends up builder API. I couldn't remember the reason

My guess is that it has to do with all of the other functions returning Expr rather than Result<Expr> and to fit neatly in with the rest of the code base, you wanted one point of checking? Aside from that for Aggregate I don't see a strong need to have the builder. Actually, window functions have a better argument for them with the fact that you need to (sometimes) build the window frame if it is not specified.

The reasons I can see sticking with a builder function instead of updating the Expr directly:

  1. Minimal change to the API. It should only require updating the import since all of the signatures in AggregateExt are identical in ExprFunctionExt and there are only new methods added in. We could even keep AggregateExt and mark it deprecated for a release or two.
  2. Builder allows checking for errors at one point instead of needing to do ? after every operation. Right now there are two types of errors in AggregateExt, calling one of the functions on an Expr that is not an aggregate function or setting and order_by expression that is not a Sort.
  3. As mentioned above, it actually does make setting up the window functions cleaner. With the window functions, if the user specifies a window frame we want to use that. If not, we want to create it based on a few criteria. This is easier with a builder function so we can perform this operation once. In my alternate approach, I update WindowsFunction to make the frame optional. Then at the point of consuming the window function data I perform a get_frame_or_default() that I added in. However these uses I don't expect to be something the end user experiences and is mostly upon the DataFusion code to maintain.

The reasons I can see preferring to work directly with the Expr:

  1. I think this does create a better user experience in the long run. We can perform our checks along the way, which I think is more idiomatic. Plus it relieves the user of needing to understand that we're going from an expression into a builder of an expression.
  2. The code to do this is in most ways simpler. We can remove the trait entirely, and simply have functions that are implemented on Expr. The notable exception to the simplification is the window frame on window functions mentioned above.

Laying it out like this, I can see a good argument for sticking with the builder function if that's the way you two (and others) would like to go.

@jayzhan211
Copy link
Contributor

jayzhan211 commented Jul 22, 2024

I guess the chain of ? is probably the reason why I choose builder, so we have build() for checking validity at once.

But I would like to optimize for user experience on these APIs, if directly mutating Expr is actually easier and straightforward to use. We should enable it. The only concern is that method that expect Result might become more and more. How about we keep both 😆 🤔

@alamb
Copy link
Contributor

alamb commented Jul 22, 2024

I had tried to directly modify Expr before, but ends up builder API. I couldn't remember the reason

My guess is that it has to do with all of the other functions returning Expr rather than Result and to fit neatly in with the rest of the code base

I think another reason was that it might be easier to understand what APIs could be called wnen

for example, if we created Expr::order_by there are many Exprs this doesn't make sense for (like Expr::Like for example). By making a builder then the APIs used for creating aggregate expressions can be in a single trait and documented with examples, etc

But I would like to optimize for user experience on these APIs, if directly mutating Expr is actually easier and straightforward to use

I agree with this (though I do think if we change the API we should try to avoid any more churn than necessary).

I would tend to favor @timsaucer 's opinion on what is easier / more intuitive. I think I am too close to DataFusion / Rust these days to really know what is easier from a "someone who is new to the datafusion codebase" perspective

@alamb
Copy link
Contributor

alamb commented Jul 22, 2024

I looked at #11582 and after reading the rest of the commentary on this PR I think this PR's proposal is best approach

Thus I suggest:

  1. Get this PR ready for review
  2. Add some sort of backwards compatibility thing to help migration

Backwards compatibilty: basically give a deprecation message that tells users how to upgrade

For example, something like

// Compatibility: ensure existing users of AggregateExt  get a message about ExprFunctionExt
// Note the syntax isn't quite right for the deprecated macro
#[deprecated(version = 40, message=:"ExprFunctionExt instead")]
trait AggregateExt: ExprFunctionExt {}

@@ -676,6 +679,266 @@ pub fn interval_month_day_nano_lit(value: &str) -> Expr {
Expr::Literal(ScalarValue::IntervalMonthDayNano(interval))
}

/// Extensions for configuring [`Expr::AggregateFunction`]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could perhaps update this comment to refer to Window functions as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

/// .build()?;
/// # Ok(())
/// # }
/// ```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would also be awesome to add an example here for a window function too (with partition by)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

@alamb
Copy link
Contributor

alamb commented Jul 22, 2024

Thanks @timsaucer and @jayzhan211 -- this is looking pretty sweet

/// See [`ExprFunctionExt`] for usage and examples
#[derive(Debug, Clone)]
pub struct ExprFuncBuilder {
fun: Option<ExprFuncKind>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess fun is the must have field, option is not neccessary. fun: ExprFuncKind

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are sticking with the builder approach, don't we need this in case someone attempts to create a builder from an Expr that is not aggregate or window?

col("a").order_by(vec![col("b")])

This is a wrong use but won't be caught until we call .build() so up to that point don't we need the fun field to be None?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are sticking with the builder approach, don't we need this in case someone attempts to create a builder from an Expr that is not aggregate or window?

col("a").order_by(vec![col("b")])

This is a wrong use but won't be caught until we call .build() so up to that point don't we need the fun field to be None?

Makes sense, it is for incorrect usage

@timsaucer
Copy link
Contributor Author

Question: it looks like we have some functions that have both aggregate functions and window functions defined. Specifically looking at first_value and last_value. My thought is that the current PR just keeps with the existing pattern but I think we should merge these two, right? Should all aggregate functions be viable window functions?

Right now I have datafusion_expr::window_function::first_value() will return a window function expression of type BuiltInWindowFunction::FirstValue and datafusion_functions_aggregate::first_last::first_value() will return an aggregate expression of type AggregateUDF and datafusion_functions_aggregate::first_last::FirstValue.

My thinking is that we should stick with this PR for now but open an issue to converge all of these. I would expect the following psuedocode to be equivalent

let window_expr = datafusion_expr::window_function::first_value(col("a"));
let agg_expr = datafusion_functions_aggregate::first_last::first_value(col("a"));

Please correct me if I'm missing something here.

@timsaucer
Copy link
Contributor Author

From ^ comment I think things like the helper function first_value() and such are adding technical debt. I'm going to dig in tomorrow. If we're updating the window functions now we might as well clean this up at the same time.

@jayzhan211
Copy link
Contributor

jayzhan211 commented Jul 23, 2024

Question: it looks like we have some functions that have both aggregate functions and window functions defined. Specifically looking at first_value and last_value. My thought is that the current PR just keeps with the existing pattern but I think we should merge these two, right? Should all aggregate functions be viable window functions?

Right now I have datafusion_expr::window_function::first_value() will return a window function expression of type BuiltInWindowFunction::FirstValue and datafusion_functions_aggregate::first_last::first_value() will return an aggregate expression of type AggregateUDF and datafusion_functions_aggregate::first_last::FirstValue.

My thinking is that we should stick with this PR for now but open an issue to converge all of these. I would expect the following psuedocode to be equivalent

let window_expr = datafusion_expr::window_function::first_value(col("a"));
let agg_expr = datafusion_functions_aggregate::first_last::first_value(col("a"));

Please correct me if I'm missing something here.

I hadn't look into the detail of window_first implementation, but I guess we could reuse the aggregate function.

let window_expr = datafusion_expr::window_function::first_value(col("a"));
let agg_expr = datafusion_functions_aggregate::first_last::first_value(col("a"));

I think these two are not equivalent. Only the function first_value is the same.
window_expr is window_expr that could add window specific infomation (partition_by or window frame).
agg_expr is aggregate expr, no window specific information involved.

TLDR, function is the same but expr is different

I guess things get easier after udwf is introduced #8709.
We could keep them separate for now.

@timsaucer
Copy link
Contributor Author

I removed first_value and last_value so that it doesn't create confusion. Ready for review.

@timsaucer timsaucer marked this pull request as ready for review July 24, 2024 11:59
…guments. Other parameters will be set via the ExprFuncBuilder
…g ExprFunctionExt trait so we can guarantee a consistent user experience no matter which they call on the Expr and which on the builder
@timsaucer timsaucer force-pushed the feature/expr-function-extension branch from 180e9d3 to 532f262 Compare July 24, 2024 12:20
/// Window function
///
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

@@ -16,6 +16,8 @@
// under the License.
// Make cheap clones clear: https://github.com/apache/datafusion/issues/11143
#![deny(clippy::clone_on_ref_ptr)]
// TODO When the deprecated trait AggregateExt is removed, remove this unstable feature.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like this feature requires nightly rust, which isn't possible for DataFusion I don't think. I'll push a commit to this PR that removes the attempt at backwards compatibility and just updated the description

@alamb alamb changed the title Consistent approach to setting parameters on aggregate functions and window functions Consistent approach to setting parameters on aggregate functions and window functions (remove AggregateExt add ExprFunctionExt) Jul 24, 2024
@alamb alamb changed the title Consistent approach to setting parameters on aggregate functions and window functions (remove AggregateExt add ExprFunctionExt) Consisitent API for parameters of aggregate functions and window functions (remove AggregateExt add ExprFunctionExt) Jul 24, 2024
@alamb alamb changed the title Consisitent API for parameters of aggregate functions and window functions (remove AggregateExt add ExprFunctionExt) Consistent API for parameters of aggregate functions and window functions (remove AggregateExt add ExprFunctionExt) Jul 24, 2024
vec![col("time").sort(true, true)], // ORDER BY time ASC
WindowFrame::new(None),
);
let window_expr = smooth_it
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is so much nicer

@alamb
Copy link
Contributor

alamb commented Jul 24, 2024

I am so excited for this PR -- thank you @timsaucer -- I pushed a few more commits to improve the docs a bit more and get the CI passing. I think this is a really nice improvement to the usability of DataFusion

@alamb alamb changed the title Consistent API for parameters of aggregate functions and window functions (remove AggregateExt add ExprFunctionExt) Consistent API to set parameters of aggregate functions and window functions (remove AggregateExt add ExprFunctionExt) Jul 24, 2024
@alamb alamb changed the title Consistent API to set parameters of aggregate functions and window functions (remove AggregateExt add ExprFunctionExt) Consistent API to set parameters of aggregate and window functions (remove AggregateExt add ExprFunctionExt) Jul 24, 2024
@alamb alamb changed the title Consistent API to set parameters of aggregate and window functions (remove AggregateExt add ExprFunctionExt) Consistent API to set parameters of aggregate and window functions (AggregateExt --> ExprFunctionExt) Jul 24, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is looking great -- thank you @timsaucer and @jayzhan211

@jayzhan211
Copy link
Contributor

Looks great! Thanks @timsaucer and @alamb

@jayzhan211 jayzhan211 merged commit 886e8ac into apache:main Jul 24, 2024
25 checks passed
@timsaucer timsaucer deleted the feature/expr-function-extension branch August 7, 2024 13:02
Michael-J-Ward added a commit to Michael-J-Ward/datafusion-python that referenced this pull request Aug 10, 2024
Also removed `sqlparser` dependency since it's re-exported upstream.

Ref: apache/datafusion#11550
Michael-J-Ward added a commit to Michael-J-Ward/datafusion-python that referenced this pull request Aug 20, 2024
Also removed `sqlparser` dependency since it's re-exported upstream.

Ref: apache/datafusion#11550
andygrove pushed a commit to apache/datafusion-python that referenced this pull request Aug 23, 2024
* update datafusion deps to point to githuc.com/apache/datafusion

Datafusion 41 is not yet released on crates.io.

* update TableProvider::scan

Ref: apache/datafusion#11516

* use SessionStateBuilder

The old constructor is deprecated.

Ref: apache/datafusion#11403

* update AggregateFunction

Upstream Changes:
- The field name was switched from `func_name` to func.
- AggregateFunctionDefinition was removed

Ref: apache/datafusion#11803

* update imports in catalog

Catlog API was extracted to a separate crate.

Ref: apache/datafusion#11516

* use appropriate path for approx_distinct

Ref: apache/datafusion#11644

* migrate AggregateExt to ExprFunctionExt

Also removed `sqlparser` dependency since it's re-exported upstream.

Ref: apache/datafusion#11550

* update regr_count tests for new return type

Ref: apache/datafusion#11731

* migrate from function-array to functions-nested

The package was renamed upstream.

Ref: apache/datafusion#11602

* cargo fmt

* lock datafusion deps to 41

* remove todo from cargo.toml

All the datafusion dependencies are re-exported, but I still need to figure out *why*.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api change Changes the API exposed to users of the crate core Core DataFusion crate documentation Improvements or additions to documentation logical-expr Logical plan and expressions optimizer Optimizer rules sql SQL Planner
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make it easier to create WindowFunctions with the Expr API
4 participants