Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

User defined window functions #5781

Closed
4 tasks
alamb opened this issue Mar 29, 2023 · 11 comments · Fixed by #6703
Closed
4 tasks

User defined window functions #5781

alamb opened this issue Mar 29, 2023 · 11 comments · Fixed by #6703
Assignees
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Mar 29, 2023

Is your feature request related to a problem or challenge?

When implementing our InfuxQL frontend for DataFusion, we found certain functions we would like to express as window functions (like a specialized interpolation function for example)

We can write user defined aggregates:

select 
  x, my_aggregate(x, time) 
from t;

But those do not produce the same number of rows that go in (they reduce cardinality)

We would like to use our own window functions like

select 
  x, my_interpolation_function(x, time) OVER (PARTITION BY y, ORDER BY time) 
from t;

Describe the solution you'd like

I would like the ability to define, and register user defined window functions , like we have for user defined aggregate functions:

https://github.com/apache/arrow-datafusion/blob/8139ed40d06d77498217438ff52fe73a4ea16f61/datafusion-examples/examples/simple_udaf.rs#L18-L19

This would likely involve

It isn't clear to me if we would want to expose AggregateWindowExpr, for example

Describe alternatives you've considered

No response

Additional context

No response

@alamb alamb added the enhancement New feature or request label Mar 29, 2023
@alamb
Copy link
Contributor Author

alamb commented Apr 25, 2023

FYI @mustafasrepo and @ozankabak I think @stuartcarnie is contemplating what implementing user defined window functions might entail. I am not sure if you have any thoughts on this matter you would like to share

@ozankabak
Copy link
Contributor

Yes, we will be using UDWF too. Is there a design doc we can read and comment on?

@doki23
Copy link
Contributor

doki23 commented Apr 26, 2023

How we distinguish UDWF from UDAF since window function already supports AggregateUDF, I mean UDWF is a super set of udaf in window function.

@alamb
Copy link
Contributor Author

alamb commented Apr 26, 2023

Yes, we will be using UDWF too. Is there a design doc we can read and comment on?

There is no document that I know of -- it is probably time to start one

@alamb
Copy link
Contributor Author

alamb commented Apr 26, 2023

How we distinguish UDWF from UDAF since window function already supports AggregateUDF, I mean UDWF is a super set of udaf in window function.

@doki23 I am not sure

The current code has this:
https://docs.rs/datafusion-expr/23.0.0/datafusion_expr/window_function/enum.WindowFunction.html

I never really understood why DataFusion makes a distinction between built in aggregate functions and user defined aggregate functions (it would be really nice if all aggregate functions had the same interface)

@alamb
Copy link
Contributor Author

alamb commented Jun 6, 2023

Unless someone beats me to it I plan to start working on a proposed design for this feature in the next few days

@alamb alamb self-assigned this Jun 7, 2023
@alamb
Copy link
Contributor Author

alamb commented Jun 7, 2023

Here is a proposed design.

LogicalPlan / Expr

The current (Expr) representation of window functions:

#[derive(Clone, PartialEq, Eq, Hash)]
pub enum Expr {
...
    /// Represents the call of a window function with arguments.
    WindowFunction(WindowFunction),
}

Which is defined like:

/// Window function
#[derive(Clone, PartialEq, Eq, Hash)]
pub struct WindowFunction {
    /// Name of the function
    pub fun: window_function::WindowFunction,
    /// List of expressions to feed to the functions as arguments
    pub args: Vec<Expr>,
    /// List of partition by expressions
    pub partition_by: Vec<Expr>,
    /// List of order by expressions
    pub order_by: Vec<Expr>,
    /// Window frame
    pub window_frame: window_frame::WindowFrame,
}

Note that window_function::WindowFunction already handles user defined functions (AggregateUDF), so I propose to add an addutionl user defined window function variant there:

/// WindowFunction (in `window_function`):
#[derive(Debug, Clone, PartialEq, Eq, Hash)]
pub enum WindowFunction {
    /// A built in aggregate function that leverages an aggregate function
    AggregateFunction(AggregateFunction),
    /// A a built-in window function
    BuiltInWindowFunction(BuiltInWindowFunction),
    /// A user defined aggregate function
    AggregateUDF(Arc<AggregateUDF>),
    /// A user defined aggregate function  <---- This is NEW
    WindowUDF(Arc<WindowUDF>),
}

WindowUDF

WindowUDF will be a structure that looks very similar to AggregateUDF:

https://github.com/apache/arrow-datafusion/blob/1d3860dc813725b5a987aae4c3a8f4a7b2bfdb2d/datafusion/expr/src/udaf.rs#L30-L40

And similarly to the way that an AggregateUDF instantiates an Accumulator (via AccumulatorFunctionImplementation) the WindowUDF will need to provide a PartitionEvaluator instance for each partition of the data.

pub trait PartitionEvaluator: Debug + Send {

I am a little unclear on certain parts of the PartitionEvaluator AP (specifically, the stateful / stateless partition evaluation) and if we can make it straight forward to implement this for datafusion

Looking at how this code is used in the physical_plan module, what is needed is something that implements the BuiltInWindowFunctionExpr. I am not sure yet if it makes sense to have WindowUDF do so directly of if another structure would be better

Here are the next steps I plan:

So steps:

  • Add some additional docstrings to PartitionEvaluator which will both improve the code, and allow me to sort out some of my questions: Minor: Add additional docstrings to Window function implementations #6592
  • Propose renaming BuiltInWindowFunctionExpr to WindowFunctionExpr to reflect they will be used for more than BuiltIn
  • Propose renaming BuiltinWindowState to WindowState (in the same theory that it is not related to built ins)
  • Work on a POC / technical spike showing how this might work, including an end to end example

@stuartcarnie
Copy link
Contributor

Great stuff, @alamb! I did have similar ideas in my own test branch, but I wasn't sure how to replicate BuiltInWindowFunctionExpr and BuiltinWindowState. Renaming them makes a lot of sense 👍🏻

@ozankabak
Copy link
Contributor

Looks reasonable to me, thanks @alamb

@alamb
Copy link
Contributor Author

alamb commented Jun 8, 2023

While working on #6592 I think I can now articulate a key design question about WindowUDF. Specifically:

Use existing API / Traits

What this would mean:

This would mean exposing (at least) BuiltInWindowFunctionExpr and PartitionEvaluator directly for people to implement a WindowUDF

This would mean making those traits pub and part of a WindowUDF and

Pros

  1. This is the same pattern used by AggregateUDF that exposes the Accumulator trait directly to implementors
  2. Allow user defined window functions to access all the features and performance as built in window functions (both now and in the future)

Cons

BuiltInWindowFunctionExpr and PartitionEvaluator are are fairly complicated (as they support many features) which could make implementing user defined window functions harder

Also, BuiltInWindowFunctionExpr and PartitionEvaluator would likely need some changes such as:

  1. renamed so they weren't called "BuiltIn*"
  2. Moving the trait definitions into datafusion_expr
  3. Update the state management to be extendable (e.g. the enum for BuiltinWindowState would have to be updated to to allow user defined state somehow)

Make new APIs and traits

What this would mean:

In this case, we would make new traits that are subsets of what is in
BuiltInWindowFunctionExpr and PartitionEvaluator that users would
implement. Internally we would implement BuiltInWindowFunctionExpr
and PartitionEvaluator in terms of these new traits.

Pros

  1. We can tailor the API to precisely the needs of user defined window function writers
  2. Would allow changing the internal APIs without having to change the WindowUDF API

Cons

A new API would mean introducing another api layer that needs to be tested and kept up to date.

The new API also might not expose all the functionality of the built in window functions if such functionality was added at a later date

Discussion

I am going to try and bash out a technical proof of concept of the first approach (exposing the existing APIs) and we can see how it would look. On the balance that is my preferred approach due to the power of the API and the similarities with AggregateUDFs

@alamb
Copy link
Contributor Author

alamb commented Jun 9, 2023

Here is a proposed API #6617 showing how a User Defined Window function would work. Any feedback on the approach would be most appreciated.

While waiting for feedback on #6617 I plan to try and simplify some of the window function implementation as some small PRs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants