Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support User Defined Table Function #8306

Merged
merged 10 commits into from
Nov 30, 2023
Merged

Support User Defined Table Function #8306

merged 10 commits into from
Nov 30, 2023

Conversation

Veeupup
Copy link
Contributor

@Veeupup Veeupup commented Nov 22, 2023

Which issue does this PR close?

Closes #7926

Rationale for this change

This is a draft PR to show how to implement User-Defined-Table Function.

It remains a lot of things to do, but maybe it's a good start.

Such as:

  • Internal Table Defined Function
  • More input functions types
  • More friendly register
  • ...

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added sql SQL Planner core Core DataFusion crate labels Nov 22, 2023
@Veeupup
Copy link
Contributor Author

Veeupup commented Nov 22, 2023

@alamb PTAL, a draft implementation of user-defined-table-function : )

@alamb
Copy link
Contributor

alamb commented Nov 22, 2023

Thank you @Veeupup -- I am very excited to check this out. I plan to do so tomorrow if possible (though it is a holiday in the US so it may not be until Friday or the weekend)

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I couldn't help myself and took a quick look now @Veeupup -- this is really cool, thank you. Also thank you for putting up a draft PR for early feedback.

I left some comments, let me know what you think.

Thanks again! 🏆

}

/// Get the function implementation and generate a table
pub fn create_table_provider(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking we would allow users to provide their own TableProvider directly (rather than a PartitionStream) as this permits them to implement predicate / projection / limit pushdown, among other things

I like how you have made a simple version in the example -- maybe you can just move the creation of StreamingTable::try_new() into datafusion-examples/examples/simple_udtf.rs 🤔

I am also thinking about how to support TableFunctions that take streams as input: #7926 (comment). Any ideas on this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking we would allow users to provide their own TableProvider directly (rather than a PartitionStream) as this permits them to implement predicate / projection / limit pushdown, among other things

sure! Agree with you, just implementing their TableProvider seems more reasonable. I have change the func signature.

I am also thinking about how to support TableFunctions that take streams as input:

maybe users can implement PartitionStream and use inner StreamTable to take streams as input? see it in example

/// Get the function implementation and generate a table
pub fn create_table_provider(
&self,
args: &[String],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of taking strings what would you think about taking Expr instead (which would allow for more sophisticated things like passing in now() as sugested by @yukkit in #7926 (comment))?

Copy link
Contributor Author

@Veeupup Veeupup Nov 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love this idea too! I'll try to figure out if we can find a way to evaluate Expr correctly!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb PTAL I have madeExprs and provider inputs and it looks just fine!

but we still have some problems to solve, you can check TODO in the latest commit.

  • how to evaluate expr simpler?
    • for now, I left it in fn scan to construct Expr -> LogicalPlan -> PhysicalPlan and evaluate it then, but I think we can make it simpler or elsewhere to evaluate Exprs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love this idea too! I'll try to figure out if we can find a way to evaluate Expr correctly!

BTW since I agree evaluating expressions is non trivial, I added some additional documentation and examples of how to do so here: #8377

@Veeupup
Copy link
Contributor Author

Veeupup commented Nov 23, 2023

Happy Thanksgiving!! Enjoy your holiday! 😺

I'll checkout your constructive comments later today when I'm free : )

@Veeupup Veeupup force-pushed the table_function branch 2 times, most recently from 26ddf6a to 050ab9d Compare November 25, 2023 07:47
@Veeupup Veeupup requested a review from alamb November 25, 2023 07:55
@Veeupup Veeupup marked this pull request as ready for review November 25, 2023 08:10
@Veeupup
Copy link
Contributor Author

Veeupup commented Nov 27, 2023

cc @yukkit hi, this PR is almost ready for review, comments are welcome!

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @Veeupup -- I think this is looking very close. (Another) really nice PR 🦾

What I think is needed prior to merging this PR:

  1. Gather any more feedback on the API
  2. Add tests (in addition to the example) in https://github.com/apache/arrow-datafusion/tree/main/datafusion/core/tests/user_defined (the example outputs are not validated)

I also added some suggestions on how to make the code simpler / better but I don't think they are strictly necessary

}

/// Get the function implementation and generate a table
pub fn create_table_provider(&self, args: &[Expr]) -> Result<Arc<dyn TableProvider>> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

/// A trait for table function implementations
pub trait TableFunctionImpl: Sync + Send {
/// Create a table provider
fn call(&self, args: &[Expr]) -> Result<Arc<dyn TableProvider>>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This API is nice and is as specified in #7926. I think it will work for using a table function as a relation in the query (aka like a table with parameters)

The one thing I don't think this API supports is TableFunctions that take other arguments (aka that are fed the result of a table / can use the value of correlated subqueries as mentioned by @yukkit and @Jesse-Bakker #7926 (comment).

I can think of two options:

  1. Leave this API as is , and add a follow on / new API somehow to support that usecase
  2. Try to extend this API somehow to support table inputs

I personally prefer 1 as I think it offers several additional use cases, even though it doesn't cover "take a table input".

Any other thoughts?

Copy link
Contributor

@Jesse-Bakker Jesse-Bakker Nov 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One specific use case for table-valued arguments to table-valued functions is, for example windowing tvf's like in apache flink.

Example which cannot be expressed by taking Expr arguments (maybe if Expr::Row() is added?):

SELECT window_start, window_end, SUM(price)
  FROM TABLE(
    TUMBLE(TABLE Bid, DESCRIPTOR(bidtime), INTERVAL '10' MINUTES))
  GROUP BY window_start, window_end;

That can also be emulated, however, using something like:

SELECT window_start, window_end, SUM(price)
  FROM Bid,
    TUMBLE(Bid.bidtime, INTERVAL '10' MINUTES))
    GROUP BY window_start, window_end;

which doesn't need table-valued arguments (but does need to resolve Expr::Column(name=bidtime). I'm not sure if the current API can do that?).

Anyway, the current API is nice, and definitely very useful 👍

datafusion/core/src/datasource/function.rs Outdated Show resolved Hide resolved
datafusion/sql/src/planner.rs Outdated Show resolved Hide resolved
FunctionArg::Unnamed(FunctionArgExpr::Expr(expr)) => {
let expr = self.sql_expr_to_logical_expr(
expr,
// TODO(veeupup): for now, maybe it's little diffcult to resolve tables' schema before create provider
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the function is used like a base relation (aka a TableFactor / item in the FROM clause) I don't think there is any schema to use to resolve the input arguments (aka this code is correct and probably shouldn't be a todo).

I don't really know how we would represent the correlated subquery case in #7926 (comment) 🤔 I think it would need some sort of analysis pass after the initial plan is built

datafusion/sql/src/relation/mod.rs Outdated Show resolved Hide resolved
datafusion/sql/src/relation/mod.rs Show resolved Hide resolved
datafusion-examples/examples/simple_udtf.rs Outdated Show resolved Hide resolved
}
}

// Option2: (use StreamingTable to make it simpler)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually think adding a second option is more confusing in this example, so I would recommend removing this part (perhaps you can just add a note that it is posisble)

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @Veeupup -- I think this can be merged as is. Really nice

I have a suggestion about trying to simplify the code a bit (we can do it as a follow on PR as well): Veeupup#1

@Veeupup
Copy link
Contributor Author

Veeupup commented Nov 30, 2023

we should support table function inputs with Expr::Column and more complicated Expr.

Where should we track it? Still track it in #7926 ? or start a new issue

I can make some trials later.

@alamb
Copy link
Contributor

alamb commented Nov 30, 2023

I took the liberty of merging this branch up from main. If the tests pass I'll plan to merge it. While waiting I will likely file follow on tickets

@alamb
Copy link
Contributor

alamb commented Nov 30, 2023

I filed #8383 to track follow on work. Is there any other specific improvement that we discussed that isn't covered?

@alamb
Copy link
Contributor

alamb commented Nov 30, 2023

🚀

@alamb alamb merged commit e19c669 into apache:main Nov 30, 2023
23 checks passed
appletreeisyellow pushed a commit to appletreeisyellow/datafusion that referenced this pull request Dec 14, 2023
* Support User Defined Table Function

Signed-off-by: veeupup <[email protected]>

* fix comments

Signed-off-by: veeupup <[email protected]>

* add udtf test

Signed-off-by: veeupup <[email protected]>

* add file header

* Simply table function example, add some comments

* Simplfy exprs

* make clippy happy

* Update datafusion/core/tests/user_defined/user_defined_table_functions.rs

---------

Signed-off-by: veeupup <[email protected]>
Co-authored-by: Andrew Lamb <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate sql SQL Planner
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support User Defined Table Functions / Table Value Functions
3 participants