-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create datafusion-functions
crate, extract encode and decode to
#8705
Conversation
d9528ab
to
d9378a9
Compare
@@ -69,14 +69,10 @@ pub enum BuiltinScalarFunction { | |||
Cos, | |||
/// cos | |||
Cosh, | |||
/// Decode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the key point of this PR: to begin removing items from this (giant) enum
d9378a9
to
7f860dc
Compare
@@ -758,30 +752,6 @@ impl BuiltinScalarFunction { | |||
BuiltinScalarFunction::Digest => { | |||
utf8_or_binary_to_binary_type(&input_expr_types[0], "digest") | |||
} | |||
BuiltinScalarFunction::Encode => Ok(match input_expr_types[0] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic is moved into the ScalarUDFImpl
for Encode/Decod
@@ -1310,34 +1306,4 @@ mod test { | |||
unreachable!(); | |||
} | |||
} | |||
|
|||
#[test] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These tests are testing the expression construction which I don't think is hugely valuable -- what is far more important is that encode/decode can be called as SQL and the dataframe API
encode/decode is covered by in slt here: https://github.com/apache/arrow-datafusion/blob/d68c0fcb491d4d0a2159ff433b59fa77f78edca4/datafusion/sqllogictest/test_files/encoding.slt#L18-L17
And I have added an encode/decode dataframe test in this PR
@@ -524,6 +527,30 @@ async fn roundtrip_logical_plan_with_extension() -> Result<()> { | |||
Ok(()) | |||
} | |||
|
|||
#[tokio::test] | |||
async fn roundtrip_expr_api() -> Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are tests demonstrating that using ScalarUDF to implement functions are properly serialized
7f860dc
to
1a1e0f4
Compare
@@ -750,3 +756,48 @@ async fn test_fn_upper() -> Result<()> { | |||
|
|||
Ok(()) | |||
} | |||
|
|||
#[tokio::test] | |||
async fn test_fn_encode() -> Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these functions weren't previously tested. I added tests to show that it is still possible to use the fluent API even with the moving of the expr_fn
implementation
not_impl_err!("Registering ScalarUDF") | ||
} | ||
|
||
// TODO add register_udaf and register_udwf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will file follow on tickets for this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
/// A [`FunctionRegistry`] that uses in memory [`HashMap`]s | ||
#[derive(Default, Debug)] | ||
pub struct MemoryFunctionRegistry { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is currently only used for testing, but I plan to reduce the duplication between SessionContext, and TaskContext using it
under the License. | ||
--> | ||
|
||
# DataFusion Function Library |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the basic idea -- make a new crate with all the function definitions
} | ||
}; | ||
} | ||
|
||
#[cfg(feature = "crypto_expressions")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other existing feature flags such as crypto_expressions
are obvious candidates to extract into datafusion-functions
datafusion-functions
cratedatafusion-functions
crate, extract encode and decode to
I think we should wait to merge this until after releasing #8863 |
@metesynnada PTAL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!. I think this PR is great in accomplishing your aim:
This is the start of removing the distinction between "Built in" functions and "User Defined Functions"
This will enable force us to use, create good APIs for external users. Since now every scalar function is like User Defined function. Thanks @alamb for this PR.
I think, in the future your intent is to move all functions in the BuiltinScalarFunction
enum to this package. If this is the intent, maybe we can add a recipe somewhere (can be a issue body) for explaining steps to convert existing BuiltinScalarFunction
s to ScalarUDFImpl
s
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @mustafasrepo @metesynnada and @crepererum for your reviews. I plan to merge this PR tomorrow unless anyone else would like extra time to review, and file follow on issues to start organizing the bigger project |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. This looks like a pretty nice feature!
} | ||
|
||
fn invoke(&self, args: &[ColumnarValue]) -> Result<ColumnarValue> { | ||
// Put a feature flag here to make sure this is only compiled when the feature is activated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need a TODO here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good call -- I think the comment is out of date -- this whole module is only compiled when the feature is activated. I will remove it
macro_rules! make_package { | ||
($name:ident, $feature:literal, $DOC:expr) => { | ||
#[cfg(feature = $feature)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I wonder if all function packages have corresponding features? I guess for built-in scalar functions they don't have feature?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an excellent observation
I think eventually it would be nice to have as many functions as possible be optional and behind a feature flat. There are some functions that back built in Expr
like make_array
What I am thinking is that we'll first break out the functions that do have their own feature flags already and establish a pattern and then we can design / lift out others that don't. I hope to write down more of the plan shortly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is a discussion / proposal on how the functions could be organized: #9100
Looks nice, I think this should really help with the readability of the code |
Co-authored-by: Liang-Chi Hsieh <[email protected]>
…fusion into alamb/function_packages
🚀 |
Thank you everyone who helped review and provide commentary on this. @viirya @sunchao @2010YOUY01 @crepererum @metesynnada and @mustafasrepo -- this is only the first step but I am pretty excited about where this will go |
A couple of questions on this:
|
I responded to these questions above at #9100 (comment) |
Note: I view this PR as strategically very important because:
rust
tosqllogictest
#6195)While this PR looks large, it is a lot of boiler plate and comments. The actual changes are quite small (move two functions)
Which issue does this PR close?
Part of #8045
Closes #8046
Closes #8157 (it is not needed anymore)
Rationale for this change
This is the start of removing the distinction between "Built in" functions and "User Defined Functions"
Removing the distinction between "Built in" functions and "User Defined Functions" has two benefits:
What changes are included in this PR?
datafusion-functions
crate (eventually I imagine there may be several subcrates, to allow faster compilation, but let's start with one)register_udf
function to theFunctionRegistry
traitencode
anddecode
functions into the new crateAre these changes tested?
Yes, there are new unit tests
Are there any user-facing changes?
encode
anddecode
are now defined indatafusion-functions
rather thandatafusion-expr
. For anyone who just uses theexpr_fn
APIs this will mean a new dependency on datafusion-physical-expr.