Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Demonstrate what a function package might look like -- encoding expressions #8046

Closed
wants to merge 24 commits into from

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Nov 3, 2023

Which issue does this PR close?

Builds on #8039

Demonstrates what #8045 might look like

Rationale for this change

This PR demonstrates what a function package API might look like by removing encoding expressions encode/decode from BuiltInScalarFunction enum and adding it in a separate crate (datafusion-functions)

What changes are included in this PR?

  1. A new FunctionImplementation trait and integration into ScalarUDF to make it easier to write ScalarUDFs;
  2. a new datafusion-functions crate that has the implementation of encode and decode.
  3. Automatically register these functions as part of SessionState::new(), similarly to the automatically registered ListingTables
  4. TODO optional enabling of functions based on feature flag

Open Questions:

  1. to support the expr_fns encode and decode, I think we will need a Expr::ScalarFunction call or something that can take a function by name rather than fully resolved function
  2. Extract registration functions from SessionContext into their own trait / consolidate the function registry code rather than passing
    around a set of HahsMaps.... And make a way to actually modify them

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added logical-expr Logical plan and expressions physical-expr Physical Expressions optimizer Optimizer rules core Core DataFusion crate labels Nov 3, 2023
@alamb alamb force-pushed the alamb/extract_encoding_expressions branch from b3e25be to c441a0d Compare November 3, 2023 21:42
datafusion/expr/src/expr_fn.rs Outdated Show resolved Hide resolved
@@ -710,30 +704,6 @@ impl BuiltinScalarFunction {
BuiltinScalarFunction::Digest => {
utf8_or_binary_to_binary_type(&input_expr_types[0], "digest")
}
BuiltinScalarFunction::Encode => Ok(match input_expr_types[0] {
Copy link
Contributor Author

@alamb alamb Nov 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This metadata information about the functions is now moved into functions/encoding.rs module, along side its implementation

}
}

/// Convenience trait for implementing ScalarUDF. See [`ScalarUDF::from_impl()`]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This echo's the trait that @2010YOUY01 proposed in ) #7752, but does so in a way that is backwards compatible (makes a ScalarUDF out of the trait, to retain backwards compatibly)


pub(super) struct EncodeFunc {}

static ENCODE_SIGNATURE: OnceLock<Signature> = OnceLock::new();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is what encode and decode look like using the ScalarUDF API -- I think they are much clearer when all this type information is in one place (though I still kept it separate from the implementation to show the implementation did not change at all)

use std::collections::HashMap;
use std::sync::Arc;

/// Registers the `encode` and `decode` functions with the function registry
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the conditional registration of these functions based on feature flag -- there are probably nicer ways to do this but I don't think it is any worse than the current solution.


/// Registers all "built in" functions from this crate with the provided registry
pub fn register_all(registry: &mut HashMap<String, Arc<ScalarUDF>>) {
encoding::register(registry);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I envision extending this list with other packages over time.

@alamb
Copy link
Contributor Author

alamb commented Nov 4, 2023

@2010YOUY01 and @viirya I wonder if you have any thoughts on this approach / proposal?

Copy link
Contributor

@2010YOUY01 2010YOUY01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, this looks great. I have several questions/suggestions:

  1. to support the expr_fns encode and decode, I think we will need a Expr::ScalarFunction call or something that can take a function by name rather than fully resolved function

Now constructing an Expr for built-in functions is stateless (does not require context), so it's tricky to be backwards compatible for Expr API, the best solution I can think of is to also support initializing a UDF Expr with only name string, and resolve them during logical plan optimization.

  1. Extract registration functions from SessionContext into their own trait / consolidate the function registry code rather than passing
    around a set of HahsMaps.... And make a way to actually modify them

It's a good idea to pack 3 HashMaps for scalar/aggr/window UDFs into a new struct like FunctionRegistry 👍🏼

pub mod utils;

/// Registers all "built in" functions from this crate with the provided registry
pub fn register_all(registry: &mut HashMap<String, Arc<ScalarUDF>>) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should support registering a single function here, there might be a use case that the user wants to override only one function from a function package
(possibly by changing this interface to something like

pub fn register_all() {
    register_package(encoding::all_functions());
    register_function(my_encoding::decode()); // override a method in default function package
}

Comment on lines 120 to 137
/// Returns this function's name
pub fn name(&self) -> &str {
&self.name
}
/// Returns this function's signature
pub fn signature(&self) -> &Signature {
&self.signature
}
/// return the return type of this function given the types of the arguments
pub fn return_type(&self, args: &[DataType]) -> Result<DataType> {
// Old API returns an Arc of the datatype for some reason
let res = (self.return_type)(args)?;
Ok(res.as_ref().clone())
}
/// return the implementation of this function
pub fn fun(&self) -> &ScalarFunctionImplementation {
&self.fun
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that the case this set of interfaces is internal-faced for execution, we might extend it during separating function packages?
And trait FunctionImplementation is the user-faced API for defining functions in separate crates

@alamb
Copy link
Contributor Author

alamb commented Nov 6, 2023

Thank you, this looks great. I have several questions/suggestions:

  1. to support the expr_fns encode and decode, I think we will need a Expr::ScalarFunction call or something that can take a function by name rather than fully resolved function

Now constructing an Expr for built-in functions is stateless (does not require context), so it's tricky to be backwards compatible for Expr API, the best solution I can think of is to also support initializing a UDF Expr with only name string, and resolve them during logical plan optimization.

Yes, I agree this approach is the best I can come up with.

  1. Extract registration functions from SessionContext into their own trait / consolidate the function registry code rather than passing
    around a set of HahsMaps.... And make a way to actually modify them

It's a good idea to pack 3 HashMaps for scalar/aggr/window UDFs into a new struct like FunctionRegistry 👍🏼

👍 Unfortunately that name is already taken :) Maybe MemoryFunctionRegistry 🤔

@github-actions github-actions bot removed the optimizer Optimizer rules label Nov 18, 2023
@@ -34,6 +34,17 @@ pub trait FunctionRegistry {

/// Returns a reference to the udwf named `name`.
fn udwf(&self, name: &str) -> Result<Arc<WindowUDF>>;

/// Registers a new `ScalarUDF`, returning any previously registered
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a new proposed API -- to allow registering new scalar UDFs with a FunctionRegistry.

@@ -1228,30 +1229,4 @@ mod test {
unreachable!();
}
}

#[test]
fn encode_function_definitions() {
Copy link
Contributor Author

@alamb alamb Nov 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think these tests add a lot -- they simply encode the signature again. This is also covered by actually calling encode() via the expr API / dataframe tests which is done.

pub mod expr_fn {
use super::*;
/// Return encode(arg)
pub fn encode(args: Vec<Expr>) -> Expr {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here are the new expr_fn implementations

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate logical-expr Logical plan and expressions physical-expr Physical Expressions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants