Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement trait based API for defining ScalarUDFs #8568

Closed
alamb opened this issue Dec 17, 2023 · 3 comments · Fixed by #8578
Closed

Implement trait based API for defining ScalarUDFs #8568

alamb opened this issue Dec 17, 2023 · 3 comments · Fixed by #8578
Assignees
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Dec 17, 2023

Is your feature request related to a problem or challenge?

The current way a user implements a ScalarUDF is awkward:

They must wade through several Arc<dyn<...> typedefs to figure out how to provide the type signature and implementation

pub fn new(
    name: &str,
    signature: &Signature,
    return_type: &Arc<dyn Fn(&[DataType]) -> Result<Arc<DataType>, DataFusionError> + Send + Sync>,
    fun: &Arc<dyn Fn(&[ColumnarValue]) -> Result<ColumnarValue, DataFusionError> + Send + Sync>
) -> ScalarUDF

The create_udf is somewhat easier to use, but it still requires Arc's of anonymous functions

Describe the solution you'd like

I am not sure why this API is implemented like it is. If I were a user I would expect to be able to use a trait object

like

struct MyUDF { 
..
}

impl FunctionDefintion for MyUDF {
  fn name(&self) -> &str, 
  fn return_type(&self) -> &DataType, 
...
}

Describe alternatives you've considered

No response

Additional context

I want to make ScalarUDFs easy to define for two reasons:

  1. they are a key API for DataFusion
  2. As we work on making all scalar functions ScalarUDF [EPIC] Unify Function Interface (remove BuiltInScalarFunction) #8045 the easer it is to make UDFs the easier it is to complete that task
@alamb alamb added the enhancement New feature or request label Dec 17, 2023
@alamb alamb self-assigned this Dec 17, 2023
@universalmind303
Copy link
Contributor

Ideally, i'd prefer something like this for implementing scalar udfs.

pub trait ScalarUdf: Send + Sync {
    /// The name for this function
    fn name(&self) -> &'static str;

    /// Return the signature for this function
    fn signature(&self) -> Signature;

    /// Return type of the function
    fn return_type(&self) -> DataType;
    
    /// udf implementation
    fn func(&self, args: &[&ColumnarValue]) -> Result<ColumnarValue, DataFusionError>
}

@universalmind303
Copy link
Contributor

I'm also curious.

Are there plans to ever allow udfs to support return types based off the input value instead of the type? I know there is currently special handling for arrow_cast because of this limitation.

@alamb
Copy link
Contributor Author

alamb commented Dec 22, 2023

Ideally, i'd prefer something like this for implementing scalar udfs.

@universalmind303 -- perhaps you could check out #8578 which proposes the following trait for implementing UDFs:

(Edit: I see you did in fact look at that PR)

pub trait ScalarUDFImpl {
    /// Returns this object as an [`Any`] trait object
    fn as_any(&self) -> &dyn Any;

    /// Returns this function's name
    fn name(&self) -> &str;

    /// Returns the function's [`Signature`] for information about what input
    /// types are accepted and the function's Volatility.
    fn signature(&self) -> &Signature;

    /// What [`DataType`] will be returned by this function, given the types of
    /// the arguments
    fn return_type(&self, arg_types: &[DataType]) -> Result<DataType>;

    /// Invoke the function on `args`, returning the appropriate result
    ///
    /// The function will be invoked passed with the slice of [`ColumnarValue`]
    /// (either scalar or array).
    ///
    /// # Zero Argument Functions
    /// If the function has zero parameters (e.g. `now()`) it will be passed a
    /// single element slice which is a a null array to indicate the batch's row
    /// count (so the function can know the resulting array size).
    ///
    /// # Performance
    ///
    /// For the best performance, the implementations of `invoke` should handle
    /// the common case when one or more of their arguments are constant values
    /// (aka  [`ColumnarValue::Scalar`]). Calling [`ColumnarValue::into_array`]
    /// and treating all arguments as arrays will work, but will be slower.
    fn invoke(&self, args: &[ColumnarValue]) -> Result<ColumnarValue>;

    /// Returns any aliases (alternate names) for this function.
    ///
    /// Aliases can be used to invoke the same function using different names.
    /// For example in some databases `now()` and `current_timestamp()` are
    /// aliases for the same function. This behavior can be obtained by
    /// returning `current_timestamp` as an alias for the `now` function.
    ///
    /// Note: `aliases` should only include names other than [`Self::name`].
    /// Defaults to `[]` (no aliases)
    fn aliases(&self) -> &[String] {
        &[]
    }
}

I'm also curious.

Are there plans to ever allow udfs to support return types based off the input value instead of the type? I know there is currently special handling for arrow_cast because of this limitation.

We discussed it in #7657 and I just filed #8624

Once we have a trait based API I believe we will be in a much better position to support this feature

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants