Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support User Defined Table Functions / Table Value Functions #7926

Closed
alamb opened this issue Oct 25, 2023 · 13 comments · Fixed by #8306
Closed

Support User Defined Table Functions / Table Value Functions #7926

alamb opened this issue Oct 25, 2023 · 13 comments · Fixed by #8306
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Oct 25, 2023

Is your feature request related to a problem or challenge?

It is sometimes helpful to have a custom table functions to extend DataFusion's functionality. As we continue to get more feature requests, such as #7859 it is important to support such usecases without having to add everything to the DataFusion core.

For example in the following query my_custom_fun is a table function

SELECT foo, bar FROM my_custom_fun(city='NYC', year=2020)

A specific example might be a function that fetches the contents of remote csv file and parses it into a table.

SELECT date, value FROM parse_remote_csv('https://data.wa.gov/api/views/f6w7-q2d2/rows.csv?accessType=DOWNLOAD')

You can do something similar to this with a TableProvider, but the main differences are:

  1. A TableProvider has no way to pass parameters
  2. A TableProvider's schema is fixed (it can't be a function of the parameters)

Prior Art

Other examples include the read_parquet etc functions in DuckDB

SELECT * FROM read_parquet(['folder1/*.parquet', 'folder2/*.parquet']);
...
SELECT * FROM parquet_schema('test.parquet');

Describe the solution you'd like

I would like to be able to have a table function that supported everything a TableProvider does, including filter and projection pushdown. One way to do so would be to actually return a TableProvider:

Option 1: Add to FunctionRegistry:

We could add Table Functions to the datafusion::execution::FunctionRegistry along with the UDFs, UDAs, etc which arguably would make them easier to discover

Something like

trait FunctionRegistry  { 
...
/// Return a  `TableProvider` for executing the `name` table function
fn udtf(name: &str, args: &[Expr]) -> Result<Arc<dyn TableProvider>>;

}

We would probably also need a

trait TableUDF {
  /// Return a  `TableProvider` for executing this  table function, given the specified
  /// arguments
  fn invoke(name: &str, args: &[Expr]) -> Result<Arc<dyn TableProvider>>;
}

Describe alternatives you've considered

This API is very powerful and would allow Table Functions to do anything a table provider does. We could also offer a stripped down version of the API potentially

We can probably add something like datafusion::logical_expr::create_udf to make it easier to construct basic table functions (e.g that produce a single SendableRecordBatchStream)

Add to SchemaProvider:

We could also add Table Functions to datafusion::catalog::schema::SchemaProvider

This might make sense given how similar TableFunctions are to TableProviders

trait SchemaProvider  { 
...
/// Return a  `TableProvider` for executing the `name` table function, given the specified
/// arguments
fn table_function(name: &str) -> Result<Arc<dyn TableProvider>>;

/// Register the `TableFunction` with the specified name, returning the previously registered function, if any
fn register_table_function(name: &str) -> Result<Option<Arc<dyn TableProvider>>>;
...
}

Additional context

I thought there was an existing ticket for this, but I can not find one

This came up several times, including:

@alamb alamb added the enhancement New feature or request label Oct 25, 2023
@Veeupup
Copy link
Contributor

Veeupup commented Nov 12, 2023

I'm interested by this topic! Seems like a very powerful feature! : )

And I want to study the related codes and figure it out to learn more about datafusion!

Now it looks a little difficult for me, but maybe I can help in future weeks if I can handle it.

@alamb
Copy link
Contributor Author

alamb commented Nov 13, 2023

I'm interested by this topic! Seems like a very powerful feature! : )

Yes, I think so

And I want to study the related codes and figure it out to learn more about datafusion!

Thank you! It is very much appreciated.

Now it looks a little difficult for me, but maybe I can help in future weeks if I can handle it.

Thank you.

My suggestion is to first make a RFC / proposal PR (for example like #8046) that sketches out the main APIs and identifies any potential problems / areas that need additional work prior to trying implement a fully mergeable PR. This would allow us to solidify the design without wasted effort on polishing code / tests

@Veeupup
Copy link
Contributor

Veeupup commented Nov 14, 2023

Before User Defined Table Functions, I can not find Internal Table Function either.

So I suppose that maybe we can start by implmenting the Internal Table Functions first, just like

SELECT * FROM read_parquet(['folder1/*.parquet', 'folder2/*.parquet']);

how do you think @alamb : )

Then we can let users define their own Table Functions.

@alamb
Copy link
Contributor Author

alamb commented Nov 14, 2023

I think implementing something like read_parquet woudl be great (as an example perhaps in https://github.com/apache/arrow-datafusion/tree/main/datafusion-examples/examples)

@yukkit
Copy link
Contributor

yukkit commented Nov 15, 2023

I would like to be able to have a table function that supported everything a TableProvider does, including filter and projection pushdown. One way to do so would be to actually return a TableProvider:

Agreed, that sounds like a feasible approach.

trait TableUDF {
  /// Return a  `TableProvider` for executing this  table function, given the specified
  /// arguments
  fn invoke(name: &str, args: &[Expr]) -> Result<Arc<dyn TableProvider>>;
}

FYI. Should table-valued function inputs allow variables or only constants (like 'literal value' or 'now()')?

Add to SchemaProvider:

We could also add Table Functions to datafusion::catalog::schema::SchemaProvider

This might make sense given how similar TableFunctions are to TableProviders

trait SchemaProvider  { 
...
/// Return a  `TableProvider` for executing the `name` table function, given the specified
/// arguments
fn table_function(name: &str) -> Result<Arc<dyn TableProvider>>;

/// Register the `TableFunction` with the specified name, returning the previously registered function, if any
fn register_table_function(name: &str) -> Result<Option<Arc<dyn TableProvider>>>;
...
}

I prefer this approach because table-valued functions resemble a table more and the processing logic is also more similar.

@alamb
Copy link
Contributor Author

alamb commented Nov 15, 2023

FYI. Should table-valued function inputs allow variables or only constants (like 'literal value' or 'now()')?

It depends on what you mean by 'variables'. I think it would be very reasonable to accept @@var type variables from the VariableProvider

However, I think it would be much harder to accept things like Expr::Column as it is not clear to me from where such input would come? Would it be other tables?

@yukkit
Copy link
Contributor

yukkit commented Nov 16, 2023

However, I think it would be much harder to accept things like Expr::Column as it is not clear to me from where such input would come? Would it be other tables?

Sorry for not being clear earlier, what I meant is exactly how you understood it.
I saw some use cases for correlated subquery in the PostgreSQL documentation

CREATE TABLE foo (fooid int, foosubid int, fooname text);

CREATE FUNCTION getfoo(int) RETURNS setof foo AS '
    SELECT * FROM foo WHERE fooid = $1;
' LANGUAGE SQL;

SELECT * FROM getfoo(1) AS t1;

SELECT * FROM foo
WHERE foosubid in (select foosubid from getfoo(foo.fooid) z
                   where z.fooid = foo.fooid);

CREATE VIEW vw_getfoo AS SELECT * FROM getfoo(1);
SELECT * FROM vw_getfoo;

@alamb
Copy link
Contributor Author

alamb commented Nov 16, 2023

Sorry for not being clear earlier, what I meant is exactly how you understood it.
I saw some use cases for correlated subquery in the PostgreSQL documentation

That is interesting -- I am not sure how we would make this work - maybe sketching out what the plan would look like would help. It seems like it would effectively have to make a TableProvider instance for each matching row 🤔 Our typical trick of rewriting to a join probably would not work in this case)

@Veeupup
Copy link
Contributor

Veeupup commented Nov 21, 2023

I'm going to make a draft PR this week : )

@lewiszlw
Copy link
Member

This is a very useful feature. Really looking forward it can be landed next release.

@Jesse-Bakker
Copy link
Contributor

Jesse-Bakker commented Nov 24, 2023

A valuable specialization for table-valued function is being able to use them as de-aggregating expressions, if the output is single-column. An example of this is, for example, the unnest function in postgres. The first form takes a single argument and returns setof anyelement and can therefore be used in a select list to deaggregate:

# select v, unnest(v) FROM (SELECT ARRAY[1,2,3] as v);
    v    | unnest
---------+--------
 {1,2,3} |      1
 {1,2,3} |      2
 {1,2,3} |      3
(3 rows)

The second form takes multiple arrays as arguments and returns records. Therefore, this form can only be used in table-factor position

select * from unnest(array[1,2], array['a', 'b']);
 unnest | unnest
--------+--------
      1 | a
      2 | b
(2 rows)

@alamb
Copy link
Contributor Author

alamb commented Nov 28, 2023

@Veeupup has a nice PR up that I think looks pretty close to implementing TableFunctions -- #8306 if you are interested please take a look

@alamb
Copy link
Contributor Author

alamb commented Nov 30, 2023

Follow on work is tracked in #8383

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants