-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: update udf docs for udtf #8546
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @tshauck -- this is great. I think it might help to add a little more motiviation about about why UDTFs are so cool and what types of things you can do with them, but we can also do that as a follow on PR. This PR is a great step forward
🚀
|
||
A User-Defined Table Function (UDTF) is a function that takes parameters and returns a `TableProvider`. | ||
|
||
Because we're returning a `TableProvider`, in this example we'll use the `MemTable` data source to represent a table. This is a simple struct that holds a set of RecordBatches in memory and treats them as a table. In your case, this would be replaced with your own struct that implements `TableProvider`. See the [example][4] for a working example that reads from a CSV file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can add some other examples of things one could do, for example
parse_url('http://foo.com')
Or point at the parquet_metadata
function in datafusion-cli and note that the output of the table function can be processed like the output of any other table.
For example
❯ select filename, row_group_id, row_group_num_rows, row_group_bytes, stats_min, stats_max from parquet_metadata('./benchmarks/data/hits.parquet') where column_id = 17 limit 10;
+--------------------------------+--------------+--------------------+-----------------+-----------+-----------+
| filename | row_group_id | row_group_num_rows | row_group_bytes | stats_min | stats_max |
+--------------------------------+--------------+--------------------+-----------------+-----------+-----------+
| ./benchmarks/data/hits.parquet | 0 | 450560 | 188921521 | 0 | 73256 |
| ./benchmarks/data/hits.parquet | 1 | 612174 | 210338885 | 0 | 109827 |
| ./benchmarks/data/hits.parquet | 2 | 344064 | 161242466 | 0 | 122484 |
| ./benchmarks/data/hits.parquet | 3 | 606208 | 235549898 | 0 | 121073 |
| ./benchmarks/data/hits.parquet | 4 | 335872 | 137103898 | 0 | 108996 |
| ./benchmarks/data/hits.parquet | 5 | 311296 | 145453612 | 0 | 108996 |
| ./benchmarks/data/hits.parquet | 6 | 303104 | 138833963 | 0 | 108996 |
| ./benchmarks/data/hits.parquet | 7 | 303104 | 191140113 | 0 | 73256 |
| ./benchmarks/data/hits.parquet | 8 | 573440 | 208038598 | 0 | 95823 |
| ./benchmarks/data/hits.parquet | 9 | 344064 | 147838157 | 0 | 73256 |
+--------------------------------+--------------+--------------------+-----------------+-----------+-----------+
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback! I just pushed a090783 which expands a bit on why they're nice and adds the parquet metadata use-case since it shows why they're nice for interactive analysis.
Thanks again @tshauck |
Which issue does this PR close?
Closes #8545
Rationale for this change
Updating the docs to match the exciting UDTF addition.
What changes are included in this PR?
Updates the UDF library doc, makes minor style update to the example
Are these changes tested?
Are there any user-facing changes?
Yes, public docs update.