Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: update udf docs for udtf #8546

Merged
merged 5 commits into from
Dec 15, 2023
Merged

Conversation

tshauck
Copy link
Contributor

@tshauck tshauck commented Dec 14, 2023

Which issue does this PR close?

Closes #8545

Rationale for this change

Updating the docs to match the exciting UDTF addition.

What changes are included in this PR?

Updates the UDF library doc, makes minor style update to the example

Are these changes tested?

image image

Are there any user-facing changes?

Yes, public docs update.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @tshauck -- this is great. I think it might help to add a little more motiviation about about why UDTFs are so cool and what types of things you can do with them, but we can also do that as a follow on PR. This PR is a great step forward

🚀


A User-Defined Table Function (UDTF) is a function that takes parameters and returns a `TableProvider`.

Because we're returning a `TableProvider`, in this example we'll use the `MemTable` data source to represent a table. This is a simple struct that holds a set of RecordBatches in memory and treats them as a table. In your case, this would be replaced with your own struct that implements `TableProvider`. See the [example][4] for a working example that reads from a CSV file.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can add some other examples of things one could do, for example

parse_url('http://foo.com')

Or point at the parquet_metadata function in datafusion-cli and note that the output of the table function can be processed like the output of any other table.

For example

❯ select filename, row_group_id, row_group_num_rows, row_group_bytes, stats_min, stats_max from parquet_metadata('./benchmarks/data/hits.parquet') where  column_id = 17 limit 10;
+--------------------------------+--------------+--------------------+-----------------+-----------+-----------+
| filename                       | row_group_id | row_group_num_rows | row_group_bytes | stats_min | stats_max |
+--------------------------------+--------------+--------------------+-----------------+-----------+-----------+
| ./benchmarks/data/hits.parquet | 0            | 450560             | 188921521       | 0         | 73256     |
| ./benchmarks/data/hits.parquet | 1            | 612174             | 210338885       | 0         | 109827    |
| ./benchmarks/data/hits.parquet | 2            | 344064             | 161242466       | 0         | 122484    |
| ./benchmarks/data/hits.parquet | 3            | 606208             | 235549898       | 0         | 121073    |
| ./benchmarks/data/hits.parquet | 4            | 335872             | 137103898       | 0         | 108996    |
| ./benchmarks/data/hits.parquet | 5            | 311296             | 145453612       | 0         | 108996    |
| ./benchmarks/data/hits.parquet | 6            | 303104             | 138833963       | 0         | 108996    |
| ./benchmarks/data/hits.parquet | 7            | 303104             | 191140113       | 0         | 73256     |
| ./benchmarks/data/hits.parquet | 8            | 573440             | 208038598       | 0         | 95823     |
| ./benchmarks/data/hits.parquet | 9            | 344064             | 147838157       | 0         | 73256     |
+--------------------------------+--------------+--------------------+-----------------+-----------+-----------+

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback! I just pushed a090783 which expands a bit on why they're nice and adds the parquet metadata use-case since it shows why they're nice for interactive analysis.

@alamb alamb added documentation Improvements or additions to documentation devrel labels Dec 15, 2023
@github-actions github-actions bot removed the documentation Improvements or additions to documentation label Dec 15, 2023
@alamb alamb merged commit b7fde3c into apache:main Dec 15, 2023
23 checks passed
@alamb
Copy link
Contributor

alamb commented Dec 15, 2023

Thanks again @tshauck

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update UDF Library Docs with UDTFs
2 participants