Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: add example for custom file format with COPY TO #11174

Merged
merged 10 commits into from
Jul 2, 2024

Conversation

tshauck
Copy link
Contributor

@tshauck tshauck commented Jun 29, 2024

Which issue does this PR close?

Closes #11079

Rationale for this change

Adds an example of for how to COPY table TO a custom file format.

What changes are included in this PR?

Created sort of a mock file format factory and file format that just wrap the CSV one but for a TSV.

Are these changes tested?

yes, I've run the example

Are there any user-facing changes?

no

@tshauck tshauck changed the title feat: add example for copy to feat: add example for custom file format with COPY TO Jun 29, 2024
@tshauck tshauck changed the title feat: add example for custom file format with COPY TO docs: add example for custom file format with COPY TO Jun 29, 2024
@tshauck
Copy link
Contributor Author

tshauck commented Jun 29, 2024

Seem to be failing in CI due to being out space on the worker. Potentially fixed with a change to the build script.

image image

@@ -29,5 +29,6 @@ do
# Skip tests that rely on external storage and flight
if [ ! -d $filename ]; then
cargo run --example $example_name
cargo clean -p datafusion-examples
Copy link
Contributor Author

@tshauck tshauck Jun 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran into consistent build issues where the runner would run out of disk space. This is the best solution I could come up with. It seems to add about 5-10 seconds to the example action, which takes about 12 minutes overall.

Certainly open to alternatives.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am hitting similar issues with trying to add new examples in #11089

This seems like a good idea to me

My best hope is to move some of the example binaries into inline examples in the docs instead: #11178

Hopefully that will free up some additional space as well as make the examples easier to navigate

Copy link
Contributor Author

@tshauck tshauck Jul 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

er, sorry this came up. There are a few other tricks here (actions/runner-images#2840 (comment)) to clear up some disk space that may help also.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries! I think this is a good excuse / reason to take another pass through the examples directory (and the library guide that you started however long ago).

@tshauck tshauck marked this pull request as ready for review June 30, 2024 20:01
Copy link
Contributor

@devinjdangelo devinjdangelo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @tshauck for putting this example together! I think that wrapping an existing FileFormat was a clever way to demonstrate this without tons of boilerplate code being required for a working example. It ran locally for me no issues.

I left just one small suggestion.

datafusion-examples/examples/custom_file_format.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much @tshauck -- I also think this looks really nice.

Thank you for the review @devinjdangelo

I agree the idea of wrapping an existing format to show the API is clever. I was thinking it would be awesome to somehow cook up a simple to implement custom format that would handle things entirely end to end but I can't think of any format suitable simple.

@tshauck
Copy link
Contributor Author

tshauck commented Jul 1, 2024

Yeah, I started down the road of something custom, but it turned out to be a non-trivial amount of code.

Perhaps adding a link to something like https://github.com/datafusion-contrib/datafusion-orc would show a more complex example, though it doesn't implement create_writer_physical_plan yet.

@alamb alamb merged commit ab8761d into apache:main Jul 2, 2024
23 checks passed
@alamb
Copy link
Contributor

alamb commented Jul 2, 2024

Thanks again @tshauck and @devinjdangelo

@tshauck tshauck deleted the add-example-for-copy-to branch July 2, 2024 12:40
findepi pushed a commit to findepi/datafusion that referenced this pull request Jul 16, 2024
* feat: add example for copy to

* better docs plus tempdir

* build: clean examples if over 10GB

* only 1GB

* build: try clearing some disk space before running

* build: remove sudo

* build: try clean

* build: run clean

* build: only clean examples

* docs: better output for example
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add example for writing a FileFormat
3 participants