Implement Support for Copy To Logical and Physical plans #7283

devinjdangelo · 2023-08-14T22:13:51Z

Which issue does this PR close?

closes #5076
closes #6539
Part of #5654

Rationale for this change

In many cases, we want to be able to export data to file(s) in an ObjectStore without first registering an external table. This is possible with COPY ... TO ... statements. We can leverage the FileSinks created to support inserting to ListingTables for part of the implementation for this.

What changes are included in this PR?

Implement a logical plan for Copy To statements
Generalize name of InsertExec to FileSinkExec
Implement a physical plan for Copy To statements relying on FileSinkExec
Expand sqllogictests in copy.slt, add support for automatically cleared directory in sqllogictests for writing files fresh
Reimplement DataFrame::write_* methods to use Copy To
Add support for per_thread_output setting in FileSinks and Copy To so user can specify if they want only one file output or possibly many is ok

Note that this PR does not add support for most statement level settings / overrides yet. That will be important to implement before closing out #5654.

This graphic shows how all of the write related code is wired up after this PR:

Are these changes tested?

Yes, see expanded copy.slt for new tests.

I also have plans to expand insert.slt to improve testing of recent additions of insert into support.

Are there any user-facing changes?

Copy To statements (less most statement level overrides) are supported now.

DataFrame write_* APIs have some small changes will need more changes as support for statement level overrides is added for copy to

alamb · 2023-08-15T00:14:23Z

Thank you @devinjdangelo -- this looks epic -- I plan to review it tomorrow

devinjdangelo · 2023-08-15T11:52:09Z

datafusion/core/tests/sqllogictests/test_files/copy.slt

+----
+2
+
+#Explain copy queries not currently working


I noticed that EXPLAIN <copy statement> currently does not work. When prefixed by EXPLAIN the subsequent COPY token is being parsed into the COPY statement defined in the sqlparser crate. We haven't implemented a logical plan for that and our expected syntax for COPY is different from sqlparser, so this leads to various errors.

When the COPY token starts the statement, it is correctly parsed as a DFStatement defined within DataFusion.

Here is where the parsing diverges. We might need to add some special case for parsing explain copy, but I'm not sure if there is a better way.

https://github.com/apache/arrow-datafusion/blob/6ad79165f6554a66aa5ed4c5d432401c2c162f69/datafusion/sql/src/parser.rs#L296-L331

I think a special case for explain copy is probably the right thing

BTW I coded up support for EXPLAIN COPY in #7291

alamb

🏆 work @devinjdangelo -- it is amazing to see COPY finally working in DataFusion.

I tried this out locally and it worked great 👍 -- thank you so much!

In terms of what to do with this PR, I have a few comments, but nothing that I think is required prior to merge, so therefore I am approving it.

Here are some of the tickets:

Add documentation for copy statement
Implement additional parameters
Explain support

I am happy to file these if you would like. Just let me know.

Example

DataFusion CLI v29.0.0
❯ copy (values (1), (2)) to '/tmp/foo.parquet';
+-------+
| count |
+-------+
| 2     |
+-------+
1 row in set. Query took 0.050 seconds.

❯
\q
(arrow_dev) alamb@MacBook-Pro-8:~/Software/arrow-datafusion2/datafusion-cli$ file /tmp/foo.parquet
/tmp/foo.parquet: Apache Parquet
(arrow_dev) alamb@MacBook-Pro-8:~/Software/arrow-datafusion2/datafusion-cli$ datafusion-cli -c "select * from '/tmp/foo.parquet'";
DataFusion CLI v28.0.0
+---------+
| column1 |
+---------+
| 1       |
| 2       |
+---------+
2 rows in set. Query took 0.041 seconds.

This is so cool

copy (select * from '/Users/alamb/Software/clickbench_hits_compatible/hits.parquet' limit 100) to '/tmp/hits.10.parquet';
+-------+
| count |
+-------+
| 100   |
+-------+
1 row in set. Query took 2.809 seconds.

alamb · 2023-08-15T12:44:34Z

.gitignore

@@ -104,3 +104,6 @@ datafusion/CHANGELOG.md.bak

 # Generated tpch data
 datafusion/core/tests/sqllogictests/test_files/tpch/data/*
+
+# Scratch temp dir for sqllogictests
+datafusion/core/tests/sqllogictests/test_files/scratch*


this will likely have a logical conflict with #7284, FYI

Yes, I rebased locally and then updated these relative paths throughout the pr to fix the pipeline.

alamb · 2023-08-15T12:45:49Z

datafusion/core/src/physical_planner.rs

+                    };
+
+                    // TODO: implement statement level overrides for each file type
+                    // E.g. CsvFormat::from_options(options)


alamb · 2023-08-15T12:46:50Z

datafusion/core/tests/sqllogictests/src/main.rs

@@ -58,10 +59,26 @@ pub async fn main() -> Result<()> {
    run_tests().await
 }

+/// Sets up an empty directory at tests/sqllogictests/test_files/scratch/


❤️ -- we should also probably add a note about scratch to https://github.com/apache/arrow-datafusion/blob/main/datafusion/core/tests/sqllogictests/README.md eventually too

alamb · 2023-08-15T12:47:17Z

datafusion/core/tests/sqllogictests/test_files/copy.slt

+----
+2
+
+#Explain copy queries not currently working


I think a special case for explain copy is probably the right thing

alamb · 2023-08-15T12:53:37Z

datafusion/core/tests/sqllogictests/test_files/copy.slt

-statement error DataFusion error: This feature is not implemented: `COPY \.\. TO \.\.` statement is not yet supported
-COPY source_table  to '/tmp/table.parquet' (row_group_size 55);
+query IT
+COPY source_table  to 'tests/sqllogictests/test_files/scratch/table.json' (row_group_size 55);


row_group_size for json is somewhat surprising to me as I expect it is a parquet thing

Ah, setting that as JSON with row_group_size is a mistake. It doesn't cause any issue because options are ignored right now.

This raises an interesting question about the desired behavior in this scenario. If the options specify an irrelevant setting (row_group_size for a json setting), should DataFusion:

Ignore the irrelevant setting (current behavior)

Ignore the irrelevant setting but emit a warning

Raise an error and refuse to execute the query entirely

I think raising an error and refusing to execute the query entirely is the most sensible thing -- I left a comment on #7298 (comment)

alamb · 2023-08-15T12:56:23Z

datafusion/expr/src/logical_plan/builder.rs

+        input: LogicalPlan,
+        output_url: String,
+        file_format: OutputFileFormat,
+        per_thread_output: bool,


Can we add some more documentation about what each of the parameters mean (specifically per_thread_output)

Also, do you envision other options here (like overwrite vs append)? If so maybe it makes sense to make a struct lik

CopyOptions { output_url: String, file_format: OutputFileFormat, per_thread_output: bool, other: Vec<(String, String) }

Making a config struct like that would not only allow additional options to be easily added without an API change, it would also provide a natural location to document the options and what they meant

There is documentation for these parameters here:
https://github.com/devinjdangelo/arrow-datafusion/blob/27e062a880d8960a9aef4657bc7416d98c9a744c/datafusion/expr/src/logical_plan/dml.rs#L30-L43

I envision most additional options being passed in optionally via the options Vec<>. I think the list supported by DuckDB would be a good starting point.

It might make sense to implement a CopyOptions struct as you show with a CopyOptions::from(Vec<String,String>) method. Then, for the DataFrame::write_* API the user could construct CopyOptions directly rather than passing a Vec<String,String>, which is an awkward interface in comparison for use directly from rust code.

Filed #7322 to track this idea

alamb · 2023-08-15T12:57:55Z

datafusion/expr/src/logical_plan/dml.rs

+    pub options: Vec<(String, String)>,
+}
+
+/// The file formats that CopyTo can output


This looks very similar to the existing FileType enum: https://docs.rs/datafusion/latest/datafusion/datasource/file_format/file_type/enum.FileType.html

Perhaps we could move FileType into datafusion_common so it could be used by both the logical plan and datasource?

alamb · 2023-08-15T12:59:05Z

datafusion/expr/src/logical_plan/plan.rs

+                        let mut op_str = String::new();
+                        op_str.push('(');
+                        for (key, val) in options {
+                            if !op_str.is_empty() {
+                                op_str.push(',');
+                            }
+                            op_str.push_str(&format!("{key} {val}"));
+                        }
+                        op_str.push(')');


I think you could use join() here to be more conscise: https://doc.rust-lang.org/std/primitive.slice.html#method.join

But then you would probably have to make the (key val) pairs in a separate Vec resulting in another copy

Maybe a .map + .join to do this more elegantly? I take another look at it.

Took a stab at making this more concise in #7294

alamb · 2023-08-15T18:09:22Z

I took the liberty of merging up from main to resolve conflicts on this PR

alamb · 2023-08-16T16:39:28Z

Thanks @devinjdangelo -- I am merging this PR in and will file follow on PRs / tickets for the remaining items we have identified. Thanks again 🙏

timrobertson100 · 2023-08-17T13:17:46Z

Thanks @devinjdangelo !

github-actions bot added sql SQL Planner logical-expr Logical plan and expressions optimizer Optimizer rules core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels Aug 14, 2023

devinjdangelo mentioned this pull request Aug 14, 2023

Enable creating and inserting to empty external tables via SQL #7276

Merged

devinjdangelo force-pushed the copyto branch from 25e418e to 4210943 Compare August 15, 2023 11:46

devinjdangelo commented Aug 15, 2023

View reviewed changes

alamb approved these changes Aug 15, 2023

View reviewed changes

alamb mentioned this pull request Aug 15, 2023

feature: Support EXPLAIN COPY #7291

Merged

devinjdangelo added 3 commits August 15, 2023 12:26

rebase

f40f57e

maybe windows fix

014221e

rebase add explain copy tests

1c30703

rebase and fix pipeline

27e062a

devinjdangelo force-pushed the copyto branch from 9471c12 to 27e062a Compare August 15, 2023 20:59

This was referenced Aug 15, 2023

Add Sqllogictests for INSERT INTO external table #7294

Merged

Implement support for FileFormat Options for COPY and Create External Table #7298

Closed

alamb merged commit 7d77448 into apache:main Aug 16, 2023
21 checks passed

This was referenced Aug 16, 2023

Document and scratch directory for sqllogictest and make test specific #7312

Merged

Implement COPY ... TO statement #5654

Closed

This was referenced Aug 17, 2023

Create CopyOptions for controlling copy behavior #7322

Closed

Consolidate OutputFileFormat and FileType #7323

Closed

andygrove mentioned this pull request Aug 28, 2023

Regression: write_parquet no longer supports compression options #7433

Closed

yarenty mentioned this pull request Oct 19, 2023

df.write_xxx no longer working in ballista apache/datafusion-ballista#894

Closed

andygrove mentioned this pull request Dec 20, 2023

Implement protobuf serialization for LogicalPlan::CopyTo #8596

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement Support for Copy To Logical and Physical plans #7283

Implement Support for Copy To Logical and Physical plans #7283

devinjdangelo commented Aug 14, 2023 •

edited by alamb

Loading

alamb commented Aug 15, 2023

devinjdangelo Aug 15, 2023 •

edited

Loading

devinjdangelo Aug 15, 2023

alamb Aug 15, 2023

alamb Aug 15, 2023

alamb left a comment •

edited

Loading

alamb Aug 15, 2023

devinjdangelo Aug 15, 2023

alamb Aug 15, 2023

alamb Aug 15, 2023

alamb Aug 16, 2023

alamb Aug 15, 2023

alamb Aug 15, 2023

devinjdangelo Aug 15, 2023

alamb Aug 16, 2023

alamb Aug 15, 2023

devinjdangelo Aug 15, 2023

alamb Aug 17, 2023

alamb Aug 15, 2023

alamb Aug 15, 2023

devinjdangelo Aug 15, 2023

devinjdangelo Aug 15, 2023

alamb commented Aug 15, 2023

alamb commented Aug 16, 2023

timrobertson100 commented Aug 17, 2023

Implement Support for Copy To Logical and Physical plans #7283

Implement Support for Copy To Logical and Physical plans #7283

Conversation

devinjdangelo commented Aug 14, 2023 • edited by alamb Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

alamb commented Aug 15, 2023

devinjdangelo Aug 15, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb left a comment • edited Loading

Choose a reason for hiding this comment

Example

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Aug 15, 2023

alamb commented Aug 16, 2023

timrobertson100 commented Aug 17, 2023

devinjdangelo commented Aug 14, 2023 •

edited by alamb

Loading

devinjdangelo Aug 15, 2023 •

edited

Loading

alamb left a comment •

edited

Loading