feat: Added DataFrameWriteOptions option when writing as csv, json, p… #857

allinux · 2024-09-06T00:14:15Z

…arquet.

Which issue does this PR close?

N/A

Rationale for this change

Added DataFrameWriteOptions when using write_csv, write_json, write_parquet functions.

Are there any user-facing changes?

No

timsaucer

This is a very nice addition! Thank you!

timsaucer · 2024-09-06T12:29:34Z

python/datafusion/dataframe.py

+        write_options_overwrite: bool = False,
+        write_options_single_file_output: bool = False,
+        write_options_partition_by: List = [],


I think it's okay to remove the write_options_ prefixes here.

with_header: bool = False, overwrite: bool = False, single_file_output: bool = False,

Also for the partition by, I took a very quick look at the code and it looks like partition_by takes a list of strings, which I think our users would be surprised because all other uses of partition_by takes a list of expressions. So I think we want to add to the documentation a tiny bit about how to use that.

My understanding is that it's bad form in python to pass in a [] as a default, but I'm no expert. I bet we could change the type hint to partition_by: Optional[List[str]] = None and make the appropriate change on the call in the lines below.

Edit has been completed. Thank you for the review.

timsaucer · 2024-09-06T12:30:13Z

python/datafusion/dataframe.py

+        write_options_overwrite: bool = False,
+        write_options_single_file_output: bool = False,
+        write_options_partition_by: List = [],


Same recommendation on parameter names and partition_by as above

Edit has been completed. Thank you for the review.

timsaucer · 2024-09-06T12:30:47Z

python/datafusion/dataframe.py

        """Execute the :py:class:`DataFrame`  and write the results to a CSV file.

        Args:
            path: Path of the CSV file to write.
            with_header: If true, output the CSV header row.
+            write_options_overwrite: Controls if existing data should be overwritten
+            write_options_single_file_output: Controls if all partitions should be coalesced into a single output file. Generally will have slower performance when set to true.
+            write_options_partition_by: Sets which columns should be used for hive-style partitioned writes by name. Can be set to empty vec![] for non-partitioned writes.


empty vec![] is mixing rust and python terminology

comment는 rust 의 comment 를 복사한 것 입니다. vec![] 이 포함한 라인은 제거했습니다.

The comment is a copy of Rust's comment. Lines containing vec![] have been removed.

timsaucer · 2024-09-06T12:31:12Z

python/datafusion/dataframe.py

    ) -> None:
        """Execute the :py:class:`DataFrame` and write the results to a Parquet file.

        Args:
            path: Path of the Parquet file to write.
            compression: Compression type to use.
            compression_level: Compression level to use.
+            write_options_overwrite: Controls if existing data should be overwritten
+            write_options_single_file_output: Controls if all partitions should be coalesced into a single output file. Generally will have slower performance when set to true.
+            write_options_partition_by: Sets which columns should be used for hive-style partitioned writes by name. Can be set to empty vec![] for non-partitioned writes.


vec![] is rust not python

The comment is a copy of Rust's comment. Lines containing vec![] have been removed.

timsaucer · 2024-09-06T12:31:33Z

python/datafusion/dataframe.py

+        write_options_overwrite: bool = False,
+        write_options_single_file_output: bool = False,
+        write_options_partition_by: List = [],
+    ) -> None:


Same comment as above on naming and partition_by

python/datafusion/dataframe.py

timsaucer · 2024-09-06T12:32:46Z

src/dataframe.rs

+    #[pyo3(signature = (
+        path,
+        with_header=false,
+
+        write_options_overwrite=false,
+        write_options_single_file_output=false,
+        write_options_partition_by=vec![],
+    ))]


Since we're setting all the type hints in the wrappers, you don't have to include this here. It's up to you but can lead to duplicate effort and long term maintainability.

As a note to myself, we need to include in our developer's documentation our best practice (and also decide as a group if we want these signatures in the rust code at all)

timsaucer · 2024-09-06T12:34:36Z

src/dataframe.rs

        ))]
    fn write_parquet(
        &self,
        path: &str,
        compression: &str,
        compression_level: Option<u32>,
+


formatting: extra blank line

Edit has been completed. Thank you for the review.

timsaucer · 2024-09-06T12:35:10Z

src/dataframe.rs

+    fn write_json(
+        &self, 
+        path: &str, 
+


formatting: extra blank line

Edit has been completed. Thank you for the review.

timsaucer · 2024-09-07T12:28:59Z

python/datafusion/dataframe.py

+            write_options_overwrite: Controls if existing data should be overwritten
+            write_options_single_file_output: Controls if all partitions should be coalesced into a single output file. Generally will have slower performance when set to true.
+            write_options_partition_by: Sets which columns should be used for hive-style partitioned writes by name.
        """
-        self.df.write_csv(str(path), with_header)
+        self.df.write_csv(str(path), with_header, write_options_overwrite, write_options_single_file_output, write_options_partition_by)


Since we've updated the argument names we need to update the documentation and the function call. We should add in unit tests so we can catch these errors in CI.

timsaucer · 2024-09-21T14:35:03Z

I was hoping we could add this in to DF42. Would you be willing to add unit tests?

timsaucer requested changes Sep 6, 2024

View reviewed changes

allinux force-pushed the main branch from 7b47717 to 610b11d Compare September 7, 2024 10:28

timsaucer reviewed Sep 7, 2024

View reviewed changes

allinux force-pushed the main branch 2 times, most recently from 1278955 to ff45187 Compare September 7, 2024 15:11

allinux closed this Oct 10, 2024

allinux force-pushed the main branch from ff45187 to cdec202 Compare October 10, 2024 03:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Added DataFrameWriteOptions option when writing as csv, json, p… #857

feat: Added DataFrameWriteOptions option when writing as csv, json, p… #857

allinux commented Sep 6, 2024

timsaucer left a comment

timsaucer Sep 6, 2024 •

edited

Loading

allinux Sep 7, 2024

timsaucer Sep 6, 2024

allinux Sep 7, 2024

timsaucer Sep 6, 2024

allinux Sep 7, 2024

allinux Sep 7, 2024

timsaucer Sep 6, 2024

allinux Sep 7, 2024

timsaucer Sep 6, 2024

timsaucer Sep 6, 2024

timsaucer Sep 6, 2024

timsaucer Sep 6, 2024

allinux Sep 7, 2024

timsaucer Sep 6, 2024

allinux Sep 7, 2024

timsaucer Sep 7, 2024

timsaucer commented Sep 21, 2024

feat: Added DataFrameWriteOptions option when writing as csv, json, p… #857

feat: Added DataFrameWriteOptions option when writing as csv, json, p… #857

Conversation

allinux commented Sep 6, 2024

Which issue does this PR close?

Rationale for this change

Are there any user-facing changes?

timsaucer left a comment

Choose a reason for hiding this comment

timsaucer Sep 6, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

timsaucer commented Sep 21, 2024

timsaucer Sep 6, 2024 •

edited

Loading