Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Added DataFrameWriteOptions option when writing as csv, json, p… #857

Closed
wants to merge 0 commits into from

Conversation

allinux
Copy link

@allinux allinux commented Sep 6, 2024

…arquet.

Which issue does this PR close?

N/A

Rationale for this change

Added DataFrameWriteOptions when using write_csv, write_json, write_parquet functions.

Are there any user-facing changes?

No

Copy link
Contributor

@timsaucer timsaucer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very nice addition! Thank you!

Comment on lines 416 to 418
write_options_overwrite: bool = False,
write_options_single_file_output: bool = False,
write_options_partition_by: List = [],
Copy link
Contributor

@timsaucer timsaucer Sep 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's okay to remove the write_options_ prefixes here.

        with_header: bool = False,
        overwrite: bool = False,
        single_file_output: bool = False,

Also for the partition by, I took a very quick look at the code and it looks like partition_by takes a list of strings, which I think our users would be surprised because all other uses of partition_by takes a list of expressions. So I think we want to add to the documentation a tiny bit about how to use that.

My understanding is that it's bad form in python to pass in a [] as a default, but I'm no expert. I bet we could change the type hint to partition_by: Optional[List[str]] = None and make the appropriate change on the call in the lines below.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Edit has been completed. Thank you for the review.

Comment on lines 436 to 438
write_options_overwrite: bool = False,
write_options_single_file_output: bool = False,
write_options_partition_by: List = [],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same recommendation on parameter names and partition_by as above

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Edit has been completed. Thank you for the review.

"""Execute the :py:class:`DataFrame` and write the results to a CSV file.

Args:
path: Path of the CSV file to write.
with_header: If true, output the CSV header row.
write_options_overwrite: Controls if existing data should be overwritten
write_options_single_file_output: Controls if all partitions should be coalesced into a single output file. Generally will have slower performance when set to true.
write_options_partition_by: Sets which columns should be used for hive-style partitioned writes by name. Can be set to empty vec![] for non-partitioned writes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

empty vec![] is mixing rust and python terminology

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment는 rust 의 comment 를 복사한 것 입니다. vec![] 이 포함한 라인은 제거했습니다.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment is a copy of Rust's comment. Lines containing vec![] have been removed.

) -> None:
"""Execute the :py:class:`DataFrame` and write the results to a Parquet file.

Args:
path: Path of the Parquet file to write.
compression: Compression type to use.
compression_level: Compression level to use.
write_options_overwrite: Controls if existing data should be overwritten
write_options_single_file_output: Controls if all partitions should be coalesced into a single output file. Generally will have slower performance when set to true.
write_options_partition_by: Sets which columns should be used for hive-style partitioned writes by name. Can be set to empty vec![] for non-partitioned writes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vec![] is rust not python

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment is a copy of Rust's comment. Lines containing vec![] have been removed.

Comment on lines 455 to 458
write_options_overwrite: bool = False,
write_options_single_file_output: bool = False,
write_options_partition_by: List = [],
) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above on naming and partition_by

python/datafusion/dataframe.py Outdated Show resolved Hide resolved
src/dataframe.rs Outdated
Comment on lines 405 to 411
#[pyo3(signature = (
path,
with_header=false,

write_options_overwrite=false,
write_options_single_file_output=false,
write_options_partition_by=vec![],
))]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're setting all the type hints in the wrappers, you don't have to include this here. It's up to you but can lead to duplicate effort and long term maintainability.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a note to myself, we need to include in our developer's documentation our best practice (and also decide as a group if we want these signatures in the rust code at all)

src/dataframe.rs Outdated
))]
fn write_parquet(
&self,
path: &str,
compression: &str,
compression_level: Option<u32>,

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

formatting: extra blank line

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Edit has been completed. Thank you for the review.

src/dataframe.rs Outdated
fn write_json(
&self,
path: &str,

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

formatting: extra blank line

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Edit has been completed. Thank you for the review.

Comment on lines 425 to 429
write_options_overwrite: Controls if existing data should be overwritten
write_options_single_file_output: Controls if all partitions should be coalesced into a single output file. Generally will have slower performance when set to true.
write_options_partition_by: Sets which columns should be used for hive-style partitioned writes by name.
"""
self.df.write_csv(str(path), with_header)
self.df.write_csv(str(path), with_header, write_options_overwrite, write_options_single_file_output, write_options_partition_by)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we've updated the argument names we need to update the documentation and the function call. We should add in unit tests so we can catch these errors in CI.

@timsaucer
Copy link
Contributor

I was hoping we could add this in to DF42. Would you be willing to add unit tests?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants