-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Added DataFrameWriteOptions option when writing as csv, json, p… #857
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a very nice addition! Thank you!
python/datafusion/dataframe.py
Outdated
write_options_overwrite: bool = False, | ||
write_options_single_file_output: bool = False, | ||
write_options_partition_by: List = [], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's okay to remove the write_options_
prefixes here.
with_header: bool = False,
overwrite: bool = False,
single_file_output: bool = False,
Also for the partition by, I took a very quick look at the code and it looks like partition_by
takes a list of strings, which I think our users would be surprised because all other uses of partition_by
takes a list of expressions. So I think we want to add to the documentation a tiny bit about how to use that.
My understanding is that it's bad form in python to pass in a []
as a default, but I'm no expert. I bet we could change the type hint to partition_by: Optional[List[str]] = None
and make the appropriate change on the call in the lines below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Edit has been completed. Thank you for the review.
python/datafusion/dataframe.py
Outdated
write_options_overwrite: bool = False, | ||
write_options_single_file_output: bool = False, | ||
write_options_partition_by: List = [], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same recommendation on parameter names and partition_by as above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Edit has been completed. Thank you for the review.
python/datafusion/dataframe.py
Outdated
"""Execute the :py:class:`DataFrame` and write the results to a CSV file. | ||
|
||
Args: | ||
path: Path of the CSV file to write. | ||
with_header: If true, output the CSV header row. | ||
write_options_overwrite: Controls if existing data should be overwritten | ||
write_options_single_file_output: Controls if all partitions should be coalesced into a single output file. Generally will have slower performance when set to true. | ||
write_options_partition_by: Sets which columns should be used for hive-style partitioned writes by name. Can be set to empty vec![] for non-partitioned writes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
empty vec![]
is mixing rust and python terminology
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment는 rust 의 comment 를 복사한 것 입니다. vec![] 이 포함한 라인은 제거했습니다.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment is a copy of Rust's comment. Lines containing vec![] have been removed.
python/datafusion/dataframe.py
Outdated
) -> None: | ||
"""Execute the :py:class:`DataFrame` and write the results to a Parquet file. | ||
|
||
Args: | ||
path: Path of the Parquet file to write. | ||
compression: Compression type to use. | ||
compression_level: Compression level to use. | ||
write_options_overwrite: Controls if existing data should be overwritten | ||
write_options_single_file_output: Controls if all partitions should be coalesced into a single output file. Generally will have slower performance when set to true. | ||
write_options_partition_by: Sets which columns should be used for hive-style partitioned writes by name. Can be set to empty vec![] for non-partitioned writes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vec![]
is rust not python
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment is a copy of Rust's comment. Lines containing vec![] have been removed.
python/datafusion/dataframe.py
Outdated
write_options_overwrite: bool = False, | ||
write_options_single_file_output: bool = False, | ||
write_options_partition_by: List = [], | ||
) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as above on naming and partition_by
src/dataframe.rs
Outdated
#[pyo3(signature = ( | ||
path, | ||
with_header=false, | ||
|
||
write_options_overwrite=false, | ||
write_options_single_file_output=false, | ||
write_options_partition_by=vec![], | ||
))] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we're setting all the type hints in the wrappers, you don't have to include this here. It's up to you but can lead to duplicate effort and long term maintainability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a note to myself, we need to include in our developer's documentation our best practice (and also decide as a group if we want these signatures in the rust code at all)
src/dataframe.rs
Outdated
))] | ||
fn write_parquet( | ||
&self, | ||
path: &str, | ||
compression: &str, | ||
compression_level: Option<u32>, | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
formatting: extra blank line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Edit has been completed. Thank you for the review.
src/dataframe.rs
Outdated
fn write_json( | ||
&self, | ||
path: &str, | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
formatting: extra blank line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Edit has been completed. Thank you for the review.
python/datafusion/dataframe.py
Outdated
write_options_overwrite: Controls if existing data should be overwritten | ||
write_options_single_file_output: Controls if all partitions should be coalesced into a single output file. Generally will have slower performance when set to true. | ||
write_options_partition_by: Sets which columns should be used for hive-style partitioned writes by name. | ||
""" | ||
self.df.write_csv(str(path), with_header) | ||
self.df.write_csv(str(path), with_header, write_options_overwrite, write_options_single_file_output, write_options_partition_by) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we've updated the argument names we need to update the documentation and the function call. We should add in unit tests so we can catch these errors in CI.
1278955
to
ff45187
Compare
I was hoping we could add this in to DF42. Would you be willing to add unit tests? |
…arquet.
Which issue does this PR close?
N/A
Rationale for this change
Added DataFrameWriteOptions when using write_csv, write_json, write_parquet functions.
Are there any user-facing changes?
No