-
Notifications
You must be signed in to change notification settings - Fork 13
docs: Document that the write_parquet
methods use mkdir
in polars
#142
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #142 +/- ##
=========================================
Coverage 100.00% 100.00%
=========================================
Files 49 50 +1
Lines 2818 2851 +33
=========================================
+ Hits 2818 2851 +33 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
@AndreasAlbertQC I did not consider delta storage in this PR. If there is a more elegant way to fix this, feel free to amend/close this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mh, I thought that write_parquet
takes care of this, unfortunately not... 😅 I think we should be consistent with write_parquet
from polars though and not create parents (and document it accordingly).
dataframely/_storage/parquet.py
Outdated
) -> None: | ||
file = kwargs.pop("file") | ||
metadata = kwargs.pop("metadata", {}) | ||
file.parent.mkdir(parents=True, exist_ok=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that we shouldn't do this here but rather in the specialized functions for collections. Also, I wouldn't set parents=True
(see main comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm fair point. I looked into this more. pl.write_parquet
has a parameter mkdir
(default False
). So to get the exact same behavior, I guess we would have to add this. But I see that the additional complexity is not worth it and we can just always not do it.
We currently rely on the fact that write_parquet
sets mkdir
to True
for partitioned writes and creates all parent directories here because we add an extra layer of directories that is then going to be created by polars.
I think if we do not want an mkdir
argument, we should not create anything anywhere, except for partitioned writes. That would not require any code changes, just updates to the docs. Is this what you have in mind?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want to just delegate this to the mkdir
parameter in polars.
@borchero I just realized that the mkdir
parameter is only exposed from polars 1.33
onwards. So I think we should wait with this PR until we bump there anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will bump the version requirement as part of #139, do you want to wait until then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that works for me. I will redraft this for now then.
write_parquet
methods use mkdir
in polars
Motivation
We would like to clarify in the docs that the
write_parquet
methods use themkdir
kwarg to control whether directories should be created.Changes
mkdir
)