-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow for writing hive-style partitions with datatypes beyond just Utf8 and Dictionary #12221
Comments
I think this is a good idea. The only thing to watch out for is what happens when you read the partitions back using datafusion. E.g. if you write out a date column as the partition into 'some/path/some_col=2023-01-01/...', will datafusion parse this back into a date32 column? I am not as familiar with the read path, but we would want to make sure type handling is consistent between reading and writing partitioned columns. |
Ah, good point. I'll verify that when I put in a PR for this |
take |
I've made the changes to a branch in my repo for this however I can't seem to find any place where this fn (or hive partitioning at all) is actually tested. Any suggestions as to where to add a test for this would be appreciated. The demux functionality is quite complex so I was thinking that maybe something at a much higher level would be best. |
The partitioning is tested in sqllogictests. See https://github.com/apache/datafusion/blob/main/datafusion/sqllogictest/test_files/copy.slt#L83 The files are written to partitioned directories and then a specific directory is queried to make sure it was partitioned as expected. |
Is your feature request related to a problem or challenge?
Currently in demux::compute_partition_keys_by_row the only supported types for writing out partitions seems to be just
DataType::Utf8
andDataType::Dictionary(_, _)
. I think there is opportunity to support a number of otherDataType
s such as int/uint 8/32/64 types, Date32 (with a fixed format 'yyyy-MM-dd') and bool.Describe the solution you'd like
Code and tests for writing out hive-style partitions includes additional datatypes beyond just utf8 and Dictionary
Describe alternatives you've considered
Cast field to utf8 prior to output.
Additional context
No response
The text was updated successfully, but these errors were encountered: