Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow for writing hive-style partitions with datatypes beyond just Utf8 and Dictionary #12221

Closed
Omega359 opened this issue Aug 28, 2024 · 5 comments · Fixed by #12283
Closed
Assignees
Labels
enhancement New feature or request

Comments

@Omega359
Copy link
Contributor

Is your feature request related to a problem or challenge?

Currently in demux::compute_partition_keys_by_row the only supported types for writing out partitions seems to be just DataType::Utf8 and DataType::Dictionary(_, _). I think there is opportunity to support a number of other DataTypes such as int/uint 8/32/64 types, Date32 (with a fixed format 'yyyy-MM-dd') and bool.

Describe the solution you'd like

Code and tests for writing out hive-style partitions includes additional datatypes beyond just utf8 and Dictionary

Describe alternatives you've considered

Cast field to utf8 prior to output.

Additional context

No response

@Omega359 Omega359 added the enhancement New feature or request label Aug 28, 2024
@devinjdangelo
Copy link
Contributor

devinjdangelo commented Aug 29, 2024

I think this is a good idea. The only thing to watch out for is what happens when you read the partitions back using datafusion.

E.g. if you write out a date column as the partition into 'some/path/some_col=2023-01-01/...', will datafusion parse this back into a date32 column? I am not as familiar with the read path, but we would want to make sure type handling is consistent between reading and writing partitioned columns.

@Omega359
Copy link
Contributor Author

I think this is a good idea. The only thing to watch out for is what happens when you read the partitions back using datafusion.

E.g. if you write out a date column as the partition into 'some/path/some_col=2023-01-01/...', will datafusion parse this back into a date32 column? I am not as familiar with the read path, but we would want to make sure type handling is consistent between reading and writing partitioned columns.

Ah, good point. I'll verify that when I put in a PR for this

@Omega359
Copy link
Contributor Author

take

@Omega359
Copy link
Contributor Author

Omega359 commented Sep 1, 2024

I've made the changes to a branch in my repo for this however I can't seem to find any place where this fn (or hive partitioning at all) is actually tested. Any suggestions as to where to add a test for this would be appreciated. The demux functionality is quite complex so I was thinking that maybe something at a much higher level would be best.

@devinjdangelo
Copy link
Contributor

The partitioning is tested in sqllogictests. See https://github.com/apache/datafusion/blob/main/datafusion/sqllogictest/test_files/copy.slt#L83

The files are written to partitioned directories and then a specific directory is queried to make sure it was partitioned as expected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
2 participants