Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Utf8View, Boolean, Date32/64, int32/64 for writing hive style partitions #12283

Merged
merged 4 commits into from
Sep 11, 2024

Conversation

Omega359
Copy link
Contributor

@Omega359 Omega359 commented Sep 2, 2024

Which issue does this PR close?

Closes #12221

Rationale for this change

Support hive style partitions for additional data types.

What changes are included in this PR?

Code, slt test coverage.

Are these changes tested?

Yes

Are there any user-facing changes?

No

@github-actions github-actions bot added core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels Sep 2, 2024
@Omega359 Omega359 marked this pull request as ready for review September 2, 2024 14:01
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @Omega359 and @devinjdangelo and I apoligize for the delay in reviewing this PR 😢

My only concern is about the seeming change to allocate a new String for each row. Otherwise this PR looks great ✨

@@ -320,9 +324,11 @@ async fn hive_style_partitions_demuxer(
fn compute_partition_keys_by_row<'a>(
rb: &'a RecordBatch,
partition_by: &'a [(String, DataType)],
) -> Result<Vec<Vec<&'a str>>> {
) -> Result<Vec<Vec<String>>> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 I wonder if computing new strings for each row will be unnecessarily slow 🤔 The current code only allocates a string for each distinct partition value (in the final take map) but this code now creates a new string for each row in the output record batch, just to match them up

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I vaguely recall I had to make this change because of an issue with borrowing temp values with the dates but I'd have to switch back to see exactly where the cause was. You are obviously correct though in that there is a fairly obvious overhead to using String. I'll take another look and see if there is something I can come up with.

Copy link
Contributor

@alamb alamb Sep 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you could use Cow to avoid causing a regression for StringArrays 🤔

I think it would be fine if existing functionality was unaffected but new features (aka partitioning on newly supported types) was not as fast as it could be

We could treat the subsequent optimization of such new features as a follow on project

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to use Cow.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love it -- thank you @Omega359

cc @devinjdangelo in case you have additional thoughts

@@ -320,9 +325,11 @@ async fn hive_style_partitions_demuxer(
fn compute_partition_keys_by_row<'a>(
rb: &'a RecordBatch,
partition_by: &'a [(String, DataType)],
) -> Result<Vec<Vec<&'a str>>> {
) -> Result<Vec<Vec<Cow<'a, str>>>> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️ 🐮

@alamb alamb merged commit 6aae2ee into apache:main Sep 11, 2024
24 checks passed
@alamb
Copy link
Contributor

alamb commented Sep 11, 2024

Thanks again @Omega359

@Omega359 Omega359 deleted the feature/hive_partition_types branch September 11, 2024 14:35
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow for writing hive-style partitions with datatypes beyond just Utf8 and Dictionary
2 participants