Add support for Utf8View, Boolean, Date32/64, int32/64 for writing hive style partitions #12283

Omega359 · 2024-09-02T01:20:16Z

Which issue does this PR close?

Closes #12221

Rationale for this change

Support hive style partitions for additional data types.

What changes are included in this PR?

Code, slt test coverage.

Are these changes tested?

Yes

Are there any user-facing changes?

No

…t hive style partitioning.

alamb

Thank you @Omega359 and @devinjdangelo and I apoligize for the delay in reviewing this PR 😢

My only concern is about the seeming change to allocate a new String for each row. Otherwise this PR looks great ✨

alamb · 2024-09-06T22:11:14Z

datafusion/core/src/datasource/file_format/write/demux.rs

@@ -320,9 +324,11 @@ async fn hive_style_partitions_demuxer(
 fn compute_partition_keys_by_row<'a>(
    rb: &'a RecordBatch,
    partition_by: &'a [(String, DataType)],
-) -> Result<Vec<Vec<&'a str>>> {
+) -> Result<Vec<Vec<String>>> {


🤔 I wonder if computing new strings for each row will be unnecessarily slow 🤔 The current code only allocates a string for each distinct partition value (in the final take map) but this code now creates a new string for each row in the output record batch, just to match them up

I vaguely recall I had to make this change because of an issue with borrowing temp values with the dates but I'd have to switch back to see exactly where the cause was. You are obviously correct though in that there is a fairly obvious overhead to using String. I'll take another look and see if there is something I can come up with.

Maybe you could use Cow to avoid causing a regression for StringArrays 🤔

I think it would be fine if existing functionality was unaffected but new features (aka partitioning on newly supported types) was not as fast as it could be

We could treat the subsequent optimization of such new features as a follow on project

Updated to use Cow.

…ion_types

alamb

Love it -- thank you @Omega359

cc @devinjdangelo in case you have additional thoughts

alamb · 2024-09-10T10:55:09Z

datafusion/core/src/datasource/file_format/write/demux.rs

@@ -320,9 +325,11 @@ async fn hive_style_partitions_demuxer(
 fn compute_partition_keys_by_row<'a>(
    rb: &'a RecordBatch,
    partition_by: &'a [(String, DataType)],
-) -> Result<Vec<Vec<&'a str>>> {
+) -> Result<Vec<Vec<Cow<'a, str>>>> {


❤️ 🐮

alamb · 2024-09-11T13:31:41Z

Thanks again @Omega359

Add support for Utf8View, Boolean, Date32/64, int32/64 for writing ou…

c7e84b5

…t hive style partitioning.

github-actions bot added core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels Sep 2, 2024

Omega359 marked this pull request as ready for review September 2, 2024 14:01

alamb reviewed Sep 6, 2024

View reviewed changes

Omega359 added 3 commits September 9, 2024 14:20

Merge remote-tracking branch 'upstream/main' into feature/hive_partit…

cb66414

…ion_types

Swith to Cow vs String to reduce instances of string allocation.

18d9802

Cargo fmt update.

b5fed20

alamb approved these changes Sep 10, 2024

View reviewed changes

alamb merged commit 6aae2ee into apache:main Sep 11, 2024
24 checks passed

Omega359 deleted the feature/hive_partition_types branch September 11, 2024 14:35

alamb mentioned this pull request Sep 16, 2024

DataFusion weekly project plan (Andrew Lamb) - Sep 16, 2024 #12494

Open

8 tasks

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for Utf8View, Boolean, Date32/64, int32/64 for writing hive style partitions #12283

Add support for Utf8View, Boolean, Date32/64, int32/64 for writing hive style partitions #12283

Omega359 commented Sep 2, 2024

alamb left a comment

alamb Sep 6, 2024

Omega359 Sep 6, 2024

alamb Sep 6, 2024 •

edited

Loading

Omega359 Sep 9, 2024

alamb left a comment •

edited

Loading

alamb Sep 10, 2024

alamb commented Sep 11, 2024

Add support for Utf8View, Boolean, Date32/64, int32/64 for writing hive style partitions #12283

Add support for Utf8View, Boolean, Date32/64, int32/64 for writing hive style partitions #12283

Conversation

Omega359 commented Sep 2, 2024

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

alamb left a comment

Choose a reason for hiding this comment

alamb Sep 6, 2024

Choose a reason for hiding this comment

Omega359 Sep 6, 2024

Choose a reason for hiding this comment

alamb Sep 6, 2024 • edited Loading

Choose a reason for hiding this comment

Omega359 Sep 9, 2024

Choose a reason for hiding this comment

alamb left a comment • edited Loading

Choose a reason for hiding this comment

alamb Sep 10, 2024

Choose a reason for hiding this comment

alamb commented Sep 11, 2024

alamb Sep 6, 2024 •

edited

Loading

alamb left a comment •

edited

Loading