[HUDI-7930] Flink Support for Array of Row and Map of Row value #11727

David-N-Perkins · 2024-08-05T17:01:07Z

Change Logs

Added support for Array of Rows and Maps with Row values in Flink. Only supports nesting 1 level deep. The Row cannot contain another Array or Map.

I've only applied the changes to Flink 1.18.x, but will copy to other versions if no changes are requested.

Impact

No change to the public API. As far as I could tell, the limitation of only allowing non container types was not in the documentation.

Risk level (write none, low medium or high below)

Low if there are adequate unit tests. I added unit tests for my two use cases, but I did not see a lot of unit tests. I'm not sure if there are unit tests that cover the Arrays and Maps of the basic types, which my changes could affect.

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none".

As far as I could tell, this was not in the documentation. But it would be nice to note somewhere that as of this change, Arrays and Map values support basic types and Rows, but does not support additional Arrays or Maps within the Row.

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/table/TestHoodieTableSource.java

hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/utils/TestData.java

yihua · 2024-09-11T05:54:50Z

@David-N-Perkins any updates on addressing the comments on the PR?

David-N-Perkins · 2024-09-11T12:12:07Z

I added unit tests to ITTestHoodieDataSource and uncovered an issue with inconsistent Parquet schemas when performing an insert vs upsert. I spoke with @danny0405 about this and they need to be fixed so they are both the same. I've investigated the issue but haven't fully implemented a fix. I've been pretty busy with work lately, and haven't had much time.

danny0405 · 2024-09-22T00:45:51Z

Hmm, there is a test failure:

TestParquetSchemaConverter.testConvertComplexTypes:72 
Expected: is "message converted {\n  optional group f_array (LIST) {\n    repeated group list {\n      optional binary element (STRING);\n    }\n  }\n  optional group f_map (MAP) {\n    repeated group key_value {\n      required int32 key;\n      optional binary value (STRING);\n    }\n  }\n  optional group f_row {\n    optional int32 f_row_f0;\n    optional binary f_row_f1 (STRING);\n    optional group f_row_f2 {\n      optional int32 f_row_f2_f0;\n      optional binary f_row_f2_f1 (STRING);\n    }\n  }\n}\n"
     but: was "message converted {\n  optional group f_array (LIST) {\n    repeated group array {\n

David-N-Perkins · 2024-09-22T21:11:34Z

.../hudi-flink1.18.x/src/main/java/org/apache/hudi/table/format/cow/ParquetSplitReaderUtil.java

          int fieldIndex = getFieldIndexInPhysicalType(rowType.getFields().get(i).getName(), groupType);
          if (fieldIndex < 0) {
-            columnVectors[i] = (WritableColumnVector) createVectorFromConstant(rowType.getTypeAt(i), null, batchSize);
+            if (groupType.getRepetition().equals(Type.Repetition.REPEATED) && !rowType.getTypeAt(i).is(LogicalTypeRoot.ARRAY)) {


I'm not sure this will work in all cases. Since there isn't a descriptor, that can't be used to check the repetition level. I did add a unit test for adding a new field to an Array of rows and Map of rows.

David-N-Perkins · 2024-09-23T00:13:50Z

Do I need to copy these changes to the other Flink versions?

danny0405 · 2024-09-23T01:07:21Z

Do I need to copy these changes to the other Flink versions?

yeah, we should but in separate PR, let's make this one solid first.

danny0405 · 2024-09-23T01:49:58Z

@David-N-Perkins I have no access to your focked repo, here is a patch to fix the checkstyle error:
Fix_checkstyle_errors.patch.zip

danny0405 · 2024-09-24T01:18:35Z

Still there is compile error:

Error:  src/main/java/org/apache/hudi/io/storage/row/parquet/ParquetSchemaConverter.java:[636,38] (whitespace) ParenPad: '(' is followed by whitespace.

danny0405 · 2024-09-27T23:39:16Z

...-client/src/test/java/org/apache/hudi/io/storage/row/parquet/TestParquetSchemaConverter.java

    final String expected = "message converted {\n"
        + "  optional group f_array (LIST) {\n"
-        + "    repeated group list {\n"
+        + "    repeated group array {\n"


Can we supplement a test case for nested row in array type.

danny0405 · 2024-09-27T23:54:14Z

.../hudi-flink1.18.x/src/main/java/org/apache/hudi/table/format/cow/ParquetSplitReaderUtil.java

+              columnVectors[i] =
+                  createWritableColumnVector(
+                      batchSize,
+                      new ArrayType(rowType.getTypeAt(i).isNullable(), rowType.getTypeAt(i)),


Can you explain why we use array logical type here if the row field type is not array?

This is done here, at line 470, and in some other files in order to meet the Parquet field algorithm that pushes multiplicity and structures down to individual fields. In Parquet, an array of rows is stored as separate arrays for each field.
This approach does have some limitations. It won't work for multiple nested arrays and maps. The main problem is that the Flink classes and interface don't follow that pattern.

.../hudi-flink1.18.x/src/main/java/org/apache/hudi/table/format/cow/ParquetSplitReaderUtil.java

hudi-bot · 2024-10-05T13:07:42Z

CI report:

5153961 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

danny0405 · 2024-10-07T05:15:15Z

@David-N-Perkins Can you also cherry-pick the changes to Flink 1.19.x, I think we can just drop this support for releases before Flink 1.18.

David-N-Perkins · 2024-10-07T13:53:38Z

Sure, I can do that.

I was also chasing down a potential issue with an instance of a null array of rows that was being read as an array with a single row full of nulls.

danny0405 · 2024-10-08T01:26:10Z

yeah, be careful with the null values handling.

empcl · 2025-03-18T05:44:55Z

@David-N-Perkins Hello, why do we need to change the written structure? Why do we change the list and element into array structures? Thank you for your answer.

danny0405 · 2025-03-18T06:09:14Z

@empcl I guess we should use list as the name: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists

David-N-Perkins · 2025-03-18T11:57:30Z

@empcl If I remember correctly, it was needed to get consistent names and structure in the Parquet files. I was seeing differences depending on whether the operation was "insert", "upsert", or "bulk_insert".

David-N-Perkins added 4 commits August 5, 2024 12:50

HUDI-7930 add unit test to reproduce issue

5f6a642

HUDI-7930 add avro schema for array of rows data

3bfbc98

HUDI initial implementation

e9f3959

HUDI-7930 implemented Row value type for Maps

e9d0cf0

github-actions bot added the size:XL PR with lines of changes > 1000 label Aug 5, 2024

danny0405 changed the title ~~HUDI-7930 Flink Support for Array of Row and Map of Row value~~ [HUDI-7930] Flink Support for Array of Row and Map of Row value Aug 6, 2024

danny0405 reviewed Aug 6, 2024

View reviewed changes

hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/table/TestHoodieTableSource.java Outdated Show resolved Hide resolved

danny0405 reviewed Aug 6, 2024

View reviewed changes

hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/utils/TestData.java Outdated Show resolved Hide resolved

David-N-Perkins added 3 commits September 21, 2024 11:04

GH-7930 fixed insert and bulk insert to match upsert schema

74a68c9

GH-7930 fix other tests

a4511ea

GH-7930 fixed missing row field bug

d1fb6b5

danny0405 self-assigned this Sep 22, 2024

GH-7930 fix TestParquetSchemaConverter

a3f5fc8

David-N-Perkins commented Sep 22, 2024

View reviewed changes

GH-7930 remove TestHoodieTableSource tests and associated resources

e08c6cf

github-actions bot added size:L PR with lines of changes in (300, 1000] and removed size:XL PR with lines of changes > 1000 labels Sep 22, 2024

GH-7930 apply checkstyle patch

61ea65a

github-actions bot added size:XL PR with lines of changes > 1000 and removed size:L PR with lines of changes in (300, 1000] labels Sep 23, 2024

David-N-Perkins and others added 3 commits September 24, 2024 08:53

GH-7930 remove extra space

ef78ab3

GH-730 remove extra line

b53a854

fix the api compatibility

b1242af

fix the null values for map type and schema evolution for row data type

53977f3

danny0405 reviewed Sep 27, 2024

View reviewed changes

.../hudi-flink1.18.x/src/main/java/org/apache/hudi/table/format/cow/ParquetSplitReaderUtil.java Outdated Show resolved Hide resolved

danny0405 and others added 2 commits September 30, 2024 08:08

Add comment for the nested row in array check

e07088a

GH-7930 add schema evolution tests

5153961

danny0405 approved these changes Oct 7, 2024

View reviewed changes

danny0405 merged commit 78dcff7 into apache:master Oct 7, 2024

David-N-Perkins deleted the HUDI-7930 branch October 11, 2024 23:50

David-N-Perkins mentioned this pull request Oct 12, 2024

[HUDI-7930] Update Flink 1.19 to support Array and Map of Rows #12083

Closed

4 tasks

[HUDI-7930] Flink Support for Array of Row and Map of Row value #11727

[HUDI-7930] Flink Support for Array of Row and Map of Row value #11727

Uh oh!

Conversation

David-N-Perkins commented Aug 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Change Logs

Impact

Risk level (write none, low medium or high below)

Documentation Update

Contributor's checklist

Uh oh!

Uh oh!

Uh oh!

yihua commented Sep 11, 2024

Uh oh!

David-N-Perkins commented Sep 11, 2024

Uh oh!

danny0405 commented Sep 22, 2024

Uh oh!

David-N-Perkins Sep 22, 2024

Choose a reason for hiding this comment

Uh oh!

David-N-Perkins commented Sep 23, 2024

Uh oh!

danny0405 commented Sep 23, 2024

Uh oh!

danny0405 commented Sep 23, 2024

Uh oh!

danny0405 commented Sep 24, 2024

Uh oh!

danny0405 Sep 27, 2024

Choose a reason for hiding this comment

Uh oh!

danny0405 Sep 27, 2024

Choose a reason for hiding this comment

Uh oh!

David-N-Perkins Sep 28, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

hudi-bot commented Oct 5, 2024

CI report:

Uh oh!

danny0405 commented Oct 7, 2024

Uh oh!

David-N-Perkins commented Oct 7, 2024

Uh oh!

danny0405 commented Oct 8, 2024

Uh oh!

empcl commented Mar 18, 2025

Uh oh!

danny0405 commented Mar 18, 2025

Uh oh!

David-N-Perkins commented Mar 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

David-N-Perkins commented Aug 5, 2024 •

edited

Loading