[HUDI-5004] Allow nested field as primary key and preCombineField in flink sql #6915

voonhous · 2022-10-11T04:29:06Z

Change Logs

Added a more comprehensive validation routine for checking primary key and preCombineField to allow nested field.

Ensure that primaryKey and preCombineField definition.

Note: Flink's primaryKey and preCombineField validation is more strict than that of Spark's validation. Spark's validation will only check if the upper most parent exists without validating nested fields down to it's lowest level.

Impact

No public API changed

Risk level (write none, low medium or high below)

low

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change

The config description must be updated if new configs are added or the default value of the configs are changed
Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
ticket number here and follow the instruction to make
changes to the website.

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

danny0405 · 2022-10-11T06:54:08Z

hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/HoodieTableFactory.java

      Arrays.stream(recordKeys)
-          .filter(field -> !fields.contains(field))
+          .filter(field -> !fields.contains(getRootLevelFieldName(field)))
          .findAny()


Do we support nested primary key also in this patch ?

Uhm, from what i understand, Spark-SQL supports nested primaryKey and precombineField.

The changes I made here is to standardize the validations in Spark and Flink.
https://issues.apache.org/jira/browse/HUDI-4051
https://github.com/apache/hudi/pull/5517/files

danny0405 · 2022-10-11T06:54:41Z

hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/HoodieTableFactory.java

    String preCombineField = conf.get(FlinkOptions.PRECOMBINE_FIELD);
-    if (!fields.contains(preCombineField)) {
+    if (!fields.contains(getRootLevelFieldName(preCombineField))) {
      if (OptionsResolver.isDefaultHoodieRecordPayloadClazz(conf)) {


We may also need to validate the nested field names.

I agree. This change is to standardize the validations between Flink and Spark. As such, no checks on the nested field names were made.

@danny0405 I have added the feature to validate nested fields for Flink.

Can you please help to review this PR again?

Thank you.

danny0405 · 2022-10-11T06:55:14Z

...-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/table/TestHoodieTableFactory.java

+    // nested pk field is allowed
+    ResolvedSchema schema6 = SchemaBuilder.instance()
+        .field("f0",
+            DataTypes.ROW(DataTypes.FIELD("id", DataTypes.INT()), DataTypes.FIELD("date", DataTypes.VARCHAR(20))))


Can we write a IT test in ITTestHoodieDataSource.

Hmmm, do you mean add IT tests?

Or remove the UT i included here and rewrite them as IT tests?

hudi-bot · 2022-11-29T14:36:54Z

CI report:

7727723 UNKNOWN
f4336b7 UNKNOWN
300a021 Azure: FAILURE

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

danny0405 · 2022-12-02T03:30:28Z

hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/HoodieTableFactory.java

  private void sanityCheck(Configuration conf, ResolvedSchema schema) {
    List<String> fields = schema.getColumnNames();
+    Schema inferredSchema = AvroSchemaConverter.convertToSchema(schema.toPhysicalRowDataType().notNull().getLogicalType());



There is no need to convert the ResolvedSchema as an avro schema for validation, the ResolvedSchema#getColumnDataTypes can fetch the data type of each field.

Also we need to fix the RowDataKeyGen#getRecordKey for nested primary keys.

Sure, I will take a look.

The main reasons for doing this is:

AvroSchemaConverter.convertToSchema was already imported and used somewhere else in the code so, just reuse

Convert it to an AvroSchema so that the helper functions can be written in HoodieAvroUtils, where the validation for creation of tables using the Spark as entrypoint can be reused.

Let me try to see if we can use the ResolvedSchema instead, will get back to you.

Also we need to fix the RowDataKeyGen#getRecordKey for nested primary keys.

Got it!

voonhous · 2023-01-30T03:41:46Z

Closing this PR as @hbgstc123 has already fixed this issue by disabling schema sanity checks prior to creating the source in this PR

danny0405 reviewed Oct 11, 2022

View reviewed changes

yihua added writer-core engine:flink Flink integration labels Oct 15, 2022

yihua assigned danny0405 Oct 15, 2022

voonhous force-pushed the HUDI-5004 branch 5 times, most recently from f4336b7 to a0baf96 Compare November 29, 2022 09:31

[HUDI-5004] Support nested fields as mandatory columns in flink sql

300a021

voonhous force-pushed the HUDI-5004 branch from a0baf96 to 300a021 Compare November 29, 2022 10:15

danny0405 reviewed Dec 2, 2022

View reviewed changes

voonhous closed this Jan 30, 2023

voonhous deleted the HUDI-5004 branch May 15, 2023 08:49

hudi-bot mentioned this pull request Dec 9, 2025

Support nested fields as mandatory columns in flink sql #15488

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[HUDI-5004] Allow nested field as primary key and preCombineField in flink sql #6915

[HUDI-5004] Allow nested field as primary key and preCombineField in flink sql #6915

Uh oh!

voonhous commented Oct 11, 2022 •

edited

Loading

Uh oh!

danny0405 Oct 11, 2022

Uh oh!

voonhous Oct 11, 2022 •

edited

Loading

Uh oh!

danny0405 Oct 11, 2022

Uh oh!

voonhous Oct 11, 2022

Uh oh!

voonhous Nov 29, 2022

Uh oh!

danny0405 Oct 11, 2022

Uh oh!

voonhous Oct 11, 2022 •

edited

Loading

Uh oh!

hudi-bot commented Nov 29, 2022

Uh oh!

danny0405 Dec 2, 2022

Uh oh!

voonhous Dec 2, 2022 •

edited

Loading

Uh oh!

voonhous commented Jan 30, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[HUDI-5004] Allow nested field as primary key and preCombineField in flink sql #6915

[HUDI-5004] Allow nested field as primary key and preCombineField in flink sql #6915

Uh oh!

Conversation

voonhous commented Oct 11, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Change Logs

Impact

Risk level (write none, low medium or high below)

Documentation Update

Contributor's checklist

Uh oh!

danny0405 Oct 11, 2022

Choose a reason for hiding this comment

Uh oh!

voonhous Oct 11, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

danny0405 Oct 11, 2022

Choose a reason for hiding this comment

Uh oh!

voonhous Oct 11, 2022

Choose a reason for hiding this comment

Uh oh!

voonhous Nov 29, 2022

Choose a reason for hiding this comment

Uh oh!

danny0405 Oct 11, 2022

Choose a reason for hiding this comment

Uh oh!

voonhous Oct 11, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hudi-bot commented Nov 29, 2022

CI report:

Uh oh!

danny0405 Dec 2, 2022

Choose a reason for hiding this comment

Uh oh!

voonhous Dec 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

voonhous commented Jan 30, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

voonhous commented Oct 11, 2022 •

edited

Loading

voonhous Oct 11, 2022 •

edited

Loading

voonhous Oct 11, 2022 •

edited

Loading

voonhous Dec 2, 2022 •

edited

Loading