[HUDI-2681] Some fixes and config validation when auto generation of record keys is enabled #7668

lokeshj1703 · 2023-01-13T09:40:18Z

Change Logs

Recently we added support for auto generation of record keys for hudi tables. But some configs need to tweaked when using such auto generation of record keys.

Dis-allow de-dup. If de-dup is enabled (combine before insert), we will fail the write.
Fail if "upsert" is explicitly set as operaiton type. With auto generation of record keys, we can't support upsert.
Fail if "hoodie.merge.allow.duplicate.on.inserts" is not enabled so that hudi does not unintentionaly de-dup due to small file handling.
Fail if someone choose to use MOR table type:- 2 reasons. a: there are no updates and so no point in choosing MOR. b: preCombine is a mandatory field w/ MOR table. but for table w/ auto generated record keys, precombine if not required to be set.
Fail if preCombine or recordkey field is set.

This patch also fixes auto generation of record keys w/ spark-sql writes. Added tests for the same. (Create table + insert and CTAS)

Impact

Enables smoother usage of auto record key generation.

Risk level (write none, low medium or high below)

Medium (Added tests)

Documentation Update

Adds support for KeylessGenerator for insert operations. This ensures user doesn't need to configure record key for inserts in an immutable dataset.

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

lokeshj1703 · 2023-01-13T09:43:13Z

@nsivabalan Can you please take a look?

kazdy · 2023-01-14T00:54:34Z

...spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala

  private var asyncCompactionTriggerFnDefined: Boolean = false
  private var asyncClusteringTriggerFnDefined: Boolean = false

+  def changeOperationToInsertIfRequired(writeOperationType: WriteOperationType, hoodieConfig: HoodieConfig)


afaik in this case when changing op from upsert to insert one should also make sure that insert mode (SQL_INSERT_MODE) is set to "upsert", otherwise either duplicate record will be created (non-strict mode) or error will be thrown (strict mode)

@kazdy : Would you be able to help us w/ a contribution on this. If you take this branch and work on top of it. As of this patch, we have tested auto generation of record keys works for spark-sql (CTAS and create table + inserts ). So, would appreciate if you can put up a fix on the strict/non-strict mode.

kazdy · 2023-01-18T17:16:08Z

@lokeshj1703 I also noticed that in flink no_precombine is already supported and is guarded behind a config.
Maybe it would be good to do the same in spark ds?

hudi/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java

Line 107 in c9bc03e

public static final String NO_PRE_COMBINE = "no_precombine";

...spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieWriterUtils.scala

hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/TestHoodieSparkSqlWriter.scala

lokeshj1703 · 2023-01-22T11:02:43Z

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieWriterUtils.scala

+        throw new HoodieKeyGeneratorException(s"Config ${DataSourceWriteOptions.TABLE_TYPE.key()} should be set to " +
+          s"COW_TABLE_TYPE_OPT_VAL when $autoGenerateRecordKey is used")
+      }
+      if (hoodieConfig.getString(OPERATION) == UPSERT_OPERATION_OPT_VAL) {


Should we change condition to succeed only if insert and bulk insert is enabled? Is it possible for some other operation to execute in this function apart from {insert,upsert and bulk insert}

nsivabalan · 2023-01-23T22:47:46Z

few high level points

Dis-allow de-dup. If de-dup is enabled (combine before insert), we will fail the write.
Siva: looks ok to me.
Fail if "upsert" is set. With auto generation of record keys, we can't support upsert. I mean, every record is treated as a new record and so doing an index look up causes only unnecessary overhead.
Siva: I feel we can automatically switch to insert. Its an impl detail. We anyways will document that auto generation of record keys is meant to be used only for immutable use-cases. So, rather than failing, I would prefer to auto switch to "insert".

Fail if "hoodie.merge.allow.duplicate.on.inserts" is not enabled so that hudi does not unintentionaly de-dup due to small file handling.
Siva: we should automatically enable this since this is more of an impl detail. I mean, we should not fail if user does not set this.

Fail if someone choose to use MOR table type:- 2 reasons. a: there are no updates and so no point in choosing MOR. b: preCombine is a mandatory field w/ MOR table. but for table w/ auto generated record keys, precombine if not required to be set.
Siva: seems ok.

Fail if preCombine or recordkey field is set.
Siva: seems ok to me.

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieWriterUtils.scala

…record keys is enabled (cherry picked from commit 6616b7afc7ba5d4b8721582f8e553cefbf07fd88)

(cherry picked from commit 2b0fe973c7fa19eda39ecf52622dd07b0976f38e)

…enabled

codope · 2023-01-25T04:18:33Z

hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/keygen/SimpleKeyGenerator.java

-  private static void validateRecordKey(String recordKeyField) {
-    checkArgument(recordKeyField == null || !recordKeyField.isEmpty(),
+  private static void validateRecordKey(String recordKeyField, boolean isAutoGenerateRecordKeyEnabled) {
+    checkArgument(recordKeyField == null || !recordKeyField.isEmpty() || isAutoGenerateRecordKeyEnabled,


Should the validation on record key field be:

It is not empty.

Or, isAutoGenerateRecordKeyEnabled.

I am assuming record key field can be null only when isAutoGenerateRecordKeyEnabled is true. We can take it as a followup after confirming why null record key field was allowed in the first place.

codope · 2023-01-25T05:51:34Z

...di-spark/src/main/scala/org/apache/spark/sql/hudi/command/InsertIntoHoodieTableCommand.scala


 import org.apache.hudi.exception.HoodieException
-import org.apache.hudi.{HoodieSparkSqlWriter, SparkAdapterSupport}
+import org.apache.hudi.{DataSourceWriteOptions, HoodieSparkSqlWriter, SparkAdapterSupport}


nit: unused import

hudi-bot · 2023-01-25T07:21:26Z

CI report:

b9d8e9c UNKNOWN
5ff1afe Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

nsivabalan · 2023-01-25T07:53:08Z

We are closing this for now. will put up proper fix later.

danny0405 assigned nsivabalan Jan 13, 2023

danny0405 added writer-core priority:high Significant impact; potential bugs and removed priority:high Significant impact; potential bugs labels Jan 13, 2023

kazdy reviewed Jan 14, 2023

View reviewed changes

lokeshj1703 force-pushed the HUDI-2681 branch 2 times, most recently from 304400b to a586f65 Compare January 14, 2023 07:29

nsivabalan requested changes Jan 18, 2023

View reviewed changes

nsivabalan changed the title ~~[HUDI-2681] Make hoodie record_key and preCombine_key optional~~ [HUDI-2681] Some fixes and tweaks to configs when auto generation of record keys is enabled Jan 18, 2023

lokeshj1703 force-pushed the HUDI-2681 branch from a586f65 to 155d8af Compare January 22, 2023 10:56

lokeshj1703 changed the title ~~[HUDI-2681] Some fixes and tweaks to configs when auto generation of record keys is enabled~~ [HUDI-2681] Some fixes and config validation when auto generation of record keys is enabled Jan 22, 2023

lokeshj1703 commented Jan 22, 2023

View reviewed changes

xushiyan added priority:blocker Production down; release blocker and removed priority:high Significant impact; potential bugs labels Jan 24, 2023

lokeshj1703 force-pushed the HUDI-2681 branch from 155d8af to dbc0a70 Compare January 24, 2023 08:30

codope reviewed Jan 24, 2023

View reviewed changes

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieWriterUtils.scala Outdated Show resolved Hide resolved

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieWriterUtils.scala Outdated Show resolved Hide resolved

lokeshj1703 added 3 commits January 24, 2023 22:05

[HUDI-2681] Some fixes and config validation when auto generation of …

f91470f

…record keys is enabled (cherry picked from commit 6616b7afc7ba5d4b8721582f8e553cefbf07fd88)

Fix configs validation

03007ad

(cherry picked from commit 2b0fe973c7fa19eda39ecf52622dd07b0976f38e)

Fix condition for precombine field and address review comments

ce82da8

lokeshj1703 force-pushed the HUDI-2681 branch from dbc0a70 to ce82da8 Compare January 24, 2023 17:01

lokeshj1703 and others added 5 commits January 24, 2023 22:35

Address review comment

19e6f8f

Avoid setting default value for precombine field when auto record is …

085d17a

…enabled

Fix Spark SQL, add tests

5a88ce8

addressing feedback from Siva

be4ac65

fixing test failures

5ff1afe

codope approved these changes Jan 25, 2023

View reviewed changes

nsivabalan closed this Jan 25, 2023

hudi-bot mentioned this pull request Dec 9, 2025

Make hoodie record_key and preCombine_key optional #14905

Open

[HUDI-2681] Some fixes and config validation when auto generation of record keys is enabled #7668

[HUDI-2681] Some fixes and config validation when auto generation of record keys is enabled #7668

Uh oh!

Conversation

lokeshj1703 commented Jan 13, 2023 • edited by nsivabalan Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Change Logs

Impact

Risk level (write none, low medium or high below)

Documentation Update

Contributor's checklist

Uh oh!

lokeshj1703 commented Jan 13, 2023

Uh oh!

kazdy Jan 14, 2023

Choose a reason for hiding this comment

Uh oh!

nsivabalan Jan 24, 2023

Choose a reason for hiding this comment

Uh oh!

kazdy commented Jan 18, 2023

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lokeshj1703 Jan 22, 2023

Choose a reason for hiding this comment

Uh oh!

nsivabalan commented Jan 23, 2023

Uh oh!

Uh oh!

Uh oh!

codope Jan 25, 2023

Choose a reason for hiding this comment

Uh oh!

codope Jan 25, 2023

Choose a reason for hiding this comment

Uh oh!

hudi-bot commented Jan 25, 2023

CI report:

Uh oh!

nsivabalan commented Jan 25, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

lokeshj1703 commented Jan 13, 2023 •

edited by nsivabalan

Loading