-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[HUDI-2681] Some fixes and config validation when auto generation of record keys is enabled #7668
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@nsivabalan Can you please take a look? |
| private var asyncCompactionTriggerFnDefined: Boolean = false | ||
| private var asyncClusteringTriggerFnDefined: Boolean = false | ||
|
|
||
| def changeOperationToInsertIfRequired(writeOperationType: WriteOperationType, hoodieConfig: HoodieConfig) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
afaik in this case when changing op from upsert to insert one should also make sure that insert mode (SQL_INSERT_MODE) is set to "upsert", otherwise either duplicate record will be created (non-strict mode) or error will be thrown (strict mode)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kazdy : Would you be able to help us w/ a contribution on this. If you take this branch and work on top of it. As of this patch, we have tested auto generation of record keys works for spark-sql (CTAS and create table + inserts ). So, would appreciate if you can put up a fix on the strict/non-strict mode.
304400b to
a586f65
Compare
|
@lokeshj1703 I also noticed that in flink no_precombine is already supported and is guarded behind a config. hudi/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java Line 107 in c9bc03e
|
...spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
Outdated
Show resolved
Hide resolved
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieWriterUtils.scala
Outdated
Show resolved
Hide resolved
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/TestHoodieSparkSqlWriter.scala
Show resolved
Hide resolved
a586f65 to
155d8af
Compare
| throw new HoodieKeyGeneratorException(s"Config ${DataSourceWriteOptions.TABLE_TYPE.key()} should be set to " + | ||
| s"COW_TABLE_TYPE_OPT_VAL when $autoGenerateRecordKey is used") | ||
| } | ||
| if (hoodieConfig.getString(OPERATION) == UPSERT_OPERATION_OPT_VAL) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we change condition to succeed only if insert and bulk insert is enabled? Is it possible for some other operation to execute in this function apart from {insert,upsert and bulk insert}
|
few high level points Dis-allow de-dup. If de-dup is enabled (combine before insert), we will fail the write. Fail if "hoodie.merge.allow.duplicate.on.inserts" is not enabled so that hudi does not unintentionaly de-dup due to small file handling. Fail if someone choose to use MOR table type:- 2 reasons. a: there are no updates and so no point in choosing MOR. b: preCombine is a mandatory field w/ MOR table. but for table w/ auto generated record keys, precombine if not required to be set. Fail if preCombine or recordkey field is set. |
155d8af to
dbc0a70
Compare
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieWriterUtils.scala
Outdated
Show resolved
Hide resolved
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieWriterUtils.scala
Outdated
Show resolved
Hide resolved
…record keys is enabled (cherry picked from commit 6616b7afc7ba5d4b8721582f8e553cefbf07fd88)
(cherry picked from commit 2b0fe973c7fa19eda39ecf52622dd07b0976f38e)
dbc0a70 to
ce82da8
Compare
| private static void validateRecordKey(String recordKeyField) { | ||
| checkArgument(recordKeyField == null || !recordKeyField.isEmpty(), | ||
| private static void validateRecordKey(String recordKeyField, boolean isAutoGenerateRecordKeyEnabled) { | ||
| checkArgument(recordKeyField == null || !recordKeyField.isEmpty() || isAutoGenerateRecordKeyEnabled, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the validation on record key field be:
- It is not empty.
- Or,
isAutoGenerateRecordKeyEnabled.
I am assuming record key field can be null only when isAutoGenerateRecordKeyEnabled is true. We can take it as a followup after confirming why null record key field was allowed in the first place.
|
|
||
| import org.apache.hudi.exception.HoodieException | ||
| import org.apache.hudi.{HoodieSparkSqlWriter, SparkAdapterSupport} | ||
| import org.apache.hudi.{DataSourceWriteOptions, HoodieSparkSqlWriter, SparkAdapterSupport} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: unused import
|
We are closing this for now. will put up proper fix later. |
Change Logs
Recently we added support for auto generation of record keys for hudi tables. But some configs need to tweaked when using such auto generation of record keys.
This patch also fixes auto generation of record keys w/ spark-sql writes. Added tests for the same. (Create table + insert and CTAS)
Impact
Enables smoother usage of auto record key generation.
Risk level (write none, low medium or high below)
Medium (Added tests)
Documentation Update
Adds support for KeylessGenerator for insert operations. This ensures user doesn't need to configure record key for inserts in an immutable dataset.
Contributor's checklist