-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[HUDI-5575] Adding/Fixing auto generation of record keys w/ hudi #7726
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
...ient/hudi-client-common/src/test/java/org/apache/hudi/keygen/TestAutoRecordKeyGenerator.java
Show resolved
Hide resolved
...ient/hudi-client-common/src/test/java/org/apache/hudi/keygen/TestAutoRecordKeyGenerator.java
Show resolved
Hide resolved
.../hudi-client-common/src/main/java/org/apache/hudi/keygen/TimestampBasedAvroKeyGenerator.java
Outdated
Show resolved
Hide resolved
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/keygen/ComplexKeyGenerator.java
Show resolved
Hide resolved
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/keygen/CustomKeyGenerator.java
Show resolved
Hide resolved
...spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/TestAutoRecordKeyGeneration.scala
Show resolved
Hide resolved
...-client/hudi-client-common/src/main/java/org/apache/hudi/keygen/ComplexAvroKeyGenerator.java
Show resolved
Hide resolved
...-client/hudi-client-common/src/main/java/org/apache/hudi/keygen/ComplexAvroKeyGenerator.java
Show resolved
Hide resolved
6561f2b to
efe71f4
Compare
|
@codope : addressed all feedback. |
codope
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing the comments. Ready to land once the CI is green.
efe71f4 to
7de21a1
Compare
|
Merging it as all tests are fixed and only |
| this.recordKeyFields = Arrays.stream(props.getString(KeyGeneratorOptions.RECORDKEY_FIELD_NAME.key()) | ||
| .split(",")).map(String::trim).filter(s -> !s.isEmpty()).collect(Collectors.toList()); | ||
| this.partitionPathFields = EMPTY_PARTITION_FIELD_LIST; | ||
| instantiateAutoRecordKeyGenerator(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please assign var inside the ctor, make it final
| protected final boolean hiveStylePartitioning; | ||
| protected final boolean consistentLogicalTimestampEnabled; | ||
| protected final boolean autoGenerateRecordKeys; | ||
| protected AutoRecordKeyGenerator autoRecordKeyGenerator; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please wrap it into Option
| protected final boolean encodePartitionPath; | ||
| protected final boolean hiveStylePartitioning; | ||
| protected final boolean consistentLogicalTimestampEnabled; | ||
| protected final boolean autoGenerateRecordKeys; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need a separate boolean, let's hold auto-gen as an option
| + "The class attempts to get sufficient uniqueness for keys to prevent frequent collisions by choosing the fields it uses in order of decreasing " | ||
| + "likelihood for uniqueness."); | ||
|
|
||
| public static final ConfigProperty<Integer> NUM_FIELDS_IN_AUTO_RECORDKEY_GENERATION = ConfigProperty |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should expose this kind of configuration -- it's very confusing in terms of what exactly it affects, and how setting it will translate into desired outcome
| public void createKeyWithoutPartitionColumn() { | ||
| KeylessKeyGenerator keyGenerator = new KeylessKeyGenerator(getKeyGenProperties("", 3)); | ||
| GenericRecord record = createRecord("partition1", "value1", 123, 456L, TIME, null); | ||
| ComplexAvroKeyGenerator keyGenerator = new ComplexAvroKeyGenerator(getKeyGenProperties("partition_field", 3)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we testing ComplexKeyGenerator here?
We need to test 2 concerns:
AutoRecordKeyGeneratoritself- All existing key-generators in auto-gen mode (just one parameterized test for
getRecordKeyshould be fine)
| tryInitRowAccessor(schema); | ||
| return combineCompositeRecordKeyUnsafe(rowAccessor.getRecordKeyParts(internalRow)); | ||
| if (autoGenerateRecordKeys) { | ||
| return super.getRecordKey(internalRow, schema); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we redirecting to BuiltinKeyGenerator here?
On top of that i don't see that we've changed it either, so how is this supposed to work?
|
@nsivabalan are @alexeykudinkin 's review feedback addressed? |
…udi (apache#7726)" This reverts commit 2fc20c1.
…che#7726) * Adding auto generation of record keys w/ hudi. Added support to all key gen classes * addressing feedback * Fix tests Co-authored-by: Sagar Sumit <[email protected]>
…udi (apache#7726)" (apache#7747) * Revert "[HUDI-5575] Adding/Fixing auto generation of record keys w/ hudi (apache#7726)" This reverts commit 2fc20c1. * Revert "[HUDI-5514] Add in support for a keyless workflow by building an ID based off of values within the record (apache#7640)" This reverts commit eacae1e.
…che#7726) * Adding auto generation of record keys w/ hudi. Added support to all key gen classes * addressing feedback * Fix tests Co-authored-by: Sagar Sumit <[email protected]>
…udi (apache#7726)" (apache#7747) * Revert "[HUDI-5575] Adding/Fixing auto generation of record keys w/ hudi (apache#7726)" This reverts commit 2fc20c1. * Revert "[HUDI-5514] Add in support for a keyless workflow by building an ID based off of values within the record (apache#7640)" This reverts commit eacae1e.
Change Logs
As of now, record key generation and partition path generation are tightly coupled based on the key gen class used. Recently we added auto generation of record keys, its not flexible to be used w/ any key gen class as its a separate key gen class in itself.
This patch fixes that so that users can enable auto generation of record keys along w/ any key gen class. Users have to enable a config called
hoodie.auto.generate.record.keysto let hudi auto generate record keys for them. In such cases, users don't need to set any value forhoodie.datasource.write.recordkey.fieldwhich otherwise is a mandatory field. When this new config is enabled, the key gen class config (hoodie.datasource.write.keygenerator.class) will determine the partition path generation logic as before.Added a new class called AutoRecordKeyGenerator which will be leveraged in all key gen classes for record key related methods.
Impact
Enables users enable auto generation of record keys along w/ any key gen class.
Risk level (write none, low medium or high below)
low.
Documentation Update
Describe any necessary documentation update if there is any new feature, config, or user-facing change
ticket number here and follow the instruction to make
changes to the website.
Contributor's checklist