Skip to content

Conversation

@nsivabalan
Copy link
Contributor

@nsivabalan nsivabalan commented Jan 21, 2023

Change Logs

As of now, record key generation and partition path generation are tightly coupled based on the key gen class used. Recently we added auto generation of record keys, its not flexible to be used w/ any key gen class as its a separate key gen class in itself.

This patch fixes that so that users can enable auto generation of record keys along w/ any key gen class. Users have to enable a config called hoodie.auto.generate.record.keys to let hudi auto generate record keys for them. In such cases, users don't need to set any value for hoodie.datasource.write.recordkey.field which otherwise is a mandatory field. When this new config is enabled, the key gen class config (hoodie.datasource.write.keygenerator.class) will determine the partition path generation logic as before.

Added a new class called AutoRecordKeyGenerator which will be leveraged in all key gen classes for record key related methods.

Impact

Enables users enable auto generation of record keys along w/ any key gen class.

Risk level (write none, low medium or high below)

low.

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change

  • The config description must be updated if new configs are added or the default value of the configs are changed
  • Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
    ticket number here and follow the instruction to make
    changes to the website.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@nsivabalan nsivabalan added priority:blocker Production down; release blocker writer-core labels Jan 21, 2023
@nsivabalan nsivabalan changed the title [HUDI-5575] Adding auto generation of record keys w/ hudi. Added support to all key gen clases [HUDI-5575] Adding/Fixing auto generation of record keys w/ hudi Jan 21, 2023
@nsivabalan nsivabalan force-pushed the autoGenerationRecKeys branch from 6561f2b to efe71f4 Compare January 23, 2023 22:31
@nsivabalan
Copy link
Contributor Author

@codope : addressed all feedback.

Copy link
Member

@codope codope left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing the comments. Ready to land once the CI is green.

@codope codope force-pushed the autoGenerationRecKeys branch from efe71f4 to 7de21a1 Compare January 24, 2023 11:04
@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@codope
Copy link
Member

codope commented Jan 24, 2023

Merging it as all tests are fixed and only testUpsertsContinuousModeWithMultipleWritersForConflicts is flaky, which is expected to be fixed by #7720

@codope codope merged commit 2fc20c1 into apache:master Jan 24, 2023
this.recordKeyFields = Arrays.stream(props.getString(KeyGeneratorOptions.RECORDKEY_FIELD_NAME.key())
.split(",")).map(String::trim).filter(s -> !s.isEmpty()).collect(Collectors.toList());
this.partitionPathFields = EMPTY_PARTITION_FIELD_LIST;
instantiateAutoRecordKeyGenerator();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please assign var inside the ctor, make it final

protected final boolean hiveStylePartitioning;
protected final boolean consistentLogicalTimestampEnabled;
protected final boolean autoGenerateRecordKeys;
protected AutoRecordKeyGenerator autoRecordKeyGenerator;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please wrap it into Option

protected final boolean encodePartitionPath;
protected final boolean hiveStylePartitioning;
protected final boolean consistentLogicalTimestampEnabled;
protected final boolean autoGenerateRecordKeys;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need a separate boolean, let's hold auto-gen as an option

+ "The class attempts to get sufficient uniqueness for keys to prevent frequent collisions by choosing the fields it uses in order of decreasing "
+ "likelihood for uniqueness.");

public static final ConfigProperty<Integer> NUM_FIELDS_IN_AUTO_RECORDKEY_GENERATION = ConfigProperty
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should expose this kind of configuration -- it's very confusing in terms of what exactly it affects, and how setting it will translate into desired outcome

public void createKeyWithoutPartitionColumn() {
KeylessKeyGenerator keyGenerator = new KeylessKeyGenerator(getKeyGenProperties("", 3));
GenericRecord record = createRecord("partition1", "value1", 123, 456L, TIME, null);
ComplexAvroKeyGenerator keyGenerator = new ComplexAvroKeyGenerator(getKeyGenProperties("partition_field", 3));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we testing ComplexKeyGenerator here?

We need to test 2 concerns:

  • AutoRecordKeyGenerator itself
  • All existing key-generators in auto-gen mode (just one parameterized test for getRecordKey should be fine)

tryInitRowAccessor(schema);
return combineCompositeRecordKeyUnsafe(rowAccessor.getRecordKeyParts(internalRow));
if (autoGenerateRecordKeys) {
return super.getRecordKey(internalRow, schema);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we redirecting to BuiltinKeyGenerator here?
On top of that i don't see that we've changed it either, so how is this supposed to work?

@vinothchandar
Copy link
Member

@nsivabalan are @alexeykudinkin 's review feedback addressed?

codope added a commit to codope/hudi that referenced this pull request Jan 25, 2023
codope added a commit that referenced this pull request Jan 25, 2023
…udi (#7726)" (#7747)

* Revert "[HUDI-5575] Adding/Fixing auto generation of record keys w/ hudi (#7726)"

This reverts commit 2fc20c1.

* Revert "[HUDI-5514] Add in support for a keyless workflow by building an ID based off of values within the record (#7640)"

This reverts commit eacae1e.
fengjian428 pushed a commit to fengjian428/hudi that referenced this pull request Jan 31, 2023
…che#7726)

* Adding auto generation of record keys w/ hudi. Added support to all key gen classes

* addressing feedback

* Fix tests

Co-authored-by: Sagar Sumit <[email protected]>
fengjian428 pushed a commit to fengjian428/hudi that referenced this pull request Jan 31, 2023
…udi (apache#7726)" (apache#7747)

* Revert "[HUDI-5575] Adding/Fixing auto generation of record keys w/ hudi (apache#7726)"

This reverts commit 2fc20c1.

* Revert "[HUDI-5514] Add in support for a keyless workflow by building an ID based off of values within the record (apache#7640)"

This reverts commit eacae1e.
fengjian428 pushed a commit to fengjian428/hudi that referenced this pull request Apr 5, 2023
…che#7726)

* Adding auto generation of record keys w/ hudi. Added support to all key gen classes

* addressing feedback

* Fix tests

Co-authored-by: Sagar Sumit <[email protected]>
fengjian428 pushed a commit to fengjian428/hudi that referenced this pull request Apr 5, 2023
…udi (apache#7726)" (apache#7747)

* Revert "[HUDI-5575] Adding/Fixing auto generation of record keys w/ hudi (apache#7726)"

This reverts commit 2fc20c1.

* Revert "[HUDI-5514] Add in support for a keyless workflow by building an ID based off of values within the record (apache#7640)"

This reverts commit eacae1e.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

priority:blocker Production down; release blocker

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

5 participants