Skip to content

Conversation

@codope
Copy link
Member

@codope codope commented Dec 3, 2021

What is the purpose of the pull request

#3944 handled logical timestamp type for complex keygen. This PR fixes HUDI-2909 by handing it for timestamp based keygen.

When the new config hoodie.datasource.write.keygenerator.consistent.logical.timestamp.enabled is set to true, consistent value will be generated for a logical timestamp type column like timestamp-millis and timestamp-micros, irrespective of whether row-writer is enabled. Disabled by default so as not to break the pipeline that deploy either fully row-writer path or non row-writer path. For example, if it is kept disabled then record key of timestamp type with value 2016-12-29 09:54:00 will be written as timestamp 2016-12-29 09:54:00.0 in row-writer path, while it will be written as long value 1483023240000000 in non row-writer path. If enabled, then the timestamp value will be written in both the cases.

Brief change log

(for example:)

  • Modify AnnotationLocation checkstyle rule in checkstyle.xml

Verify this pull request

Added UT and manually verified with spark datasource.

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

@codope codope added the priority:blocker Production down; release blocker label Dec 3, 2021
@codope codope removed the priority:blocker Production down; release blocker label Dec 4, 2021
@nsivabalan
Copy link
Contributor

@codope : is this good to review. If there are any pending work, I can take it up as I am focusing on all issues and jiras. let me know.

@nsivabalan nsivabalan self-assigned this Dec 15, 2021
@nsivabalan nsivabalan added the priority:critical Production degraded; pipelines stalled label Dec 15, 2021
@codope codope force-pushed the hudi-2909-timestamp-keygen branch from e12df0f to 5b68cad Compare December 31, 2021 15:50
@codope
Copy link
Member Author

codope commented Dec 31, 2021

@codope : is this good to review. If there are any pending work, I can take it up as I am focusing on all issues and jiras. let me know.

@nsivabalan It is now ready to review. Both row-writer and non-row writer path will emit same value for logical timestamp type column if the config is enabled.

Copy link
Contributor

@nsivabalan nsivabalan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's see if we can avoid changing public apis in KeyGenerator.

}

public boolean isConsistentLogicalTimestampEnabled() {
if (getBoolean(KeyGeneratorOptions.KEYGENERATOR_CONSISTENT_LOGICAL_TIMESTAMP_ENABLED) == null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't getBoolean return the default if not found?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@YannByron
Copy link
Contributor

@nsivabalan @codope
I have a discussion related to this implement. In this pr, most of work is just to pass isConsistentLogicalTimestampEnabled to the method HoodieAvroUtils.convertValueForAvroLogicalTypes. What if we have another config need to do this in the future?

@codope
Copy link
Member Author

codope commented Jan 6, 2022

@nsivabalan @codope I have a discussion related to this implement. In this pr, most of work is just to pass isConsistentLogicalTimestampEnabled to the method HoodieAvroUtils.convertValueForAvroLogicalTypes. What if we have another config need to do this in the future?

@YannByron You bring up a good point. Adding another config in future is tedious. However, the intention behind adding a new config was to avoid discrepancy in existing pipelines. @nsivabalan has explained this in more detail on the jira HUDI-2909. I do not expect such changes to be frequent. Nevertheless, i'll try to avoid making incompatible changes to public APIs.

@codope codope force-pushed the hudi-2909-timestamp-keygen branch from 5b68cad to 2c0565c Compare January 6, 2022 15:57
@nsivabalan
Copy link
Contributor

@codope : TestGenericRecordAndRowConsistency test failed in CI. Can you please check.

@codope codope force-pushed the hudi-2909-timestamp-keygen branch 2 times, most recently from fde98bb to cb2d16c Compare January 7, 2022 07:19
@codope
Copy link
Member Author

codope commented Jan 7, 2022

@hudi-bot run azure

Add config to enable consistent logical timestamp values

Address review comments

Add an example in the config doc

Cleanup spark context in test
@codope codope force-pushed the hudi-2909-timestamp-keygen branch from cb2d16c to 9c22d6c Compare January 7, 2022 08:27
@hudi-bot
Copy link
Collaborator

hudi-bot commented Jan 8, 2022

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Copy link
Contributor

@nsivabalan nsivabalan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@nsivabalan nsivabalan merged commit 8275499 into apache:master Jan 8, 2022
nsivabalan pushed a commit that referenced this pull request Jan 10, 2022
* [HUDI-2909] Handle logical type in TimestampBasedKeyGenerator

Timestampbased key generator was returning diff values for row writer and non row writer path. this patch fixes it and is guarded by a config flag (`hoodie.datasource.write.keygenerator.consistent.logical.timestamp.enabled`)
*/
public static String getNestedFieldValAsString(GenericRecord record, String fieldName, boolean returnNullIfNotFound) {
Object obj = getNestedFieldVal(record, fieldName, returnNullIfNotFound);
public static String getNestedFieldValAsString(GenericRecord record, String fieldName, boolean returnNullIfNotFound, boolean consistentLogicalTimestampEnabled) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@codope @nsivabalan folks, i don't believe this is the right fix to address the problem: bubbling down this config value we're now exposing every user of this API to be aware of it, which, very likely, will have very little to do with it.

@vinishjail97 vinishjail97 mentioned this pull request Jan 24, 2022
5 tasks
vingov pushed a commit to vingov/hudi that referenced this pull request Jan 26, 2022
…#4203)

* [HUDI-2909] Handle logical type in TimestampBasedKeyGenerator

Timestampbased key generator was returning diff values for row writer and non row writer path. this patch fixes it and is guarded by a config flag (`hoodie.datasource.write.keygenerator.consistent.logical.timestamp.enabled`)
liusenhua pushed a commit to liusenhua/hudi that referenced this pull request Mar 1, 2022
…#4203)

* [HUDI-2909] Handle logical type in TimestampBasedKeyGenerator

Timestampbased key generator was returning diff values for row writer and non row writer path. this patch fixes it and is guarded by a config flag (`hoodie.datasource.write.keygenerator.consistent.logical.timestamp.enabled`)
vingov pushed a commit to vingov/hudi that referenced this pull request Apr 3, 2022
…#4203)

* [HUDI-2909] Handle logical type in TimestampBasedKeyGenerator

Timestampbased key generator was returning diff values for row writer and non row writer path. this patch fixes it and is guarded by a config flag (`hoodie.datasource.write.keygenerator.consistent.logical.timestamp.enabled`)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

priority:critical Production degraded; pipelines stalled

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants