[HUDI-3178] Fixing metadata table compaction so as to not include uncommitted data by nsivabalan · Pull Request #4530 · apache/hudi

nsivabalan · 2022-01-07T01:57:51Z

What is the purpose of the pull request

We commit to metadata table followed by data table while committing any writes. At the end of metadata table commit, we also trigger compaction if conditions are met. There is a chance that the actual write eventually failed in data table on which case, compaction could have included the uncommitted data. But once compacted, it may never be ignored while reading from metadata table. So, this patch fixes the bug. Metadata table compaction is triggered before applying the commit to metadata table to circumvent this issue.

Brief change log

(for example:)

Modify AnnotationLocation checkstyle rule in checkstyle.xml

Verify this pull request

(Please pick either of the following options)

This pull request is a trivial rework / code cleanup without any test coverage.

(or)

This pull request is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

Added integration tests for end-to-end.
Added HoodieClientWriteTest to verify the change.
Manually verified the change by running a job locally.

Committer checklist

Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

…ommitted data

hudi-bot · 2022-01-07T06:10:53Z

CI report:

f67f3a7 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

nsivabalan · 2022-01-07T12:55:28Z

@manojpec : Can you review the patch too.

prashantwason

LGTM

nsivabalan

Will go ahead and land this. Let me know if you have any more feedback. Will take it up in a follow up.

manojpec · 2022-01-09T23:46:42Z

        .get().getTimestamp();
    List<HoodieInstant> pendingInstants = dataMetaClient.reloadActiveTimeline().filterInflightsAndRequested()
-        .findInstantsBefore(latestDeltacommitTime).getInstants().collect(Collectors.toList());
+        .findInstantsBefore(instantTime).getInstants().collect(Collectors.toList());


compactionInstantTime at line 703 has to be based off instantTime and not latestDeltaCommitTime. Latest delta commit time is not part of the compaction yet. Otherwise we are changing the meaning of the current compaction timeline with this change.

let me try to explain.
lets say we have 10 commits, C1, C2 -> C10.
Prior to this patch, we will compact immediately after C10 and so compaction commit will be C10 + "001".

With this patch, we will be compacting just before C11 starts getting applied to MDT.
And so, I am basing the compaction commit of latest delta commit time which is C10 and not instant time which is C11.
And so, its C10 + "001". but if I go with instantTime, then we might change the behavior. In fact, we can't do that, since compaction time will be greater than the delta commit which will be eventually created when we apply C11 to MDT.

Let me know if this makes sense.

Right, I was actually asking the compaction time to be C10 and not C11. I misread line 689. Look good then.

…ommitted data (#4530) - There is a chance that the actual write eventually failed in data table but commit was successful in Metadata table, and if compaction was triggered in MDT, compaction could have included the uncommitted data. But once compacted, it may never be ignored while reading from metadata table. So, this patch fixes the bug. Metadata table compaction is triggered before applying the commit to metadata table to circumvent this issue.

…ommitted data (apache#4530) - There is a chance that the actual write eventually failed in data table but commit was successful in Metadata table, and if compaction was triggered in MDT, compaction could have included the uncommitted data. But once compacted, it may never be ignored while reading from metadata table. So, this patch fixes the bug. Metadata table compaction is triggered before applying the commit to metadata table to circumvent this issue.

alexeykudinkin · 2022-06-29T20:33:24Z


    try (SparkRDDWriteClient writeClient = new SparkRDDWriteClient(engineContext, metadataWriteConfig, true)) {
+      if (canTriggerTableService) {
+        // trigger compaction before doing the delta commit. this is to ensure, if this delta commit succeeds in metadata table, but failed in data table,


Why is it the case that MT commit could succeed, while Data Table commit could fail?
MT table should only be updated after we're done with the Data Table changes, and right before we complete the txn, right?

nsivabalan added the priority:critical Production degraded; pipelines stalled label Jan 7, 2022

nsivabalan requested a review from prashantwason January 7, 2022 02:09

[HUDI-3178] Fixing metadata table compaction so as to not include unc…

f67f3a7

…ommitted data

nsivabalan force-pushed the metadataCompactionIgnoreFailedCommits branch from 33af796 to f67f3a7 Compare January 7, 2022 04:34

prashantwason reviewed Jan 8, 2022

View reviewed changes

Comment thread ...di-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java

prashantwason approved these changes Jan 8, 2022

View reviewed changes

nsivabalan commented Jan 8, 2022

View reviewed changes

nsivabalan merged commit 98ec215 into apache:master Jan 8, 2022

manojpec reviewed Jan 9, 2022

View reviewed changes

vinishjail97 mentioned this pull request Jan 24, 2022

FixIgnoreKey nsivabalan/hudi#11

Closed

5 tasks

alexeykudinkin reviewed Jun 29, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HUDI-3178] Fixing metadata table compaction so as to not include uncommitted data#4530

[HUDI-3178] Fixing metadata table compaction so as to not include uncommitted data#4530
nsivabalan merged 1 commit intoapache:masterfrom
nsivabalan:metadataCompactionIgnoreFailedCommits

nsivabalan commented Jan 7, 2022

Uh oh!

hudi-bot commented Jan 7, 2022

Uh oh!

nsivabalan commented Jan 7, 2022

Uh oh!

Uh oh!

prashantwason left a comment

Uh oh!

nsivabalan left a comment

Uh oh!

manojpec Jan 9, 2022

Uh oh!

nsivabalan Jan 10, 2022 •

edited

Loading

Uh oh!

manojpec Jan 10, 2022

Uh oh!

alexeykudinkin Jun 29, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

nsivabalan commented Jan 7, 2022

What is the purpose of the pull request

Brief change log

Verify this pull request

Committer checklist

Uh oh!

hudi-bot commented Jan 7, 2022

CI report:

Uh oh!

nsivabalan commented Jan 7, 2022

Uh oh!

Uh oh!

prashantwason left a comment

Choose a reason for hiding this comment

Uh oh!

nsivabalan left a comment

Choose a reason for hiding this comment

Uh oh!

manojpec Jan 9, 2022

Choose a reason for hiding this comment

Uh oh!

nsivabalan Jan 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

manojpec Jan 10, 2022

Choose a reason for hiding this comment

Uh oh!

alexeykudinkin Jun 29, 2022

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

nsivabalan Jan 10, 2022 •

edited

Loading