Skip to content

Conversation

@prashantwason
Copy link
Member

[HUDI-6151] Rollback previously applied commits to MDT when operations are retried.

Change Logs

Operations like Clean, Compaction are retried after failures with the same instant time. If the previous run of the operation successfully committed to the MDT but failed to commit to the dataset, then the operation will be retried later with the same instantTime causing duplicate updates applied to MDT.

Currently, we simply delete the completed deltacommit without rolling back the deltacommit.

To handle this, we detect a replay of operation and rollback any changes from that operation in MDT.

Impact

Fixes the issue of duplicate log blocks written in the MDT. This is deterimental for indexes where duplicates are not allowed.

Risk level (write none, low medium or high below)

None. Unit test has been added.

Documentation Update

None

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

// if this is a new commit being applied to metadata for the first time
writeClient.startCommitWithTime(instantTime);
LOG.info("New commit at " + instantTime + " being applied to MDT");
} else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to MDT -> to MDT.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we sync the logic also to FlinkHoodieBackedTableMetadataWriter ?

alreadyCompletedInstant.isPresent() ? "Already" : "Partially", instantTime));

// Rollback the previous committed commit
if (!writeClient.rollback(instantTime)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying to gauge if we really need this.

I guess in next couple of patches, you are going to add below change:

  • Any rollback in DT will be an actual rollback in MDT as well.

having said that, lets go through this use-case.

Compaction Commit C5 is inflight in DT and succeeded in MDT, but crashed in DT.
so on restart, a rollback is triggered in DT. which when gets into MDT territory, will rollback the succeeded commit in MDT. So, it will be automatically taken care of.

After rollback of C5 is completed, C5 will be re-attempted in DT. and when it gets into MDT territory, there won't be any traces of DC5 at all. So, wondering when exactly we will hit this case?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we are talking about a partially failed commit in MDT:

Compaction Commit C5 is inflight in DT and DC5 in MDT is also partitally committed and crashed.
On restart, any new operation in DT when it gets into MDT territory, on deducting a partial commit in MDT, a rollback will be triggered eagerly. Ref:

So, this case is also taken care of.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will rollback the succeeded commit in MDT. So, it will be automatically taken care of.

The current code on master just removes the .complete metadata file, is that a rollback you are mentioning about? To keep sync with regular rollback, I think triggering a real rollback action is necessary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Older solution of removing the completed action and reattempt won't work in all scenarios. We will have to consider the following scenarios:
(1) c1.commit failed on the main dataset; On MDT, c1.deltacommit was completed.
(a) with record index enabled, new log block was added to the log file by c1.deltacommit. Simply removing deltacommit, may not be enough and will require additional action to rollback the logblock, to keep the log file consistent.
(2) c1.clean was attempted. c1.deltacommit was completed. When clean is retried, second attempt could bring in some of the files that were in the "failed" list of the first attempt (vs the "success" list).
(3) c1.rollback was attempted. c1.deltacommit was completed. (We fixed an issue with incomplete rollback, with MDT updated with deltacommit, scenario. This change played a role in this scenario as well).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, fix the rollback in sync with normal DT can avoid many potential bugs, +1 for this direction.

@nsivabalan nsivabalan self-assigned this May 2, 2023
@nsivabalan nsivabalan added release-0.14.0 priority:critical Production degraded; pipelines stalled labels May 2, 2023
@nbalajee
Copy link
Contributor

nbalajee commented May 4, 2023

LGTM.

Copy link
Contributor

@nbalajee nbalajee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@danny0405
Copy link
Contributor

+1, we are good if the test failures are resolved.

@nsivabalan
Copy link
Contributor

once the CI is green, we can land

@danny0405
Copy link
Contributor

@prashantwason can you rebase with the latest master code and re-trigger the Azure CI tests?

4 similar comments
@danny0405
Copy link
Contributor

@prashantwason can you rebase with the latest master code and re-trigger the Azure CI tests?

@danny0405
Copy link
Contributor

@prashantwason can you rebase with the latest master code and re-trigger the Azure CI tests?

@danny0405
Copy link
Contributor

@prashantwason can you rebase with the latest master code and re-trigger the Azure CI tests?

@danny0405
Copy link
Contributor

@prashantwason can you rebase with the latest master code and re-trigger the Azure CI tests?

@yihua yihua force-pushed the pw_fix_mdt_rollback_on_fail branch from f1653d9 to f0cf8b8 Compare May 23, 2023 20:31
@yihua
Copy link
Contributor

yihua commented May 23, 2023

@danny0405 @nsivabalan @prashantwason I rebased the PR on the latest PR. Once CI passes, we can land this.

Copy link
Contributor

@danny0405 danny0405 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @prashantwason , can we also sync the fix to FlinkHoodieBackedTableMetadataWriter ?

@xushiyan xushiyan self-assigned this Jun 19, 2023
@prashantwason prashantwason force-pushed the pw_fix_mdt_rollback_on_fail branch from 7612c1e to 2faefb4 Compare June 28, 2023 06:55
@prashantwason
Copy link
Member Author

@danny0405 I have synced the fix to FlinkHoodieBackedTableMetadataWriter. Please take a look.

@prashantwason
Copy link
Member Author

@hudi-bot run azure

Copy link
Contributor

@danny0405 danny0405 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@danny0405
Copy link
Contributor

@codope codope force-pushed the pw_fix_mdt_rollback_on_fail branch from e6b3312 to eb39bc7 Compare June 29, 2023 04:39
Copy link
Member

@codope codope left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change looks good. Have fixed the failing tests. We can land once the CI is green.

@danny0405
Copy link
Contributor

Change looks good. Have fixed the failing tests. We can land once the CI is green.

Nice job! @codope ~

@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@danny0405 danny0405 merged commit b95248e into apache:master Jun 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

priority:critical Production degraded; pipelines stalled release-0.14.0

Projects

Status: ✅ Done

Development

Successfully merging this pull request may close these issues.

8 participants