Skip to content

Conversation

@Zouxxyy
Copy link
Contributor

@Zouxxyy Zouxxyy commented Aug 10, 2023

Change Logs

The current getCleanInstantsToArchive filters clean and rollback instants according to maxInstantsToKeep and minInstantsToKeep respectively. There are two disadvantages:

  1. If user has only a few rollback instants (not satisfied with the archive), then they will exist forever, even if the commit instant has become very large, which can be very confusing for users.

eg:

// user: what is 202101.rollback?
202101.rollback, 202301.commit, 202302.commit ...
  1. Archiving clean and rollback separately will cause holes in the active timeline.

eg:

archive:
2.clean, 3.clean, 4.clean

active:
1.commit, 5.commit ...

A continuous active timeline is our goal.

Therefore, modifying the logic of getCleanAndRollbackInstantsToArchive as follows:

// Since the commit instants to archive is continuous, we can use the newest commit instant to archive as the
// right boundary to collect the clean or rollback instants to archive.
//
//                                                  newestCommitInstantToArchive
//                                                               v
//  | commit1 clean1 commit2 commit3 clean2 commit4 rollback1 commit5 | commit6 clean3 commit7 ...
//  | <------------------  instants to archive  --------------------> |
//
//  CommitInstantsToArchive: commit1, commit2, commit3, commit4, commit5
//  CleanAndRollbackInstantsToArchive: clean1, clean2, rollback1

Impact

  • modifying the logic of getCleanAndRollbackInstantsToArchive, fix two disadvantages above
  • Optimize the processing of getCommitInstantsToArchive, divide it into two parts: 1) handle oldestInstantToRetain, 2) handle savepoint.
  • Fix all test cases.

Risk level (write none, low medium or high below)

medium

Documentation Update

None

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@Zouxxyy Zouxxyy changed the title [HUDI-6678] Fix the acquisition of clean and rollback instants to arc… [HUDI-6678] Fix the acquisition of clean&rollback instants to archive Aug 10, 2023
@Zouxxyy
Copy link
Contributor Author

Zouxxyy commented Aug 11, 2023

@suryaprasanna @yihua @prashantwason can you help with a revew~

@danny0405 danny0405 added the area:table-service Table services label Aug 18, 2023
@yihua yihua self-assigned this Aug 18, 2023
@danny0405
Copy link
Contributor

Looks good from my side, cc @yihua for the final review.

Copy link
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks ok to me. Will take a second pass to inspect details.

// beyond savepoint) and the earliest inflight instant (all actions).
// This is required by metadata table, see HoodieTableMetadataUtil#processRollbackMetadata
// for details.
// Todo: Remove #7580
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on @danny0405 's comment. Before this PR, the logic is to keep at least one clean and rollback instant on the active timeline. Archiving all rollback instants in the active timeline are ok, but there's caveat on archiving all clean instants (e.g., all completed clean instants are before the earliest commit to retain; one of the cases could be the user turns off cleaning for some time and turns it on again). The incremental cleaning may not get the latest clean instant.

However, it's not the right way either to keep at least one clean instant in the active timeline in the new logic, because that can block the archival of commits for the sake of a contiguous block of instants.

@Zouxxyy could you double check if there is any issue wrt to different cleaning modes under the new archival logic? e.g., incremental cleaning should fall back to full cleaning if there's no clean instant on the active timeline.

Copy link
Contributor Author

@Zouxxyy Zouxxyy Sep 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yihua see #9416 (comment), It explains why incremental cleaning always works even with this PR.

@yihua
Copy link
Contributor

yihua commented Sep 14, 2023

@Zouxxyy could you also rebase the PR on the master?

@Zouxxyy
Copy link
Contributor Author

Zouxxyy commented Sep 14, 2023

@Zouxxyy could you also rebase the PR on the master?

Thanks for the review, will rebase at night, the conflict is a bit big

@danny0405
Copy link
Contributor

@yihua
Copy link
Contributor

yihua commented Sep 15, 2023

Screenshot 2023-09-14 at 21 50 35

@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@yihua
Copy link
Contributor

yihua commented Sep 15, 2023

Taking a final pass now.

Copy link
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Finally, the archival logic is intuitive and easy to follow. Thanks for improving it :)

Comment on lines +285 to +293
List<HoodieInstant> instantsToArchive = getCommitInstantsToArchive();
if (!instantsToArchive.isEmpty()) {
HoodieInstant latestCommitInstantToArchive = instantsToArchive.get(instantsToArchive.size() - 1);
// Then get clean and rollback instants to archive.
List<HoodieInstant> cleanAndRollbackInstantsToArchive =
getCleanAndRollbackInstantsToArchive(latestCommitInstantToArchive);
instantsToArchive.addAll(cleanAndRollbackInstantsToArchive);
instantsToArchive.sort(HoodieInstant.COMPARATOR);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think the logic can be further simplified by treating all instants to archive the same, i.e., getting all instants before the earliest commit to retain, without differentiating the action types. We can follow up in a separate PR.

@yihua yihua merged commit 7863348 into apache:master Sep 16, 2023
@yihua
Copy link
Contributor

yihua commented Mar 25, 2024

@Zouxxyy the changes have not been cherry-picked to branch-0.x. It would be great to have this feature in the upcoming release 0.15.0. Do you have time to do that?

@Zouxxyy
Copy link
Contributor Author

Zouxxyy commented Mar 27, 2024

@Zouxxyy the changes have not been cherry-picked to branch-0.x. It would be great to have this feature in the upcoming release 0.15.0. Do you have time to do that?

@yihua OK, but I don’t seem to find the 0.15.0 branch

@yihua
Copy link
Contributor

yihua commented Mar 29, 2024

@Zouxxyy the changes have not been cherry-picked to branch-0.x. It would be great to have this feature in the upcoming release 0.15.0. Do you have time to do that?

@yihua OK, but I don’t seem to find the 0.15.0 branch

@Zouxxyy there is no 0.15.0 branch yet. 0.15.0 release branch will be forked out of branch-0.x, so you can raise a PR against branch-0.x.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: ✅ Done

Development

Successfully merging this pull request may close these issues.

4 participants