-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[HUDI-6574] Fix the problem that incremental clean cannot be executed when the earliest ActiveTimeline is a pending commit. #9243
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| return Option.of(nthInstant); | ||
| } else { | ||
| return completedCommitsTimeline.findInstantsBefore(earliestPendingCommits.get().getTimestamp()).lastInstant(); | ||
| return earliestPendingCommits.map(earliestCommit -> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could rename earliestPendingCommits -> earliestPendingCommit since it is singular
| return Option.of(nthInstant); | ||
| } else { | ||
| return completedCommitsTimeline.findInstantsBefore(earliestPendingCommits.get().getTimestamp()).lastInstant(); | ||
| return earliestPendingCommits.map(earliestCommit -> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| return earliestPendingCommits.map(earliestCommit -> | |
| return completedCommitsTimeline .findInstantsBefore(earliestPendingCommits.get().getTimestamp()).lastInstant().or(earliestPendingCommits); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fine!
983b362 to
42068ed
Compare
voonhous
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When performing a clean in Hudi, the earliest commit to be retained is obtained by calling the getEarliestCommitToRetain method in CleanPlanner. However, if a pending commit takes a long time and becomes the earliest active timeline, calling getEarliestCommitToRetain will return empty because there is no earlier commit than the pending commit
Please correct me if i am wrong, the issue here is not with cleaner, but with archival.
The PRs below ensures that async-table service actions that are still in the pending stage will block the progress of the incremental-clean window.
As such, if there is a pending table service action, the earliestCommitToRetain will return the completed commit before it. The earliestCommitToRetain will only move forward after pending table service action is completed.
However, your fix here is addressing an issue where the active timeline has a pending table service action as the earliest commit. I assume this happened because the commit before it is archived.
Perhaps a better fix is to ensure archival behaves correct such that the assumptions in cleaner are valid?
| JavaRDD<HoodieRecord> writeRecords = jsc.parallelize(records, 1); | ||
| writeClient.startCommitWithTime(newCommitTime); | ||
| JavaRDD<WriteStatus> writeStatues = writeClient.insert(writeRecords, newCommitTime); | ||
| // Assuming the first commit is pending, simulating the situation where all instants before the first pending commit have been achieved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
achieved -> archived
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your understanding of the premise is correct, but a pending commit may be caused by delayed heartbeats or the execution of a replace commit that takes a long time, resulting in many commits on the timeline. In this case, I think it is reasonable for the archive to remove all commits before this pending commit. However, this can lead to the problem I mentioned, that is, when performing incremental clean, the endpoint of the last clean record cannot be found as the starting point of this clean, which will result in a full clean. If we mark the pending commit record as the endpoint of this case in getPartitionPathsForIncrementalCleaning, we can find a timeline greater than or equal to the starting point and less than the endpoint, which means that the pending commit will not be removed. This operation maintains the original intention without changing other situations and solves this problem. I'm not sure if I've expressed my understanding clearly, do you think what I said makes sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, in the next incremental clean, if this clean is still pending, neither the starting nor the ending point of this clean will have any commits cleaned. However, if this pending commit is rolled back or executed, then the starting point will be this pending commit and the ending point will be the normal commit. In this case, the clean will be executed normally and no files will be missed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, I think it is reasonable for the archive to remove all commits before this pending commit
I don't think this is reasonable.
/**
* A FileSlice is considered committed, if one of the following is true - There is a committed data file - There are
* some log files, that are based off a commit or delta commit.
*/
private boolean isFileSliceCommitted(FileSlice slice) {
if (!compareTimestamps(slice.getBaseInstantTime(), LESSER_THAN_OR_EQUALS, lastInstant.get().getTimestamp())) {
return false;
}
return timeline.containsOrBeforeTimelineStarts(slice.getBaseInstantTime());
}Given your above assumption, a fileSlice that may be the output of an inflight replacecommit/clustering will be flagged as committed, when in actuality, it is not...
… when the earliest ActiveTimeline is a pending commit.
42068ed to
d52245c
Compare
After #8783, the active timeline would keep at least one completed instant. |
Will revisit this PR to check if this assumption is correct in abit. |
When performing a clean in Hudi, the earliest commit to be retained is obtained by calling the getEarliestCommitToRetain method in CleanPlanner. However, if a pending commit takes a long time and becomes the earliest active timeline, calling getEarliestCommitToRetain will return empty because there is no earlier commit than the pending commit. This can cause a problem during an incremental clean, as the previous endpoint (i.e., the last commit retained in the previous clean) is used as the starting point. If this starting point is empty, a full clean will be triggered, which is very resource-intensive.
To solve this problem without affecting normal clean, we set the EarliestCommitToRetain obtained in this situation to the earliest pending commit. Since the endpoint will not be cleaned in the current clean, this approach can solve the aforementioned problem without affecting normal clean operations.
Change Logs
Modify the method of getEarliestCommitToRetain in CleanPlanner when the first commit is a pending commit during a clean operation, to ensure that incremental clean can be executed normally.
Impact
None
Risk level (write none, low medium or high below)
None
Documentation Update
None
Contributor's checklist