Skip to content

Conversation

@majian1998
Copy link
Contributor

@majian1998 majian1998 commented Jul 20, 2023

When performing a clean in Hudi, the earliest commit to be retained is obtained by calling the getEarliestCommitToRetain method in CleanPlanner. However, if a pending commit takes a long time and becomes the earliest active timeline, calling getEarliestCommitToRetain will return empty because there is no earlier commit than the pending commit. This can cause a problem during an incremental clean, as the previous endpoint (i.e., the last commit retained in the previous clean) is used as the starting point. If this starting point is empty, a full clean will be triggered, which is very resource-intensive.
To solve this problem without affecting normal clean, we set the EarliestCommitToRetain obtained in this situation to the earliest pending commit. Since the endpoint will not be cleaned in the current clean, this approach can solve the aforementioned problem without affecting normal clean operations.

Change Logs

Modify the method of getEarliestCommitToRetain in CleanPlanner when the first commit is a pending commit during a clean operation, to ensure that incremental clean can be executed normally.

Impact

None

Risk level (write none, low medium or high below)

None

Documentation Update

None

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

return Option.of(nthInstant);
} else {
return completedCommitsTimeline.findInstantsBefore(earliestPendingCommits.get().getTimestamp()).lastInstant();
return earliestPendingCommits.map(earliestCommit ->
Copy link
Contributor

@stream2000 stream2000 Jul 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could rename earliestPendingCommits -> earliestPendingCommit since it is singular

return Option.of(nthInstant);
} else {
return completedCommitsTimeline.findInstantsBefore(earliestPendingCommits.get().getTimestamp()).lastInstant();
return earliestPendingCommits.map(earliestCommit ->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return earliestPendingCommits.map(earliestCommit ->
return completedCommitsTimeline .findInstantsBefore(earliestPendingCommits.get().getTimestamp()).lastInstant().or(earliestPendingCommits);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fine!

@majian1998 majian1998 force-pushed the fix-earliestCommitToRetain branch from 983b362 to 42068ed Compare July 21, 2023 02:31
Copy link
Member

@voonhous voonhous left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When performing a clean in Hudi, the earliest commit to be retained is obtained by calling the getEarliestCommitToRetain method in CleanPlanner. However, if a pending commit takes a long time and becomes the earliest active timeline, calling getEarliestCommitToRetain will return empty because there is no earlier commit than the pending commit

Please correct me if i am wrong, the issue here is not with cleaner, but with archival.

The PRs below ensures that async-table service actions that are still in the pending stage will block the progress of the incremental-clean window.

  1. #7568
  2. #9038

As such, if there is a pending table service action, the earliestCommitToRetain will return the completed commit before it. The earliestCommitToRetain will only move forward after pending table service action is completed.

However, your fix here is addressing an issue where the active timeline has a pending table service action as the earliest commit. I assume this happened because the commit before it is archived.

Perhaps a better fix is to ensure archival behaves correct such that the assumptions in cleaner are valid?

JavaRDD<HoodieRecord> writeRecords = jsc.parallelize(records, 1);
writeClient.startCommitWithTime(newCommitTime);
JavaRDD<WriteStatus> writeStatues = writeClient.insert(writeRecords, newCommitTime);
// Assuming the first commit is pending, simulating the situation where all instants before the first pending commit have been achieved.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

achieved -> archived

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your understanding of the premise is correct, but a pending commit may be caused by delayed heartbeats or the execution of a replace commit that takes a long time, resulting in many commits on the timeline. In this case, I think it is reasonable for the archive to remove all commits before this pending commit. However, this can lead to the problem I mentioned, that is, when performing incremental clean, the endpoint of the last clean record cannot be found as the starting point of this clean, which will result in a full clean. If we mark the pending commit record as the endpoint of this case in getPartitionPathsForIncrementalCleaning, we can find a timeline greater than or equal to the starting point and less than the endpoint, which means that the pending commit will not be removed. This operation maintains the original intention without changing other situations and solves this problem. I'm not sure if I've expressed my understanding clearly, do you think what I said makes sense?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, in the next incremental clean, if this clean is still pending, neither the starting nor the ending point of this clean will have any commits cleaned. However, if this pending commit is rolled back or executed, then the starting point will be this pending commit and the ending point will be the normal commit. In this case, the clean will be executed normally and no files will be missed.

Copy link
Member

@voonhous voonhous Jul 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, I think it is reasonable for the archive to remove all commits before this pending commit

I don't think this is reasonable.

  /**
   * A FileSlice is considered committed, if one of the following is true - There is a committed data file - There are
   * some log files, that are based off a commit or delta commit.
   */
  private boolean isFileSliceCommitted(FileSlice slice) {
    if (!compareTimestamps(slice.getBaseInstantTime(), LESSER_THAN_OR_EQUALS, lastInstant.get().getTimestamp())) {
      return false;
    }

    return timeline.containsOrBeforeTimelineStarts(slice.getBaseInstantTime());
  }

Given your above assumption, a fileSlice that may be the output of an inflight replacecommit/clustering will be flagged as committed, when in actuality, it is not...

… when the earliest ActiveTimeline is a pending commit.
@majian1998 majian1998 force-pushed the fix-earliestCommitToRetain branch from 42068ed to d52245c Compare July 21, 2023 03:25
@danny0405
Copy link
Contributor

However, your fix here is addressing an issue where the active timeline has a pending table service action as the earliest commit. I assume this happened because the commit before it is archived.

After #8783, the active timeline would keep at least one completed instant.

@voonhous
Copy link
Member

However, your fix here is addressing an issue where the active timeline has a pending table service action as the earliest commit. I assume this happened because the commit before it is archived.

After #8783, the active timeline would keep at least one completed instant.

Will revisit this PR to check if this assumption is correct in abit.

@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants