Skip to content

Conversation

@hbgstc123
Copy link
Contributor

Change Logs

Original ClusteringUtils::getOldestInstantToRetainForClustering is based on inflight clean instant, but there maybe a moment when the last clean is complete and the next clean plan not generated, if timeline archive execute at this moment, no replace commit will be retained.
This pr propose to decide OldestInstantToRetainForClustering based on latest completed clean instant, return the first replace commit after the earliestInstantToRetain of last complete clean or first replace commit after last clean instant if earliestInstantToRetain is empty, and return the first replace commit in active timeline if there is no clean instant.

Impact

no

Risk level (write none, low medium or high below)

low

Documentation Update

no

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

if (!replaceTimeline.empty()) {
Option<HoodieInstant> cleanInstantOpt =
activeTimeline.getCleanerTimeline().filter(instant -> !instant.isCompleted()).firstInstant();
activeTimeline.getCleanerTimeline().filterCompletedInstants().lastInstant();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch, using completed clean instants is more reliable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks good, cc @bvaradar for another round of review~

@SteNicholas
Copy link
Member

@hbgstc123, which border situation could cause that there maybe a moment when the last clean is complete and the next clean plan not generated, if timeline archive execute at this moment, no replace commit will be retained?

@hbgstc123
Copy link
Contributor Author

@hbgstc123, which border situation could cause that there maybe a moment when the last clean is complete and the next clean plan not generated, if timeline archive execute at this moment, no replace commit will be retained?

For example in flink pipeline, clean is scheduled and executed in class Cleanfunction when function snapshotState() is invoked at the beginning of a checkpoint, so after a clean is complete and before the next checkpoint is triggered, there is no inflight clean instant in timeline.
And even if last clean operation complete the moment before the next checkpoint begin, the generation of clean plan will take time, the bigger the table the longer it may takes, during this clean planning time, there is no inflight clean instant in timeline neither.

@hbgstc123 hbgstc123 force-pushed the improve_getOldestInstantToRetainForClustering branch from 996608e to 0dff5a6 Compare April 14, 2023 02:38
@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@bvaradar bvaradar self-assigned this Apr 15, 2023
@hbgstc123
Copy link
Contributor Author

@bvaradar pls cc when convenient, thanks

Copy link
Contributor

@bvaradar bvaradar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Thanks for the fix.

@bvaradar bvaradar merged commit f3db222 into apache:master May 2, 2023
@danny0405
Copy link
Contributor

yihua pushed a commit to yihua/hudi that referenced this pull request May 15, 2023
yihua pushed a commit to yihua/hudi that referenced this pull request May 15, 2023
yihua pushed a commit to yihua/hudi that referenced this pull request May 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

5 participants