-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[HUDI-5341] IncrementalCleaning consider clustering completed later #7405
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
e52b948
ff0887a
ff02d94
b7a1071
1fb15d7
6779912
cea266d
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -188,8 +188,12 @@ private List<String> getPartitionPathsForIncrementalCleaning(HoodieCleanMetadata | |
| + "since last cleaned at " + cleanMetadata.getEarliestCommitToRetain() | ||
| + ". New Instant to retain : " + newInstantToRetain); | ||
| return hoodieTable.getCompletedCommitsTimeline().getInstantsAsStream().filter( | ||
| instant -> HoodieTimeline.compareTimestamps(instant.getTimestamp(), HoodieTimeline.GREATER_THAN_OR_EQUALS, | ||
| cleanMetadata.getEarliestCommitToRetain()) && HoodieTimeline.compareTimestamps(instant.getTimestamp(), | ||
| instant -> (HoodieTimeline.compareTimestamps(instant.getTimestamp(), HoodieTimeline.GREATER_THAN_OR_EQUALS, | ||
| cleanMetadata.getEarliestCommitToRetain()) | ||
| || (instant.getMarkerFileModificationTimestamp().isPresent() | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I didn't think the condition is needed for the commits which isn't the replacecommit.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Guarantee compatibility. Maybe other new kinds of instant need to clean later ?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If an out-of-order replace commit finished before the clean start and the instant time of the replace commit is before the earliest commit to retain, it won't be cleaned and left in the timeline. Archiver will then archive it since it's last modified time is earlier than the last clean in the timeline. What do you think?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
You are right,it still won't clean the clustering instant in this scenario. |
||
| && (HoodieTimeline.compareTimestamps(instant.getMarkerFileModificationTimestamp().get(), HoodieTimeline.GREATER_THAN_OR_EQUALS, | ||
| cleanMetadata.getStartCleanTime())))) | ||
| && HoodieTimeline.compareTimestamps(instant.getTimestamp(), | ||
| HoodieTimeline.LESSER_THAN, newInstantToRetain.get().getTimestamp())).flatMap(instant -> { | ||
| try { | ||
| if (HoodieTimeline.REPLACE_COMMIT_ACTION.equals(instant.getAction())) { | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -24,8 +24,10 @@ | |
| import org.apache.hudi.avro.model.HoodieCleanerPlan; | ||
| import org.apache.hudi.common.HoodieCleanStat; | ||
| import org.apache.hudi.common.model.CleanFileInfo; | ||
| import org.apache.hudi.common.model.HoodieAvroPayload; | ||
| import org.apache.hudi.common.model.HoodieFailedWritesCleaningPolicy; | ||
| import org.apache.hudi.common.table.HoodieTableMetaClient; | ||
| import org.apache.hudi.common.table.timeline.HoodieActiveTimeline; | ||
| import org.apache.hudi.common.table.timeline.HoodieInstant; | ||
| import org.apache.hudi.common.table.timeline.HoodieTimeline; | ||
| import org.apache.hudi.common.table.timeline.TimelineMetadataUtils; | ||
|
|
@@ -48,7 +50,7 @@ | |
| /** | ||
| * Utils for clean action. | ||
| */ | ||
| public class CleanerUtils { | ||
| public class CleanerUtils<T extends HoodieAvroPayload, I, K, O> { | ||
|
|
||
| private static final Logger LOG = LogManager.getLogger(CleanerUtils.class); | ||
|
|
||
|
|
@@ -161,4 +163,42 @@ public static void rollbackFailedWrites(HoodieFailedWritesCleaningPolicy cleanin | |
| throw new IllegalArgumentException("Unsupported action type " + actionType); | ||
| } | ||
| } | ||
|
|
||
| /** | ||
| * Get latest completed clean planner clean time. | ||
| * @param metaClient | ||
| * @return Latest clean planner clean time. | ||
| * @throws IOException | ||
| */ | ||
| public static Option<String> getLatestCompletedCleanInstantCleanTime(HoodieTableMetaClient metaClient) | ||
| throws IOException { | ||
| HoodieActiveTimeline activeTimeline = metaClient.getActiveTimeline(); | ||
| Option<HoodieInstant> lastCleanInstantOption = activeTimeline.getCleanerTimeline().filterCompletedInstants().lastInstant(); | ||
| Option<HoodieCleanMetadata> cleanMetadata = lastCleanInstantOption.isPresent() && ! activeTimeline.isEmpty(lastCleanInstantOption.get()) | ||
| ? Option.ofNullable(TimelineMetadataUtils.deserializeHoodieCleanMetadata(activeTimeline.getInstantDetails(lastCleanInstantOption.get()).get())) | ||
| : Option.empty(); | ||
| return cleanMetadata.isPresent() ? Option.ofNullable(cleanMetadata.get().getStartCleanTime()) : Option.empty(); | ||
| } | ||
|
|
||
| /** | ||
| * Get earliest unClean completed instant | ||
| * @param metaClient | ||
| * @return | ||
| * @throws IOException | ||
| */ | ||
| public static Option<HoodieInstant> getEarliestUnCleanCompletedInstant(HoodieTableMetaClient metaClient) throws IOException { | ||
| HoodieActiveTimeline activeTimeline = metaClient.getActiveTimeline(); | ||
| Option<HoodieInstant> firstCompletedClusteringInstant = activeTimeline.getCommitsTimeline().filterCompletedInstants() | ||
| .filter(hoodieInstant -> hoodieInstant.getAction().equals(HoodieTimeline.REPLACE_COMMIT_ACTION)).firstInstant(); | ||
| if (!firstCompletedClusteringInstant.isPresent()) { | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If the firstCompletedClusteringInstant isn't present, it's no need to get the latest clean time from the clean metadata.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Move it behind else. |
||
| return Option.empty(); | ||
| } else { | ||
| Option<String> earliestInstantCleanTime = getLatestCompletedCleanInstantCleanTime(metaClient); | ||
| return ! earliestInstantCleanTime.isPresent() | ||
| ? firstCompletedClusteringInstant | ||
| : activeTimeline.getCommitsTimeline().filterCompletedInstants().filter(instant -> | ||
| instant.getMarkerFileModificationTimestamp().isPresent() | ||
| && HoodieTimeline.compareTimestamps(instant.getMarkerFileModificationTimestamp().get(), HoodieTimeline.GREATER_THAN_OR_EQUALS, earliestInstantCleanTime.get())).firstInstant(); | ||
| } | ||
| } | ||
| } | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We better check the timeline clean metadata files, only when all the replaced clustering files are cleaned, can the clustering instant be archived.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no completed clustering instant -> no need to check
firstCompletedClusteringInstant is not null && no completed clean instant -> return firstCompletedClusteringInstant
firstCompletedClusteringInstant is not null && earliestCleanInstant is not null -> return firstUncleanClusteringInstant