-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[HUDI-3032] Do not clean the log files right after compaction for met… #4336
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -706,7 +706,20 @@ protected void compactIfNecessary(AbstractHoodieWriteClient writeClient, String | |
| } | ||
| } | ||
|
|
||
| protected void doClean(AbstractHoodieWriteClient writeClient, String instantTime) { | ||
| protected void cleanIfNecessary(AbstractHoodieWriteClient writeClient, String instantTime) { | ||
| Option<HoodieInstant> lastCompletedCompactionInstant = metadataMetaClient.reloadActiveTimeline() | ||
| .getCommitTimeline().filterCompletedInstants().lastInstant(); | ||
| if (lastCompletedCompactionInstant.isPresent() | ||
| && metadataMetaClient.getActiveTimeline().filterCompletedInstants() | ||
| .findInstantsAfter(lastCompletedCompactionInstant.get().getTimestamp()).countInstants() < 3) { | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. pull the |
||
| // do not clean the log files immediately after compaction to give some buffer time for metadata table reader, | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this can happen only if cleaner is very aggressive right? i.e. retain just 1 commit. for older file slices, we should not encounter this. correct me if I am wrong.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Now the default compaction delta commits is 10, the min retain commits is 20, the max retain commits is 30. When the second compaction schedules and triggers, there are about 22 commits on the timeline, which would then trigger the cleaning immediately.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. min and max commits is for archival. cleaner is controlled by hoodie.metadata.cleaner.commits.retained which is 3. But I am good with the change.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So this problem should also exist in the MOR table data path? Is there any solution there?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I guess so, we need some protection logic too for the MOR table log files or even COW table parquet files, because if we write COW table very frequently (like each 10 seconds), the reader would very probably encounter |
||
| // because there is case that the reader has prepared for the log file readers already before the compaction completes | ||
| // while before/during the reading of the log files, the cleaning triggers and delete the reading files, | ||
| // then a FileNotFoundException(for LogFormatReader) or NPE(for HFileReader) would throw. | ||
|
|
||
| // 3 is a value that I think is enough for metadata table reader. | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. what's this based on?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just a subjective value that i think the metadata table reads fast enough within 3 transaction interval. Do you have better suggestion ? |
||
| return; | ||
| } | ||
| // Trigger cleaning with suffixes based on the same instant time. This ensures that any future | ||
| // delta commits synced over will not have an instant time lesser than the last completed instant on the | ||
| // metadata table. | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -154,7 +154,7 @@ protected void commit(HoodieData<HoodieRecord> hoodieDataRecords, String partiti | |
| metadataMetaClient.reloadActiveTimeline(); | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. reloadActiveTimelice called here so not necessary in ccleanIfNeceasry/
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
| if (canTriggerTableService) { | ||
| compactIfNecessary(writeClient, instantTime); | ||
| doClean(writeClient, instantTime); | ||
| cleanIfNecessary(writeClient, instantTime); | ||
| writeClient.archive(); | ||
| } | ||
| } | ||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.