-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[HUDI-44] Adding support to preserve commit metadata for compaction #4428
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
codope
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall. Do you see the need to save this config in compaction plan like how we do for clusteriing?
|
Probably we can skip adding it to plan. here is the use-case. |
|
@nsivabalan @vinothchandar @xushiyan @codope we need to make |
vinothchandar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's look into fixing this based on each meta field and have the behavior turned on by default. We need not worry about turning on and off kind of scenarios here. Lets just think about how backwards compatible this will be
|
|
||
| public static final ConfigProperty<Boolean> PRESERVE_COMMIT_METADATA = ConfigProperty | ||
| .key("hoodie.compaction.preserve.commit.metadata") | ||
| .defaultValue(false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Compaction should not change the original _hoodie_commit_time or _hoodie_commit_seqno values at all. So we should look into making that the default behavior as @YannByron suggested.
| // Convert GenericRecord to GenericRecord with hoodie commit metadata in schema | ||
| IndexedRecord recordWithMetadataInSchema = rewriteRecord((GenericRecord) indexedRecord.get()); | ||
| fileWriter.writeAvroWithMetadata(recordWithMetadataInSchema, hoodieRecord); | ||
| if (preserveMetadata) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's see. _hoodie_file_name could technically change to the base file?
| .setExtraMetadata(getExtraMetadata()) | ||
| .setVersion(getPlanVersion()) | ||
| .setPreserveHoodieMetadata(getWriteConfig().isPreserveHoodieCommitMetadata()) | ||
| .setPreserveHoodieMetadata(getWriteConfig().isPreserveHoodieCommitMetadataForClustering()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clustering should not change the commit time either.
|
sure. have filed https://issues.apache.org/jira/browse/HUDI-3800 to track the work item for compaction. and this one https://issues.apache.org/jira/browse/HUDI-3801 for clustering |
What is the purpose of the pull request
hoodie.compaction.preserve.commit.metadatato guard this feature. When enabled, commit metadata will not be overwritten.to be discussed:
should entire meta fields be preserved or just the commit time.
Brief change log
(for example:)
Verify this pull request
This change added tests and can be verified as follows:
Committer checklist
Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.