Skip to content

Conversation

@nsivabalan
Copy link
Contributor

@nsivabalan nsivabalan commented Dec 23, 2021

What is the purpose of the pull request

  • Adding support to maintain commit metadata with compaction.
  • Introduced a new config hoodie.compaction.preserve.commit.metadata to guard this feature. When enabled, commit metadata will not be overwritten.

to be discussed:
should entire meta fields be preserved or just the commit time.

Brief change log

(for example:)

  • Modify AnnotationLocation checkstyle rule in checkstyle.xml

Verify this pull request

This change added tests and can be verified as follows:

  • Fixed TestHoodieMergeOnReadTable.testLogFileCountsAfterCompaction to verify the new functionality.

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

@nsivabalan nsivabalan requested a review from codope December 23, 2021 03:06
@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@nsivabalan nsivabalan changed the title [HUDI-44][WIP] Adding support to preserve commit metadata for compaction [HUDI-44] Adding support to preserve commit metadata for compaction Dec 27, 2021
@nsivabalan nsivabalan added the priority:critical Production degraded; pipelines stalled label Jan 6, 2022
Copy link
Member

@codope codope left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall. Do you see the need to save this config in compaction plan like how we do for clusteriing?

@nsivabalan
Copy link
Contributor Author

Probably we can skip adding it to plan. here is the use-case.
lets say a compaction was triggered w/ preserve commit metadata enabled and mid way users thinks that he does not want preserve commit metadata to be enabled.
and so cancels on-going compaction. changes write config to disable preserve commit metadata and restarts.
but since we serialized the value to the plan, we will re-execute it from scratch but with preserve commit metadata enabled right ?
guess we can't do much.
so, better not to serialize the value to the plan. and always honor current write configs.
Let me know what do you think

@codope codope merged commit b6891d2 into apache:master Jan 6, 2022
@xushiyan xushiyan removed the priority:critical Production degraded; pipelines stalled label Jan 11, 2022
@YannByron
Copy link
Contributor

@nsivabalan @vinothchandar @xushiyan @codope
IMO, compaction is a very clear and specific operation which should not to change any commit metadata, and do not need a config to control this behavior. I feel this a bug in the case we disable this and before. It also affects which data should be returned when the incremental query.

we need to make compaction, even and increment query when there is compaction in the range reasonable. Expect what do you think.

Copy link
Member

@vinothchandar vinothchandar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's look into fixing this based on each meta field and have the behavior turned on by default. We need not worry about turning on and off kind of scenarios here. Lets just think about how backwards compatible this will be


public static final ConfigProperty<Boolean> PRESERVE_COMMIT_METADATA = ConfigProperty
.key("hoodie.compaction.preserve.commit.metadata")
.defaultValue(false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Compaction should not change the original _hoodie_commit_time or _hoodie_commit_seqno values at all. So we should look into making that the default behavior as @YannByron suggested.

// Convert GenericRecord to GenericRecord with hoodie commit metadata in schema
IndexedRecord recordWithMetadataInSchema = rewriteRecord((GenericRecord) indexedRecord.get());
fileWriter.writeAvroWithMetadata(recordWithMetadataInSchema, hoodieRecord);
if (preserveMetadata) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's see. _hoodie_file_name could technically change to the base file?

.setExtraMetadata(getExtraMetadata())
.setVersion(getPlanVersion())
.setPreserveHoodieMetadata(getWriteConfig().isPreserveHoodieCommitMetadata())
.setPreserveHoodieMetadata(getWriteConfig().isPreserveHoodieCommitMetadataForClustering())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clustering should not change the commit time either.

@nsivabalan
Copy link
Contributor Author

sure. have filed https://issues.apache.org/jira/browse/HUDI-3800 to track the work item for compaction. and this one https://issues.apache.org/jira/browse/HUDI-3801 for clustering

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants