Skip to content

Conversation

@nsivabalan
Copy link
Contributor

@nsivabalan nsivabalan commented Feb 14, 2022

What is the purpose of the pull request

Flipping the value of preserver commit metadata for compaction to true. Disabled for metadata table since meta fields are not populated.

To discuss:
when populate meta fields are disabled, should be check that preserve metadata can't be set to true. this might be an issue w/ kafka connect where we directly create log files.

Brief change log

(for example:)

  • Modify AnnotationLocation checkstyle rule in checkstyle.xml

Verify this pull request

(Please pick either of the following options)

This pull request is a trivial rework / code cleanup without any test coverage.

(or)

This pull request is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

  • Added integration tests for end-to-end.
  • Added HoodieClientWriteTest to verify the change.
  • Manually verified the change by running a job locally.

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

@nsivabalan nsivabalan changed the title [HUDI-3213][WIP] Making commit preserve metadata to true [HUDI-3213][WIP] Making commit preserve metadata to true for compaction Feb 14, 2022
@nsivabalan nsivabalan force-pushed the compactPreserveCommitMetadata branch from 9ae9222 to 7ed0a4a Compare February 15, 2022 21:21
@nsivabalan nsivabalan force-pushed the compactPreserveCommitMetadata branch from 480bf89 to 082ec6b Compare February 17, 2022 11:58
Comment on lines 296 to 299
if (preserveMetadata) {
// do not preserve FILENAME_METADATA_FIELD
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it always the case where we never preserve FILENAME ? since when compact we always create new files, right? then it's some extra rule we have to remember for this config, which sounds like we preserve all meta cols.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't have a better idea. just felt the confusion here.

Copy link
Contributor Author

@nsivabalan nsivabalan Feb 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. so by preserve commit metadata, implicitly we mean to preserve every commit metadata(record key, partition path, commit time, commit seq no) except filename. Every other meta field should be carry forwarded from old record, but filename has to represent the file where the record actually exists.

@xushiyan xushiyan added priority:critical Production degraded; pipelines stalled writer-core labels Feb 20, 2022
@nsivabalan nsivabalan force-pushed the compactPreserveCommitMetadata branch 2 times, most recently from f80680b to b8f3071 Compare February 21, 2022 20:14
@nsivabalan
Copy link
Contributor Author

@xushiyan : this might need some work to be done. I will sync up w/ you directly. Fix is not trivial as I anticipated it to be.

@nsivabalan nsivabalan force-pushed the compactPreserveCommitMetadata branch 2 times, most recently from 6ed7704 to ba8eca5 Compare February 22, 2022 23:35
@nsivabalan nsivabalan changed the title [HUDI-3213][WIP] Making commit preserve metadata to true for compaction [HUDI-3213] Making commit preserve metadata to true for compaction Feb 23, 2022
@nsivabalan
Copy link
Contributor Author

@xushiyan : this is good to review.

@nsivabalan
Copy link
Contributor Author

@xushiyan : if you are occupied, I can ask sagar or ethan to review this patch. let me know.

@nsivabalan nsivabalan force-pushed the compactPreserveCommitMetadata branch from ba8eca5 to 38cf843 Compare March 2, 2022 22:18
Copy link
Member

@codope codope left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check if my below comment holds for this patch.

@nsivabalan nsivabalan force-pushed the compactPreserveCommitMetadata branch from 1b14058 to 8ccef8c Compare March 7, 2022 03:43
@hudi-bot
Copy link
Collaborator

hudi-bot commented Mar 7, 2022

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@codope codope merged commit 3539578 into apache:master Mar 7, 2022
Copy link
Member

@xushiyan xushiyan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nsivabalan hey sorry i failed in catching up with notifications. Changes look okay. The comments below are optional or follow-up.

IndexedRecord recordWithMetadataInSchema = rewriteRecord((GenericRecord) indexedRecord.get());
if (preserveMetadata) {
IndexedRecord recordWithMetadataInSchema = rewriteRecord((GenericRecord) indexedRecord.get(), preserveMetadata, oldRecord);
if (preserveMetadata && useWriterSchema) { // useWriteSchema will be true only incase of compaction.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this is the case, better name could be useWriterSchemaForCompaction

return newRecord;
}

public static GenericRecord rewriteRecord(GenericRecord genericRecord, Schema newSchema, boolean copyOverMetaFields, GenericRecord fallbackRecord) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be good to have at least 1 UT covering this

public static final String OPERATION_METADATA_FIELD = "_hoodie_operation";
public static final String HOODIE_IS_DELETED = "_hoodie_is_deleted";

public static int FILENAME_METADATA_FIELD_POS = 4;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this could be a separate cleanup task: make constants for all meta fields and adopt them across codebase

vingov pushed a commit to vingov/hudi that referenced this pull request Apr 3, 2022
…pache#4811)

* Making commit preserve metadata to true

* Fixing integ tests

* Fixing preserve commit metadata for metadata table

* fixed bootstrap tests

* temp diff

* Fixing merge handle

* renaming fallback record

* fixing build issue

* Fixing test failures
stayrascal pushed a commit to stayrascal/hudi that referenced this pull request Apr 12, 2022
…pache#4811)

* Making commit preserve metadata to true

* Fixing integ tests

* Fixing preserve commit metadata for metadata table

* fixed bootstrap tests

* temp diff

* Fixing merge handle

* renaming fallback record

* fixing build issue

* Fixing test failures
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

priority:critical Production degraded; pipelines stalled

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants