-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[HUDI-2763] Metadata table records - support for key deduplication based on hardcoded key field #4449
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HUDI-2763] Metadata table records - support for key deduplication based on hardcoded key field #4449
Conversation
vinothchandar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to find a way to remove the reference to metadatapayload reference from HFileWriter and HFileReader. lets keep pushing in this direction
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/storage/HoodieHFileWriter.java
Outdated
Show resolved
Hide resolved
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/storage/HoodieHFileWriter.java
Outdated
Show resolved
Hide resolved
hudi-common/src/main/java/org/apache/hudi/common/table/log/block/HoodieHFileDataBlock.java
Outdated
Show resolved
Hide resolved
hudi-common/src/main/java/org/apache/hudi/io/storage/HoodieHFileReader.java
Outdated
Show resolved
Hide resolved
...i-spark-client/src/test/java/org/apache/hudi/client/functional/TestHoodieBackedMetadata.java
Show resolved
Hide resolved
hudi-common/src/main/java/org/apache/hudi/io/storage/HoodieHFileReader.java
Outdated
Show resolved
Hide resolved
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/storage/HoodieHFileWriter.java
Outdated
Show resolved
Hide resolved
dc9fe1b to
ce8a8d9
Compare
vinothchandar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to avoid these direct references at all costs. I/O layer cannot reference the metadata payload class directly
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieStorageConfig.java
Outdated
Show resolved
Hide resolved
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieStorageConfig.java
Outdated
Show resolved
Hide resolved
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieStorageConfig.java
Outdated
Show resolved
Hide resolved
hudi-common/src/main/java/org/apache/hudi/common/table/log/block/HoodieHFileDataBlock.java
Outdated
Show resolved
Hide resolved
hudi-common/src/main/java/org/apache/hudi/common/table/log/block/HoodieHFileDataBlock.java
Outdated
Show resolved
Hide resolved
hudi-common/src/main/java/org/apache/hudi/io/storage/HoodieHFileReader.java
Outdated
Show resolved
Hide resolved
|
cc @prashantwason as well . I still think we need to avoid the references to HoodieMetadataPayload directly from HFileReader and HoodieHFileDataBlock level. The issue here that these layers currently don't take in any config object. For HFileReader - there is a hadoop Alternatively, a good Hudi citizen contributor (I just made that up), would have to plumb the interfaces more nicely to take a |
ce8a8d9 to
3d6e3e7
Compare
|
After discussions, made HoodieHFileReader the single source of truth for all HFile schema related fields. HFileReader already tracks other fields and it is meaningful to move the key field also here. MetadataPayload and HFileDataBlock will refer to HFileReader for the key field. Passing the properties to HFileDataBlock can come in separate PR as suggested. |
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/storage/HoodieHFileConfig.java
Outdated
Show resolved
Hide resolved
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/storage/HoodieHFileWriter.java
Outdated
Show resolved
Hide resolved
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieStorageConfig.java
Outdated
Show resolved
Hide resolved
…d virtual keys - The backing log format for the metadata table is HFile, a KeyValue type. Since the key field in the metadata record payload is a duplicate of the Key in the Cell, the redundant key field in the record can be emptied to save on the cost. - HoodieHFileWriter and HoodieHFileDataBlock will now serialize records with the key field emptied by default. HFile writer tries to find if the record has metadata payload schema field 'key' and if so it does the key trimming from the record payload. - HoodieHFileReader when reading the serialized records back from disk, it materializes the missing keyFields if any. HFile reader tries to find if the record has metadata payload schema fiels 'key' and if so it does the key materialization in the record payload. - Tests have been added to verify the default virtual keys and key deduplication support for the metadata table records.
…d virtual keys - Added storage config property for hfile schema key field so that HFileWriter callers can pass in the field id explicitly for the key deduplication. - Updated the test issue after the latest merge with master which has compaction related changes.
…d virtual keys - Moving the metadata schema key field to HFileReader so that it can be the single source of truth for all the hfile schema related fields. HFileReader already tracks other schema fields. - HFileDataBlock, MetadataPayload will refer to HFileReader for the key field.
3d6e3e7 to
dd2fd12
Compare
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/storage/HoodieHFileWriter.java
Outdated
Show resolved
Hide resolved
hudi-common/src/main/java/org/apache/hudi/common/table/log/block/HoodieHFileDataBlock.java
Outdated
Show resolved
Hide resolved
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/storage/HoodieHFileWriter.java
Outdated
Show resolved
Hide resolved
vinothchandar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made some edits. LGTM high level, please fix the remaining naming comments. Also look at the changes I made to avoid duplicate conversion of records
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/storage/HoodieHFileConfig.java
Show resolved
Hide resolved
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/storage/HoodieHFileWriter.java
Outdated
Show resolved
Hide resolved
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/storage/HoodieHFileWriter.java
Outdated
Show resolved
Hide resolved
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataPayload.java
Outdated
Show resolved
Hide resolved
hudi-common/src/main/java/org/apache/hudi/common/table/log/block/HoodieHFileDataBlock.java
Outdated
Show resolved
Hide resolved
hudi-common/src/main/java/org/apache/hudi/io/storage/HoodieHFileReader.java
Outdated
Show resolved
Hide resolved
hudi-common/src/main/java/org/apache/hudi/io/storage/HoodieHFileReader.java
Outdated
Show resolved
Hide resolved
hudi-common/src/main/java/org/apache/hudi/io/storage/HoodieHFileReader.java
Outdated
Show resolved
Hide resolved
hudi-common/src/main/java/org/apache/hudi/io/storage/HoodieHFileReader.java
Outdated
Show resolved
Hide resolved
...i-spark-client/src/test/java/org/apache/hudi/client/functional/TestHoodieBackedMetadata.java
Show resolved
Hide resolved
|
LGTM. Nothing more to add to the existing open review comments. |
…d virtual keys - Renamed variables and refactored test to run for shorter duration
|
CI passed in the re-run - https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=5508&view=results |
nsivabalan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
| public static final ConfigProperty<Boolean> POPULATE_META_FIELDS = ConfigProperty | ||
| .key(METADATA_PREFIX + ".populate.meta.fields") | ||
| .defaultValue(true) | ||
| .defaultValue(false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this changing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only for metadata table. Its intentional that we are disabling meta fields, that is enabling virtual keys for metadata table.
…sed on hardcoded key field (apache#4449) * [HUDI-2763] Metadata table records - support for key deduplication and virtual keys - The backing log format for the metadata table is HFile, a KeyValue type. Since the key field in the metadata record payload is a duplicate of the Key in the Cell, the redundant key field in the record can be emptied to save on the cost. - HoodieHFileWriter and HoodieHFileDataBlock will now serialize records with the key field emptied by default. HFile writer tries to find if the record has metadata payload schema field 'key' and if so it does the key trimming from the record payload. - HoodieHFileReader when reading the serialized records back from disk, it materializes the missing keyFields if any. HFile reader tries to find if the record has metadata payload schema fiels 'key' and if so it does the key materialization in the record payload. - Tests have been added to verify the default virtual keys and key deduplication support for the metadata table records. Co-authored-by: Vinoth Chandar <[email protected]>
…sed on hardcoded key field (apache#4449) * [HUDI-2763] Metadata table records - support for key deduplication and virtual keys - The backing log format for the metadata table is HFile, a KeyValue type. Since the key field in the metadata record payload is a duplicate of the Key in the Cell, the redundant key field in the record can be emptied to save on the cost. - HoodieHFileWriter and HoodieHFileDataBlock will now serialize records with the key field emptied by default. HFile writer tries to find if the record has metadata payload schema field 'key' and if so it does the key trimming from the record payload. - HoodieHFileReader when reading the serialized records back from disk, it materializes the missing keyFields if any. HFile reader tries to find if the record has metadata payload schema fiels 'key' and if so it does the key materialization in the record payload. - Tests have been added to verify the default virtual keys and key deduplication support for the metadata table records. Co-authored-by: Vinoth Chandar <[email protected]>
What is the purpose of the pull request
The backing log format for the metadata table is HFile, a KeyValue type.
Since the key field in the metadata record payload is a duplicate of the
Key in the Cell, the redundant key field in the record can be emptied
to save on the cost.
HoodieHFileWriter and HoodieHFileDataBlock will now serialize records
with the key field emptied by default. HFile writer tries to find if
the record has metadata payload schema field 'key' and if so it does
the key trimming from the record payload.
HoodieHFileReader when reading the serialized records back from disk,
it materializes the missing keyFields if any. HFile reader tries to
find if the record has metadata payload schema fiels 'key' and if so
it does the key materialization in the record payload.
NOTE: There is a generic version of this PR at #4447
where the reader/writers of the log/base HFile format would say what the actual key field is
and the key deduplication and key materialization is done for the any such requested key field.
Verify this pull request
Tests have been added to verify the default virtual keys and key
deduplication support for the metadata table records.
Manually verified the serialized records on the disk are trimmed
off the key field
Committer checklist
Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.