Skip to content

Conversation

@manojpec
Copy link
Contributor

@manojpec manojpec commented Dec 27, 2021

What is the purpose of the pull request

  • The backing log format for the metadata table is HFile, a KeyValue type.
    Since the key field in the metadata record payload is a duplicate of the
    Key in the Cell, the redundant key field in the record can be emptied
    to save on the cost.

  • HoodieHFileWriter and HoodieHFileDataBlock will now serialize records
    with the key field emptied by default. HFile writer tries to find if
    the record has metadata payload schema field 'key' and if so it does
    the key trimming from the record payload.

  • HoodieHFileReader when reading the serialized records back from disk,
    it materializes the missing keyFields if any. HFile reader tries to
    find if the record has metadata payload schema fiels 'key' and if so
    it does the key materialization in the record payload.

NOTE: There is a generic version of this PR at #4447
where the reader/writers of the log/base HFile format would say what the actual key field is
and the key deduplication and key materialization is done for the any such requested key field.

Verify this pull request

  • Tests have been added to verify the default virtual keys and key
    deduplication support for the metadata table records.

  • Manually verified the serialized records on the disk are trimmed
    off the key field

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

@manojpec manojpec changed the title [HUDI-2763] Metadata table records - support for key deduplication and virtual keys [HUDI-2763] Metadata table records - support for key deduplication based on hardcoded key field Dec 27, 2021
Copy link
Member

@vinothchandar vinothchandar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to find a way to remove the reference to metadatapayload reference from HFileWriter and HFileReader. lets keep pushing in this direction

@manojpec manojpec force-pushed the fix/HUDI-2763-metadata-table-avoid-redundant-key-in-payload-3 branch from dc9fe1b to ce8a8d9 Compare January 10, 2022 02:28
Copy link
Member

@vinothchandar vinothchandar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to avoid these direct references at all costs. I/O layer cannot reference the metadata payload class directly

@vinothchandar
Copy link
Member

cc @prashantwason as well .

I still think we need to avoid the references to HoodieMetadataPayload directly from HFileReader and HoodieHFileDataBlock level. The issue here that these layers currently don't take in any config object.

For HFileReader - there is a hadoop configuration object that can be used to pass the field name
For HFileDataBlock - the log format writer creation could take an additional param.

Alternatively, a good Hudi citizen contributor (I just made that up), would have to plumb the interfaces more nicely to take a Properties and pass the writeConfig/commonConfig all the way to the reader/writer. This will be useful for many things going forward. This has to happen in a separate PR.

@manojpec manojpec force-pushed the fix/HUDI-2763-metadata-table-avoid-redundant-key-in-payload-3 branch from ce8a8d9 to 3d6e3e7 Compare January 15, 2022 00:27
@manojpec
Copy link
Contributor Author

@prashantwason @vinothchandar

After discussions, made HoodieHFileReader the single source of truth for all HFile schema related fields. HFileReader already tracks other fields and it is meaningful to move the key field also here. MetadataPayload and HFileDataBlock will refer to HFileReader for the key field. Passing the properties to HFileDataBlock can come in separate PR as suggested.

…d virtual keys

 - The backing log format for the metadata table is HFile, a KeyValue type.
   Since the key field in the metadata record payload is a duplicate of the
   Key in the Cell, the redundant key field in the record can be emptied
   to save on the cost.

 - HoodieHFileWriter and HoodieHFileDataBlock will now serialize records
   with the key field emptied by default. HFile writer tries to find if
   the record has metadata payload schema field 'key' and if so it does
   the key trimming from the record payload.

 - HoodieHFileReader when reading the serialized records back from disk,
   it materializes the missing keyFields if any. HFile reader tries to
   find if the record has metadata payload schema fiels 'key' and if so
   it does the key materialization in the record payload.

 - Tests have been added to verify the default virtual keys and key
   deduplication support for the metadata table records.
…d virtual keys

 - Added storage config property for hfile schema key field so that
   HFileWriter callers can pass in the field id explicitly for the
   key deduplication.

 - Updated the test issue after the latest merge with master which has
   compaction related changes.
…d virtual keys

 - Moving the metadata schema key field to HFileReader so that it can be
   the single source of truth for all the hfile schema related fields.
   HFileReader already tracks other schema fields.

 - HFileDataBlock, MetadataPayload will refer to HFileReader for the key
   field.
@manojpec manojpec force-pushed the fix/HUDI-2763-metadata-table-avoid-redundant-key-in-payload-3 branch from 3d6e3e7 to dd2fd12 Compare January 20, 2022 18:58
Copy link
Member

@vinothchandar vinothchandar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made some edits. LGTM high level, please fix the remaining naming comments. Also look at the changes I made to avoid duplicate conversion of records

@prashantwason
Copy link
Member

LGTM. Nothing more to add to the existing open review comments.

…d virtual keys

 - Renamed variables and refactored test to run for shorter duration
@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@manojpec
Copy link
Contributor Author

Copy link
Contributor

@nsivabalan nsivabalan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@nsivabalan nsivabalan merged commit f87c473 into apache:master Jan 26, 2022
public static final ConfigProperty<Boolean> POPULATE_META_FIELDS = ConfigProperty
.key(METADATA_PREFIX + ".populate.meta.fields")
.defaultValue(true)
.defaultValue(false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this changing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only for metadata table. Its intentional that we are disabling meta fields, that is enabling virtual keys for metadata table.

liusenhua pushed a commit to liusenhua/hudi that referenced this pull request Mar 1, 2022
…sed on hardcoded key field (apache#4449)

* [HUDI-2763] Metadata table records - support for key deduplication and virtual keys
- The backing log format for the metadata table is HFile, a KeyValue type.
Since the key field in the metadata record payload is a duplicate of the
Key in the Cell, the redundant key field in the record can be emptied
to save on the cost.

- HoodieHFileWriter and HoodieHFileDataBlock will now serialize records
with the key field emptied by default. HFile writer tries to find if
the record has metadata payload schema field 'key' and if so it does
the key trimming from the record payload.

- HoodieHFileReader when reading the serialized records back from disk,
it materializes the missing keyFields if any. HFile reader tries to
find if the record has metadata payload schema fiels 'key' and if so
it does the key materialization in the record payload.

- Tests have been added to verify the default virtual keys and key
   deduplication support for the metadata table records.

Co-authored-by: Vinoth Chandar <[email protected]>
vingov pushed a commit to vingov/hudi that referenced this pull request Apr 3, 2022
…sed on hardcoded key field (apache#4449)

* [HUDI-2763] Metadata table records - support for key deduplication and virtual keys
- The backing log format for the metadata table is HFile, a KeyValue type.
Since the key field in the metadata record payload is a duplicate of the
Key in the Cell, the redundant key field in the record can be emptied
to save on the cost.

- HoodieHFileWriter and HoodieHFileDataBlock will now serialize records
with the key field emptied by default. HFile writer tries to find if
the record has metadata payload schema field 'key' and if so it does
the key trimming from the record payload.

- HoodieHFileReader when reading the serialized records back from disk,
it materializes the missing keyFields if any. HFile reader tries to
find if the record has metadata payload schema fiels 'key' and if so
it does the key materialization in the record payload.

- Tests have been added to verify the default virtual keys and key
   deduplication support for the metadata table records.

Co-authored-by: Vinoth Chandar <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants