Skip to content

Conversation

@prashantwason
Copy link
Member

Brief change log

Adds a record level index which stores the mapping within the metadata table.

Pre-requisite: Metadata Table version 2: #3426

Verify this pull request

Unit tests to be added

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

1. Removed code which calls syncTableMetadata
2. Unit tests are broken because not all functionality have been implemented yet.
Reader does not need to merge the instants in memory. It simply opens the base and log files, validates which log blocks to read (should have completed instants on dataset timeline).
…inst file listing.

Validation does not work in several cases especially with multi-writer. So its best to remove it.
We cannot perform compaction if there are previous inflight operations on the dataset. This is because a compacted metadata base file at time Tx should represent all the actions on the dataset till time Tx.
1. There will be fixed number of shards for each Metadata Table partition.
2. Shards are implemented using filenames of format fileId00ABCD where ABCD is the shard number. This allows easy identification of the files and their order while still keeping the names unique.
3. Shards are pre-allocation during the time of bootstrap.
4. Currently only files partition has 1 shard. But this code is required for record-level-index so implemented here.
…han latest compaction on metadata table.

LogBlocks written to the log file of Metadata Table need to be validated - they are used only if they correspond to a completed action on the dataset.
…ing the table and re-bootstrapping.

The two versions differ in schema and its complicated to check whether the table is in sync. So its simpler to re-bootstrap as its only the file listing which needs to be re-bootstrapped.
Since each operation on metadata table writes to the same files (file-listing partition has a single FileSlice), we can only allow single-writer access to the metadata table. For this, the Transaction Manager is used to lock the table before any updates.
Saving partition, fileID and instantTime.
 - The fileID is encoded into two longs (16bytes) instead of a UUID string (36chars).
 - Instant time is encoded into an int (4bytes) ratther than a  YYYYMMDDHHMMSS string (14bytes).
Improved key reading from metadata table by allowing pre loading of keys to a cache.
@loukey-lj
Copy link
Contributor

org.apache.hudi.common.config.HoodieMetadataConfig#RECORD_LEVEL_INDEX_SHARD_COUNT_PROP key name incorrect?

@nsivabalan nsivabalan added the priority:blocker Production down; release blocker label Nov 3, 2021
@hudi-bot
Copy link
Collaborator

hudi-bot commented Nov 5, 2021

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@danny0405
Copy link
Contributor

Hi @prashantwason, i see that this PR is marked as release-blocker but it seems legacy for a long time and not active in recent days, should we consider to move it into release 0.11 or something ?

@nsivabalan nsivabalan removed the priority:blocker Production down; release blocker label Nov 18, 2021
@prashantwason
Copy link
Member Author

Closing in favor of #5581.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants