-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[HUDI-53] Adding a record level index based on the Metadata Table v2 #3508
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
prashantwason
wants to merge
15
commits into
apache:master
from
prashantwason:pw_record_level_index_oss
Closed
[HUDI-53] Adding a record level index based on the Metadata Table v2 #3508
prashantwason
wants to merge
15
commits into
apache:master
from
prashantwason:pw_record_level_index_oss
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
1. Removed code which calls syncTableMetadata 2. Unit tests are broken because not all functionality have been implemented yet.
Reader does not need to merge the instants in memory. It simply opens the base and log files, validates which log blocks to read (should have completed instants on dataset timeline).
…inst file listing. Validation does not work in several cases especially with multi-writer. So its best to remove it.
We cannot perform compaction if there are previous inflight operations on the dataset. This is because a compacted metadata base file at time Tx should represent all the actions on the dataset till time Tx.
1. There will be fixed number of shards for each Metadata Table partition. 2. Shards are implemented using filenames of format fileId00ABCD where ABCD is the shard number. This allows easy identification of the files and their order while still keeping the names unique. 3. Shards are pre-allocation during the time of bootstrap. 4. Currently only files partition has 1 shard. But this code is required for record-level-index so implemented here.
…han latest compaction on metadata table. LogBlocks written to the log file of Metadata Table need to be validated - they are used only if they correspond to a completed action on the dataset.
…ing the table and re-bootstrapping. The two versions differ in schema and its complicated to check whether the table is in sync. So its simpler to re-bootstrap as its only the file listing which needs to be re-bootstrapped.
Since each operation on metadata table writes to the same files (file-listing partition has a single FileSlice), we can only allow single-writer access to the metadata table. For this, the Transaction Manager is used to lock the table before any updates.
Saving partition, fileID and instantTime. - The fileID is encoded into two longs (16bytes) instead of a UUID string (36chars). - Instant time is encoded into an int (4bytes) ratther than a YYYYMMDDHHMMSS string (14bytes).
Includes config
Improved key reading from metadata table by allowing pre loading of keys to a cache.
Contributor
|
org.apache.hudi.common.config.HoodieMetadataConfig#RECORD_LEVEL_INDEX_SHARD_COUNT_PROP key name incorrect? |
Collaborator
Contributor
|
Hi @prashantwason, i see that this PR is marked as release-blocker but it seems legacy for a long time and not active in recent days, should we consider to move it into release 0.11 or something ? |
Member
Author
|
Closing in favor of #5581. |
5 tasks
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Brief change log
Adds a record level index which stores the mapping within the metadata table.
Pre-requisite: Metadata Table version 2: #3426
Verify this pull request
Unit tests to be added
Committer checklist
Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.