-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[HUDI-1072] Add replace metadata file to timeline #1853
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| "fields": [ | ||
| {"name": "totalFilesReplaced", "type": "int"}, | ||
| {"name": "command", "type": "string"}, | ||
| {"name": "partitionMetadata", "type": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the partitionMetadata contain partitionPath -> new file groups ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be obtained from related commit file. Every t.replace has a corresponding t.[delta]commit (we decided to write two files to reduce overhead of reading all commit files in the query path)
Plus, it helps on query side to keep replace metadata files small. I also didn't want to repeat same information in two places.
Let me know if you think its important to store new file group info here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
High level Question : To make sure we are all on the same page : Is this metadata enough to achieve clustering ? Do you foresee any changes that needs to happen to this metadata to support clustering ? The PR mentions that this is for both clustering and overwrite. Hence, asking this question.
| } | ||
|
|
||
| /** | ||
| * Transition Clean State from inflight to Committed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/Clean/Replace
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will update. thanks
| INFLIGHT_DELTA_COMMIT_EXTENSION, REQUESTED_DELTA_COMMIT_EXTENSION, SAVEPOINT_EXTENSION, | ||
| INFLIGHT_SAVEPOINT_EXTENSION, CLEAN_EXTENSION, REQUESTED_CLEAN_EXTENSION, INFLIGHT_CLEAN_EXTENSION, | ||
| INFLIGHT_COMPACTION_EXTENSION, REQUESTED_COMPACTION_EXTENSION, INFLIGHT_RESTORE_EXTENSION, RESTORE_EXTENSION)); | ||
| INFLIGHT_COMPACTION_EXTENSION, REQUESTED_COMPACTION_EXTENSION, INFLIGHT_RESTORE_EXTENSION, RESTORE_EXTENSION, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the intention of this PR is to simply introduce a new action, do we need to change the VALID_ACTION in this PR ? Do the other methods you implemented in the ActiveTimeline require this ? I'm wondering if we can avoid this change in case we want to land this PR incrementally
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this for tests. But can split this and related test out into a different review
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its better to avoid this for rollout purpose. In case, this PR gets landed before the next and a release cut, then we need to worry about ordering of rollout between readers and writers.
| * | ||
| * @return | ||
| */ | ||
| HoodieTimeline getCompletedAndReplaceTimeline(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May be you have used this API in a different PR, but where exactly are we going to use this specific method ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, this is used to build filesystem view in #1859, can move this to that PR
n3nash
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
High level, introducing replace action changes seem fine to me, interested in learning how old_file_group -> new_file_group mapping is stored and accessed. Yet to review the other PRs if they have this info..
This is not 1:1 mapping. So this is not being stored in replace metadata file. We can get new file groups created from corresponding commit metadata file. As far as I can see, we only need this information for auditing, debugging. So I'm planning to build this with CLI task. Let me know if you see this useful in other scenarios. |
| "fields": [ | ||
| {"name": "totalFilesReplaced", "type": "int"}, | ||
| {"name": "command", "type": "string"}, | ||
| {"name": "partitionMetadata", "type": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
High level Question : To make sure we are all on the same page : Is this metadata enough to achieve clustering ? Do you foresee any changes that needs to happen to this metadata to support clustering ? The PR mentions that this is for both clustering and overwrite. Hence, asking this question.
| INFLIGHT_DELTA_COMMIT_EXTENSION, REQUESTED_DELTA_COMMIT_EXTENSION, SAVEPOINT_EXTENSION, | ||
| INFLIGHT_SAVEPOINT_EXTENSION, CLEAN_EXTENSION, REQUESTED_CLEAN_EXTENSION, INFLIGHT_CLEAN_EXTENSION, | ||
| INFLIGHT_COMPACTION_EXTENSION, REQUESTED_COMPACTION_EXTENSION, INFLIGHT_RESTORE_EXTENSION, RESTORE_EXTENSION)); | ||
| INFLIGHT_COMPACTION_EXTENSION, REQUESTED_COMPACTION_EXTENSION, INFLIGHT_RESTORE_EXTENSION, RESTORE_EXTENSION, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its better to avoid this for rollout purpose. In case, this PR gets landed before the next and a release cut, then we need to worry about ordering of rollout between readers and writers.
| String[] VALID_ACTIONS_IN_TIMELINE = {COMMIT_ACTION, DELTA_COMMIT_ACTION, | ||
| CLEAN_ACTION, SAVEPOINT_ACTION, RESTORE_ACTION, ROLLBACK_ACTION, | ||
| COMPACTION_ACTION}; | ||
| COMPACTION_ACTION, REPLACE_ACTION}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment as above
| * | ||
| * @return | ||
| */ | ||
| HoodieTimeline getCompletedAndReplaceTimeline(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the return timeline contains only replace timeline ? The naming is confusing to me. How about getValidReplaceTimeline or anything else to reflect the intention
|
@satishkotha is this WIP? |
|
This has been replaced by #2048 |
What is the purpose of the pull request
This is part of work required for RFC-18 and RFC-19. Add replace action to valid actions in the timeline.
To keep the diff small and get feedback, i am sending just the structure of metadata. For examples of how this will be used, see POC here satishkotha@45d8c26.
I'm happy to bring in other changes from POC if you think we can push large change.
Brief change log
Verify this pull request
This change added unit tests
Committer checklist
Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.