-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[WIP] [HUDI-1072] Use replace metadata file to filter excluded files in views #1859
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
ccd2611 to
fd9291b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This name irks me getCommitsReplaceAndCompactionTimeline..we should introduce another hierarchy to group our actions, {commit, delta, compaction, replace} introduce new file groups, {rollback, restore, clean} remove file groups etc. Need to think more
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 need a name to capture this more nicely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean that we can never go back to querying the older file groups once they have been replaced ? Can you still do time-travel for insert-overwrite use-cases ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, for time travel, consider this scenario:
t0 -> insert
t1 -> insert overwrite1
t2 -> insert overwrite2
If we set high watermark to t1 for time travel, visibleCommitTimeline would not have t2.commit, t2.replace. So file groups in t1 would still show as active file groups.
When we move to t2, visibleCommitTimeline will have t2 commit/replace. So file groups in t1 will not show as active
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 we should ideally have a test for this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test here simulates rollback of replace instant.
I can add another one by filtering timeline to move high watermark.
n3nash
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 50%, high level, I feel the changes of excludeFileGroups is being forced into many of the TableFileSystem implementations. Need to think more if there is a way to introduce the correct abstractions to avoid having to add this excludeFileGroups everywhere.
Yes, intent is to get early feedback. Appreciate any suggestions. The reason I added excludeFileGroups in all views is that in some cases this list may be huge. So having configurable spillable view (or RocksDB view) can be useful. It is also possible to encapsulate all this in AbstractFileView and hide it from subclasses too. Let me know if you think that is a better solution. |
fd9291b to
873854b
Compare
873854b to
16650d4
Compare
vinothchandar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
High level approach LGTM. Can do a more thorough review as a follow up.
can you clarify what the state transitions are for REPLACE? would it be like compaction?
t1.replace.requested, t1.replace.inflight, t1.commit?
or
t1.replace.requested, t1.replace.inflight, t1.replace
| "type": "record", | ||
| "name": "HoodieReplaceMetadata", | ||
| "fields": [ | ||
| {"name": "totalFilesReplaced", "type": "int"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rename: totalFileSlicesReplaced
| {"name": "partitionMetadata", "type": { | ||
| "type" : "map", "values" : { | ||
| "type": "array", | ||
| "items": "string" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was expecting this to contain the actual file slices being replaced? seems like we just want to have the partitions here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 need a name to capture this more nicely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 we should ideally have a test for this
So, in the approach I implemented, we will have both t1.replace and t1.commit files. i.e., t1.replace.requested, t1.replace.inflight, t1.replace, t1.commit There are few reasons for doing this:
In short, 't1.replace ' and 't1.commit' together define changes done during t1 instant. After consolidated metadata lands, I think this can be simplified quite a bit. I discussed this with few others offline and implemented this approach. But, let me know if you think there is a better way to do this. Its still early stages and i'm happy to implement cleaner approach, if theres one. |
… replaced files as part of archival
16650d4 to
1217882
Compare
|
Moved to #2048 |
What is the purpose of the pull request
Follow up on #1853
Use metadata and filter excluded files from views.
Changed base views. If general approach looks good, I can update RocksDB and spillable view implementations
Brief change log
Add new methods in Abstract view to filter files excluded by replace commits
Verify this pull request
Added unit tests
Committer checklist
Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.