[HUDI-571] Add 'commits show archived' command to CLI#1274
[HUDI-571] Add 'commits show archived' command to CLI#1274n3nash merged 1 commit intoapache:masterfrom
Conversation
|
@satishkotha can you please look into the failing build ? |
@n3nash Fixed now, please take a look |
hudi-cli/src/main/java/org/apache/hudi/cli/commands/CommitsCommand.java
Outdated
Show resolved
Hide resolved
hudi-common/src/main/java/org/apache/hudi/common/model/HoodiePartitionMetadata.java
Outdated
Show resolved
Hide resolved
n3nash
left a comment
There was a problem hiding this comment.
High level I'm okay with the approach, will review more deeply by tomorrow
There was a problem hiding this comment.
Any reason to remove if (archivedTimeline == null) check ?
There was a problem hiding this comment.
archivedTimeline as instance variable does not make sense because we are creating archive timeline for small time window (depending on command line arguments). So we will create mutiple archivedTimelines when someone is paginating through archived commit files. null check prevents us from doing that.
In the long term, when we have lazy loading, we can consider creating one instance of ArchivedTimeline that stores all metadata. we can bring back instance variable at that time. Until we have that, this does not make sense.
Let me know if you think there is a better way to organize this.
There was a problem hiding this comment.
Isn't the instance variable loaded lazily only when getArchivedTimeline() is called in which case subsequent cases can use that instance variable. I think the concern here is that the only method we expose now is getArchivedTimeline(String startTs, String endTs) and caching the instance variable from that doesn't make sense as you are saying as well. I'm wondering if we should keep an instance variable and keep the same method as before getArchivedTimeline() and then the constructor below is also light. Finally, we expose a method on the archived timeline to filter(startTs, endTs). This way we don't have to create multiple objects for each window invocation. WDYT ?
There was a problem hiding this comment.
I initially wanted to do that. But HoodieDefaultTimeline(base class) keeps track of instants in private class variable. And we do all filtering operations on top of that. So, if we reuse same HoodieArchivedTimeline object, we will likely have to change structure of how instants are kept in memory for base class. I can talk to you more in person tomorrow. But right now, i prefer creating multiple instances of archived timeline because its less error prone IMO.
There was a problem hiding this comment.
After talking in person, changed behavior to create single instance of "HoodieArchivedTimeline" and load metadata for all commits.
Note that this means we read all archived files first. Then do a second pass for details of commits in specific time range. This increases overall time taken by first command. In the example dataset, it took ~20 minutes for initial metadata to load.
Then subsequent commands are few seconds each.
With previous approach we only do one pass on files. All commands are few seconds each.
So I think we need to improve metadata to reduce time taken by first step with new approach.
hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTimeline.java
Outdated
Show resolved
Hide resolved
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieArchivedTimeline.java
Outdated
Show resolved
Hide resolved
n3nash
left a comment
There was a problem hiding this comment.
Thanks, took a second pass, left some more comments, once addressed we should be able to merge this
n3nash
left a comment
There was a problem hiding this comment.
@satishkotha left 1 high level comment on the structuring of the constructor, if we can come to a conclusion on that, will merge as rest looks good.
There was a problem hiding this comment.
You can use a defaultValue during the cliOption itself so you don't have to use these checks ?
There was a problem hiding this comment.
CLIOption attribute can only be assigned a static value. i think default static value is not that friendly. I chose to pick '10 days before today' which is likely to be commonly used. Let me know if you prefer default constant date instead
There was a problem hiding this comment.
nit : s/removeInstantDetailsFromMemory/clearInstantDetailsFromMemory
There was a problem hiding this comment.
@vinothchandar can you ack that this is okay to move to this class ? This way the ArchivedTimeline can also use these methods (and is aligned with our thoughts on having the same operations on archived and active timeline..)
There was a problem hiding this comment.
@vinothchandar just pinging to resurface this since it might be lost in the large number of notifications :)
There was a problem hiding this comment.
@n3nash can't think of anything top of my mind.. Should be fine.
n3nash
left a comment
There was a problem hiding this comment.
@satishkotha 2 minor comments, rest looks good to me, can merge after it's addressed. Also, once you've addressed the comments, could you please left a comment on the PR that you've addressed it ?
|
Addressed comments and made some more improvements. Please take a look |
|
@satishkotha one conflict, please rebase and then I can merge this |
ff43a04 to
596f329
Compare
|
@n3nash Fixed conflict. Note that test has been moved from hudi-common to hudi-client because sequence file based writing for archive log is no longer supported. Please take a look after Travis completes building |
There was a problem hiding this comment.
Please avoid this change as part of this diff
There was a problem hiding this comment.
Discussed offline. Without this we are not able to read certain archived commits
There was a problem hiding this comment.
same here, we can address this as part of some other refactoring diff
There was a problem hiding this comment.
Discussed offline. Without this we are not able to read certain archived commits
10f6622 to
2c94b17
Compare
|
Hi guys, it seems that there 's a little problem with the regex pattern Just raise a PR #4521 trying to fix it. Wish you're interested and help me review? Thanks a lot. |
What is the purpose of the pull request
Add command to show archived commits. This is useful for debugging historical timeline.
Brief change log
Other items planned (to be part of later diffs):
Verify this pull request
This pull request is already covered by existing tests, such as:
`hoodie:$dataset->commits show archived