-
Notifications
You must be signed in to change notification settings - Fork 587
HDDS-10042. Show IDs of under-replicated and unclosed containers for decommissioning nodes #5929
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| return underReplicated == 0 && unclosed == 0; | ||
| } | ||
|
|
||
| public Map<String, List<ContainerID>> containersReplicatedOnNode(DatanodeDetails dn) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's rename this method to something more appropriate, like getContainersPendingReplication.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also if it doesn't turn out to be too complicated, let's refactor the common code between this method and the one above to another common method. That way we don't have to maintain the same logic in two places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have refactored the common code, Could you please take a look at it now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The refactor looks good, but I think you forgot to change the method's name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the review! It is fixed in the latest code. Could you please review it again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still see the method name as containersReplicatedOnNode in the latest commit... am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have addressed the above in this pr: #6293
siddhantsangwan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Tejaskriya Thanks for working on this. The draft looks good. Please continue adding tests. It'll also be good to see a sample output of the command.
...hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/server/SCMClientProtocolServer.java
Outdated
Show resolved
Hide resolved
| return (containerOnDn.get("UnderReplicated").size() == 0) && (containerOnDn.get("UnClosed").size() == 0); | ||
| } | ||
|
|
||
| public Map<String, List<ContainerID>> getContainersReplicatedOnNode(TrackedNode dn, boolean updateMetrics) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this change, the client is going to trigger a somewhat expensive operation on SCM. There is potential for multiple clients to issue these calls at the same time, and we don't really have a way to throttle it.
I think it would be better, if checkContainersReplicatedOnNode saved the counts and ID lists, perhaps as a map inside TrackedNode. Then getContainersReplicatedOnNode can simply retrive the value stored inside tracked node. Note that there will be a period of time where the node is scheduled for decommission, but it has not been checked yet, so we would need to decide what to return in that case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have followed your suggestion to save the lists of ContainerIDs in TrackedNode as a Map at the end of checkContainersReplicatedOnNode and getContainersReplicatedOnNode only retrieves this Map.
Currently, for the time period in which the node is scheduled for decommissioning but hasn't been checked yet, an empty map is shown in the place for the container IDs.
Could you please review the PR with these recent changes?
...-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/DatanodeAdminMonitorImpl.java
Show resolved
Hide resolved
| } | ||
|
|
||
| public void setContainersReplicatedOnNode(List<ContainerID> underReplicated, List<ContainerID> unClosed) { | ||
| this.containersReplicatedOnNode.put("UnderReplicated", ImmutableList.copyOf(underReplicated)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need to make a copy of the lists here. The list are local variable so nothing can change them after we exit the method, so making a copy just adds expense. It would be find to wrap them in an Immutable list however.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the review! I have changed this to Collections.unmodifiableList(containerList)
...-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/DatanodeAdminMonitorImpl.java
Outdated
Show resolved
Hide resolved
...ools/src/main/java/org/apache/hadoop/hdds/scm/cli/datanode/DecommissionStatusSubCommand.java
Show resolved
Hide resolved
|
Looks largely good - just a few minor things to fix for the comments I left inline. |
| public Map<String, List<ContainerID>> getContainersOnDecomNode(DatanodeDetails dn) throws IOException { | ||
| try { | ||
| return scm.getScmDecommissionManager().getContainersReplicatedOnNode( | ||
| new DatanodeAdminMonitorImpl.TrackedNode(dn, 0L)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about changing this method's definition in NodeDecommissionManager to accept DatanodeDetails instead of TrackedNode? That way SCMClientProtocolServer or other users don't need to know about TrackedNode at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have changed it now to use DatanodeDetails instead. Only inside DatanodeAdminMonitor it uses TrackedNode to find the required node
...-hdds/server-scm/src/test/java/org/apache/hadoop/hdds/scm/node/TestDatanodeAdminMonitor.java
Outdated
Show resolved
Hide resolved
|
@sodonnel @siddhantsangwan Thank you for the reviews! I have addressed all your comments with my latest push. Could you please review it another round and approve the workflows if everything seems good to go? |
sodonnel
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. We can commit after green CI and if @siddhantsangwan is happy too.
|
Merging this now since CI is green and we have approvals. The minor name update can be taken care of in another Jira. Thanks for the code and reviews! |
…decommissioning nodes (apache#5929)
What changes were proposed in this pull request?
In order to see which containers are blocking the progress of the decommissioning of a datanode, a list of the IDs of under-replicated and unclosed container present in the datanode are required. We already have
ozone admin datanode status decommissioncommand to view which datanodes are currently in decommissioning. Adding this information as a part of this command will be helpful.The DatanodeAdminMonitor already creates these lists when the datanodes in DECOMMISSIONING were being checked. In this patch, these lists are stored as a part of TrackedNodes in DatanoedAdminMonitor and updated at the end of each iteration of DatanodeAdminMonitor. The command utilises an API to fetch these lists and display it.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-10042
How was this patch tested?
Added a unit test in TestDatanodeAdminMonitor, and extended existing tests in TestDecommissionStatusSubCommand to check if the container lists are printed.
Also Tested locally in docker set-up: