Skip to content

Conversation

@Tejaskriya
Copy link
Contributor

@Tejaskriya Tejaskriya commented Jan 5, 2024

What changes were proposed in this pull request?

In order to see which containers are blocking the progress of the decommissioning of a datanode, a list of the IDs of under-replicated and unclosed container present in the datanode are required. We already have ozone admin datanode status decommission command to view which datanodes are currently in decommissioning. Adding this information as a part of this command will be helpful.
The DatanodeAdminMonitor already creates these lists when the datanodes in DECOMMISSIONING were being checked. In this patch, these lists are stored as a part of TrackedNodes in DatanoedAdminMonitor and updated at the end of each iteration of DatanodeAdminMonitor. The command utilises an API to fetch these lists and display it.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-10042

How was this patch tested?

Added a unit test in TestDatanodeAdminMonitor, and extended existing tests in TestDecommissionStatusSubCommand to check if the container lists are printed.
Also Tested locally in docker set-up:

$ ozone admin datanode status decommission
Decommission Status: DECOMMISSIONING - 1 node(s)

Datanode: 940864be-ef3d-4d89-8a7b-b17a57626cca (/default-rack/172.20.0.10/ozone-datanode-5.ozone_default)
{UnderReplicated=[#5,#6], UnClosed=[#10]}

return underReplicated == 0 && unclosed == 0;
}

public Map<String, List<ContainerID>> containersReplicatedOnNode(DatanodeDetails dn)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's rename this method to something more appropriate, like getContainersPendingReplication.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also if it doesn't turn out to be too complicated, let's refactor the common code between this method and the one above to another common method. That way we don't have to maintain the same logic in two places.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have refactored the common code, Could you please take a look at it now?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The refactor looks good, but I think you forgot to change the method's name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the review! It is fixed in the latest code. Could you please review it again?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still see the method name as containersReplicatedOnNode in the latest commit... am I missing something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have addressed the above in this pr: #6293

Copy link
Contributor

@siddhantsangwan siddhantsangwan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Tejaskriya Thanks for working on this. The draft looks good. Please continue adding tests. It'll also be good to see a sample output of the command.

@Tejaskriya Tejaskriya marked this pull request as ready for review January 17, 2024 09:00
return (containerOnDn.get("UnderReplicated").size() == 0) && (containerOnDn.get("UnClosed").size() == 0);
}

public Map<String, List<ContainerID>> getContainersReplicatedOnNode(TrackedNode dn, boolean updateMetrics)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this change, the client is going to trigger a somewhat expensive operation on SCM. There is potential for multiple clients to issue these calls at the same time, and we don't really have a way to throttle it.

I think it would be better, if checkContainersReplicatedOnNode saved the counts and ID lists, perhaps as a map inside TrackedNode. Then getContainersReplicatedOnNode can simply retrive the value stored inside tracked node. Note that there will be a period of time where the node is scheduled for decommission, but it has not been checked yet, so we would need to decide what to return in that case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have followed your suggestion to save the lists of ContainerIDs in TrackedNode as a Map at the end of checkContainersReplicatedOnNode and getContainersReplicatedOnNode only retrieves this Map.
Currently, for the time period in which the node is scheduled for decommissioning but hasn't been checked yet, an empty map is shown in the place for the container IDs.
Could you please review the PR with these recent changes?

}

public void setContainersReplicatedOnNode(List<ContainerID> underReplicated, List<ContainerID> unClosed) {
this.containersReplicatedOnNode.put("UnderReplicated", ImmutableList.copyOf(underReplicated));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to make a copy of the lists here. The list are local variable so nothing can change them after we exit the method, so making a copy just adds expense. It would be find to wrap them in an Immutable list however.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the review! I have changed this to Collections.unmodifiableList(containerList)

@sodonnel
Copy link
Contributor

Looks largely good - just a few minor things to fix for the comments I left inline.

public Map<String, List<ContainerID>> getContainersOnDecomNode(DatanodeDetails dn) throws IOException {
try {
return scm.getScmDecommissionManager().getContainersReplicatedOnNode(
new DatanodeAdminMonitorImpl.TrackedNode(dn, 0L));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about changing this method's definition in NodeDecommissionManager to accept DatanodeDetails instead of TrackedNode? That way SCMClientProtocolServer or other users don't need to know about TrackedNode at all.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have changed it now to use DatanodeDetails instead. Only inside DatanodeAdminMonitor it uses TrackedNode to find the required node

@Tejaskriya
Copy link
Contributor Author

@sodonnel @siddhantsangwan Thank you for the reviews! I have addressed all your comments with my latest push. Could you please review it another round and approve the workflows if everything seems good to go?

Copy link
Contributor

@sodonnel sodonnel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. We can commit after green CI and if @siddhantsangwan is happy too.

@adoroszlai adoroszlai changed the title HDDS-10042. Show container IDs for under-replicated and unclosed containers for decommissioning nodes HDDS-10042. Show IDs of under-replicated and unclosed containers for decommissioning nodes Jan 23, 2024
@siddhantsangwan
Copy link
Contributor

Merging this now since CI is green and we have approvals. The minor name update can be taken care of in another Jira. Thanks for the code and reviews!

@siddhantsangwan siddhantsangwan merged commit 0f5de57 into apache:master Jan 24, 2024
Tejaskriya added a commit to Tejaskriya/ozone that referenced this pull request Jan 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants