HDDS-10042. Show IDs of under-replicated and unclosed containers for decommissioning nodes #5929

Tejaskriya · 2024-01-05T07:07:01Z

What changes were proposed in this pull request?

In order to see which containers are blocking the progress of the decommissioning of a datanode, a list of the IDs of under-replicated and unclosed container present in the datanode are required. We already have ozone admin datanode status decommission command to view which datanodes are currently in decommissioning. Adding this information as a part of this command will be helpful.
The DatanodeAdminMonitor already creates these lists when the datanodes in DECOMMISSIONING were being checked. In this patch, these lists are stored as a part of TrackedNodes in DatanoedAdminMonitor and updated at the end of each iteration of DatanodeAdminMonitor. The command utilises an API to fetch these lists and display it.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-10042

How was this patch tested?

Added a unit test in TestDatanodeAdminMonitor, and extended existing tests in TestDecommissionStatusSubCommand to check if the container lists are printed.
Also Tested locally in docker set-up:

$ ozone admin datanode status decommission
Decommission Status: DECOMMISSIONING - 1 node(s)

Datanode: 940864be-ef3d-4d89-8a7b-b17a57626cca (/default-rack/172.20.0.10/ozone-datanode-5.ozone_default)
{UnderReplicated=[#5,#6], UnClosed=[#10]}

…ainers

siddhantsangwan · 2024-01-09T14:02:40Z

...-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/DatanodeAdminMonitorImpl.java

    return underReplicated == 0 && unclosed == 0;
  }

+  public Map<String, List<ContainerID>> containersReplicatedOnNode(DatanodeDetails dn)


Let's rename this method to something more appropriate, like getContainersPendingReplication.

Also if it doesn't turn out to be too complicated, let's refactor the common code between this method and the one above to another common method. That way we don't have to maintain the same logic in two places.

I have refactored the common code, Could you please take a look at it now?

The refactor looks good, but I think you forgot to change the method's name.

Thank you for the review! It is fixed in the latest code. Could you please review it again?

I still see the method name as containersReplicatedOnNode in the latest commit... am I missing something?

I have addressed the above in this pr: #6293

siddhantsangwan

@Tejaskriya Thanks for working on this. The draft looks good. Please continue adding tests. It'll also be good to see a sample output of the command.

...hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/server/SCMClientProtocolServer.java

sodonnel · 2024-01-17T14:53:55Z

...-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/DatanodeAdminMonitorImpl.java

+    return (containerOnDn.get("UnderReplicated").size() == 0) && (containerOnDn.get("UnClosed").size() == 0);
+  }
+
+  public Map<String, List<ContainerID>> getContainersReplicatedOnNode(TrackedNode dn, boolean updateMetrics)


With this change, the client is going to trigger a somewhat expensive operation on SCM. There is potential for multiple clients to issue these calls at the same time, and we don't really have a way to throttle it.

I think it would be better, if checkContainersReplicatedOnNode saved the counts and ID lists, perhaps as a map inside TrackedNode. Then getContainersReplicatedOnNode can simply retrive the value stored inside tracked node. Note that there will be a period of time where the node is scheduled for decommission, but it has not been checked yet, so we would need to decide what to return in that case.

I have followed your suggestion to save the lists of ContainerIDs in TrackedNode as a Map at the end of checkContainersReplicatedOnNode and getContainersReplicatedOnNode only retrieves this Map.
Currently, for the time period in which the node is scheduled for decommissioning but hasn't been checked yet, an empty map is shown in the place for the container IDs.
Could you please review the PR with these recent changes?

...-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/DatanodeAdminMonitorImpl.java

sodonnel · 2024-01-22T12:37:12Z

...-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/DatanodeAdminMonitorImpl.java

+    }
+
+    public void setContainersReplicatedOnNode(List<ContainerID> underReplicated, List<ContainerID> unClosed) {
+      this.containersReplicatedOnNode.put("UnderReplicated", ImmutableList.copyOf(underReplicated));


I don't think we need to make a copy of the lists here. The list are local variable so nothing can change them after we exit the method, so making a copy just adds expense. It would be find to wrap them in an Immutable list however.

Thank you for the review! I have changed this to Collections.unmodifiableList(containerList)

...-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/DatanodeAdminMonitorImpl.java

...ools/src/main/java/org/apache/hadoop/hdds/scm/cli/datanode/DecommissionStatusSubCommand.java

sodonnel · 2024-01-22T12:50:56Z

Looks largely good - just a few minor things to fix for the comments I left inline.

siddhantsangwan · 2024-01-23T07:02:08Z

...hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/server/SCMClientProtocolServer.java

+  public Map<String, List<ContainerID>> getContainersOnDecomNode(DatanodeDetails dn) throws IOException {
+    try {
+      return scm.getScmDecommissionManager().getContainersReplicatedOnNode(
+          new DatanodeAdminMonitorImpl.TrackedNode(dn, 0L));


What do you think about changing this method's definition in NodeDecommissionManager to accept DatanodeDetails instead of TrackedNode? That way SCMClientProtocolServer or other users don't need to know about TrackedNode at all.

I have changed it now to use DatanodeDetails instead. Only inside DatanodeAdminMonitor it uses TrackedNode to find the required node

...-hdds/server-scm/src/test/java/org/apache/hadoop/hdds/scm/node/TestDatanodeAdminMonitor.java

…DecommissionStatusSubCommand

Tejaskriya · 2024-01-23T09:56:45Z

@sodonnel @siddhantsangwan Thank you for the reviews! I have addressed all your comments with my latest push. Could you please review it another round and approve the workflows if everything seems good to go?

sodonnel

LGTM. We can commit after green CI and if @siddhantsangwan is happy too.

siddhantsangwan · 2024-01-24T05:44:39Z

Merging this now since CI is green and we have approvals. The minor name update can be taken care of in another Jira. Thanks for the code and reviews!

…decommissioning nodes (apache#5929)

HDDS-10042. Show container IDs for under-replicated and unclosed cont…

8c5fe3e

…ainers

siddhantsangwan reviewed Jan 9, 2024

View reviewed changes

tejaskriya and others added 3 commits January 16, 2024 14:13

Refactor common code

3675e80

Checkstyle fix

743390b

Merge branch 'apache:master' into HDDS-10042

98f6e65

Tejaskriya marked this pull request as ready for review January 17, 2024 09:00

sodonnel reviewed Jan 17, 2024

View reviewed changes

...hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/server/SCMClientProtocolServer.java Outdated Show resolved Hide resolved

sodonnel reviewed Jan 17, 2024

View reviewed changes

tejaskriya added 6 commits January 22, 2024 09:56

Merge remote-tracking branch 'origin' into HDDS-10042

0ffc941

Not exposing monitor and adding test case

42b4aa4

Not exposing monitor and adding test case

a8afded

Store ids in trackedNode

01accb4

Remove unneeded parameter for getContainers method

8edd010

findbugs fix

53c54a7

sodonnel reviewed Jan 22, 2024

View reviewed changes

...-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/DatanodeAdminMonitorImpl.java Show resolved Hide resolved

sodonnel reviewed Jan 22, 2024

View reviewed changes

...-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/DatanodeAdminMonitorImpl.java Outdated Show resolved Hide resolved

sodonnel reviewed Jan 22, 2024

View reviewed changes

...ools/src/main/java/org/apache/hadoop/hdds/scm/cli/datanode/DecommissionStatusSubCommand.java Show resolved Hide resolved

siddhantsangwan reviewed Jan 23, 2024

View reviewed changes

...-hdds/server-scm/src/test/java/org/apache/hadoop/hdds/scm/node/TestDatanodeAdminMonitor.java Outdated Show resolved Hide resolved

tejaskriya added 2 commits January 23, 2024 14:06

Address DatanodeAdminMonitor review comments and extend tests in Test…

4d5d2da

…DecommissionStatusSubCommand

Changing method parameter from TrackedNode to DatanodeDetails

fadf3c4

sodonnel approved these changes Jan 23, 2024

View reviewed changes

adoroszlai changed the title ~~HDDS-10042. Show container IDs for under-replicated and unclosed containers for decommissioning nodes~~ HDDS-10042. Show IDs of under-replicated and unclosed containers for decommissioning nodes Jan 23, 2024

siddhantsangwan merged commit 0f5de57 into apache:master Jan 24, 2024

Tejaskriya added a commit to Tejaskriya/ozone that referenced this pull request Jan 24, 2024

HDDS-10042. Show IDs of under-replicated and unclosed containers for …

a322aaf

…decommissioning nodes (apache#5929)

HDDS-10042. Show IDs of under-replicated and unclosed containers for decommissioning nodes #5929

HDDS-10042. Show IDs of under-replicated and unclosed containers for decommissioning nodes #5929

Uh oh!

Conversation

Tejaskriya commented Jan 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

siddhantsangwan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

sodonnel commented Jan 22, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Tejaskriya commented Jan 23, 2024

Uh oh!

sodonnel left a comment

Choose a reason for hiding this comment

Uh oh!

siddhantsangwan commented Jan 24, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Tejaskriya commented Jan 5, 2024 •

edited

Loading