HDDS-9738. Display startTime, pipeline and container counts for decommissioning datanode #6185

Tejaskriya · 2024-02-07T09:02:13Z

What changes were proposed in this pull request?

In order to track the progress of the decommissioning of a datanode, the number of pipelines associated to the datanode and the number of containers on the datanode blocking the decommissioning (i.e., unhealthy and under-replicated containers) is necessary to be shown as a part of the decommission status command.
These counts, along with the time at which decommission was started for the datanode are stored as a part of metrics in NodeDecommissionMetrics. In this PR, a class similar to JMXJsonServerlet (from hadoop-common) is introduced in scm-server, which can accept a request for metrics from a specific class. The response is parsed to display the counts and start-time for each node currently in DECOMMISSIONING.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-9738

How was this patch tested?

Updated tests in TestDecommissionStatusSubCommand. Also tested in docker cluster.
When metrics haven't been updated-

bash-4.2$ ozone admin datanode status decommission

Decommission Status: DECOMMISSIONING - 1 node(s)

Datanode: 554a15ca-5530-4d4a-8b3a-10d475725fe1 (/default-rack/172.19.0.11/ozone-datanode-1.ozone_default)
Error getting pipeline and container counts for ozone-datanode-1.ozone_default
{}

When metrics have been updated-

bash-4.2$ ozone admin datanode status decommission

Decommission Status: DECOMMISSIONING - 1 node(s)

Datanode: 554a15ca-5530-4d4a-8b3a-10d475725fe1 (/default-rack/172.19.0.11/ozone-datanode-1.ozone_default)
Decommission started at : 13/02/2024 05:28:28 UTC
No. of Pipelines: 1
No. of UnderReplicated containers: 0.0
No. of Unclosed Containers: 0.0
{}

…ing Datanode

sodonnel · 2024-02-12T12:33:33Z

...ools/src/main/java/org/apache/hadoop/hdds/scm/cli/datanode/DecommissionStatusSubCommand.java

          decommissioningNodes.size() + " node(s)");
    }

+    String metricsJson = scmClient.getMetrics("Hadoop:service=StorageContainerManager,name=NodeDecommissionMetrics");


This is nice - that you can filter the metrics server side with the query string. I thought we would have to do that client side, but this is better.

sodonnel · 2024-02-12T12:34:43Z

The change LGTM. Have you tried it out in docker-compose cluster and validated that it all works fine when one node is decommissioning, perhaps multiple nodes decommissioning and none are decommissioning?

Tejaskriya · 2024-02-13T06:49:37Z

I have tested it for all three cases in docker-compose:
Case-1: no nodes

bash-4.2$ ozone admin datanode status decommission

Decommission Status: DECOMMISSIONING - 0 node(s)

Case-2: 1 node (first output is when metrics are not available yet, second is once the metrics are updated)

bash-4.2$ ozone admin datanode status decommission

Decommission Status: DECOMMISSIONING - 1 node(s)

Datanode: 554a15ca-5530-4d4a-8b3a-10d475725fe1 (/default-rack/172.19.0.11/ozone-datanode-1.ozone_default)
Error getting pipeline and container counts for ozone-datanode-1.ozone_default
{}

bash-4.2$ ozone admin datanode status decommission

Decommission Status: DECOMMISSIONING - 1 node(s)

Datanode: 554a15ca-5530-4d4a-8b3a-10d475725fe1 (/default-rack/172.19.0.11/ozone-datanode-1.ozone_default)
Decommission started at : 13/02/2024 05:28:28 UTC
No. of Pipelines: 1
No. of UnderReplicated containers: 0.0
No. of Unclosed Containers: 0.0
{}

Case-3: 2 nodes decommissioning

bash-4.2$ ozone admin datanode status decommission

Decommission Status: DECOMMISSIONING - 2 node(s)

Datanode: 1a067845-b5a2-4f2a-b1c8-70a2140173ee (/default-rack/172.23.0.8/ozone-datanode-2.ozone_default)
Decommission started at : 13/02/2024 06:31:31 UTC
No. of Pipelines: 1
No. of UnderReplicated containers: 0.0
No. of Unclosed Containers: 0.0
{}

Datanode: 2df9e226-3e04-404c-8836-986231ab2b82 (/default-rack/172.23.0.10/ozone-datanode-1.ozone_default)
Decommission started at : 13/02/2024 06:31:27 UTC
No. of Pipelines: 2
No. of UnderReplicated containers: 0.0
No. of Unclosed Containers: 0.0
{}

(The empty braces at the end of each output are for the container lists, it can be ignored for this PR)
The results are same for secure and HA clusters as well

tejaskriya added 7 commits February 6, 2024 11:04

HDDS-9738. Display pipeline and container counts for the decommission…

29fad58

…ing Datanode

Add license comment block

98d18f0

FindBugs fix

ea82eef

checkstyle fix

00aaafd

Move FetchMetrics to hdds.scm package

08f6e3f

Fix log messages and remove attribute filter

0b788de

Parse metrics and display cleanly

c30f97d

Tejaskriya changed the title ~~HDDS-9738.~~ HDDS-9738. Display startTime, pipeline and container counts for decommissioning datanode Feb 7, 2024

Add tests to TestDecommissionStatusSubCommand

e6a6a8d

Tejaskriya marked this pull request as ready for review February 8, 2024 18:53

Tejaskriya mentioned this pull request Feb 9, 2024

HDDS-9738. Display startTime, pipeline and container counts for decommissioning datanode #6083

Closed

checkstyle fix

390becb

sodonnel reviewed Feb 12, 2024

View reviewed changes

sodonnel approved these changes Feb 13, 2024

View reviewed changes

sodonnel merged commit 3c4683e into apache:master Feb 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

HDDS-9738. Display startTime, pipeline and container counts for decommissioning datanode #6185

HDDS-9738. Display startTime, pipeline and container counts for decommissioning datanode #6185

Uh oh!

Tejaskriya commented Feb 7, 2024 •

edited

Loading

Uh oh!

sodonnel Feb 12, 2024

Uh oh!

sodonnel commented Feb 12, 2024

Uh oh!

Tejaskriya commented Feb 13, 2024 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

HDDS-9738. Display startTime, pipeline and container counts for decommissioning datanode #6185

HDDS-9738. Display startTime, pipeline and container counts for decommissioning datanode #6185

Uh oh!

Conversation

Tejaskriya commented Feb 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

sodonnel Feb 12, 2024

Choose a reason for hiding this comment

Uh oh!

sodonnel commented Feb 12, 2024

Uh oh!

Tejaskriya commented Feb 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Tejaskriya commented Feb 7, 2024 •

edited

Loading

Tejaskriya commented Feb 13, 2024 •

edited

Loading