Skip to content

Conversation

@Tejaskriya
Copy link
Contributor

What changes were proposed in this pull request?

In order to track the progress of the decommissioning of a datanode, the number of pipelines associated to the datanode and the number of containers on the datanode blocking the decommissioning (i.e., unhealthy and under-replicated containers) is necessary to be shown as a part of the decommission status command.
These counts, along with the time at which decommission was started for the datanode are stored as a part of metrics in NodeDecommissionMetrics. In this PR, the JMX endpoint for SCM is queried for the NodeDecommissionMetrics class and the response is parsed to display the counts and start-time for each node currently in DECOMMISSIONING.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-9738

How was this patch tested?

Tested locally in docker set-up:

$ ozone admin datanode status decommission

Decommission Status: DECOMMISSIONING - 1 node(s)

Datanode: e56afcce-f5b5-4980-8b1b-d55a5714ad3c (/default-rack/172.21.0.9/ozone-datanode-4.ozone_default)
Decommission started at : 170558246119118/01/2024 12:54:21 UTC
No. of Pipelines: 1
No. of UnderReplicated containers: 2
No. of Unclosed Containers: 1
{UnderReplicated=[#5,#6], UnClosed=[#10]}

@sodonnel
Copy link
Contributor

This is still marked draft - is it ready for review now?

@Tejaskriya Tejaskriya marked this pull request as ready for review January 27, 2024 09:43
@Tejaskriya
Copy link
Contributor Author

@sodonnel yes, it is ready for review now

@adoroszlai
Copy link
Contributor

@Tejaskriya there is one more findbugs error:

H I Dm: Found reliance on default encoding in org.apache.hadoop.hdds.scm.cli.datanode.TestDecommissionStatusSubCommand$1.handle(HttpExchange): String.getBytes()  At TestDecommissionStatusSubCommand.java:[line 87]

https://github.com/Tejaskriya/ozone/actions/runs/7650664460/job/20847133955

InputStream inputStream;
int errorCode;

if (webPolicy.isHttpsEnabled()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good idea, to pull the metrics rather than having a special command for getting these details.

Does this work if Kerberos is enabled and the SCM webUI has kerberos authentication enabled too?

Also, what about HA SCM? We need to get the metrics from the active SCM, not the standbys as they will not have the correct metrics.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To solve these issues, as you had suggested during our discussions, a better way would be to have something similar to JMXJsonServerlet from hadoop-common library in ozone which can return any filtered metrics through grpc calls. This way we avoid dealing with http calls issues like handling security and finding the scm leader to get the right metrics. We would still getting the metrics from MBeansServer and not adding any significant overhead in scm.
I have raised this PR with this new approach: #6185
Please do review it. Thank you!

@Tejaskriya Tejaskriya closed this Feb 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants