HDDS-2860. Cluster disk space metrics should reflect decommission and maintenance states #433

sodonnel · 2020-01-10T17:55:27Z

This needs HDDS-2113 committed before this one.

What changes were proposed in this pull request?

Now we have decommission states, we need to adjust the cluster capacity, space used and available metrics which are exposed via JMX.

For a node decommissioning, the space used on the node effectively needs to be transfer to other nodes via container replication before decommission can complete, but this is difficult to track from a space usage perspective. When a node completes decommission, we can assume it provides no capacity to the cluster and uses none. Therefore, for decommissioning + decommissioned nodes, the simplest calculation is to exclude the node completely in a similar way to a dead node.

For maintenance nodes, things are even less clear. For a maintenance node, it is read only so it cannot provide capacity to the cluster, but it is expected to return to service, so excluding it completely probably does not make sense. However, perhaps the simplest solution is to do the following:

For any node not IN_SERVICE, do not include its usage or space in the cluster capacity totals.
Introduce some new metrics to account for the maintenance and perhaps decommission capacity, so it is not lost eg:

# Existing metrics
"DiskCapacity" : 62725623808,
"DiskUsed" : 4096,
"DiskRemaining" : 50459619328,

# Suggested additional new ones, with the above only considering IN_SERVICE nodes:
"MaintenanceDiskCapacity": 0
"MaintenanceDiskUsed": 0
"MaintenanceDiskRemaining": 0
"DecommissionedDiskCapacity": 0
"DecommissionedDiskUsed": 0
"DecommissionedDiskRemaining": 0
...

That way, the cluster totals are only what is currently "online", but we have the other metrics to track what has been removed etc. The key advantage of this, is that it is easy to understand.

There could also be an argument that the new decommissionedDisk metrics are not needed as that capacity is technically lost from the cluster forever.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-2860

How was this patch tested?

Additional unit test was added an manual inspection of the new metrics.

…ce states seperately in JMX and Prom metrics

elek

+1 thanks the patch @sodonnel

One note: instead of using cross product names we can also use tags:

disk_used{type="ssd", usageState=EnteringMaintenance", state="Healthy"}
disk_remaining{type="ssd", usageState="EnteringMaintenance", state="Healthy"}

Hadoop metrics + prometheus both supports using tags, but it might require a bigger refactor (And I am not sure if it solves the aggregation problem. Do we need aggregated, cluster-wide value?)

But let's fix the problem first with committing this patch...

elek

+1 Thanks the patch.

See my comments under #433

… maintenance states #433

S O'Donnell added 4 commits January 10, 2020 17:44

Report the number of nodes the in_service, decommission and maintenan…

21d46ad

…ce states seperately in JMX and Prom metrics

Fixed failing integration test

f23408f

Introduce the new disk metrics for decommission and maintenance nodes

441df2a

Fixed compile issue caused by rebase

2ad92c0

elek changed the title ~~Hdds 2860 Cluster disk space metrics should reflect decommission and maintenance states~~ HDDS-2860. Cluster disk space metrics should reflect decommission and maintenance states Feb 10, 2020

elek approved these changes Feb 10, 2020

View reviewed changes

elek pushed a commit that referenced this pull request Feb 10, 2020

HDDS-2860. Cluster disk space metrics should reflect decommission and…

9506aa9

… maintenance states #433

elek closed this Feb 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

HDDS-2860. Cluster disk space metrics should reflect decommission and maintenance states #433

HDDS-2860. Cluster disk space metrics should reflect decommission and maintenance states #433

Uh oh!

sodonnel commented Jan 10, 2020

Uh oh!

elek left a comment

Uh oh!

elek left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

HDDS-2860. Cluster disk space metrics should reflect decommission and maintenance states #433

HDDS-2860. Cluster disk space metrics should reflect decommission and maintenance states #433

Uh oh!

Conversation

sodonnel commented Jan 10, 2020

This needs HDDS-2113 committed before this one.

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

elek left a comment

Choose a reason for hiding this comment

Uh oh!

elek left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants