HDDS-2860. Cluster disk space metrics should reflect decommission and maintenance states #433
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This needs HDDS-2113 committed before this one.
What changes were proposed in this pull request?
Now we have decommission states, we need to adjust the cluster capacity, space used and available metrics which are exposed via JMX.
For a node decommissioning, the space used on the node effectively needs to be transfer to other nodes via container replication before decommission can complete, but this is difficult to track from a space usage perspective. When a node completes decommission, we can assume it provides no capacity to the cluster and uses none. Therefore, for decommissioning + decommissioned nodes, the simplest calculation is to exclude the node completely in a similar way to a dead node.
For maintenance nodes, things are even less clear. For a maintenance node, it is read only so it cannot provide capacity to the cluster, but it is expected to return to service, so excluding it completely probably does not make sense. However, perhaps the simplest solution is to do the following:
That way, the cluster totals are only what is currently "online", but we have the other metrics to track what has been removed etc. The key advantage of this, is that it is easy to understand.
There could also be an argument that the new decommissionedDisk metrics are not needed as that capacity is technically lost from the cluster forever.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-2860
How was this patch tested?
Additional unit test was added an manual inspection of the new metrics.