-
Notifications
You must be signed in to change notification settings - Fork 588
HDDS-13093. Add metrics for the cumulative state of volumes #8609
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HDDS-13093. Add metrics for the cumulative state of volumes #8609
Conversation
errose28
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this @ptlrs. Can you add some tests as well?
...rvice/src/main/java/org/apache/hadoop/ozone/container/common/volume/VolumeHealthMetrics.java
Outdated
Show resolved
Hide resolved
...rvice/src/main/java/org/apache/hadoop/ozone/container/common/volume/VolumeHealthMetrics.java
Outdated
Show resolved
Hide resolved
…trics-to-count-volumes-by-health-state-per-datanode # Conflicts: # hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/ozoneimpl/OzoneContainer.java
|
Hi @errose28 @Tejaskriya, I have updated the PR with some new tests and refactoring. Could you please take a look. |
errose28
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates @ptlrs.
...rvice/src/main/java/org/apache/hadoop/ozone/container/common/volume/VolumeHealthMetrics.java
Outdated
Show resolved
Hide resolved
...-service/src/main/java/org/apache/hadoop/ozone/container/common/volume/MutableVolumeSet.java
Outdated
Show resolved
Hide resolved
...rvice/src/main/java/org/apache/hadoop/ozone/container/common/volume/VolumeHealthMetrics.java
Show resolved
Hide resolved
...e/src/test/java/org/apache/hadoop/ozone/container/common/volume/TestVolumeHealthMetrics.java
Outdated
Show resolved
Hide resolved
…rtions into TestVolumeSet.
Tejaskriya
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes look good to me over all @ptlrs , just one suggestion. Could we add some metrics assertions in the TestPeriodicVolumeChecker to have some tests binding volume scans with these metrics too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates, left some comments on the latest changes. Looks like there's some cleanup around MutableVolumeSet which I filed HDDS-13545 for.
...-service/src/main/java/org/apache/hadoop/ozone/container/common/volume/MutableVolumeSet.java
Outdated
Show resolved
Hide resolved
...rvice/src/main/java/org/apache/hadoop/ozone/container/common/volume/VolumeHealthMetrics.java
Show resolved
Hide resolved
...rvice/src/main/java/org/apache/hadoop/ozone/container/common/volume/VolumeHealthMetrics.java
Outdated
Show resolved
Hide resolved
...ner-service/src/test/java/org/apache/hadoop/ozone/container/common/volume/TestVolumeSet.java
Outdated
Show resolved
Hide resolved
...-service/src/main/java/org/apache/hadoop/ozone/container/common/volume/MutableVolumeSet.java
Outdated
Show resolved
Hide resolved
|
@errose28 @Tejaskriya can you please take another look. |
errose28
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates.
...-service/src/main/java/org/apache/hadoop/ozone/container/common/volume/MutableVolumeSet.java
Outdated
Show resolved
Hide resolved
...rvice/src/main/java/org/apache/hadoop/ozone/container/common/volume/VolumeHealthMetrics.java
Show resolved
Hide resolved
...rvice/src/main/java/org/apache/hadoop/ozone/container/common/volume/VolumeHealthMetrics.java
Outdated
Show resolved
Hide resolved
...ner-service/src/test/java/org/apache/hadoop/ozone/container/common/volume/TestVolumeSet.java
Outdated
Show resolved
Hide resolved
|
@errose28 I have pushed the changes for the latest comments. Could you please take another look? |
errose28
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly LGTM. Cursor found one small issue that I missed in my previous review though: if MutableVolumeSet#initializeVolumeSet throws, we don't unregister the metrics. We've seen these types of issues lead to bugs in corner cases like #4966 so we should fix that here.
This should address the issue:
diff --git a/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/volume/MutableVolumeSet.java b/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/volume/MutableVolumeSet.java
index 8a6cc2b9c4..c6d28f58a4 100644
--- a/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/volume/MutableVolumeSet.java
+++ b/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/volume/MutableVolumeSet.java
@@ -105,7 +105,6 @@ public MutableVolumeSet(String dnUuid, String clusterID,
this.volumeChecker.registerVolumeSet(this);
}
this.volumeType = volumeType;
- this.volumeHealthMetrics = VolumeHealthMetrics.create(volumeType);
SpaceUsageCheckFactory usageCheckFactory =
SpaceUsageCheckFactory.create(conf);
@@ -125,7 +124,14 @@ public MutableVolumeSet(String dnUuid, String clusterID,
maxVolumeFailuresTolerated = dnConf.getFailedDataVolumesTolerated();
}
- initializeVolumeSet();
+ // Ensure metrics are unregistered if the volume set initialization fails.
+ this.volumeHealthMetrics = VolumeHealthMetrics.create(volumeType);
+ try {
+ initializeVolumeSet();
+ } catch (IOException ex) {
+ volumeHealthMetrics.unregister();
+ throw ex;
+ }
}
public void setFailedVolumeListener(CheckedRunnable<IOException> runnable) {|
Moving metrics create to the same area as the try/catch like the diff shared above with the comment explaining why is an improvement over the latest commit because it minimizes the chances of a new throwing call being placed between the metrics create and cleanup. I don't think catching all exceptions is necessary since runtime exceptions should crash the datanode but I don't think it causes harm in this case either. |
errose28
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates, LGTM.
|
Thanks for the review @errose28 and @Tejaskriya. |
Please describe your PR in detail:
This PR:
VolumeHealthMetricsto capture cumulative metrics of volumes on a datanodeTotalVolumes,NumHealthyVolumesandNumFailedVolumes{ "name" : "Hadoop:service=HddsDatanode,name=VolumeHealthMetrics-DATA_VOLUME", "modelerType" : "VolumeHealthMetrics-DATA_VOLUME", "tag.Context" : "ozone", "tag.Hostname" : "bf73c953196f", "TotalVolumes" : 1, "NumHealthyVolumes" : 1, "NumFailedVolumes" : 0 },What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-13093
How was this patch tested?
Manually tested by observing the jmx values in a docker cluster
CI: https://github.com/ptlrs/ozone/actions/runs/16435851356