Skip to content

Conversation

@ptlrs
Copy link
Contributor

@ptlrs ptlrs commented Jun 12, 2025

Please describe your PR in detail:
This PR:

  1. Creates a new JMX metrics type VolumeHealthMetrics to capture cumulative metrics of volumes on a datanode
  2. It publishes three metrics: TotalVolumes, NumHealthyVolumes and NumFailedVolumes
 {
    "name" : "Hadoop:service=HddsDatanode,name=VolumeHealthMetrics-DATA_VOLUME",
    "modelerType" : "VolumeHealthMetrics-DATA_VOLUME",
    "tag.Context" : "ozone",
    "tag.Hostname" : "bf73c953196f",
    "TotalVolumes" : 1,
    "NumHealthyVolumes" : 1,
    "NumFailedVolumes" : 0
  },

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-13093

How was this patch tested?

Manually tested by observing the jmx values in a docker cluster
CI: https://github.com/ptlrs/ozone/actions/runs/16435851356

@errose28 errose28 added the scanners Changes related to datanode container and volume scanners label Jun 12, 2025
Copy link
Contributor

@errose28 errose28 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this @ptlrs. Can you add some tests as well?

ptlrs added 4 commits July 21, 2025 21:15
…trics-to-count-volumes-by-health-state-per-datanode

# Conflicts:
#	hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/ozoneimpl/OzoneContainer.java
@ptlrs ptlrs requested a review from errose28 July 22, 2025 05:52
@ptlrs
Copy link
Contributor Author

ptlrs commented Jul 22, 2025

Hi @errose28 @Tejaskriya, I have updated the PR with some new tests and refactoring. Could you please take a look.

Copy link
Contributor

@errose28 errose28 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates @ptlrs.

@Tejaskriya Tejaskriya self-requested a review July 23, 2025 08:31
@ptlrs ptlrs requested a review from errose28 July 31, 2025 08:07
Copy link
Contributor

@Tejaskriya Tejaskriya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look good to me over all @ptlrs , just one suggestion. Could we add some metrics assertions in the TestPeriodicVolumeChecker to have some tests binding volume scans with these metrics too?

Copy link
Contributor

@errose28 errose28 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates, left some comments on the latest changes. Looks like there's some cleanup around MutableVolumeSet which I filed HDDS-13545 for.

@ptlrs ptlrs requested review from Tejaskriya and errose28 August 12, 2025 02:18
@ptlrs
Copy link
Contributor Author

ptlrs commented Aug 12, 2025

@errose28 @Tejaskriya can you please take another look.
The TestPeriodicVolumeChecker has been updated.
Other comments have been addressed or a reasoning has been provided for the current approach. Let me know if we prefer to change the implementation.

Copy link
Contributor

@errose28 errose28 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates.

@ptlrs ptlrs requested a review from errose28 August 14, 2025 03:03
@ptlrs
Copy link
Contributor Author

ptlrs commented Aug 14, 2025

@errose28 I have pushed the changes for the latest comments. Could you please take another look?

Copy link
Contributor

@errose28 errose28 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly LGTM. Cursor found one small issue that I missed in my previous review though: if MutableVolumeSet#initializeVolumeSet throws, we don't unregister the metrics. We've seen these types of issues lead to bugs in corner cases like #4966 so we should fix that here.

This should address the issue:

diff --git a/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/volume/MutableVolumeSet.java b/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/volume/MutableVolumeSet.java
index 8a6cc2b9c4..c6d28f58a4 100644
--- a/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/volume/MutableVolumeSet.java
+++ b/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/volume/MutableVolumeSet.java
@@ -105,7 +105,6 @@ public MutableVolumeSet(String dnUuid, String clusterID,
       this.volumeChecker.registerVolumeSet(this);
     }
     this.volumeType = volumeType;
-    this.volumeHealthMetrics = VolumeHealthMetrics.create(volumeType);
 
     SpaceUsageCheckFactory usageCheckFactory =
         SpaceUsageCheckFactory.create(conf);
@@ -125,7 +124,14 @@ public MutableVolumeSet(String dnUuid, String clusterID,
       maxVolumeFailuresTolerated = dnConf.getFailedDataVolumesTolerated();
     }
 
-    initializeVolumeSet();
+    // Ensure metrics are unregistered if the volume set initialization fails.
+    this.volumeHealthMetrics = VolumeHealthMetrics.create(volumeType);
+    try {
+      initializeVolumeSet();
+    } catch (IOException ex) {
+      volumeHealthMetrics.unregister();
+      throw ex;
+    }
   }
 
   public void setFailedVolumeListener(CheckedRunnable<IOException> runnable) {

@ptlrs ptlrs requested a review from errose28 August 14, 2025 17:01
@errose28
Copy link
Contributor

Moving metrics create to the same area as the try/catch like the diff shared above with the comment explaining why is an improvement over the latest commit because it minimizes the chances of a new throwing call being placed between the metrics create and cleanup. I don't think catching all exceptions is necessary since runtime exceptions should crash the datanode but I don't think it causes harm in this case either.

Copy link
Contributor

@errose28 errose28 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates, LGTM.

@ptlrs
Copy link
Contributor Author

ptlrs commented Aug 19, 2025

Thanks for the review @errose28 and @Tejaskriya.

@ptlrs ptlrs marked this pull request as ready for review August 19, 2025 18:05
@errose28 errose28 merged commit e280fbb into apache:master Aug 19, 2025
92 of 93 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

scanners Changes related to datanode container and volume scanners

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants