-
Notifications
You must be signed in to change notification settings - Fork 588
HDDS-13125. Add metrics for monitoring the SST file pruning threads. #8764
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Adds comprehensive metrics collection for SST file pruning within the RocksDB checkpoint differ, enabling monitoring of queue size, pruned/skipped files, processed compactions, and failure counts.
- Introduce SSTFilePruningMetrics to define and register counters/gauges.
- Instrument RocksDBCheckpointDiffer to update metrics on queue changes, batch processing, and failures, and expose them for tests.
- Update tests to assert metric values and adjust hadoop-common dependency scope.
Reviewed Changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| src/main/java/org/apache/ozone/rocksdiff/SSTFilePruningMetrics.java | New metrics source defining counters/gauges for pruning operations |
| src/main/java/org/apache/ozone/rocksdiff/RocksDBCheckpointDiffer.java | Integrated metrics updates (queue size, batch metrics, failures), unregister on close, added test getter |
| src/test/java/org/apache/ozone/rocksdiff/TestRocksDBCheckpointDiffer.java | Added assertions to validate initial and post-pruning metric values; typo fix |
| pom.xml | Changed hadoop-common scope from test to provided |
Comments suppressed due to low confidence (4)
hadoop-hdds/rocksdb-checkpoint-differ/src/test/java/org/apache/ozone/rocksdiff/TestRocksDBCheckpointDiffer.java:1937
- No test covers the pruningFailures metric. Consider adding a scenario that triggers an IOException in pruneSstFileValues and asserts that getPruningFailures() increments.
public void testPruneSSTFileValues() throws Exception {
hadoop-hdds/rocksdb-checkpoint-differ/src/main/java/org/apache/ozone/rocksdiff/RocksDBCheckpointDiffer.java:1454
- [nitpick] Add JavaDoc to clarify that this getter is intended only for testing and should not be used in production code paths.
public SSTFilePruningMetrics getPruningMetrics() {
hadoop-hdds/rocksdb-checkpoint-differ/src/main/java/org/apache/ozone/rocksdiff/RocksDBCheckpointDiffer.java:540
- Potential NullPointerException if sstFilePruningMetrics is ever uninitialized. Consider guarding this call or ensuring metrics are initialized in every constructor.
sstFilePruningMetrics.updateQueueSize(pruneQueue.size());
hadoop-hdds/rocksdb-checkpoint-differ/pom.xml:87
- Changing the scope of hadoop-common from 'test' to 'provided' may remove it from the test classpath and break existing tests. Verify that all test dependencies are still satisfied.
<scope>provided</scope>
| public void updateBatchLevelMetrics(long filesPruned, long filesSkipped, int compactions, long queueSize) { | ||
| filesPrunedTotal.incr(filesPruned); | ||
| filesPrunedLast.set(filesPruned); | ||
| filesRemovedTotal.incr(filesSkipped); |
Copilot
AI
Jul 8, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Incrementing 'filesRemovedTotal' with 'filesSkipped' may be confusing. Either rename the metric/parameter for clarity or update logic to reflect actual removed files separately.
| filesRemovedTotal.incr(filesSkipped); | |
| filesSkippedTotal.incr(filesSkipped); |
| @Metric("No. of SST files pruned in the last batch") | ||
| private MutableGaugeLong filesPrunedLast; | ||
| @Metric("Total no. of SST files removed") | ||
| private MutableCounterLong filesRemovedTotal; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
filesSkippedTotal would make more sense
| * @return SSTFilePruningMetrics | ||
| */ | ||
| public static SSTFilePruningMetrics create() { | ||
| return DefaultMetricsSystem.instance().register(METRICS_SOURCE_NAME, "SST File Pruning Metrics", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems there can be more than one RocksDBCheckpointDifferHolder, and therefore more than one SSTFilePruningMetrics instances.
In which case, the unRegister() could cause resource leak -- when two SSTFilePruningMetrics are created, the first one registered to DefaultMetricsSystem is replaced by the second reference. In unRegister(), only the second one will be unregistered, and the first one leaks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The easiest solution is to associate the instance with a unique name. Check out the example made by Google Gemini: jojochuang@51c32f7
* master: (90 commits) HDDS-13308. OM should expose Ratis config for increasing pending write limits (apache#8668) HDDS-8903. Add validation for ozone.om.snapshot.db.max.open.files. (apache#8787) HDDS-13429. Custom metadata headers with uppercase characters are not supported (apache#8805) HDDS-13448. DeleteBlocksCommandHandler thread stop for normal exception (apache#8816) HDDS-13346. Intermittent failure in TestCloseContainer#testContainerChecksumForClosedContainer (apache#8771) HDDS-13125. Add metrics for monitoring the SST file pruning threads. (apache#8764) HDDS-13367. [Docs] User doc for container balancer. (apache#8726) HDDS-13200. OM RocksDB Grafana Dashbroad shows no data on all panels (apache#8577) HDDS-13428. Recon - Retrigger of build whole NSSummary tree task submission inconsistency. (apache#8793) HDDS-13378. [Docs] Add a Production page under Getting Started (apache#8734) HDDS-13403. [Docs] Make feature proposal process more visible. (apache#8758) HDDS-11797. Remove cyclic dependency between SCMSafeModeManager and SafeModeRules (apache#8782) HDDS-13213. KeyDeletingService should limit task size by both key count and serialized size. (apache#8757) HDDS-13387. OMSnapshotCreateRequest logs invalid warning about DefaultReplicationConfig (apache#8760) HDDS-13405. ozone admin container create runs forever without kinit (apache#8765) HDDS-11514. Set optimal default values for delete configurations based on live cluster testing. (apache#8766) HDDS-13376. Add server-side limit note to ozone sh snapshot diff --page-size option (apache#8791) HDDS-11679. Support multiple S3Gs in MiniOzoneCluster (apache#8733) HDDS-13424. Use lsof instead of fuser to find if file is used in AbstractTestChunkManager (apache#8790) HDDS-13427. Bump awssdk to 2.31.78 (apache#8792) ...
…pache#8764) Co-authored-by: saketa <[email protected]>
…ng threads. (apache#8764) Co-authored-by: saketa <[email protected]> (cherry picked from commit 9d96442) Conflicts: hadoop-hdds/rocksdb-checkpoint-differ/pom.xml hadoop-hdds/rocksdb-checkpoint-differ/src/main/java/org/apache/ozone/rocksdiff/RocksDBCheckpointDiffer.java Change-Id: If8973acdd5e1605d973bd03447669d9a98ca21de
What changes were proposed in this pull request?
Added metrics to monitor backup SST file pruning. Metrics being collected
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-13125
How was this patch tested?
Unit Test.