Skip to content

Conversation

@SaketaChalamchala
Copy link
Contributor

What changes were proposed in this pull request?

Added metrics to monitor backup SST file pruning. Metrics being collected

  • Total no. of backup SST files pruned.
  • No. of backup SST files pruned in the last batch.
  • Total no. of backup SST files removed by pruning threads.
  • Total no. of compactions processed for pruning.
  • No. of pending compaction logs in pruning queue.
  • No. of pruning job failures.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-13125

How was this patch tested?

Unit Test.

Copilot AI review requested due to automatic review settings July 8, 2025 20:23
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Adds comprehensive metrics collection for SST file pruning within the RocksDB checkpoint differ, enabling monitoring of queue size, pruned/skipped files, processed compactions, and failure counts.

  • Introduce SSTFilePruningMetrics to define and register counters/gauges.
  • Instrument RocksDBCheckpointDiffer to update metrics on queue changes, batch processing, and failures, and expose them for tests.
  • Update tests to assert metric values and adjust hadoop-common dependency scope.

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File Description
src/main/java/org/apache/ozone/rocksdiff/SSTFilePruningMetrics.java New metrics source defining counters/gauges for pruning operations
src/main/java/org/apache/ozone/rocksdiff/RocksDBCheckpointDiffer.java Integrated metrics updates (queue size, batch metrics, failures), unregister on close, added test getter
src/test/java/org/apache/ozone/rocksdiff/TestRocksDBCheckpointDiffer.java Added assertions to validate initial and post-pruning metric values; typo fix
pom.xml Changed hadoop-common scope from test to provided
Comments suppressed due to low confidence (4)

hadoop-hdds/rocksdb-checkpoint-differ/src/test/java/org/apache/ozone/rocksdiff/TestRocksDBCheckpointDiffer.java:1937

  • No test covers the pruningFailures metric. Consider adding a scenario that triggers an IOException in pruneSstFileValues and asserts that getPruningFailures() increments.
  public void testPruneSSTFileValues() throws Exception {

hadoop-hdds/rocksdb-checkpoint-differ/src/main/java/org/apache/ozone/rocksdiff/RocksDBCheckpointDiffer.java:1454

  • [nitpick] Add JavaDoc to clarify that this getter is intended only for testing and should not be used in production code paths.
  public SSTFilePruningMetrics getPruningMetrics() {

hadoop-hdds/rocksdb-checkpoint-differ/src/main/java/org/apache/ozone/rocksdiff/RocksDBCheckpointDiffer.java:540

  • Potential NullPointerException if sstFilePruningMetrics is ever uninitialized. Consider guarding this call or ensuring metrics are initialized in every constructor.
          sstFilePruningMetrics.updateQueueSize(pruneQueue.size());

hadoop-hdds/rocksdb-checkpoint-differ/pom.xml:87

  • Changing the scope of hadoop-common from 'test' to 'provided' may remove it from the test classpath and break existing tests. Verify that all test dependencies are still satisfied.
      <scope>provided</scope>

public void updateBatchLevelMetrics(long filesPruned, long filesSkipped, int compactions, long queueSize) {
filesPrunedTotal.incr(filesPruned);
filesPrunedLast.set(filesPruned);
filesRemovedTotal.incr(filesSkipped);
Copy link

Copilot AI Jul 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Incrementing 'filesRemovedTotal' with 'filesSkipped' may be confusing. Either rename the metric/parameter for clarity or update logic to reflect actual removed files separately.

Suggested change
filesRemovedTotal.incr(filesSkipped);
filesSkippedTotal.incr(filesSkipped);

Copilot uses AI. Check for mistakes.
@adoroszlai adoroszlai added the snapshot https://issues.apache.org/jira/browse/HDDS-6517 label Jul 9, 2025
@Metric("No. of SST files pruned in the last batch")
private MutableGaugeLong filesPrunedLast;
@Metric("Total no. of SST files removed")
private MutableCounterLong filesRemovedTotal;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

filesSkippedTotal would make more sense

* @return SSTFilePruningMetrics
*/
public static SSTFilePruningMetrics create() {
return DefaultMetricsSystem.instance().register(METRICS_SOURCE_NAME, "SST File Pruning Metrics",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems there can be more than one RocksDBCheckpointDifferHolder, and therefore more than one SSTFilePruningMetrics instances.

In which case, the unRegister() could cause resource leak -- when two SSTFilePruningMetrics are created, the first one registered to DefaultMetricsSystem is replaced by the second reference. In unRegister(), only the second one will be unregistered, and the first one leaks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The easiest solution is to associate the instance with a unique name. Check out the example made by Google Gemini: jojochuang@51c32f7

@jojochuang jojochuang merged commit 9d96442 into apache:master Jul 15, 2025
80 of 82 checks passed
errose28 added a commit to errose28/ozone that referenced this pull request Jul 22, 2025
* master: (90 commits)
  HDDS-13308. OM should expose Ratis config for increasing pending write limits (apache#8668)
  HDDS-8903. Add validation for ozone.om.snapshot.db.max.open.files. (apache#8787)
  HDDS-13429. Custom metadata headers with uppercase characters are not supported (apache#8805)
  HDDS-13448. DeleteBlocksCommandHandler thread stop for normal exception (apache#8816)
  HDDS-13346. Intermittent failure in TestCloseContainer#testContainerChecksumForClosedContainer (apache#8771)
  HDDS-13125. Add metrics for monitoring the SST file pruning threads. (apache#8764)
  HDDS-13367. [Docs] User doc for container balancer. (apache#8726)
  HDDS-13200. OM RocksDB Grafana Dashbroad shows no data on all panels (apache#8577)
  HDDS-13428. Recon - Retrigger of build whole NSSummary tree task submission inconsistency. (apache#8793)
  HDDS-13378. [Docs] Add a Production page under Getting Started (apache#8734)
  HDDS-13403. [Docs] Make feature proposal process more visible. (apache#8758)
  HDDS-11797. Remove cyclic dependency between SCMSafeModeManager and SafeModeRules (apache#8782)
  HDDS-13213. KeyDeletingService should limit task size by both key count and serialized size. (apache#8757)
  HDDS-13387. OMSnapshotCreateRequest logs invalid warning about DefaultReplicationConfig (apache#8760)
  HDDS-13405. ozone admin container create runs forever without kinit (apache#8765)
  HDDS-11514. Set optimal default values for delete configurations based on live cluster testing. (apache#8766)
  HDDS-13376. Add server-side limit note to ozone sh snapshot diff --page-size option (apache#8791)
  HDDS-11679. Support multiple S3Gs in MiniOzoneCluster (apache#8733)
  HDDS-13424. Use lsof instead of fuser to find if file is used in AbstractTestChunkManager (apache#8790)
  HDDS-13427. Bump awssdk to 2.31.78 (apache#8792)
  ...
jojochuang pushed a commit to jojochuang/ozone that referenced this pull request Jul 31, 2025
swamirishi pushed a commit to swamirishi/ozone that referenced this pull request Dec 3, 2025
…ng threads. (apache#8764)

Co-authored-by: saketa <[email protected]>
(cherry picked from commit 9d96442)

 Conflicts:
	hadoop-hdds/rocksdb-checkpoint-differ/pom.xml
	hadoop-hdds/rocksdb-checkpoint-differ/src/main/java/org/apache/ozone/rocksdiff/RocksDBCheckpointDiffer.java

Change-Id: If8973acdd5e1605d973bd03447669d9a98ca21de
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

snapshot https://issues.apache.org/jira/browse/HDDS-6517

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants