HDDS-13772. Snapshot Paths to be re read from om checkpoint db inside lock again. by sadanand48 · Pull Request #9131 · apache/ozone

sadanand48 · 2025-10-09T13:58:17Z

What changes were proposed in this pull request?

Snapshot Paths to be re read from om checkpoint db inside lock again from the checkpoint DB's metadataManager instance.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-13772

How was this patch tested?

unit test.

… lock again. (cherry picked from commit 3650d00e8f49e668b936644df79d859664564d3f)

Copilot

Pull Request Overview

This PR addresses a concurrency issue where snapshot paths were read from the live OM metadata manager instead of the checkpoint's metadata manager, potentially including stale or purged snapshots during checkpoint transfers. The fix ensures snapshot paths are re-read from the checkpoint database inside the lock to maintain consistency.

Re-reads snapshot paths from checkpoint metadata manager instead of live metadata manager
Adds proper resource cleanup for the checkpoint metadata manager
Changes method visibility from private to package-private for testing

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File	Description
OMDBCheckpointServletInodeBasedXfer.java	Adds code to re-read snapshot paths from checkpoint metadata manager and changes method visibility for testing
TestOMDbCheckpointServletInodeBasedXfer.java	Adds unit test to verify snapshot paths are correctly read from checkpoint after purge scenarios

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

...ne-manager/src/main/java/org/apache/hadoop/ozone/om/OMDBCheckpointServletInodeBasedXfer.java

...n-test/src/test/java/org/apache/hadoop/ozone/om/TestOMDbCheckpointServletInodeBasedXfer.java

...ne-manager/src/main/java/org/apache/hadoop/ozone/om/OMDBCheckpointServletInodeBasedXfer.java

swamirishi

@sadanand48 thanks for the patch left a few review comments

...ne-manager/src/main/java/org/apache/hadoop/ozone/om/OMDBCheckpointServletInodeBasedXfer.java

...n-test/src/test/java/org/apache/hadoop/ozone/om/TestOMDbCheckpointServletInodeBasedXfer.java

smengcl · 2025-10-15T03:40:36Z

Thanks @sadanand48 . Could you rebase the patch?

...ne-manager/src/main/java/org/apache/hadoop/ozone/om/OMDBCheckpointServletInodeBasedXfer.java

swamirishi

@sadanand48 there are still some minor nitpicky changes on this patch once it is addressed we can merge this

...ne-manager/src/main/java/org/apache/hadoop/ozone/om/OMDBCheckpointServletInodeBasedXfer.java

swamirishi · 2025-10-27T12:23:22Z

...n-test/src/test/java/org/apache/hadoop/ozone/om/TestOMDbCheckpointServletInodeBasedXfer.java

+   * 5. Servlet processes checkpoint - should still include S2 data
+   */
+  @Test
+  public void testCheckpointIncludesSnapshotsFromFrozenState() throws Exception {


nit: Do we need a mini ozone cluster test case for this? I believe we can do with a unit test case here instead of a full mini ozone cluster test. We can think about moving this test into unit test later

swamirishi

This needs a few more changes

swamirishi · 2025-10-27T14:43:12Z

...n-test/src/test/java/org/apache/hadoop/ozone/om/TestOMDbCheckpointServletInodeBasedXfer.java

+    when(spyDbStore.getCheckpoint(true)).thenAnswer(invocation -> {
+      DBCheckpoint checkpoint = spy(dbStore.getCheckpoint(true));
+      doNothing().when(checkpoint).cleanupCheckpoint(); // Don't cleanup for verification
+      capturedCheckpoint.set(checkpoint);
+      return checkpoint;
+    });


Suggested change

when(spyDbStore.getCheckpoint(true)).thenAnswer(invocation -> {

DBCheckpoint checkpoint = spy(dbStore.getCheckpoint(true));

doNothing().when(checkpoint).cleanupCheckpoint(); // Don't cleanup for verification

capturedCheckpoint.set(checkpoint);

return checkpoint;

});

when(spyDbStore.getCheckpoint(eq(true))).thenAnswer(invocation -> {

client.getObjectStore().deleteSnapshot(volumeName, bucketName, "snapshot2");

client.getObjectStore().createSnapshot(volumeName, bucketName, "snapshot3");

// wait for snapshot2 to get purged and snapshot3 to get created.

DBCheckpoint checkpoint = spy(Mockito.callRealMethod());

doNothing().when(checkpoint).cleanupCheckpoint(); // Don't cleanup for verification

capturedCheckpoint.set(checkpoint);

return checkpoint;

});

@sadanand48 Let us purge the snapshot just before we are trying to take a checkpoint which means that we should purge the snapshot in this block

the checkpoint is taken after the bootstrap lock is acquired so once bootstrap lock is aquired the purge won't be succesful in this block as SDS needs to acquire this lock to do the purge. We need to do it outside the lock itself

Can we do the bootstrap on the follower OM? Bootstrap lock there won't mean anything when this runs on the follower. To get into this condition maybe we should mock bootstrap lock to do a noop here.
Actually what we should do is during BootstrapLock we should pause the double buffer thread delete the snapshot and acquire the bootstrap lock. Now just before the checkpoint we should unpause the double buffer thread and let the snapshot purge get flushed this would do it.

...n-test/src/test/java/org/apache/hadoop/ozone/om/TestOMDbCheckpointServletInodeBasedXfer.java

swamirishi · 2025-11-01T17:43:28Z

...n-test/src/test/java/org/apache/hadoop/ozone/om/TestOMDbCheckpointServletInodeBasedXfer.java

+    when(requestMock.getParameter(OZONE_DB_CHECKPOINT_INCLUDE_SNAPSHOT_DATA)).thenReturn("true");
+    // custom lock because the original lock waits for double buffer flush.
+    BootstrapStateHandler.Lock customLock = new BootstrapStateHandler.Lock() {
+      private final List<BootstrapStateHandler.Lock> serviceLocks;


If that is the case I don't think purge snapshot race condition is even possible. Then we should just deal with the create snapshot race condition

swamirishi · 2025-11-01T17:48:56Z

...e/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/ratis/OzoneManagerDoubleBuffer.java

  private final Daemon daemon;
  /** Is the {@link #daemon} running? */
  private final AtomicBoolean isRunning = new AtomicBoolean(false);
+  private final AtomicBoolean isPaused = new AtomicBoolean(false);


Why can't we use pauseDeamon() and resume() again?

Just saw the code we cannot use that

swamirishi

LGTM I believe we can remove the testing race condition here and maybe add a test case to check if the bootstrapHandler lock actually waits for the double buffer flush. I see that we don't have a unit test for this class. Please add a unit test for this class which ensure we are taking a lock on all the background services and a double buffer wait happens

ozone/hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/OMDBCheckpointServlet.java

Lines 657 to 698 in de683aa

    
           static class Lock extends BootstrapStateHandler.Lock { 
        
             private final List<BootstrapStateHandler.Lock> locks; 
        
             private final OzoneManager om; 
        
             Lock(OzoneManager om) { 
        
               Preconditions.checkNotNull(om); 
        
               Preconditions.checkNotNull(om.getKeyManager()); 
        
               Preconditions.checkNotNull(om.getMetadataManager()); 
        
               Preconditions.checkNotNull(om.getMetadataManager().getStore()); 
        
               this.om = om; 
        
               locks = Stream.of( 
        
                   om.getKeyManager().getDeletingService(), 
        
                   om.getKeyManager().getDirDeletingService(), 
        
                   om.getKeyManager().getSnapshotSstFilteringService(), 
        
                   om.getKeyManager().getSnapshotDeletingService(), 
        
                   om.getMetadataManager().getStore().getRocksDBCheckpointDiffer() 
        
               ) 
        
                   .filter(Objects::nonNull) 
        
                   .map(BootstrapStateHandler::getBootstrapStateLock) 
        
                   .collect(Collectors.toList()); 
        
             } 
        
             @Override 
        
             public BootstrapStateHandler.Lock lock() 
        
                 throws InterruptedException { 
        
               // First lock all the handlers. 
        
               for (BootstrapStateHandler.Lock lock : locks) { 
        
                 lock.lock(); 
        
               } 
        
               // Then wait for the double buffer to be flushed. 
        
               om.awaitDoubleBufferFlush(); 
        
               return this; 
        
             } 
        
             @Override 
        
             public void unlock() { 
        
               locks.forEach(BootstrapStateHandler.Lock::unlock); 
        
             } 
        
           }

sadanand48 · 2025-11-02T07:01:44Z

I see that we don't have a unit test for this class. Please add a unit test for this class which ensure we are taking a lock on all the background services and a double buffer wait happens

We do have a unit test that verifies this and 2 tests that verify bootstrap lock co-ordinationi.e testBootstrapLockCoordination and testBootstrapLockBlocksMultipleServices

ozone/hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/om/TestOMDbCheckpointServletInodeBasedXfer.java

Line 498 in 833e955

verify(mockOM).awaitDoubleBufferFlush();

I have updated the patch to remove the purge snapshot condition as you suggested.

swamirishi · 2025-11-02T14:10:12Z

I see that we don't have a unit test for this class. Please add a unit test for this class which ensure we are taking a lock on all the background services and a double buffer wait happens

We do have a unit test that verifies this and 2 tests that verify bootstrap lock co-ordinationi.e testBootstrapLockCoordination and testBootstrapLockBlocksMultipleServices

ozone/hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/om/TestOMDbCheckpointServletInodeBasedXfer.java

Line 498 in 833e955

verify(mockOM).awaitDoubleBufferFlush();

I have updated the patch to remove the purge snapshot condition as you suggested.

Should we also write a test for OmDBCheckpointServletInodeBasedXfer is actually using OMDbCheckpointServlet.Lock and not something else? Here we are creating a new instance of the lock

ozone/hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/om/TestOMDbCheckpointServletInodeBasedXfer.java

Line 488 in 833e955

    
           OMDBCheckpointServlet.Lock bootstrapLock = new OMDBCheckpointServlet.Lock(mockOM);

swamirishi

LGTM thanks @sadanand48 for the patch

swamirishi · 2025-11-02T14:18:56Z

@sadanand48 Let us create a follow up jira for changing the test to initialize the lock from the Servlet object itself instead of creating a new instance. Here we are not testing whether the InodeBasedCheckpointServlet instance is actually using the correct implementation of BootstrapLock or not

adoroszlai · 2025-11-03T08:20:10Z

@swamirishi When merging PRs, please remove co-author information if it's the same person with different email address.

Author: Sadanand Shenoy <sadanand.shenoy4898@...>
Date:   Sun Nov 2 19:49:07 2025 +0530

    HDDS-13772. Snapshot Paths to be re read from om checkpoint db inside lock again. (#9131)
    
    Co-authored-by: Sadanand Shenoy <sadanand.shenoy@...>

Also, please set fix version when resolving Jira issue after PR merge.

sadanand48 added the snapshot https://issues.apache.org/jira/browse/HDDS-6517 label Oct 9, 2025

HDDS-13772. Snapshot Paths to be re read from om checkpoint db inside…

d0d6669

… lock again. (cherry picked from commit 3650d00e8f49e668b936644df79d859664564d3f)

sadanand48 force-pushed the HDDS-13772 branch from f0151a4 to d0d6669 Compare October 10, 2025 08:26

fix leaks

3e09146

sadanand48 marked this pull request as ready for review October 10, 2025 19:37

jojochuang requested review from Copilot, jojochuang and swamirishi and removed request for Copilot and swamirishi October 13, 2025 16:23

Copilot AI reviewed Oct 13, 2025

View reviewed changes

jojochuang reviewed Oct 13, 2025

View reviewed changes

...n-test/src/test/java/org/apache/hadoop/ozone/om/TestOMDbCheckpointServletInodeBasedXfer.java Outdated Show resolved Hide resolved

jojochuang reviewed Oct 13, 2025

View reviewed changes

...ne-manager/src/main/java/org/apache/hadoop/ozone/om/OMDBCheckpointServletInodeBasedXfer.java Show resolved Hide resolved

jojochuang requested review from smengcl and swamirishi October 13, 2025 22:23

swamirishi requested changes Oct 13, 2025

View reviewed changes

Sadanand Shenoy added 2 commits October 21, 2025 15:12

address comment

6d748e2

Merge branch 'master' into HDDS-13772

30aad81

jojochuang requested a review from swamirishi October 21, 2025 18:26

jojochuang reviewed Oct 22, 2025

View reviewed changes

...ne-manager/src/main/java/org/apache/hadoop/ozone/om/OMDBCheckpointServletInodeBasedXfer.java Outdated Show resolved Hide resolved

jojochuang reviewed Oct 22, 2025

View reviewed changes

...ne-manager/src/main/java/org/apache/hadoop/ozone/om/OMDBCheckpointServletInodeBasedXfer.java Show resolved Hide resolved

Sadanand Shenoy added 4 commits October 23, 2025 21:24

Merge branch 'master' into HDDS-13772

cb4cb8c

fix compile

1aa7867

fix checkstyle

ca862a8

Merge branch 'master' into HDDS-13772

73fc89a

swamirishi reviewed Oct 27, 2025

View reviewed changes

...ne-manager/src/main/java/org/apache/hadoop/ozone/om/OMDBCheckpointServletInodeBasedXfer.java Show resolved Hide resolved

swamirishi requested changes Oct 27, 2025

View reviewed changes

use try with resources

d909182

swamirishi requested changes Oct 27, 2025

View reviewed changes

...ne-manager/src/main/java/org/apache/hadoop/ozone/om/OMDBCheckpointServletInodeBasedXfer.java Show resolved Hide resolved

address comments

7f14170

swamirishi reviewed Oct 27, 2025

View reviewed changes

swamirishi requested changes Oct 27, 2025

View reviewed changes

Sadanand Shenoy added 2 commits October 31, 2025 18:00

address comment

a3c74f6

address comment

30c5a89

swamirishi requested changes Oct 31, 2025

View reviewed changes

...n-test/src/test/java/org/apache/hadoop/ozone/om/TestOMDbCheckpointServletInodeBasedXfer.java Outdated Show resolved Hide resolved

address comment

fe24137

swamirishi reviewed Nov 1, 2025

View reviewed changes

swamirishi approved these changes Nov 1, 2025 •

edited

Loading

View reviewed changes

swamirishi requested changes Nov 1, 2025

View reviewed changes

remove race condition for purge

4a38371

swamirishi approved these changes Nov 2, 2025

View reviewed changes

swamirishi merged commit 4d6f3a5 into apache:master Nov 2, 2025
43 checks passed

	static class Lock extends BootstrapStateHandler.Lock {
	private final List<BootstrapStateHandler.Lock> locks;
	private final OzoneManager om;

	Lock(OzoneManager om) {
	Preconditions.checkNotNull(om);
	Preconditions.checkNotNull(om.getKeyManager());
	Preconditions.checkNotNull(om.getMetadataManager());
	Preconditions.checkNotNull(om.getMetadataManager().getStore());

	this.om = om;

	locks = Stream.of(
	om.getKeyManager().getDeletingService(),
	om.getKeyManager().getDirDeletingService(),
	om.getKeyManager().getSnapshotSstFilteringService(),
	om.getKeyManager().getSnapshotDeletingService(),
	om.getMetadataManager().getStore().getRocksDBCheckpointDiffer()
	)
	.filter(Objects::nonNull)
	.map(BootstrapStateHandler::getBootstrapStateLock)
	.collect(Collectors.toList());
	}

	@Override
	public BootstrapStateHandler.Lock lock()
	throws InterruptedException {
	// First lock all the handlers.
	for (BootstrapStateHandler.Lock lock : locks) {
	lock.lock();
	}

	// Then wait for the double buffer to be flushed.
	om.awaitDoubleBufferFlush();
	return this;
	}

	@Override
	public void unlock() {
	locks.forEach(BootstrapStateHandler.Lock::unlock);
	}
	}

Conversation

sadanand48 commented Oct 9, 2025

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

swamirishi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

smengcl commented Oct 15, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

swamirishi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

swamirishi left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

swamirishi Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

swamirishi left a comment

Choose a reason for hiding this comment

Uh oh!

sadanand48 commented Nov 2, 2025

Uh oh!

swamirishi commented Nov 2, 2025

Uh oh!

swamirishi left a comment

Choose a reason for hiding this comment

Uh oh!

swamirishi commented Nov 2, 2025

Uh oh!

Uh oh!

adoroszlai commented Nov 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

swamirishi Oct 31, 2025 •

edited

Loading