Skip to content

HDDS-13772. Snapshot Paths to be re read from om checkpoint db inside lock again.#9131

Merged
swamirishi merged 14 commits intoapache:masterfrom
sadanand48:HDDS-13772
Nov 2, 2025
Merged

HDDS-13772. Snapshot Paths to be re read from om checkpoint db inside lock again.#9131
swamirishi merged 14 commits intoapache:masterfrom
sadanand48:HDDS-13772

Conversation

@sadanand48
Copy link
Contributor

What changes were proposed in this pull request?

Snapshot Paths to be re read from om checkpoint db inside lock again from the checkpoint DB's metadataManager instance.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-13772

How was this patch tested?

unit test.

@sadanand48 sadanand48 added the snapshot https://issues.apache.org/jira/browse/HDDS-6517 label Oct 9, 2025
… lock again.

(cherry picked from commit 3650d00e8f49e668b936644df79d859664564d3f)
@sadanand48 sadanand48 marked this pull request as ready for review October 10, 2025 19:37
@jojochuang jojochuang requested review from Copilot, jojochuang and swamirishi and removed request for Copilot and swamirishi October 13, 2025 16:23
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR addresses a concurrency issue where snapshot paths were read from the live OM metadata manager instead of the checkpoint's metadata manager, potentially including stale or purged snapshots during checkpoint transfers. The fix ensures snapshot paths are re-read from the checkpoint database inside the lock to maintain consistency.

  • Re-reads snapshot paths from checkpoint metadata manager instead of live metadata manager
  • Adds proper resource cleanup for the checkpoint metadata manager
  • Changes method visibility from private to package-private for testing

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
OMDBCheckpointServletInodeBasedXfer.java Adds code to re-read snapshot paths from checkpoint metadata manager and changes method visibility for testing
TestOMDbCheckpointServletInodeBasedXfer.java Adds unit test to verify snapshot paths are correctly read from checkpoint after purge scenarios

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Copy link
Contributor

@swamirishi swamirishi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sadanand48 thanks for the patch left a few review comments

@smengcl
Copy link
Contributor

smengcl commented Oct 15, 2025

Thanks @sadanand48 . Could you rebase the patch?

@jojochuang jojochuang requested a review from swamirishi October 21, 2025 18:26
Copy link
Contributor

@swamirishi swamirishi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sadanand48 there are still some minor nitpicky changes on this patch once it is addressed we can merge this

* 5. Servlet processes checkpoint - should still include S2 data
*/
@Test
public void testCheckpointIncludesSnapshotsFromFrozenState() throws Exception {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Do we need a mini ozone cluster test case for this? I believe we can do with a unit test case here instead of a full mini ozone cluster test. We can think about moving this test into unit test later

Copy link
Contributor

@swamirishi swamirishi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs a few more changes

Comment on lines 656 to 661
when(spyDbStore.getCheckpoint(true)).thenAnswer(invocation -> {
DBCheckpoint checkpoint = spy(dbStore.getCheckpoint(true));
doNothing().when(checkpoint).cleanupCheckpoint(); // Don't cleanup for verification
capturedCheckpoint.set(checkpoint);
return checkpoint;
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
when(spyDbStore.getCheckpoint(true)).thenAnswer(invocation -> {
DBCheckpoint checkpoint = spy(dbStore.getCheckpoint(true));
doNothing().when(checkpoint).cleanupCheckpoint(); // Don't cleanup for verification
capturedCheckpoint.set(checkpoint);
return checkpoint;
});
when(spyDbStore.getCheckpoint(eq(true))).thenAnswer(invocation -> {
client.getObjectStore().deleteSnapshot(volumeName, bucketName, "snapshot2");
client.getObjectStore().createSnapshot(volumeName, bucketName, "snapshot3");
// wait for snapshot2 to get purged and snapshot3 to get created.
DBCheckpoint checkpoint = spy(Mockito.callRealMethod());
doNothing().when(checkpoint).cleanupCheckpoint(); // Don't cleanup for verification
capturedCheckpoint.set(checkpoint);
return checkpoint;
});

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sadanand48 Let us purge the snapshot just before we are trying to take a checkpoint which means that we should purge the snapshot in this block

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the checkpoint is taken after the bootstrap lock is acquired so once bootstrap lock is aquired the purge won't be succesful in this block as SDS needs to acquire this lock to do the purge. We need to do it outside the lock itself

Copy link
Contributor

@swamirishi swamirishi Oct 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do the bootstrap on the follower OM? Bootstrap lock there won't mean anything when this runs on the follower. To get into this condition maybe we should mock bootstrap lock to do a noop here.
Actually what we should do is during BootstrapLock we should pause the double buffer thread delete the snapshot and acquire the bootstrap lock. Now just before the checkpoint we should unpause the double buffer thread and let the snapshot purge get flushed this would do it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

when(requestMock.getParameter(OZONE_DB_CHECKPOINT_INCLUDE_SNAPSHOT_DATA)).thenReturn("true");
// custom lock because the original lock waits for double buffer flush.
BootstrapStateHandler.Lock customLock = new BootstrapStateHandler.Lock() {
private final List<BootstrapStateHandler.Lock> serviceLocks;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that is the case I don't think purge snapshot race condition is even possible. Then we should just deal with the create snapshot race condition

private final Daemon daemon;
/** Is the {@link #daemon} running? */
private final AtomicBoolean isRunning = new AtomicBoolean(false);
private final AtomicBoolean isPaused = new AtomicBoolean(false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't we use pauseDeamon() and resume() again?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just saw the code we cannot use that

swamirishi
swamirishi approved these changes Nov 1, 2025
Copy link
Contributor

@swamirishi swamirishi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM I believe we can remove the testing race condition here and maybe add a test case to check if the bootstrapHandler lock actually waits for the double buffer flush. I see that we don't have a unit test for this class. Please add a unit test for this class which ensure we are taking a lock on all the background services and a double buffer wait happens

static class Lock extends BootstrapStateHandler.Lock {
private final List<BootstrapStateHandler.Lock> locks;
private final OzoneManager om;
Lock(OzoneManager om) {
Preconditions.checkNotNull(om);
Preconditions.checkNotNull(om.getKeyManager());
Preconditions.checkNotNull(om.getMetadataManager());
Preconditions.checkNotNull(om.getMetadataManager().getStore());
this.om = om;
locks = Stream.of(
om.getKeyManager().getDeletingService(),
om.getKeyManager().getDirDeletingService(),
om.getKeyManager().getSnapshotSstFilteringService(),
om.getKeyManager().getSnapshotDeletingService(),
om.getMetadataManager().getStore().getRocksDBCheckpointDiffer()
)
.filter(Objects::nonNull)
.map(BootstrapStateHandler::getBootstrapStateLock)
.collect(Collectors.toList());
}
@Override
public BootstrapStateHandler.Lock lock()
throws InterruptedException {
// First lock all the handlers.
for (BootstrapStateHandler.Lock lock : locks) {
lock.lock();
}
// Then wait for the double buffer to be flushed.
om.awaitDoubleBufferFlush();
return this;
}
@Override
public void unlock() {
locks.forEach(BootstrapStateHandler.Lock::unlock);
}
}

@sadanand48
Copy link
Contributor Author

I see that we don't have a unit test for this class. Please add a unit test for this class which ensure we are taking a lock on all the background services and a double buffer wait happens

We do have a unit test that verifies this and 2 tests that verify bootstrap lock co-ordinationi.e testBootstrapLockCoordination and testBootstrapLockBlocksMultipleServices

I have updated the patch to remove the purge snapshot condition as you suggested.

@swamirishi
Copy link
Contributor

I see that we don't have a unit test for this class. Please add a unit test for this class which ensure we are taking a lock on all the background services and a double buffer wait happens

We do have a unit test that verifies this and 2 tests that verify bootstrap lock co-ordinationi.e testBootstrapLockCoordination and testBootstrapLockBlocksMultipleServices

I have updated the patch to remove the purge snapshot condition as you suggested.

Should we also write a test for OmDBCheckpointServletInodeBasedXfer is actually using OMDbCheckpointServlet.Lock and not something else? Here we are creating a new instance of the lock

OMDBCheckpointServlet.Lock bootstrapLock = new OMDBCheckpointServlet.Lock(mockOM);

Copy link
Contributor

@swamirishi swamirishi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks @sadanand48 for the patch

@swamirishi
Copy link
Contributor

@sadanand48 Let us create a follow up jira for changing the test to initialize the lock from the Servlet object itself instead of creating a new instance. Here we are not testing whether the InodeBasedCheckpointServlet instance is actually using the correct implementation of BootstrapLock or not

@swamirishi swamirishi merged commit 4d6f3a5 into apache:master Nov 2, 2025
43 checks passed
@adoroszlai
Copy link
Contributor

@swamirishi When merging PRs, please remove co-author information if it's the same person with different email address.

Author: Sadanand Shenoy <sadanand.shenoy4898@...>
Date:   Sun Nov 2 19:49:07 2025 +0530

    HDDS-13772. Snapshot Paths to be re read from om checkpoint db inside lock again. (#9131)
    
    Co-authored-by: Sadanand Shenoy <sadanand.shenoy@...>

Also, please set fix version when resolving Jira issue after PR merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

snapshot https://issues.apache.org/jira/browse/HDDS-6517

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants