HDDS-13772. Snapshot Paths to be re read from om checkpoint db inside lock again.#9131
HDDS-13772. Snapshot Paths to be re read from om checkpoint db inside lock again.#9131swamirishi merged 14 commits intoapache:masterfrom
Conversation
… lock again. (cherry picked from commit 3650d00e8f49e668b936644df79d859664564d3f)
f0151a4 to
d0d6669
Compare
There was a problem hiding this comment.
Pull Request Overview
This PR addresses a concurrency issue where snapshot paths were read from the live OM metadata manager instead of the checkpoint's metadata manager, potentially including stale or purged snapshots during checkpoint transfers. The fix ensures snapshot paths are re-read from the checkpoint database inside the lock to maintain consistency.
- Re-reads snapshot paths from checkpoint metadata manager instead of live metadata manager
- Adds proper resource cleanup for the checkpoint metadata manager
- Changes method visibility from private to package-private for testing
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| OMDBCheckpointServletInodeBasedXfer.java | Adds code to re-read snapshot paths from checkpoint metadata manager and changes method visibility for testing |
| TestOMDbCheckpointServletInodeBasedXfer.java | Adds unit test to verify snapshot paths are correctly read from checkpoint after purge scenarios |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
...ne-manager/src/main/java/org/apache/hadoop/ozone/om/OMDBCheckpointServletInodeBasedXfer.java
Outdated
Show resolved
Hide resolved
...ne-manager/src/main/java/org/apache/hadoop/ozone/om/OMDBCheckpointServletInodeBasedXfer.java
Show resolved
Hide resolved
...ne-manager/src/main/java/org/apache/hadoop/ozone/om/OMDBCheckpointServletInodeBasedXfer.java
Outdated
Show resolved
Hide resolved
...n-test/src/test/java/org/apache/hadoop/ozone/om/TestOMDbCheckpointServletInodeBasedXfer.java
Outdated
Show resolved
Hide resolved
...ne-manager/src/main/java/org/apache/hadoop/ozone/om/OMDBCheckpointServletInodeBasedXfer.java
Show resolved
Hide resolved
swamirishi
left a comment
There was a problem hiding this comment.
@sadanand48 thanks for the patch left a few review comments
...ne-manager/src/main/java/org/apache/hadoop/ozone/om/OMDBCheckpointServletInodeBasedXfer.java
Outdated
Show resolved
Hide resolved
...ne-manager/src/main/java/org/apache/hadoop/ozone/om/OMDBCheckpointServletInodeBasedXfer.java
Outdated
Show resolved
Hide resolved
...n-test/src/test/java/org/apache/hadoop/ozone/om/TestOMDbCheckpointServletInodeBasedXfer.java
Outdated
Show resolved
Hide resolved
|
Thanks @sadanand48 . Could you rebase the patch? |
...ne-manager/src/main/java/org/apache/hadoop/ozone/om/OMDBCheckpointServletInodeBasedXfer.java
Outdated
Show resolved
Hide resolved
...ne-manager/src/main/java/org/apache/hadoop/ozone/om/OMDBCheckpointServletInodeBasedXfer.java
Show resolved
Hide resolved
...ne-manager/src/main/java/org/apache/hadoop/ozone/om/OMDBCheckpointServletInodeBasedXfer.java
Show resolved
Hide resolved
swamirishi
left a comment
There was a problem hiding this comment.
@sadanand48 there are still some minor nitpicky changes on this patch once it is addressed we can merge this
...ne-manager/src/main/java/org/apache/hadoop/ozone/om/OMDBCheckpointServletInodeBasedXfer.java
Show resolved
Hide resolved
| * 5. Servlet processes checkpoint - should still include S2 data | ||
| */ | ||
| @Test | ||
| public void testCheckpointIncludesSnapshotsFromFrozenState() throws Exception { |
There was a problem hiding this comment.
nit: Do we need a mini ozone cluster test case for this? I believe we can do with a unit test case here instead of a full mini ozone cluster test. We can think about moving this test into unit test later
swamirishi
left a comment
There was a problem hiding this comment.
This needs a few more changes
| when(spyDbStore.getCheckpoint(true)).thenAnswer(invocation -> { | ||
| DBCheckpoint checkpoint = spy(dbStore.getCheckpoint(true)); | ||
| doNothing().when(checkpoint).cleanupCheckpoint(); // Don't cleanup for verification | ||
| capturedCheckpoint.set(checkpoint); | ||
| return checkpoint; | ||
| }); |
There was a problem hiding this comment.
| when(spyDbStore.getCheckpoint(true)).thenAnswer(invocation -> { | |
| DBCheckpoint checkpoint = spy(dbStore.getCheckpoint(true)); | |
| doNothing().when(checkpoint).cleanupCheckpoint(); // Don't cleanup for verification | |
| capturedCheckpoint.set(checkpoint); | |
| return checkpoint; | |
| }); | |
| when(spyDbStore.getCheckpoint(eq(true))).thenAnswer(invocation -> { | |
| client.getObjectStore().deleteSnapshot(volumeName, bucketName, "snapshot2"); | |
| client.getObjectStore().createSnapshot(volumeName, bucketName, "snapshot3"); | |
| // wait for snapshot2 to get purged and snapshot3 to get created. | |
| DBCheckpoint checkpoint = spy(Mockito.callRealMethod()); | |
| doNothing().when(checkpoint).cleanupCheckpoint(); // Don't cleanup for verification | |
| capturedCheckpoint.set(checkpoint); | |
| return checkpoint; | |
| }); |
There was a problem hiding this comment.
@sadanand48 Let us purge the snapshot just before we are trying to take a checkpoint which means that we should purge the snapshot in this block
There was a problem hiding this comment.
the checkpoint is taken after the bootstrap lock is acquired so once bootstrap lock is aquired the purge won't be succesful in this block as SDS needs to acquire this lock to do the purge. We need to do it outside the lock itself
There was a problem hiding this comment.
Can we do the bootstrap on the follower OM? Bootstrap lock there won't mean anything when this runs on the follower. To get into this condition maybe we should mock bootstrap lock to do a noop here.
Actually what we should do is during BootstrapLock we should pause the double buffer thread delete the snapshot and acquire the bootstrap lock. Now just before the checkpoint we should unpause the double buffer thread and let the snapshot purge get flushed this would do it.
...n-test/src/test/java/org/apache/hadoop/ozone/om/TestOMDbCheckpointServletInodeBasedXfer.java
Outdated
Show resolved
Hide resolved
| when(requestMock.getParameter(OZONE_DB_CHECKPOINT_INCLUDE_SNAPSHOT_DATA)).thenReturn("true"); | ||
| // custom lock because the original lock waits for double buffer flush. | ||
| BootstrapStateHandler.Lock customLock = new BootstrapStateHandler.Lock() { | ||
| private final List<BootstrapStateHandler.Lock> serviceLocks; |
There was a problem hiding this comment.
If that is the case I don't think purge snapshot race condition is even possible. Then we should just deal with the create snapshot race condition
| private final Daemon daemon; | ||
| /** Is the {@link #daemon} running? */ | ||
| private final AtomicBoolean isRunning = new AtomicBoolean(false); | ||
| private final AtomicBoolean isPaused = new AtomicBoolean(false); |
There was a problem hiding this comment.
Why can't we use pauseDeamon() and resume() again?
There was a problem hiding this comment.
Just saw the code we cannot use that
swamirishi
left a comment
There was a problem hiding this comment.
LGTM I believe we can remove the testing race condition here and maybe add a test case to check if the bootstrapHandler lock actually waits for the double buffer flush. I see that we don't have a unit test for this class. Please add a unit test for this class which ensure we are taking a lock on all the background services and a double buffer wait happens
We do have a unit test that verifies this and 2 tests that verify bootstrap lock co-ordinationi.e testBootstrapLockCoordination and testBootstrapLockBlocksMultipleServices I have updated the patch to remove the purge snapshot condition as you suggested. |
Should we also write a test for OmDBCheckpointServletInodeBasedXfer is actually using OMDbCheckpointServlet.Lock and not something else? Here we are creating a new instance of the lock |
swamirishi
left a comment
There was a problem hiding this comment.
LGTM thanks @sadanand48 for the patch
|
@sadanand48 Let us create a follow up jira for changing the test to initialize the lock from the Servlet object itself instead of creating a new instance. Here we are not testing whether the InodeBasedCheckpointServlet instance is actually using the correct implementation of BootstrapLock or not |
|
@swamirishi When merging PRs, please remove co-author information if it's the same person with different email address. Also, please set fix version when resolving Jira issue after PR merge. |
What changes were proposed in this pull request?
Snapshot Paths to be re read from om checkpoint db inside lock again from the checkpoint DB's metadataManager instance.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-13772
How was this patch tested?
unit test.