Skip to content

Conversation

@aswinshakil
Copy link
Member

What changes were proposed in this pull request?

Add tests for SnapshotDeletingService to cover all the Snapshot GC code paths.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-8207

How was this patch tested?

This is a test change.

@aswinshakil aswinshakil added the snapshot https://issues.apache.org/jira/browse/HDDS-6517 label Apr 14, 2023
@aswinshakil aswinshakil self-assigned this Apr 14, 2023
@aswinshakil aswinshakil marked this pull request as draft April 14, 2023 20:30
@aswinshakil aswinshakil marked this pull request as ready for review April 18, 2023 18:16
@GeorgeJahad
Copy link
Contributor

It looks like checkDirReclaimable() returns true if the dir is reclaimable, but checkKeyReclaimable() returns false if the key is reclaimable. Is that correct?

private boolean checkDirReclaimable(
Table.KeyValue<String, OmKeyInfo> deletedDir,
Table<String, OmDirectoryInfo> previousDirTable) throws IOException {
if (previousDirTable == null) {
return true;
}

private boolean checkKeyReclaimable(
Table<String, OmKeyInfo> previousKeyTable,
Table<String, String> renamedKeyTable,
OmKeyInfo deletedKeyInfo, OmBucketInfo bucketInfo,
long volumeId, HddsProtos.KeyValue.Builder renamedKeyBuilder)
throws IOException {
String dbKey;
// Handle case when the deleted snapshot is the first snapshot.
if (previousKeyTable == null) {
return false;
}

@GeorgeJahad
Copy link
Contributor

GeorgeJahad commented Apr 19, 2023

Do we need to change the snapshotInfo status to SNAPSHOT_RECLAIMED here?

or maybe here:

if (checkSnapshotReclaimable(snapshotDeletedTable,
snapshotDeletedDirTable, snapshotBucketKey, dbBucketKeyForDir)) {
purgeSnapshotKeys.add(snapInfo.getTableKey());

@GeorgeJahad
Copy link
Contributor

I'd like to see unit tests that excercise all the if/else paths in the following methods:

checkSnapshotReclaimable
checkDirReclaimable
checkKeyReclaimable
handleDirectoryCleanUp

In addition, I'd like a unit test that confirms that we are correctly starting and stopping within the bucket scope.

@aswinshakil
Copy link
Member Author

It looks like checkDirReclaimable() returns true if the dir is reclaimable, but checkKeyReclaimable() returns false if the key is reclaimable. Is that correct?

Yes you are correct. I'm going to make this consistent in the following patches or in this patch.

@aswinshakil
Copy link
Member Author

Do we need to change the snapshotInfo status to SNAPSHOT_RECLAIMED here?

Do we need to store intermediate result here? Because the Snapshot is removed from the snapshotInfoTable in the next step immediately.

I'd like to see unit tests that excercise all the if/else paths in the following methods:

checkSnapshotReclaimable
checkDirReclaimable
checkKeyReclaimable
handleDirectoryCleanUp

In addition, I'd like a unit test that confirms that we are correctly starting and stopping within the bucket scope.

Sure I will add more tests to this patch.

@GeorgeJahad
Copy link
Contributor

Do we need to store intermediate result here? Because the Snapshot is removed from the snapshotInfoTable in the next step immediately.

That is a good question. So maybe we can remove the RECLAIMED status from the enum, if it isn't used anywhere?

@GeorgeJahad
Copy link
Contributor

Yes you are correct. I'm going to make this consistent in the following patches or in this patch.

Thanks, I found it very confusing.

@aswinshakil aswinshakil changed the title HDDS-8207. [Snapshot] Add tests for SnapshotDeletingService. HDDS-8207. [Snapshot] Fix bugs and add tests for SnapshotDeletingService. May 17, 2023
@aswinshakil aswinshakil requested review from GeorgeJahad and smengcl and removed request for GeorgeJahad May 17, 2023 16:42
Comment on lines 124 to 133
// If both next global and path snapshot are same, it may overwrite
// nextPathSnapInfo.setPathPreviousSnapshotID(), adding this check
// will prevent it.
if (nextGlobalSnapInfo != null && nextGlobalSnapInfo.getSnapshotID()
.equals(nextPathSnapInfo.getSnapshotID())) {
nextPathSnapInfo.setGlobalPreviousSnapshotID(
snapInfo.getPathPreviousSnapshotID());
metadataManager.getSnapshotInfoTable().putWithBatch(batchOperation,
nextPathSnapInfo.getTableKey(), nextPathSnapInfo);
} else if (nextGlobalSnapInfo != null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case described in the comment, we need to update both nextPathSnapInfo and nextGlobalSnapInfo right? if so we shouldn't use else if imo

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason for using else if because nextGlobalSnapInfo can be null if we are deleting the last snapshot, so we need that check.


public static void checkSnapshotActive(SnapshotInfo snapInfo)
public static void checkSnapshotActive(SnapshotInfo snapInfo,
boolean override)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
boolean override)
boolean skipCheck)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tho it sounds a bit weird to have a param to skip the check for the check method. If we fix the caller, how many places do we have to change? If not many we should do that instead.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initially thought about this, but it looks like we should be changing in many places. I kept on getting Exceptions from multiple request.

Comment on lines +147 to +149
BucketArgs bucketArgs = new BucketArgs.Builder()
.setBucketLayout(BucketLayout.LEGACY)
.build();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should works for all 3 types of buckets? Parameterize this later?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For FSO there is a separate test. Maybe we can parameterize it later. Anyways I'm planning on adding more tests in a separate JIRA which George requested for.

Copy link
Contributor

@smengcl smengcl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @aswinshakil for the fixes and new tests. Looks good mostly.

@smengcl
Copy link
Contributor

smengcl commented May 24, 2023

Looks like it's timing out on a relevant test:

https://github.com/apache/ozone/actions/runs/5060501663/jobs/9088775175

 Tests run: 3, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 209.808 s <<< FAILURE! - in org.apache.hadoop.fs.ozone.TestSnapshotDeletingService
Error:  org.apache.hadoop.fs.ozone.TestSnapshotDeletingService.testSnapshotWithFSO  Time elapsed: 158.164 s  <<< ERROR!
java.util.concurrent.TimeoutException: 
Timed out waiting for condition. Thread diagnostics:
Timestamp: 2023-05-24 12:20:12,141

"ContainerOp-2104f031-94b1-4091-a926-8a477d189e0d-9"  prio=5 tid=493 in Object.wait()
java.lang.Thread.State: WAITING (on object monitor)
        at sun.misc.Unsafe.park(Native Method)

This could be related to the bug @hemantk-12 just pointed out:

#4244 (comment)

@aswinshakil
Copy link
Member Author

@smengcl and @hemantk-12 for pointing out, I have fixed it. The tests doesn't look like it's failing on this bug. I checked the log and it is failing on checkpoint creation.

2023-05-23 19:10:33,535 [OMDoubleBufferFlushThread] ERROR db.RDBCheckpointManager (RDBCheckpointManager.java:createCheckpoint(101)) - Unable to create RocksDB Snapshot.
java.io.IOException: org.apache.hadoop.hdds.utils.db.RocksDatabase$RocksCheckpoint@36eabcd1: Failed to createCheckpoint /home/runner/work/ozone/ozone/hadoop-ozone/integration-test/target/test-dir/MiniOzoneClusterImpl-f735872c-86ff-4001-aead-24a4ea2dd355/ozone-meta/db.snapshots/checkpointState/om.db-c9c89178-1f5c-4a38-981e-7730888542d5; status : InvalidArgument; message : Directory exists
	at org.apache.hadoop.hdds.utils.HddsServerUtil.toIOException(HddsServerUtil.java:588)
	at org.apache.hadoop.hdds.utils.db.RocksDatabase.toIOException(RocksDatabase.java:88)
	at org.apache.hadoop.hdds.utils.db.RocksDatabase$RocksCheckpoint.createCheckpoint(RocksDatabase.java:239)
	at org.apache.hadoop.hdds.utils.db.RDBCheckpointManager.createCheckpoint(RDBCheckpointManager.java:83)
	at org.apache.hadoop.hdds.utils.db.RDBStore.getSnapshot(RDBStore.java:325)
	at org.apache.hadoop.ozone.om.OmSnapshotManager.createOmSnapshotCheckpoint(OmSnapshotManager.java:392)
	at org.apache.hadoop.ozone.om.response.snapshot.OMSnapshotCreateResponse.addToDBBatch(OMSnapshotCreateResponse.java:81)
	at org.apache.hadoop.ozone.om.response.OMClientResponse.checkAndUpdateDB(OMClientResponse.java:73)
	at org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.lambda$5(OzoneManagerDoubleBuffer.java:414)
	at org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.addToBatchWithTrace(OzoneManagerDoubleBuffer.java:237)
	at org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.addToBatch(OzoneManagerDoubleBuffer.java:412)
	at org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.flushBatch(OzoneManagerDoubleBuffer.java:333)
	at org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.flushCurrentBuffer(OzoneManagerDoubleBuffer.java:312)
	at org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.flushTransactions(OzoneManagerDoubleBuffer.java:277)
	at java.lang.Thread.run(Thread.java:750)
Caused by: org.rocksdb.RocksDBException: Directory exists
	at org.rocksdb.Checkpoint.createCheckpoint(Native Method)
	at org.rocksdb.Checkpoint.createCheckpoint(Checkpoint.java:51)
	at org.apache.hadoop.hdds.utils.db.RocksDatabase$RocksCheckpoint.createCheckpoint(RocksDatabase.java:236)
	... 12 more

I'm not sure why the directory exists in the first place. Checking this out.

@prashantpogde
Copy link
Contributor

Checkpoint creation is not an idempotent operation. In case of log replay also it can fail. We should figure this out but we need to make sure that the checkpoint directory doesn't exist. If it exists that means checkpoint must have been created in previous incarnation of log replay,

@smengcl
Copy link
Contributor

smengcl commented May 25, 2023

The test timeout issue seems fixed now. Thanks @aswinshakil .

Some test failures still looks relevant (flaky?):

https://github.com/apache/ozone/actions/runs/5075772329/jobs/9117315098

[INFO] Running org.apache.hadoop.fs.ozone.TestRootedOzoneFileSystem
Error:  Tests run: 245, Failures: 3, Errors: 0, Skipped: 9, Time elapsed: 381.878 s <<< FAILURE! - in org.apache.hadoop.fs.ozone.TestRootedOzoneFileSystem
Error:  TestRootedOzoneFileSystem.testSnapshotRead  Time elapsed: 0.15 s  <<< FAILURE!
java.lang.AssertionError: Failed to read/list on snapshotPath, exception: java.io.FileNotFoundException: Unable to load snapshot. Snapshot directory for snapshot '/volume54269/bucket05827/snap1' does not exists.
	at org.junit.Assert.fail(Assert.java:89)
	at org.apache.hadoop.fs.ozone.TestRootedOzoneFileSystem.testSnapshotRead(TestRootedOzoneFileSystem.java:2450)
...
[INFO] Running org.apache.hadoop.fs.ozone.TestRootedOzoneFileSystemWithFSO
Error:  Tests run: 104, Failures: 1, Errors: 0, Skipped: 16, Time elapsed: 152.944 s <<< FAILURE! - in org.apache.hadoop.fs.ozone.TestRootedOzoneFileSystemWithFSO
Error:  TestRootedOzoneFileSystemWithFSO.testSnapshotRead  Time elapsed: 0.126 s  <<< FAILURE!
java.lang.AssertionError: Failed to read/list on snapshotPath, exception: java.io.FileNotFoundException: Unable to load snapshot. Snapshot directory for snapshot '/volume22951/bucket62128/snap1' does not exists.
	at org.junit.Assert.fail(Assert.java:89)
	at org.apache.hadoop.fs.ozone.TestRootedOzoneFileSystem.testSnapshotRead(TestRootedOzoneFileSystem.java:2450)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
...

This might be irrelevant:

Error:  Tests run: 32, Failures: 0, Errors: 1, Skipped: 4, Time elapsed: 51.412 s <<< FAILURE! - in org.apache.hadoop.fs.ozone.contract.ITestOzoneContractCreate
Error:  ITestOzoneContractCreate.testSyncable  Time elapsed: 0.308 s  <<< ERROR!
java.io.IOException: Inconsistent read for blockID=conID: 1 locID: 111677748019200007 bcsId: 0 length=2 position=1 numBytesToRead=1 numBytesRead=-1
	at org.apache.hadoop.ozone.client.io.KeyInputStream.checkPartBytesRead(KeyInputStream.java:176)
...

@aswinshakil
Copy link
Member Author

Thanks for pointing this out @smengcl, I updated PR.


assertTableRowCount(snapshotInfoTable, 2);
GenericTestUtils.waitFor(() -> snapshotDeletingService
.getSuccessfulRunCount() >= 1, 1000, 10000);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: indentation. Can fix this in the other PR: https://github.com/apache/ozone/pull/4571/files#r1199182257

Suggested change
.getSuccessfulRunCount() >= 1, 1000, 10000);
.getSuccessfulRunCount() >= 1, 1000, 10000);

@smengcl smengcl merged commit 8e5b4bc into apache:master May 25, 2023
@smengcl
Copy link
Contributor

smengcl commented May 25, 2023

Thanks @aswinshakil for the fixes and tests. Thanks @GeorgeJahad for reviewing this as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

snapshot https://issues.apache.org/jira/browse/HDDS-6517

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants