Skip to content

Conversation

@devmadhuu
Copy link
Contributor

What changes were proposed in this pull request?

This PR fixes the intermittent failure of TestStorageContainerManager#testBlockDeletionTransactions test case.

This test case creates 5 new keys and 5 containers as well. Then close all 5 containers and creates the delete transaction log and immediately verifies the number of valid delete transactions is > 0 (delLog.getNumOfValidTransactions() > 0), But after sometime SCM Block Deleting service thread will send the DELETE BLOCK command to datanodes and commits the cleanup the delete block transaction log from DeletedBlocksTransaction table. So after few seconds this assertion of delLog.getNumOfValidTransactions() == 0 should pass. But there was an issue in test case. that after closing all containers, some containers were left OPEN due to which delete block transactions still kept lying in DeletedBlocksTransaction table even after 25 seconds. So this PR updates few configurations related to DN heartbeats interval and add a sleep to give sometime for containers to get closed.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-8492

How was this patch tested?

This patch is tested by running repeated CI runs of multiple iterations (200 iterations in CI flaky workflow). Here is the green CI link.

@devmadhuu
Copy link
Contributor Author

@adoroszlai Pls review.

@SaketaChalamchala
Copy link
Contributor

@DaveTeng0 can you please take a look?

@adoroszlai
Copy link
Contributor

@xichen01 can you please review, too?

Copy link
Contributor

@xichen01 xichen01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@devmadhuu The change looks good. Just a few comment inlined.

cluster.getStorageContainerManager());
}

Thread.sleep(3000);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can use this method to wait for the Container to close but need get the Container list first.

public static void waitForContainerClose(MiniOzoneCluster cluster,
Long... containerIdList)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xichen01 , After I updated configs for MiniOzoneCluster, this Thread.sleep is not needed for waiting to get containers closed. After removing the sleep, I re-ran 500 times repeated runs for this test class and all passwed 100%. Here is the green CI link.

Pls re-review.

// Wait for container report
Thread.sleep(1000);
for (OmKeyInfo keyInfo : keyLocations.values()) {
OzoneTestUtils.closeContainers(keyInfo.getKeyLocationVersions(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that the other tests have similar logic (such as: testBlockDeletingThrottling, it close Container too). Is it possible to have the same bugs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this test also passes in repeated runs, I would not want to change anything in this test as of now. Kindly re-review.

@xichen01
Copy link
Contributor

@devmadhuu Thanks for the update. LGTM +1.

Copy link
Contributor

@adoroszlai adoroszlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adoroszlai adoroszlai merged commit 1bd950f into apache:master Jan 17, 2024
@adoroszlai
Copy link
Contributor

Thanks @devmadhuu for the patch, @xichen01 for the review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants