Skip to content

Conversation

@xichen01
Copy link
Contributor

@xichen01 xichen01 commented Dec 29, 2023

What changes were proposed in this pull request?

Fix unstable integration tests.

Multiple tests have uncovered a number of things that can cause TestBlockDeletion to fail

  1. hdds.datanode.block.delete.queue.limit defaults to 5, which may cause tasks to be discarded if they can't be added to the queue, thus causing the test to time out. (fix by HDDS-9976. Ozone StateContext Memory leak for DeleteBlocksCommand when queue is full #5845)

  2. restartHddsDatanode in TestBlockDeletion.testBlockDeletion can sometimes cause the DN to be restarted before sending the DeleteBlockTransactionResult to the SCM.

    cluster.restartHddsDatanode(0, true);
    matchContainerTransactionIds();
    // Verify transactions committed
    GenericTestUtils.waitFor(() -> {
    try {
    verifyTransactionsCommitted();
    return true;
    } catch (Throwable t) {
    LOG.warn("Container closing failed", t);
    return false;
    }

  3. SCMBlockDeletingService#notifyStatusChanged in SCMBlockDeletingService may be executed several times, resulting in the serviceStatus being set from RUNNING to PAUSING, which leads to SCMBlockDeletingService does not work.

2023-12-28 10:42:20,496 [EventQueue-OpenPipelineForHealthyPipelineSafeModeRule] INFO  safemode.SCMSafeModeManager (SCMSafeModeManager.java:exitSafeMode(244)) - SCM exiting safe mode.  <<--- First
2023-12-28 10:42:20,496 [EventQueue-OpenPipelineForHealthyPipelineSafeModeRule] INFO  ha.SCMContext (SCMContext.java:updateSafeModeStatus(230)) - Update SafeModeStatus from SafeModeStatus{safeModeStatus=true, preCheckPassed=true} to SafeModeStatus{safeModeStatus=false, preCheckPassed=true}.
//...
2023-12-28 10:42:20,497 [EventQueue-OpenPipelineForHealthyPipelineSafeModeRule] DEBUG ha.SCMServiceManager (SCMServiceManager.java:notifyStatusChanged(51)) - Notify service:SCMBlockDeletingService.
//...
2023-12-28 10:42:20,499 [EventQueue-PipelineReportForOneReplicaPipelineSafeModeRule] INFO  safemode.SCMSafeModeManager (SCMSafeModeManager.java:validateSafeModeExitRules(215)) - ScmSafeModeManager, all rules are successfully validated
2023-12-28 10:42:20,499 [EventQueue-PipelineReportForOneReplicaPipelineSafeModeRule] INFO  safemode.SCMSafeModeManager (SCMSafeModeManager.java:exitSafeMode(244)) - SCM exiting safe mode.   <<--- Second
2023-12-28 10:42:20,499 [EventQueue-PipelineReportForOneReplicaPipelineSafeModeRule] INFO  ha.SCMContext (SCMContext.java:updateSafeModeStatus(230)) - Update SafeModeStatus from SafeModeStatus{safeModeStatus=false, preCheckPassed=true} to SafeModeStatus{safeModeStatus=false, preCheckPassed=true}.
//...
2023-12-28 10:42:20,500 [EventQueue-PipelineReportForOneReplicaPipelineSafeModeRule] DEBUG ha.SCMServiceManager (SCMServiceManager.java:notifyStatusChanged(51)) - Notify service:SCMBlockDeletingService.
//...

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-9962

How was this patch tested?

Existing Test.
Twice 25 * 15 Tests all successful
https://github.com/xichen01/ozone/actions/runs/7353980620/attempts/1
https://github.com/xichen01/ozone/actions/runs/7353980620

Copy link
Contributor

@adoroszlai adoroszlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @xichen01 for the fix, LGTM.

@adoroszlai adoroszlai merged commit 2aee970 into apache:master Dec 29, 2023
vtutrinov pushed a commit to vtutrinov/ozone that referenced this pull request May 6, 2024
swamirishi pushed a commit to swamirishi/ozone that referenced this pull request Dec 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants