-
Notifications
You must be signed in to change notification settings - Fork 588
HDDS-10875. XceiverRatisServer#getRaftPeersInPipeline should be called before XceiverRatisServer#removeGroup #6696
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
szetszwo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ivandika3 , thanks for working on this! How about catching GroupMismatchException in the existing try-catch?
+++ b/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/commandhandler/ClosePipelineCommandHandler.java
@@ -105,7 +105,6 @@ public void handle(SCMCommand command, OzoneContainer ozoneContainer,
try {
XceiverServerSpi server = ozoneContainer.getWriteChannel();
if (server.isExist(pipelineIdProto)) {
- server.removeGroup(pipelineIdProto);
if (server instanceof XceiverServerRatis) {
// TODO: Refactor Ratis logic to XceiverServerRatis
// Propagate the group remove to the other Raft peers in the pipeline
@@ -127,12 +126,18 @@ public void handle(SCMCommand command, OzoneContainer ozoneContainer,
}
});
}
+ server.removeGroup(pipelineIdProto);
LOG.info("Close Pipeline {} command on datanode {}.", pipelineID,
dn.getUuidString());
} else {
LOG.debug("Ignoring close pipeline command for pipeline {} " +
"as it does not exist", pipelineID);
}
+ } catch (GroupMismatchException gme) {
+ // ignore silently since this means that the group has been closed by earlier close pipeline
+ // command in another datanode
+ LOG.debug("The Ratis group for the pipeline {} has been removed by earlier close pipeline command from " +
+ "other datanodes", pipelineID.getId());
} catch (IOException e) {
LOG.error("Can't close pipeline {}", pipelineID, e);
} finally {|
Thank you for the review @szetszwo.
Sure, updated. Was using a nested try-catch since |
szetszwo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 the change looks good.
|
In RaftClientReply reply;
try {
reply = server.groupManagement(request);
} catch (Exception e) {
throw new IOException(e.getMessage(), e);
}Currently, I'm using |
|
Found Added a new |
|
@szetszwo I have made some changes on the patch. Could you help review this again? Thank you. |
| } catch (GroupMismatchException gme) { | ||
| // ignore silently since this means that the group has been closed by earlier close pipeline | ||
| // command in another datanode | ||
| LOG.debug("The group for pipeline {} on datanode {} has been removed by earlier close " + | ||
| "pipeline command handled in another datanode", pipelineID, dn.getUuidString()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ivandika3 , Since HddsClientUtils.containsException below will cover this case, let's remove catch (GroupMismatchException gme) {...}?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the review. Updated.
|
+1 to the latest change. |
|
Thank you for the reviews @szetszwo , merged. |
…d before XceiverRatisServer#removeGroup (apache#6696)
…concile-cli * HDDS-10239-container-reconciliation: (296 commits) HDDS-10897. Refactor OzoneQuota (apache#6714) HDDS-10422. Fix some warnings about exposing internal representation in hdds-common (apache#6351) HDDS-10899. Refactor Lease callbacks (apache#6715) HDDS-10890. Increase default value for hdds.container.ratis.log.appender.queue.num-elements (apache#6711) HDDS-10832. Client should switch to streaming based on OpenKeySession replication (apache#6683) HDDS-10435. Support S3 object tags for existing requests (apache#6607) HDDS-10883. Improve logging in Recon for finalising DN logic. (apache#6704) HDDS-8752. Enable TestOzoneRpcClientAbstract#testOverWriteKeyWithAndWithOutVersioning (apache#6702) HDDS-10875. XceiverRatisServer#getRaftPeersInPipeline should be called before XceiverRatisServer#removeGroup (apache#6696) HDDS-10514. Recon - Provide DN decommissioning detailed status and info inline with current CLI command output. (apache#6376) HDDS-10878. Bump zstd-jni to 1.5.6-3 (apache#6701) HDDS-10877. Bump Dropwizard metrics to 3.2.6 (apache#6699) HDDS-10876. Bump jackson to 2.16.2 (apache#6697) HDDS-6116. Remove flaky tag from TestSCMInstallSnapshot (apache#6695) HDDS-2643. TestOzoneDelegationTokenSecretManager#testRenewTokenFailureRenewalTime fails intermittently. HDDS-10699. Refactor ContainerBalancerTask and TestContainerBalancerTask (apache#6537) HDDS-10861. Ozone cli supports default ozone.om.service.id (apache#6680) HDDS-10859. Improve error messages when decommission and maintenance fail-early (apache#6678) HDDS-9031. Upgrade acceptance tests to Docker Compose v2 (apache#6667) HDDS-10559. Add a warning or a check to run repair tool as System user (apache#6574) ... Conflicts: hadoop-ozone/dist/src/main/smoketest/admincli/container.robot
…d before XceiverRatisServer#removeGroup (apache#6696) (cherry picked from commit 87c3945)
…d before XceiverRatisServer#removeGroup (apache#6696) (cherry picked from commit 87c3945)
…d before XceiverRatisServer#removeGroup (apache#6696) (cherry picked from commit 87c3945)
…d before XceiverRatisServer#removeGroup (apache#6696) (cherry picked from commit 87c3945)
…d before XceiverRatisServer#removeGroup (apache#6696) (cherry picked from commit 87c3945)
What changes were proposed in this pull request?
From the https://issues.apache.org/jira/browse/HDDS-10750?focusedCommentId=17847435&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17847435 in HDDS-10750, it's found
GroupMismatchExceptionare thrown during theClosePipelineCommandHandler.This is because
XceiverRatisServer#removeGroupis called beforeXceiverRatisServer#getRaftPeersInPipeline, which causesXceiverRatisServer#getRaftPeersInPipelineto throwGroupMismatchExceptionwhen it's trying to get theRaftServerProxy#getDivisionsince the group has been removed.Therefore, we need to first call the
XceiverRatisServer#getRaftPeersInPipelinebefore callingXceiverRatisServer#removeGroup.This patch also catch the
GroupMismatchExceptionin case the group has been removed by earlierClosePipelineCommandHandlerin other datanode for the same pipeline. The datanode will also try to remove the Ratis group from the other datanodes (ignoring the GroupMismatchException) before removing its own Ratis group.What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-10875
How was this patch tested?
Clean CI: https://github.com/ivandika3/ozone/actions/runs/9145754999