-
Notifications
You must be signed in to change notification settings - Fork 588
HDDS-9959. Propagate group remove to other datanodes during pipeline close #5827
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@sumitagrawl @DaveTeng0 @szetszwo Could you help to review if you have time? |
szetszwo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ivandika3 , thanks for working on this! Please see the comments inlined.
... There seems to be an unused methods in RatisPipelineUtils related to pipeline destroy (RatisPipelineUtils#destroyPipeline) ...
Sure, let's remove the unused methods.
| final Collection<RaftPeer> peers = ratisServer.getRaftPeersInPipeline(pipelineID); | ||
| final boolean shouldDeleteRatisLogDirectory = ratisServer.getShouldDeleteRatisLogDirectory(); | ||
| peers.stream() | ||
| .filter(peer -> peer.getId() != ratisServer.getServer().getId()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should use equals(..) instead of !=.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated, thank you for catching it.
| // ignore silently since this means that the group has been closed by earlier close pipeline | ||
| // command in another datanode | ||
| } catch (IOException ioe) { | ||
| LOG.warn("Remove group failed for peer {}", peer.getId(), ioe); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's include the group id.
LOG.warn("Failed to remove group {} for peer {}", raftGroupId, peer.getId(), ioe);There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated. PTAL.
szetszwo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 the change looks good.
|
Test failure in |
adoroszlai
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ivandika3 for the patch.
| /** | ||
| * Test cases to verify ClosePipelineCommandHandler. | ||
| */ | ||
| @ExtendWith(MockitoExtension.class) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MockitoExtension is unnecessary, there are no variables annotated as @Mock.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the review. Updated.
We might also need to remove this for TestCreatePipelineCommandHandler.
| @BeforeEach | ||
| public void setup() throws Exception { | ||
| conf = new OzoneConfiguration(); | ||
| ozoneContainer = Mockito.mock(OzoneContainer.class); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Please add static import for mock(), when(), any(), etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the review. Updated.
|
Thanks @ivandika3 for the patch, @szetszwo for the review. |
|
@ivandika3 , TestECContainerRecovery is failing with GroupMismatchException. The failure seems to be related to this change. See @duongkame 's comment. |
|
@szetszwo Thanks for the info. Let me take a look. |
…close (apache#5827) (cherry picked from commit ac68072)
…close (apache#5827) (cherry picked from commit ac68072)
…close (apache#5827) (cherry picked from commit ac68072)
What changes were proposed in this pull request?
In https://issues.apache.org/jira/browse/RATIS-1947, it was found that there might be cases where RaftServer for Datanodes in the same pipeline are closed hours apart.
Furthermore, Ratis group remove operation is local to the Raft server and is not propagated to the other Raft peers in the same group. Therefore, datanodes that have not received the group remove operation will keep operating (e.g. sending RequestVote / AppendEntries RPCs), although the pipeline (Raft group) is supposed to be closed. The might affect the client communicating with the Raft peer in datanodes that have not been closed yet.
Therefore, similar to CreatePipelineCommandHandler, the first datanode that receives the close pipeline command needs to propagate the group remove command to the other datanodes (Raft peers) in the same pipeline. This will close the pipeline immediately on all the datanodes. The subsequent close pipeline commands will be ignored silently as the pipeline has been successfully closed.
Note: There seems to be an unused methods in
RatisPipelineUtilsrelated to pipeline destroy (RatisPipelineUtils#destroyPipeline) which were re-introduced in #538 although they were not really used. However, it seems the intent of the methods are for the SCM to destroy the pipeline directly, without going through the command queue. We can delete in this patch if needed.What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-9959
How was this patch tested?
Unit test for
ClosePipelineCommandHandler.CI run: https://github.com/ivandika3/ozone/actions/runs/7258608368/job/19774884832