Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@
import java.util.concurrent.TimeoutException;
import org.apache.hadoop.hdds.client.ReplicationType;
import org.apache.hadoop.hdds.protocol.DatanodeDetails;
import org.apache.hadoop.hdds.protocol.proto.HddsProtos;
import org.apache.hadoop.hdds.ratis.RatisHelper;
import org.apache.hadoop.hdds.scm.container.ContainerID;
import org.apache.hadoop.hdds.scm.container.ContainerInfo;
Expand Down Expand Up @@ -234,12 +235,14 @@ public static void waitForPipelineClose(List<Pipeline> pipelineList,

// wait for the pipeline to get destroyed in the datanodes
for (Pipeline pipeline : pipelineList) {
HddsProtos.PipelineID pipelineId = pipeline.getId().getProtobuf();
for (DatanodeDetails dn : pipeline.getNodes()) {
XceiverServerSpi server =
cluster.getHddsDatanodes().get(cluster.getHddsDatanodeIndex(dn))
.getDatanodeStateMachine().getContainer().getWriteChannel();
Assert.assertTrue(server instanceof XceiverServerRatis);
server.removeGroup(pipeline.getId().getProtobuf());
GenericTestUtils.waitFor(() -> !server.isExist(pipelineId),
100, 30_000);
Comment on lines -242 to +245
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanx @adoroszlai for the fix, I am able to repro this, changes seems fair enough.
Does it makes sense to removeGroup explicitly in case server.isExist(pipelineId) is true, rather than waiting? Might speed up test? We can ignore the exception from removeGroup in the wait and retry to counter the race condition

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ayushtkn for taking a look.

I'd rather avoid such "manual intervention" and let the test exercise production code paths.

I don't think keeping removeGroup would result in much speedup. If SCM and datanodes take too much time to execute regular flow, we could probably tweak executor intervals, etc.

It might be possible to convert it to a unit test with mocks, resulting in much bigger speedup by avoiding the mini cluster.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes better to stay near the prod code. Thanx

}
}
}
Expand Down