Skip to content

Conversation

@adoroszlai
Copy link
Contributor

@adoroszlai adoroszlai commented Dec 13, 2020

What changes were proposed in this pull request?

testWatchForCommitForGroupMismatchException may fail with Group ... not found in waitForPipelineClose. The method destroys pipelines via regular SCM -> datanodes path, but then also calls removeGroup for each member manually.

Only one of these can succeed, the group cannot be removed twice. If test code wins, we see this error in the output:

2020-12-11 11:59:14,841 [Command processor thread] ERROR commandhandler.ClosePipelineCommandHandler (ClosePipelineCommandHandler.java:handle(78)) - Can't close pipeline PipelineID=82ba59fa-dd35-4d4e-a61e-59b281828b58
java.io.IOException: 675d35ba-bb57-4ae2-b121-2cd47ec358f4: Group group-59B281828B58 not found.
	at org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis.removeGroup(XceiverServerRatis.java:774)
	at org.apache.hadoop.ozone.container.common.statemachine.commandhandler.ClosePipelineCommandHandler.handle(ClosePipelineCommandHandler.java:74)

But if ClosePipelineCommandHandler is first to remove, the test fails:

testWatchForCommitForGroupMismatchException(org.apache.hadoop.ozone.client.rpc.TestWatchForCommit)  Time elapsed: 39.399 s  <<< ERROR!
java.io.IOException: 14058016-3fcf-4920-991e-a5361899d5d7: Group group-59B281828B58 not found.
	at org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis.removeGroup(XceiverServerRatis.java:774)
	at org.apache.hadoop.ozone.container.TestHelper.waitForPipelineClose(TestHelper.java:242)
	at org.apache.hadoop.ozone.client.rpc.TestWatchForCommit.testWatchForCommitForGroupMismatchException(TestWatchForCommit.java:353)

https://issues.apache.org/jira/browse/HDDS-4013

How was this patch tested?

Ran TestWatchForCommit 50x - never failed, only timed out once:
https://github.com/adoroszlai/hadoop-ozone/runs/1544890660

Regular CI:
https://github.com/adoroszlai/hadoop-ozone/actions/runs/418502626

@adoroszlai adoroszlai self-assigned this Dec 13, 2020
Comment on lines -242 to +245
server.removeGroup(pipeline.getId().getProtobuf());
GenericTestUtils.waitFor(() -> !server.isExist(pipelineId),
100, 30_000);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanx @adoroszlai for the fix, I am able to repro this, changes seems fair enough.
Does it makes sense to removeGroup explicitly in case server.isExist(pipelineId) is true, rather than waiting? Might speed up test? We can ignore the exception from removeGroup in the wait and retry to counter the race condition

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ayushtkn for taking a look.

I'd rather avoid such "manual intervention" and let the test exercise production code paths.

I don't think keeping removeGroup would result in much speedup. If SCM and datanodes take too much time to execute regular flow, we could probably tweak executor intervals, etc.

It might be possible to convert it to a unit test with mocks, resulting in much bigger speedup by avoiding the mini cluster.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes better to stay near the prod code. Thanx

@adoroszlai adoroszlai merged commit 327c148 into apache:master Dec 14, 2020
@adoroszlai
Copy link
Contributor Author

Thanks @ayushtkn for the review.

@adoroszlai adoroszlai deleted the HDDS-4013 branch December 14, 2020 12:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants