Skip to content

Conversation

@bshashikant
Copy link
Contributor

@bshashikant bshashikant commented Mar 13, 2020

What changes were proposed in this pull request?

Ensure that create container transaction succeeds on all nodes of a ring before close container is issued

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-3154

How was this patch tested?

Was not able to reproduce the issue locally.

@bshashikant bshashikant self-assigned this Mar 13, 2020
Copy link
Contributor

@adoroszlai adoroszlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @bshashikant, looks good so far: 20x passed. Since this test is pretty quick (around 1 minute), I launched another run with 50 repetitions, will update in about an hour.

@adoroszlai
Copy link
Contributor

There was only 1 failure out of 50 runs + 20x earlier. Seems to be better than previously, so I'm OK with it, although the root cause is probably still present.

@bshashikant
Copy link
Contributor Author

There was only 1 failure out of 50 runs + 20x earlier. Seems to be better than previously, so I'm OK with it, although the root cause is probably still present.

I think probably in an odd case, dns in the minicluster just doesn't seem to connect with each other running into grpc runtime exceptions thereby not able to process a client request.

@bshashikant bshashikant requested review from adoroszlai and removed request for aajisaka March 16, 2020 08:41
@adoroszlai adoroszlai merged commit d80eba3 into apache:master Mar 16, 2020
@xiaoyuyao
Copy link
Contributor

I also hit this intermittently even after HDDS-3154.

java.util.concurrent.ExecutionException: 
org.apache.ratis.protocol.RaftRetryFailureException: Failed RaftClientRequest:client-1578FC0265DA->d0082a59-cec4-4d0d-b5ec-17bb4baf1ccb@group-8EE1479002D7, cid=9, seq=3*, RW, cmdType: CloseContainer
traceID: ""
containerID: 1
datanodeUuid: "1c9bc0ef-9cc5-4deb-86e3-073d94f38b24"
closeContainer {
}
, data.size=0 for 15 attempts with RetryLimited(maxAttempts=15, sleepTime=1000ms)
	at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
	at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)
	at org.apache.hadoop.ozone.client.rpc.Test2WayCommitInRatis.test2WayCommitForRetryfailure(Test2WayCommitInRatis.java:178)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
	at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271)
	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70)
	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
	at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
	at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
	at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
	at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
	at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
	at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
	at org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
	at org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
	at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
	at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
	at org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
	at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
	at org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126)
	at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418)
Caused by: org.apache.ratis.protocol.RaftRetryFailureException: Failed RaftClientRequest:client-1578FC0265DA->d0082a59-cec4-4d0d-b5ec-17bb4baf1ccb@group-8EE1479002D7, cid=9, seq=3*, RW, cmdType: CloseContainer
traceID: ""
containerID: 1
datanodeUuid: "1c9bc0ef-9cc5-4deb-86e3-073d94f38b24"
closeContainer {
}
, data.size=0 for 15 attempts with RetryLimited(maxAttempts=15, sleepTime=1000ms)
	at org.apache.ratis.client.impl.RaftClientImpl.noMoreRetries(RaftClientImpl.java:330)
	at org.apache.ratis.client.impl.OrderedAsync.handleAsyncRetryFailure(OrderedAsync.java:151)
	at org.apache.ratis.client.impl.OrderedAsync.lambda$sendRequest$10(OrderedAsync.java:250)
	at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:884)
	at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:866)
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
	at org.apache.ratis.grpc.client.GrpcClientProtocolClient$AsyncStreamObservers.lambda$timeoutCheck$5(GrpcClientProtocolClient.java:337)
	at java.util.Optional.ifPresent(Optional.java:159)
	at org.apache.ratis.grpc.client.GrpcClientProtocolClient$AsyncStreamObservers.handleReplyFuture(GrpcClientProtocolClient.java:342)
	at org.apache.ratis.grpc.client.GrpcClientProtocolClient$AsyncStreamObservers.timeoutCheck(GrpcClientProtocolClient.java:337)
	at org.apache.ratis.grpc.client.GrpcClientProtocolClient$AsyncStreamObservers.lambda$onNext$3(GrpcClientProtocolClient.java:330)
	at org.apache.ratis.util.TimeoutScheduler.lambda$onTimeout$0(TimeoutScheduler.java:141)
	at org.apache.ratis.util.TimeoutScheduler.lambda$onTimeout$1(TimeoutScheduler.java:155)
	at org.apache.ratis.util.LogUtils.runAndLog(LogUtils.java:38)
	at org.apache.ratis.util.LogUtils$1.run(LogUtils.java:79)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.ratis.protocol.TimeoutIOException: Request #9 timeout 3s
	... 16 more

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants