-
Notifications
You must be signed in to change notification settings - Fork 592
HDDS-6072. Fix increased integration test execution time #2900
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| public void stop() { | ||
| if (isStarted) { | ||
| try { | ||
| server.shutdown(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically this implies triggering server shutdown and readExecutors shutdown (roughly) simutaneously, rather than awaiting readExecutors shutdown first then initiate server shutdown? If so I'm +1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @smengcl for the review. Initially I also suspected readExecutors, but it's not really the problem.
server.shutdown() triggers shutdown in the server's underlying Netty transport. The server's terminated flag is set via callbacks, server.awaitTermination() waits for this flag to be set. It seems that this is not happening if eventLoopGroup has already been shut down.
This can be reproduced easily (on current master) by changing server.awaitTermination() to few minutes instead of few seconds. Then we can create thread dump and see these threads being stuck at:
"ForkJoinPool.commonPool-worker-8" #358 daemon prio=5 os_prio=31 tid=0x00007fda0dc72800 nid=0x31d03 in Object.wait() [0x0000700026cc8000]
java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
at java.lang.Object.wait(Object.java:460)
at java.util.concurrent.TimeUnit.timedWait(TimeUnit.java:348)
at org.apache.ratis.thirdparty.io.grpc.internal.ServerImpl.awaitTermination(ServerImpl.java:319)
- locked <0x00000007ac151340> (a java.lang.Object)
at org.apache.hadoop.ozone.container.common.transport.server.XceiverServerGrpc.stop(XceiverServerGrpc.java:207)
at org.apache.hadoop.ozone.container.ozoneimpl.OzoneContainer.stop(OzoneContainer.java:329)
smengcl
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @adoroszlai . With this latest change I see a ~30min reduction in total CI run duration.
|
Thanks @smengcl for the review. |
What changes were proposed in this pull request?
HDDS-5962 increased integration test execution time by 20-30 minutes for each split.
With the current order of shutdown,
server.awaitTerminationtakes 5 seconds in each case. No matter how much we increase timeout (e.g. to several minutes), it always takes all specified time. It seems the event thatawaitTerminationexpects does not happen ifeventLoopGrouphas already been shutdown.https://issues.apache.org/jira/browse/HDDS-6072
How was this patch tested?
Verified that
XceiverServerGrpc.stopno longer takes 5 seconds.Each split of the integration suite still takes ~10 minutes more than before HDDS-5962, but at least they are 10-20 minutes quicker than on current
master.https://github.com/adoroszlai/hadoop-ozone/actions/runs/1554131459
https://github.com/adoroszlai/hadoop-ozone/actions/runs/1554416354