Skip to content

Conversation

@adoroszlai
Copy link
Contributor

@adoroszlai adoroszlai commented Dec 8, 2021

What changes were proposed in this pull request?

HDDS-5962 increased integration test execution time by 20-30 minutes for each split.

With the current order of shutdown, server.awaitTermination takes 5 seconds in each case. No matter how much we increase timeout (e.g. to several minutes), it always takes all specified time. It seems the event that awaitTermination expects does not happen if eventLoopGroup has already been shutdown.

https://issues.apache.org/jira/browse/HDDS-6072

How was this patch tested?

Verified that XceiverServerGrpc.stop no longer takes 5 seconds.

Each split of the integration suite still takes ~10 minutes more than before HDDS-5962, but at least they are 10-20 minutes quicker than on current master.

https://github.com/adoroszlai/hadoop-ozone/actions/runs/1554131459
https://github.com/adoroszlai/hadoop-ozone/actions/runs/1554416354

@adoroszlai adoroszlai self-assigned this Dec 8, 2021
public void stop() {
if (isStarted) {
try {
server.shutdown();
Copy link
Contributor

@smengcl smengcl Dec 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically this implies triggering server shutdown and readExecutors shutdown (roughly) simutaneously, rather than awaiting readExecutors shutdown first then initiate server shutdown? If so I'm +1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @smengcl for the review. Initially I also suspected readExecutors, but it's not really the problem.

server.shutdown() triggers shutdown in the server's underlying Netty transport. The server's terminated flag is set via callbacks, server.awaitTermination() waits for this flag to be set. It seems that this is not happening if eventLoopGroup has already been shut down.

This can be reproduced easily (on current master) by changing server.awaitTermination() to few minutes instead of few seconds. Then we can create thread dump and see these threads being stuck at:

"ForkJoinPool.commonPool-worker-8" #358 daemon prio=5 os_prio=31 tid=0x00007fda0dc72800 nid=0x31d03 in Object.wait() [0x0000700026cc8000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
	at java.lang.Object.wait(Native Method)
	at java.lang.Object.wait(Object.java:460)
	at java.util.concurrent.TimeUnit.timedWait(TimeUnit.java:348)
	at org.apache.ratis.thirdparty.io.grpc.internal.ServerImpl.awaitTermination(ServerImpl.java:319)
	- locked <0x00000007ac151340> (a java.lang.Object)
	at org.apache.hadoop.ozone.container.common.transport.server.XceiverServerGrpc.stop(XceiverServerGrpc.java:207)
	at org.apache.hadoop.ozone.container.ozoneimpl.OzoneContainer.stop(OzoneContainer.java:329)

@adoroszlai adoroszlai changed the title HDDS-6072. Increased integration test execution time HDDS-6072. Fix increased integration test execution time Dec 11, 2021
Copy link
Contributor

@smengcl smengcl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @adoroszlai . With this latest change I see a ~30min reduction in total CI run duration.

@adoroszlai adoroszlai merged commit b75ec9d into apache:master Dec 15, 2021
@adoroszlai adoroszlai deleted the HDDS-6072 branch December 15, 2021 06:49
@adoroszlai
Copy link
Contributor Author

Thanks @smengcl for the review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants