HDDS-6072. Fix increased integration test execution time #2900

adoroszlai · 2021-12-08T16:25:21Z

What changes were proposed in this pull request?

HDDS-5962 increased integration test execution time by 20-30 minutes for each split.

With the current order of shutdown, server.awaitTermination takes 5 seconds in each case. No matter how much we increase timeout (e.g. to several minutes), it always takes all specified time. It seems the event that awaitTermination expects does not happen if eventLoopGroup has already been shutdown.

https://issues.apache.org/jira/browse/HDDS-6072

How was this patch tested?

Verified that XceiverServerGrpc.stop no longer takes 5 seconds.

Each split of the integration suite still takes ~10 minutes more than before HDDS-5962, but at least they are 10-20 minutes quicker than on current master.

https://github.com/adoroszlai/hadoop-ozone/actions/runs/1554131459
https://github.com/adoroszlai/hadoop-ozone/actions/runs/1554416354

smengcl · 2021-12-10T18:38:43Z

...c/main/java/org/apache/hadoop/ozone/container/common/transport/server/XceiverServerGrpc.java

  public void stop() {
    if (isStarted) {
      try {
+        server.shutdown();


Basically this implies triggering server shutdown and readExecutors shutdown (roughly) simutaneously, rather than awaiting readExecutors shutdown first then initiate server shutdown? If so I'm +1.

Thanks @smengcl for the review. Initially I also suspected readExecutors, but it's not really the problem.

server.shutdown() triggers shutdown in the server's underlying Netty transport. The server's terminated flag is set via callbacks, server.awaitTermination() waits for this flag to be set. It seems that this is not happening if eventLoopGroup has already been shut down.

This can be reproduced easily (on current master) by changing server.awaitTermination() to few minutes instead of few seconds. Then we can create thread dump and see these threads being stuck at:

"ForkJoinPool.commonPool-worker-8" #358 daemon prio=5 os_prio=31 tid=0x00007fda0dc72800 nid=0x31d03 in Object.wait() [0x0000700026cc8000] java.lang.Thread.State: TIMED_WAITING (on object monitor) at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:460) at java.util.concurrent.TimeUnit.timedWait(TimeUnit.java:348) at org.apache.ratis.thirdparty.io.grpc.internal.ServerImpl.awaitTermination(ServerImpl.java:319) - locked <0x00000007ac151340> (a java.lang.Object) at org.apache.hadoop.ozone.container.common.transport.server.XceiverServerGrpc.stop(XceiverServerGrpc.java:207) at org.apache.hadoop.ozone.container.ozoneimpl.OzoneContainer.stop(OzoneContainer.java:329)

smengcl

Thanks @adoroszlai . With this latest change I see a ~30min reduction in total CI run duration.

adoroszlai · 2021-12-15T06:50:12Z

Thanks @smengcl for the review.

adoroszlai added 2 commits December 8, 2021 14:04

HDDS-6072. Increased integration test execution time

9f6109c

trigger new CI check

ad5981b

adoroszlai self-assigned this Dec 8, 2021

adoroszlai added 3 commits December 8, 2021 18:50

trigger new CI check

97d9363

trigger new CI check

3694c3c

trigger new CI check

002fa2b

adoroszlai requested review from ChenSammi and bshashikant December 9, 2021 09:42

smengcl reviewed Dec 10, 2021

View reviewed changes

Keep server.shutdown() and server.awaitTermination() together

4fa841f

adoroszlai changed the title ~~HDDS-6072. Increased integration test execution time~~ HDDS-6072. Fix increased integration test execution time Dec 11, 2021

smengcl approved these changes Dec 15, 2021

View reviewed changes

adoroszlai merged commit b75ec9d into apache:master Dec 15, 2021

adoroszlai deleted the HDDS-6072 branch December 15, 2021 06:49

adoroszlai mentioned this pull request Dec 22, 2021

HDDS-5537. Limit grpc threads in ReplicationServer. #2498

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

HDDS-6072. Fix increased integration test execution time #2900

HDDS-6072. Fix increased integration test execution time #2900

Uh oh!

adoroszlai commented Dec 8, 2021 •

edited

Loading

Uh oh!

smengcl Dec 10, 2021 •

edited

Loading

Uh oh!

adoroszlai Dec 11, 2021

Uh oh!

smengcl left a comment

Uh oh!

adoroszlai commented Dec 15, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

HDDS-6072. Fix increased integration test execution time #2900

HDDS-6072. Fix increased integration test execution time #2900

Uh oh!

Conversation

adoroszlai commented Dec 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

smengcl Dec 10, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adoroszlai Dec 11, 2021

Choose a reason for hiding this comment

Uh oh!

smengcl left a comment

Choose a reason for hiding this comment

Uh oh!

adoroszlai commented Dec 15, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

adoroszlai commented Dec 8, 2021 •

edited

Loading

smengcl Dec 10, 2021 •

edited

Loading