[SPARK-24355] Spark external shuffle server improvement to better handle block fetch requests. #22173

redsanket · 2018-08-21T17:36:26Z

What changes were proposed in this pull request?

Description:
Right now, the default server side netty handler threads is 2 * # cores, and can be further configured with parameter spark.shuffle.io.serverThreads.
In order to process a client request, it would require one available server netty handler thread.
However, when the server netty handler threads start to process ChunkFetchRequests, they will be blocked on disk I/O, mostly due to disk contentions from the random read operations initiated by all the ChunkFetchRequests received from clients.
As a result, when the shuffle server is serving many concurrent ChunkFetchRequests, the server side netty handler threads could all be blocked on reading shuffle files, thus leaving no handler thread available to process other types of requests which should all be very quick to process.

This issue could potentially be fixed by limiting the number of netty handler threads that could get blocked when processing ChunkFetchRequest. We have a patch to do this by using a separate EventLoopGroup with a dedicated ChannelHandler to process ChunkFetchRequest. This enables shuffle server to reserve netty handler threads for non-ChunkFetchRequest, thus enabling consistent processing time for these requests which are fast to process. After deploying the patch in our infrastructure, we no longer see timeout issues with either executor registration with local shuffle server or shuffle client establishing connection with remote shuffle server.
(Please fill in changes proposed in this fix)

For Original PR please refer here
#21402

How was this patch tested?

Unit tests and stress testing.
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Please review http://spark.apache.org/contributing.html before opening a pull request.

…e block fetch requests.

…threads

redsanket · 2018-08-21T17:36:55Z

@tgravescs @vanzin @Victsm please review thanks

tgravescs · 2018-08-21T17:38:51Z

ok to test

vanzin · 2018-08-21T17:39:59Z

You can say 'Closes SPARK-24355 Spark external shuffle server improvement to better handle block fetch requests. #21402' to automatically close the other PR if this is merged.
Please clean up the remaining parts of the template from the PR description.

SparkQA · 2018-08-21T17:44:33Z

Test build #95042 has finished for PR 22173 at commit 3bab74c.

This patch fails RAT tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2018-08-22T18:58:59Z

common/network-common/src/main/java/org/apache/spark/network/TransportContext.java

-        .addLast("handler", channelHandler);
+        .addLast("handler", channelHandler)
+        // Use a separate EventLoopGroup to handle ChunkFetchRequest messages.
+        .addLast(chunkFetchWorkers, "chunkFetchHandler", chunkFetchHandler);


Hmm... I think there is some waste here. Not all channels actually need the chunk fetch handler. Basically only the shuffle server (external or not) does. So for all other cases - RpcEnv server and clients, shuffle clients - you'd have this new thread pool just sitting there.

It would be good to avoid that.

yes i did notice that... makes sense

SparkQA · 2018-08-27T14:27:57Z

Test build #95289 has finished for PR 22173 at commit cc40d9b.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-08-27T16:32:31Z

Test build #95293 has finished for PR 22173 at commit 6580ff1.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

complete expanion of imports rearrange imports

SparkQA · 2018-08-27T17:40:33Z

Test build #95299 has finished for PR 22173 at commit 50258f7.

This patch fails Java style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-08-27T17:45:24Z

Test build #95300 has finished for PR 22173 at commit d86503c.

This patch fails Java style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-08-28T17:06:58Z

Test build #95352 has finished for PR 22173 at commit 470e9a6.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-08-29T15:09:47Z

Test build #95423 has finished for PR 22173 at commit dcc41f5.

This patch fails Java style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-08-29T22:40:53Z

Test build #95434 has finished for PR 22173 at commit b1105bd.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2018-08-30T21:52:37Z

retest this please

vanzin · 2018-08-30T21:52:51Z

(I haven't forgotten about this, just haven't had the time to look at it.)

SparkQA · 2018-08-31T02:34:52Z

Test build #95499 has finished for PR 22173 at commit b1105bd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

redsanket · 2018-08-31T12:34:51Z

thanks @vanzin, also @tgravescs gentle ping...

tgravescs

I assume we have the same issue here with processStreamRequest since its going to hold a thread. Same goes I think for the uploadStream request. We could move that to be under the same threadpool as chunkedfetch request. Or we could do those as separate jira as well since the majority of requests will be the chunkedfetch requests.

tgravescs · 2018-09-07T15:49:03Z

common/network-common/src/main/java/org/apache/spark/network/util/TransportConf.java

+    }
+    int chunkFetchHandlerThreadsPercent =
+            conf.getInt("spark.shuffle.server.chunkFetchHandlerThreadsPercent", 0);
+    return this.serverThreads() > 0? (this.serverThreads() * chunkFetchHandlerThreadsPercent)/100:


space between 0 and ?, and space between 100 and :

I assume we aren't documenting chunkFetchHandlerThreadsPercent since the serverthreads config isn't documented

I think it is a good idea to document both as this is an important config. Let me know your thoughts

tgravescs · 2018-09-07T15:50:19Z

common/network-common/src/main/java/org/apache/spark/network/util/TransportConf.java

+   * which equals 0.1 * 2*#cores or 0.1 * io.serverThreads.
+   */
+  public int chunkFetchHandlerThreads() {
+    if(!this.getModuleName().equalsIgnoreCase("shuffle")) {


space after if before (

tgravescs · 2018-09-07T15:52:45Z

common/network-common/src/main/java/org/apache/spark/network/util/TransportConf.java

+   * higher number of shuffler server threads, we are able to reserve some threads for
+   * handling other RPC messages, thus making the Client less likely to experience timeout
+   * when sending RPC messages to the shuffle server. Default to 0, which is 2*#cores
+   * or io.serverThreads. 10 would mean 10% of 2*#cores or 10% of io.serverThreads


I realize these are just examples but normally I would expect a user to have the threads processing chunk request much more then those doing other things so perhaps the example should be like 90%.

Were any tests done to see how many threads were needed for the others, I would expect very few?

No based on how many threads required for other rpc calls, i have not tested them, but the whole point would be to reduce the dependency how much time the chunkFetchedRequests will be spending doing disk I/O

tgravescs · 2018-09-07T15:52:51Z

common/network-common/src/main/java/org/apache/spark/network/util/TransportConf.java

+    int chunkFetchHandlerThreadsPercent =
+            conf.getInt("spark.shuffle.server.chunkFetchHandlerThreadsPercent", 0);
+    return this.serverThreads() > 0? (this.serverThreads() * chunkFetchHandlerThreadsPercent)/100:
+            (2* NettyRuntime.availableProcessors() * chunkFetchHandlerThreadsPercent)/100;


space between 2 and *

tgravescs · 2018-09-07T15:53:27Z

common/network-common/src/test/java/org/apache/spark/network/ExtendedChannelPromise.java

+import io.netty.util.concurrent.Future;
+import io.netty.util.concurrent.GenericFutureListener;
+
+


remove extra line.

tgravescs · 2018-09-07T19:24:32Z

...n/network-common/src/main/java/org/apache/spark/network/server/ChunkFetchRequestHandler.java

+  private ChannelFuture respond(final Channel channel,
+                                final Encodable result) throws InterruptedException {
+    final SocketAddress remoteAddress = channel.remoteAddress();
+    return channel.writeAndFlush(result).sync().addListener((ChannelFutureListener) future -> {


do we actually want to use await() here instead of sync(). Sync says "Waits for this future until it is done, and rethrows the cause of the failure if this future failed."

I'm not sure we want a rethrow here since we have the addListener to handle the future failure.

Yes await can be used as well... i will test it out and let you know its ramifications if any, thanks

ok i figured the following... i think it better to rethrow a.k.a. use sync() instead of quitely logging the exception via await(). The reason being as follows....

The respond call made here https://github.com/apache/spark/pull/22173/files#diff-a37b07d454be4d6cd26edb294661d4e3R106 waits for sending the failed or success response back. Hence, it is a blocking call. If an exception is thrown using await() we log quitely and do no rethrow, the underlying rpc handler will not know about the request status here https://github.com/apache/spark/blob/master/common/network-common/src/main/java/org/apache/spark/network/server/TransportRequestHandler.java#L183, rethrowing the exception allows it to handle and throw an rpc failure here https://github.com/apache/spark/blob/master/common/network-common/src/main/java/org/apache/spark/network/server/TransportRequestHandler.java#L194 else we have to wait for a timeout exception instead of an Interrupt exception. Either cases this seems fine but using sync() we will fail fast...

so it doesn't go through https://github.com/apache/spark/blob/master/common/network-common/src/main/java/org/apache/spark/network/server/TransportRequestHandler.java#L183 since that is hte rpchandler . The TransportChannelHandler won't handle it at all, it will go to the pipeline which calls the ChunkFetchRequestHandler. So whatever is calling that is what would get the exception propogated. The reason I said this is because previously when this was handled in TransportChannelHandler, it wasn't throwing the exception up. The TrasnportRequestHandler.respond is calling hte writeAndFlush async and then just adding a listener which would simply log an error if the future wasn't successful.
I want to make sure if an exception is thrown here it doesn't kill the entire external shuffle service for instance.

ok i had chunkFetchHandler as an instance of rpcHandler in my mind...

It justs logs the exception but the behaviour is same… as in the first case an additional warning from netty side is passed onto the listener and the channel is closed… in the later we quitely log error and the channel is closed… the executors deal with the closed channel and retry or schedule the task on different executor… and eventually the job seems to succeed… in the example i ran… i expect the retry to go through in such scenarios… i will just use await() as the ERROR is logged and we dont have to be more verbose with the warning from the netty side...

SparkQA · 2018-09-10T22:33:28Z

Test build #95876 has finished for PR 22173 at commit 8153de5.

This patch fails from timeout after a configured wait of `400m`.
This patch merges cleanly.
This patch adds no public classes.

tgravescs · 2018-09-17T13:16:42Z

test this please

SparkQA · 2018-09-17T18:17:53Z

Test build #96140 has finished for PR 22173 at commit 8153de5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tgravescs · 2018-09-18T14:44:25Z

common/network-common/src/main/java/org/apache/spark/network/TransportContext.java

    this.rpcHandler = rpcHandler;
    this.closeIdleConnections = closeIdleConnections;
+
+    synchronized(this.getClass()) {


I think synchronized(this.getClass()) is not recommended due to it not handling inheritance and such. Use synchronized(TransportContext.class)

tgravescs · 2018-09-18T14:47:29Z

common/network-common/src/main/java/org/apache/spark/network/TransportContext.java

+
+    synchronized(this.getClass()) {
+      if (chunkFetchWorkers == null && conf.getModuleName() != null &&
+              conf.getModuleName().equalsIgnoreCase("shuffle")) {


fix spacing here, line up conf.getModuleName with the chunkFetchWorkers

tgravescs · 2018-09-18T14:50:59Z

common/network-common/src/main/java/org/apache/spark/network/TransportContext.java

+        chunkFetchWorkers = NettyUtils.createEventLoop(
+            IOMode.valueOf(conf.ioMode()),
+            conf.chunkFetchHandlerThreads(),
+            "chunk-fetch-handler");


for consistency perhaps name thread shuffle-chunk-fetch-handler

like you mention if we can not create the event loop when on the client side that would be best

tgravescs · 2018-09-18T14:52:40Z

common/network-common/src/main/java/org/apache/spark/network/TransportContext.java

      TransportChannelHandler channelHandler = createChannelHandler(channel, channelRpcHandler);
-      channel.pipeline()
+      ChunkFetchRequestHandler chunkFetchHandler =
+              createChunkFetchHandler(channelHandler, channelRpcHandler);


fix spacing, only indented 2 spaces

tgravescs · 2018-09-18T14:53:58Z

...n/network-common/src/main/java/org/apache/spark/network/server/ChunkFetchRequestHandler.java

+  @Override
+  public void exceptionCaught(ChannelHandlerContext ctx, Throwable cause) throws Exception {
+    logger.warn("Exception in connection from " + getRemoteAddress(ctx.channel()),
+        cause);


spacing 2, fix throughout the file

this can actually be unwrapped here

tgravescs · 2018-09-18T14:58:11Z

...n/network-common/src/main/java/org/apache/spark/network/server/ChunkFetchRequestHandler.java

+      logger.error(String.format("Error opening block %s for request from %s",
+          msg.streamChunkId, getRemoteAddress(channel)), e);
+      respond(channel,
+              new ChunkFetchFailure(msg.streamChunkId,


fix wrapping and sapcing

tgravescs · 2018-09-18T15:51:37Z

common/network-common/src/main/java/org/apache/spark/network/TransportContext.java

        .addLast("decoder", DECODER)
-        .addLast("idleStateHandler", new IdleStateHandler(0, 0, conf.connectionTimeoutMs() / 1000))
+        .addLast("idleStateHandler",
+                new IdleStateHandler(0, 0, conf.connectionTimeoutMs() / 1000))


fix indentation

tgravescs · 2018-09-18T16:01:45Z

...on/network-common/src/main/java/org/apache/spark/network/server/TransportRequestHandler.java

 import org.apache.spark.network.buffer.NioManagedBuffer;
-import org.apache.spark.network.client.*;
-import org.apache.spark.network.protocol.*;
+import org.apache.spark.network.client.RpcResponseCallback;


revert the imports to be .*

tgravescs · 2018-09-18T16:42:37Z

...n/network-common/src/main/java/org/apache/spark/network/server/ChunkFetchRequestHandler.java

+  private ChannelFuture respond(final Channel channel,
+                                final Encodable result) throws InterruptedException {
+    final SocketAddress remoteAddress = channel.remoteAddress();
+    return channel.writeAndFlush(result).sync().addListener((ChannelFutureListener) future -> {


so it doesn't go through https://github.com/apache/spark/blob/master/common/network-common/src/main/java/org/apache/spark/network/server/TransportRequestHandler.java#L183 since that is hte rpchandler . The TransportChannelHandler won't handle it at all, it will go to the pipeline which calls the ChunkFetchRequestHandler. So whatever is calling that is what would get the exception propogated. The reason I said this is because previously when this was handled in TransportChannelHandler, it wasn't throwing the exception up. The TrasnportRequestHandler.respond is calling hte writeAndFlush async and then just adding a listener which would simply log an error if the future wasn't successful.
I want to make sure if an exception is thrown here it doesn't kill the entire external shuffle service for instance.

tgravescs · 2018-09-18T17:54:19Z

common/network-common/src/main/java/org/apache/spark/network/util/TransportConf.java

+      return 0;
+    }
+    int chunkFetchHandlerThreadsPercent =
+            conf.getInt("spark.shuffle.server.chunkFetchHandlerThreadsPercent", 0);


fix spacing

Yes it is documented above... if it is 0 or 100 it is 2*#cores or io.serverThreads

SparkQA · 2018-09-20T04:23:46Z

Test build #96307 has finished for PR 22173 at commit 0348ec8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tgravescs

+1. looks good.

I'm not super fond of the isClientOnly parameter to TransportContext (which I recommended instead of other ways) but I can't think of with a more elegant solution at this point.

tgravescs · 2018-09-21T14:09:01Z

merged into master (2.5.0)

redsanket · 2018-09-21T14:21:36Z

closes #21402

zsxwing · 2018-10-08T22:23:15Z

common/network-common/src/main/java/org/apache/spark/network/TransportContext.java

+  // Separate thread pool for handling ChunkFetchRequest. This helps to enable throttling
+  // max number of TransportServer worker threads that are blocked on writing response
+  // of ChunkFetchRequest message back to the client via the underlying channel.
+  private static EventLoopGroup chunkFetchWorkers;


Is there any special reason that this must be a global one? I have not yet looked the details. But looks like this may cause ChunkFetchIntegrationSuite flaky as there is no isolation between tests.

I haven't been able to reproduce this but the number of threads used for this tests are 2* number of cores or spark.shuffle.io.serverThreads.

…dle block fetch requests. ## What changes were proposed in this pull request? Description: Right now, the default server side netty handler threads is 2 * # cores, and can be further configured with parameter spark.shuffle.io.serverThreads. In order to process a client request, it would require one available server netty handler thread. However, when the server netty handler threads start to process ChunkFetchRequests, they will be blocked on disk I/O, mostly due to disk contentions from the random read operations initiated by all the ChunkFetchRequests received from clients. As a result, when the shuffle server is serving many concurrent ChunkFetchRequests, the server side netty handler threads could all be blocked on reading shuffle files, thus leaving no handler thread available to process other types of requests which should all be very quick to process. This issue could potentially be fixed by limiting the number of netty handler threads that could get blocked when processing ChunkFetchRequest. We have a patch to do this by using a separate EventLoopGroup with a dedicated ChannelHandler to process ChunkFetchRequest. This enables shuffle server to reserve netty handler threads for non-ChunkFetchRequest, thus enabling consistent processing time for these requests which are fast to process. After deploying the patch in our infrastructure, we no longer see timeout issues with either executor registration with local shuffle server or shuffle client establishing connection with remote shuffle server. (Please fill in changes proposed in this fix) For Original PR please refer here apache#21402 ## How was this patch tested? Unit tests and stress testing. (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. Closes apache#22173 from redsanket/SPARK-24335. Authored-by: Sanket Chintapalli <[email protected]> Signed-off-by: Thomas Graves <[email protected]>

…dle block fetch requests. (#89) ## What changes were proposed in this pull request? Description: Right now, the default server side netty handler threads is 2 * # cores, and can be further configured with parameter spark.shuffle.io.serverThreads. In order to process a client request, it would require one available server netty handler thread. However, when the server netty handler threads start to process ChunkFetchRequests, they will be blocked on disk I/O, mostly due to disk contentions from the random read operations initiated by all the ChunkFetchRequests received from clients. As a result, when the shuffle server is serving many concurrent ChunkFetchRequests, the server side netty handler threads could all be blocked on reading shuffle files, thus leaving no handler thread available to process other types of requests which should all be very quick to process. This issue could potentially be fixed by limiting the number of netty handler threads that could get blocked when processing ChunkFetchRequest. We have a patch to do this by using a separate EventLoopGroup with a dedicated ChannelHandler to process ChunkFetchRequest. This enables shuffle server to reserve netty handler threads for non-ChunkFetchRequest, thus enabling consistent processing time for these requests which are fast to process. After deploying the patch in our infrastructure, we no longer see timeout issues with either executor registration with local shuffle server or shuffle client establishing connection with remote shuffle server. (Please fill in changes proposed in this fix) For Original PR please refer here apache#21402 ## How was this patch tested? Unit tests and stress testing. (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. Closes apache#22173 from redsanket/SPARK-24335. Authored-by: Sanket Chintapalli <[email protected]> Signed-off-by: Thomas Graves <[email protected]>

…dle block fetch requests. Description: Right now, the default server side netty handler threads is 2 * # cores, and can be further configured with parameter spark.shuffle.io.serverThreads. In order to process a client request, it would require one available server netty handler thread. However, when the server netty handler threads start to process ChunkFetchRequests, they will be blocked on disk I/O, mostly due to disk contentions from the random read operations initiated by all the ChunkFetchRequests received from clients. As a result, when the shuffle server is serving many concurrent ChunkFetchRequests, the server side netty handler threads could all be blocked on reading shuffle files, thus leaving no handler thread available to process other types of requests which should all be very quick to process. This issue could potentially be fixed by limiting the number of netty handler threads that could get blocked when processing ChunkFetchRequest. We have a patch to do this by using a separate EventLoopGroup with a dedicated ChannelHandler to process ChunkFetchRequest. This enables shuffle server to reserve netty handler threads for non-ChunkFetchRequest, thus enabling consistent processing time for these requests which are fast to process. After deploying the patch in our infrastructure, we no longer see timeout issues with either executor registration with local shuffle server or shuffle client establishing connection with remote shuffle server. (Please fill in changes proposed in this fix) For Original PR please refer here apache#21402 Unit tests and stress testing. (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. Closes apache#22173 from redsanket/SPARK-24335. Authored-by: Sanket Chintapalli <[email protected]> Signed-off-by: Thomas Graves <[email protected]>

…dle block fetch requests. Description: Right now, the default server side netty handler threads is 2 * # cores, and can be further configured with parameter spark.shuffle.io.serverThreads. In order to process a client request, it would require one available server netty handler thread. However, when the server netty handler threads start to process ChunkFetchRequests, they will be blocked on disk I/O, mostly due to disk contentions from the random read operations initiated by all the ChunkFetchRequests received from clients. As a result, when the shuffle server is serving many concurrent ChunkFetchRequests, the server side netty handler threads could all be blocked on reading shuffle files, thus leaving no handler thread available to process other types of requests which should all be very quick to process. This issue could potentially be fixed by limiting the number of netty handler threads that could get blocked when processing ChunkFetchRequest. We have a patch to do this by using a separate EventLoopGroup with a dedicated ChannelHandler to process ChunkFetchRequest. This enables shuffle server to reserve netty handler threads for non-ChunkFetchRequest, thus enabling consistent processing time for these requests which are fast to process. After deploying the patch in our infrastructure, we no longer see timeout issues with either executor registration with local shuffle server or shuffle client establishing connection with remote shuffle server. (Please fill in changes proposed in this fix) For Original PR please refer here apache#21402 Unit tests and stress testing. (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. Closes apache#22173 from redsanket/SPARK-24335. Authored-by: Sanket Chintapalli <[email protected]> Signed-off-by: Thomas Graves <[email protected]> (cherry picked from commit ff601cf)

cloud-fan · 2020-01-03T09:03:30Z

We hit significant performance regression in our internal workload caused by this commit. After this commit, the executor can handle at most N chunk fetch requests at the same time, where N is the value of spark.shuffle.io.serverThreads * spark.shuffle.server.chunkFetchHandlerThreadsPercent. Previously, it was unlimited, and most of the time we can saturate the underlying channel.

This commit does fix a nasty problem, and I'm fine with it even if it may introduce perf regression, but there should be a way to turn it off. Unfortunately, we can't turn off this feature. We can set spark.shuffle.server.chunkFetchHandlerThreadsPercent to a large value so that we can handle many chunk fetch requests at the same time, but it's hard to pick a good value which is not too large and can saturate the channel.

Looking back at this problem, I think we can either create a dedicated channel for non chunk fetch request, or ask netty to handle channel write of non chunk fetch request first. Both seem hard to implement. Shall we revert it first, and think of a good fix later?

tgravescs · 2020-01-03T15:14:57Z

Can you clarify? The default for spark.shuffle.server.chunkFetchHandlerThreadsPercent is 0 which should be the same number of chunk fetcher threads as previously. It wasn't previously unlimited as Netty would limit to 2*number of cores by default. (https://github.com/netty/netty/blob/9621a5b98120f9596b5d2a337330339dda199bde/transport/src/main/java/io/netty/channel/MultithreadEventLoopGroup.java#L40)

Were you configuring Netty or something to make it unlimited? Or perhaps the default for Netty changed or we missed something?

The intention was no perf regression and same default as previous behavior. Note there was a follow on to this pr to fix the default calculation properly.
https://github.com/apache/spark/blob/master/common/network-common/src/main/java/org/apache/spark/network/util/TransportConf.java#L340
It does introduce more overall threads and the chunk fetcher ones do have to go back through

Please explain more what you are seeing and what settings you are using.

cloud-fan · 2020-01-06T15:26:05Z

It's good to know that the underlying channel write thread pool has the same concurrency with the request handling thread pool.

So there are 2 thread pools: one to handle the requests, one to write data to the channel.

Previously, fetch requests were handled by the request handling thread pool, and return immediately after reading shuffle blocks.

Now, fetch requests are handled by a new fetch request thread pool, and return until channel write is completed. It's kind of handle fetch requests in sync mode, while previously it's more likely to keep both thread pools busy with reading shuffle blocks/writing data to channel.

Unfortunately I can't share our internal workload (we don't have special settings), I'll try to write a microbenchmark.

tgravescs · 2020-01-06T15:56:01Z

sorry I'm not understanding quite what you are saying and I think perhaps you mean fetch requests in "async" mode?
There are now 2 thread pools, the fetch requests go to the new thread pool, but the response does still have to go back through the original evenLoop Group, so the fetches are somewhat async now and like the comment for the "flush" function mentioned (see the comment: https://github.com/apache/spark/blob/master/common/network-common/src/main/java/org/apache/spark/network/server/ChunkFetchRequestHandler.java#L118). it does add in an await() call there to help with some throttling so it didn't take up all the threads again and perhaps that is causing some slowness if the event loop channel its registered to is also processing some incoming events in between.
If can get more information from your workload to see what the threads are doing or have test case to reproduce that would be very helpful

cloud-fan · 2020-01-06T16:16:34Z

When I say sync mode, I mean a thread that handles fetch request has to finish reading the shuffle blocks and writing the response to channel, before handling the next requests, although the channel writing is done by another thread pool. Previously it's fully async: the thread can handle the next request once it finishes reading the shuffle blocks of the current request. That said, the throughput of handling fetch requests is reduced now.

Anyway let me come up with a microbenchmark first.

cloud-fan · 2020-01-09T08:59:36Z

Unfortunately, I'm not able to minimize our internal workload, so I switch to TPCDS to show the perf regression.

data: TPCDS table store_sales with scale factor 99. It's 3.5GB, 1233 files
query: sql("select count(distinct ss_list_price) from store_sales where ss_quantity == 5").show
spark: latest master, "local-cluster[2, 4, 19968]"
env: m4-4xlarge

Since it's too many changes to revert this commit, I simply remove the await in ChunkFetchRequestHandler, which effectively reverts this feature.

With await removed, the query runs 4% faster, which is not much. But if you look at the web UI and check the task metrics, shuffle read time is significantly reduced if we remove await.

The master branch:

and the second stage

With await removed:

and the second stage

The shuffle read is about 3x faster with await removed.

tgravescs · 2020-01-10T16:27:36Z

the jobs we ran didn't see an overall performance impact and it greatly helped when you ran on busy multi-tenant cluster.

How about if the value of spark.shuffle.server.chunkFetchHandlerThreadsPercent isn't explicitly set by the user we just don't add the extra event loop or thread pool, see TransportContext - just like it does it its not shuffle, that should be a fairly easy change.

cloud-fan · 2020-01-12T06:46:42Z

Sounds good to me. We can explore different solutions after 3.0.

otterc · 2020-01-24T00:07:16Z

I think await does't provide any benefit and could be removed.
When the chunk fetch event loop runs

channel.writeAndFlush(result)

This adds a WriteAndFlushTask in the pendingQueue of the default server-IO thread registered with that channel.

The code in NioEventLoop.run() itself throttles the number of tasks that can be run at a time from its pending queue.
Here is the code:

                    final long ioStartTime = System.nanoTime();
                    try {
                        processSelectedKeys();
                    } finally {
                        // Ensure we always run tasks.
                        final long ioTime = System.nanoTime() - ioStartTime;
                        runAllTasks(ioTime * (100 - ioRatio) / ioRatio);
                    }
                }

Here it records how much time it took to perform the IO operations, that is, execute processSelectedKeys(). runAllTasks, which is the method that processes the tasks from pendingQueue, will be performed for the same amount of time.

runAllTasks() does process 64 tasks and then checks the time.

        // Check timeout every 64 tasks because nanoTime() is relatively expensive.
            // XXX: Hard-coded value - will make it configurable if it is really a problem.
            if ((runTasks & 0x3F) == 0) {
                lastExecutionTime = ScheduledFutureTask.nanoTime();
                if (lastExecutionTime >= deadline) {
                    break;
                }
            }

This ensures that the default server-IO thread always gets time to process the ready channels. Its not always busy processing WriteAndFlushTask

Victsm · 2020-01-24T22:15:40Z

When we worked on the original fix in #21402, we were holding a wrong view of how Netty handles both event loop groups.
We thought both the chunk fetch requests and the control plane RPCs are going to become tasks placed in the task queues of threads inside the server I/O event loop group.
This was the reason we used sync or await in the patch, so that we are able to limit the # WriteAndFlushTask placed by chunk fetch requests in the task queues of threads in server I/O event loop group, which would allow the control plane RPCs to be picked up by the I/O threads in time to avoid timeout.
More recently, when working in SPARK-30512, we get a closer and much more correct view of how Netty handles these 2 event loop groups (the chunk fetch request dedicated one and the I/O event loop group).
As mentioned by @otterc, Netty should have already processed the data plane RPCs and control plane RPCs separately after we set up the dedicated event loop for chunk fetch requests.
Thus, we don't really need to throttle the # tasks placed by chunk fetch requests in the task queues of threads in I/O event loop group.

We have an internal stress testing framework that can help to validate if the original timeout issue still occurs after removing await.
We will do the test to validate things are working as expected.
If so, we may take the benefit of this fix without generating the performance regression.

otterc · 2020-01-24T23:58:21Z

@Victsm @tgravescs
I removed the await and tested with our internal stress testing framework. I started seeing SASL requests timing out. In this test, I observed more than 2 minutes delay between channel registration and when the first bytes are read from the channel.

2020-01-24 22:53:34,019 DEBUG org.spark_project.io.netty.handler.logging.LoggingHandler: [id: 0xd475f5ff, L:/10.150.16.27:7337 - R:/10.150.16.44:11388] REGISTERED
2020-01-24 22:53:34,019 DEBUG org.spark_project.io.netty.handler.logging.LoggingHandler: [id: 0xd475f5ff, L:/10.150.16.27:7337 - R:/10.150.16.44:11388] ACTIVE

2020-01-24 22:55:05,207 DEBUG org.spark_project.io.netty.handler.logging.LoggingHandler: [id: 0xd475f5ff, L:/10.150.16.27:7337 - R:/10.150.16.44:11388] READ: 48B
2020-01-24 22:55:05,207 DEBUG org.spark_project.io.netty.handler.logging.LoggingHandler: [id: 0xd475f5ff, L:/10.150.16.27:7337 - R:/10.150.16.44:11388] WRITE: org.apache.spark.network.protocol.MessageWithHeader@27e59ee9
2020-01-24 22:55:05,207 DEBUG org.spark_project.io.netty.handler.logging.LoggingHandler: [id: 0xd475f5ff, L:/10.150.16.27:7337 - R:/10.150.16.44:11388] FLUSH
2020-01-24 22:55:05,207 INFO org.apache.spark.network.server.OutgoingChannelHandler: OUTPUT request 5929104419960968526 channel d475f5ff request_rec 1579906505207 transport_rec 1579906505207 flush 1579906505207  receive-transport 0 transport-flush 0 total 0

Since there is a delay in reading the channel, I suspect this is because the hardcoding in netty code
SingleThreadEventExecutor.runAllTask() that checks time only after 64 tasks. WriteAndFlush tasks are bulky tasks. With await there will be just 1 WriteAndFlushTask per channel in the IO thread's pending queue and the rest of the tasks will be smaller tasks.
However, without await there are more WriteAndFlush tasks per channel in the IO thread's queue. Since it processes 64 tasks and then checks time, this time increases with more WriteAndFlush tasks.

/ Check timeout every 64 tasks because nanoTime() is relatively expensive.
            // XXX: Hard-coded value - will make it configurable if it is really a problem.
            if ((runTasks & 0x3F) == 0) {
                lastExecutionTime = ScheduledFutureTask.nanoTime();
                if (lastExecutionTime >= deadline) {
                    break;
                }
            }

I can test this theory by lowering this number in a fork of netty and building spark against it. However, for now we can't remove await().

Note: This test was with a dedicated boss event loop group which is why we don't see any delay in channel registration.

Victsm · 2020-01-25T15:16:48Z

@cloud-fan
What do you think of SPARK-30602 in the context of this perf regression you see?
We have also been operating our Spark infrastructure with this change for quite some time, and we do not in general notice performance regressions.
When doing shuffle in a large-scale multi-tenancy cluster, the issues we mentioned in SPARK-30602's SPIP doc becomes much more dominant.
Without the change in SPARK-24355, before saturating the underlying network, the disk is first saturated due to the small random reads, which will then further propagate its impact to start timing out control plane RPCs.
SPARK-24355 is basically an attempt to stop the small random reads impacting control plane RPCs to improve reliability of shuffle service.
On top of these, SPARK-30602 will significantly improve the overall throughput and efficiency of Spark shuffle.

cloud-fan · 2020-02-18T04:34:19Z

SPARK-30602 looks too big for this particular issue.

What I am looking for is to disable this feature completely by default (no await). We can further improve it that brings no perf regression, or replace it by SPARK-30602.

@xuanyuanking can you help to do it?

xuanyuanking · 2020-02-18T13:27:45Z

Sure, will give a follow up for this.

…event loop group ### What changes were proposed in this pull request? Fix the regression caused by #22173. The original PR changes the logic of handling `ChunkFetchReqeust` from async to sync, that's causes the shuffle benchmark regression. This PR fixes the regression back to the async mode by reusing the config `spark.shuffle.server.chunkFetchHandlerThreadsPercent`. When the user sets the config, ChunkFetchReqeust will be processed in a separate event loop group, otherwise, the code path is exactly the same as before. ### Why are the changes needed? Fix the shuffle performance regression described in #22173 (comment) ### Does this PR introduce any user-facing change? Yes, this PR disable the separate event loop for FetchRequest by default. ### How was this patch tested? Existing UT. Closes #27665 from xuanyuanking/SPARK-24355-follow. Authored-by: Yuanjian Li <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

…event loop group ### What changes were proposed in this pull request? Fix the regression caused by #22173. The original PR changes the logic of handling `ChunkFetchReqeust` from async to sync, that's causes the shuffle benchmark regression. This PR fixes the regression back to the async mode by reusing the config `spark.shuffle.server.chunkFetchHandlerThreadsPercent`. When the user sets the config, ChunkFetchReqeust will be processed in a separate event loop group, otherwise, the code path is exactly the same as before. ### Why are the changes needed? Fix the shuffle performance regression described in #22173 (comment) ### Does this PR introduce any user-facing change? Yes, this PR disable the separate event loop for FetchRequest by default. ### How was this patch tested? Existing UT. Closes #27665 from xuanyuanking/SPARK-24355-follow. Authored-by: Yuanjian Li <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 0fe203e) Signed-off-by: Wenchen Fan <[email protected]>

…dle block fetch requests. Description: Right now, the default server side netty handler threads is 2 * # cores, and can be further configured with parameter spark.shuffle.io.serverThreads. In order to process a client request, it would require one available server netty handler thread. However, when the server netty handler threads start to process ChunkFetchRequests, they will be blocked on disk I/O, mostly due to disk contentions from the random read operations initiated by all the ChunkFetchRequests received from clients. As a result, when the shuffle server is serving many concurrent ChunkFetchRequests, the server side netty handler threads could all be blocked on reading shuffle files, thus leaving no handler thread available to process other types of requests which should all be very quick to process. This issue could potentially be fixed by limiting the number of netty handler threads that could get blocked when processing ChunkFetchRequest. We have a patch to do this by using a separate EventLoopGroup with a dedicated ChannelHandler to process ChunkFetchRequest. This enables shuffle server to reserve netty handler threads for non-ChunkFetchRequest, thus enabling consistent processing time for these requests which are fast to process. After deploying the patch in our infrastructure, we no longer see timeout issues with either executor registration with local shuffle server or shuffle client establishing connection with remote shuffle server. (Please fill in changes proposed in this fix) For Original PR please refer here apache#21402 Unit tests and stress testing. (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. Closes apache#22173 from redsanket/SPARK-24335. Authored-by: Sanket Chintapalli <[email protected]> Signed-off-by: Thomas Graves <[email protected]> (cherry picked from commit ff601cf) (cherry picked from commit 8094f59a8128166dd283d97c49322ec09fa2c93c) Change-Id: Id91ae9d20ffe9aae09585b197c6a15f4e8044895 (cherry picked from commit 5261726732ef5a9cf21e02a25bba0d14a808c274) Change-Id: Ic666004b8b3d17038d8f42f4154058c0d00e9a37

…event loop group ### What changes were proposed in this pull request? Fix the regression caused by apache#22173. The original PR changes the logic of handling `ChunkFetchReqeust` from async to sync, that's causes the shuffle benchmark regression. This PR fixes the regression back to the async mode by reusing the config `spark.shuffle.server.chunkFetchHandlerThreadsPercent`. When the user sets the config, ChunkFetchReqeust will be processed in a separate event loop group, otherwise, the code path is exactly the same as before. ### Why are the changes needed? Fix the shuffle performance regression described in apache#22173 (comment) ### Does this PR introduce any user-facing change? Yes, this PR disable the separate event loop for FetchRequest by default. ### How was this patch tested? Existing UT. Closes apache#27665 from xuanyuanking/SPARK-24355-follow. Authored-by: Yuanjian Li <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

Sanket Chintapalli added 2 commits August 21, 2018 13:34

SPARK-24355 Spark external shuffle server improvement to better handl…

44bb557

…e block fetch requests.

make chunk fetch handler threads as a percentage of transport server …

3bab74c

…threads

vanzin reviewed Aug 22, 2018

View reviewed changes

Add apache license header

cc40d9b

do not create event loops for other rpcs except shuffle

d86503c

complete expanion of imports rearrange imports

redsanket force-pushed the SPARK-24335 branch from 50258f7 to d86503c Compare August 27, 2018 17:26

add null check

470e9a6

fix styling issues

dcc41f5

fix more style requests

b1105bd

tgravescs reviewed Sep 7, 2018

View reviewed changes

address nits

8153de5

tgravescs requested changes Sep 18, 2018

View reviewed changes

tgravescs approved these changes Sep 20, 2018

View reviewed changes

asfgit closed this in ff601cf Sep 21, 2018

zsxwing reviewed Oct 8, 2018

View reviewed changes

cloud-fan mentioned this pull request Oct 22, 2018

SPARK-24355 Spark external shuffle server improvement to better handle block fetch requests. #21402

Closed

xuanyuanking mentioned this pull request Feb 21, 2020

[SPARK-30623][Core] Spark external shuffle allow disable of separate event loop group #27665

Closed

		import io.netty.util.concurrent.Future;
		import io.netty.util.concurrent.GenericFutureListener;

[SPARK-24355] Spark external shuffle server improvement to better handle block fetch requests. #22173

[SPARK-24355] Spark external shuffle server improvement to better handle block fetch requests. #22173

Uh oh!

Conversation

redsanket commented Aug 21, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

redsanket commented Aug 21, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tgravescs commented Aug 21, 2018

Uh oh!

vanzin commented Aug 21, 2018

Uh oh!

SparkQA commented Aug 21, 2018

Uh oh!

vanzin Aug 22, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 27, 2018

Uh oh!

SparkQA commented Aug 27, 2018

Uh oh!

SparkQA commented Aug 27, 2018

Uh oh!

SparkQA commented Aug 27, 2018

Uh oh!

SparkQA commented Aug 28, 2018

Uh oh!

SparkQA commented Aug 29, 2018

Uh oh!

SparkQA commented Aug 29, 2018

Uh oh!

vanzin commented Aug 30, 2018

Uh oh!

vanzin commented Aug 30, 2018

Uh oh!

SparkQA commented Aug 31, 2018

Uh oh!

redsanket commented Aug 31, 2018

Uh oh!

tgravescs left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

redsanket Sep 17, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

redsanket Sep 18, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

redsanket commented Aug 21, 2018 •

edited

Loading

redsanket commented Aug 21, 2018 •

edited

Loading

vanzin Aug 22, 2018 •

edited

Loading

redsanket Sep 17, 2018 •

edited

Loading

redsanket Sep 18, 2018 •

edited

Loading