Skip to content

Conversation

@xuanyuanking
Copy link
Member

@xuanyuanking xuanyuanking commented Feb 21, 2020

What changes were proposed in this pull request?

Fix the regression caused by #22173.
The original PR changes the logic of handling ChunkFetchReqeust from async to sync, that's causes the shuffle benchmark regression. This PR fixes the regression back to the async mode by reusing the config spark.shuffle.server.chunkFetchHandlerThreadsPercent.
When the user sets the config, ChunkFetchReqeust will be processed in a separate event loop group, otherwise, the code path is exactly the same as before.

As the creation of the separate event loop group is disabled by default, this PR also is a kind of revert for SPARK-25641.

Why are the changes needed?

Fix the shuffle performance regression described in #22173 (comment)

Does this PR introduce any user-facing change?

Yes, this PR disable the separate event loop for FetchRequest by default.

How was this patch tested?

Existing UT.

@SparkQA
Copy link

SparkQA commented Feb 21, 2020

Test build #118772 has finished for PR 27665 at commit 743e566.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we remove await, we see more SASL requests timing out. I mentioned this in the comment here:
#22173 (comment)

Wouldn't making the asyncMode default, make this problem worse?
In aysnc mode, how do you plan to tackle increased number of SASL failures?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comment @otterc!
This PR aims to fix the performance regression brings by the sync mode fetching. As you mention, you find a problem with the stress test framework, I think the default mode should be the guarantee for common use cases.

For the SASL requests timing out, maybe we need more context for your internal stress test framework and analyze the root cause. IMO, there should be other configs in spark.shuffle/netty could help, not only depends on the async/sync mode here.

@xuanyuanking
Copy link
Member Author

cc @cloud-fan @tgravescs

@tgravescs
Copy link
Contributor

So I actually filed SPARK-30623 for this, please update the title and things. I don't think we need a separate feature config for this, as that jira mentioned I think was just say if the config isn't explicitly set then do synchronous mode.

@Victsm
Copy link
Contributor

Victsm commented Feb 24, 2020

@xuanyuanking we worked on the original fix of this issue. Having await in there is the key to the benefits provided in SPARK-24355, which improves reliability of Spark shuffle in a reasonable scaled deployment. This issue seems common across companies like us (LinkedIn), Netflix, Uber, Yahoo. As mentioned in #22173, what we observed is that in cases where HDD is used for shuffle storage, the disk is saturated first before the network can be saturated. So, for a reasonable scaled deployment, having this fix provides a boost in shuffle reliability without hurting much on the performance side. This is also validated by @tgravescs in the Yahoo deployment of this patch.

It's reasonable to introduce another config that disables this reliability improvement if it leads to performance regression in certain deployment mode. Just want to see whether we should leave this enabled by default or not. Also, as mentioned in #22173, we have discovered a potential fix to this perf regression issue that does not removes its reliability benefits. It will take some extra time on our side to evaluate that fix, which is a fix inside Netty. Want to make sure the broader community knows what we have been doing for this issue, so we do not take away a potential reliability improvement to Spark.

@cloud-fan
Copy link
Contributor

The default value of spark.shuffle.server.chunkFetchHandlerThreadsPercent is 100, so this feature is expected to be disabled by default IIUC.

Policy-wise, we should not fix an issue while introducing a regression. I'm OK to have it since it's disabled by default and it does fix a common issue. What I'm asking for is to make sure the code path is exactly the same as before when this feature is disabled, so that there is no regression.

+1 with @tgravescs to reuse the existing config.

@xuanyuanking xuanyuanking changed the title [SPARK-24355][Core][FOLLOWUP] Add flag for fetching chunks in async mode [SPARK-30623][Core] Spark external shuffle allow disable of separate event loop group Mar 23, 2020
@xuanyuanking
Copy link
Member Author

Do some refactoring to reuse the logic of processing fetch requests. Reuse the config spark.shuffle.server.chunkFetchHandlerThreadsPercent to control the separate event loop creation. Please take a look.
cc @otterc @Victsm @tgravescs @cloud-fan

*/
public boolean separateChunkFetchRequest() {
try {
conf.get("spark.shuffle.server.chunkFetchHandlerThreadsPercent");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is no API to check conf existence?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, no API in ConfigProvider.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about conf.getInt("spark.shuffle.server.chunkFetchHandlerThreadsPercent", 0) > 0?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, done in 4f42083.

return 0;
}
int chunkFetchHandlerThreadsPercent =
conf.getInt("spark.shuffle.server.chunkFetchHandlerThreadsPercent", 100);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's wrong with the previous code?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to give a default value here, when it comes to here, the config must be set.

Copy link
Member

@dongjoon-hyun dongjoon-hyun Mar 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by the config must be set, @xuanyuanking ? What value do you expect by default? Apparently, this seems to revert SPARK-25641 together without mentioning SPARK-25641. In the PR, only SPARK-24355 is mentioned.

No need to give a default value here, when it comes to here, the config must be set.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because we only call this method if separateChunkFetchRequest returns true.

We will see exception if the assumption is broken.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this PR, we make the separate event loop group configurable by checking the config spark.shuffle.server.chunkFetchHandlerThreadsPercent is set or not.

What do you mean by the config must be set

Here the function chunkFetchHandlerThreads is only called while the config is set.

if (conf.getModuleName() != null &&
conf.getModuleName().equalsIgnoreCase("shuffle") &&
!isClientOnly && conf.separateChunkFetchRequest()) {
chunkFetchWorkers = NettyUtils.createEventLoop(
IOMode.valueOf(conf.ioMode()),
conf.chunkFetchHandlerThreads(),

What value do you expect by default? Apparently, this seems to revert SPARK-25641 together without mentioning SPARK-25641.

Yes, this PR makes the feature disabled by default, let me also mention SPARK-25641 in PR description.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @xuanyuanking and @cloud-fan .

private final TransportRequestHandler requestHandler;
private final long requestTimeoutNs;
private final boolean closeIdleConnections;
private final boolean separateChunkFetchRequest;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe a more explicit name: skipChunkFetchRequest

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

TransportClient reverseClient,
RpcHandler rpcHandler,
Long maxChunksBeingTransferred) {
super(reverseClient, rpcHandler.getStreamManager(), maxChunksBeingTransferred);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we still need the maxChunksBeingTransferred variable in this class?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is still in use in processStreamRequest.

@SparkQA
Copy link

SparkQA commented Mar 23, 2020

Test build #120190 has finished for PR 27665 at commit 5015f60.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 23, 2020

Test build #120192 has finished for PR 27665 at commit 010fbec.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 23, 2020

Test build #120196 has finished for PR 27665 at commit e655227.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 24, 2020

Test build #120229 has finished for PR 27665 at commit da8abd3.

  • This patch fails PySpark pip packaging tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 24, 2020

Test build #120274 has finished for PR 27665 at commit 34b52ef.

  • This patch fails to generate documentation.
  • This patch merges cleanly.
  • This patch adds no public classes.


@Test
public void handleStreamRequest() {
public void handleStreamRequest() throws Exception {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this change necessary?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SparkQA
Copy link

SparkQA commented Mar 24, 2020

Test build #120273 has finished for PR 27665 at commit d4e0352.

  • This patch fails to generate documentation.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Mar 25, 2020

Test build #120311 has finished for PR 27665 at commit 4f42083.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 25, 2020

Test build #120310 has finished for PR 27665 at commit 34b52ef.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@xuanyuanking
Copy link
Member Author

retest this please

@SparkQA
Copy link

SparkQA commented Mar 25, 2020

Test build #120326 has finished for PR 27665 at commit 4f42083.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master/3.0!

@cloud-fan cloud-fan closed this in 0fe203e Mar 26, 2020
@dongjoon-hyun
Copy link
Member

Hi, @cloud-fan . This seems to be not in branch-3.0 yet.

cloud-fan pushed a commit that referenced this pull request Mar 26, 2020
…event loop group

### What changes were proposed in this pull request?
Fix the regression caused by #22173.
The original PR changes the logic of handling `ChunkFetchReqeust` from async to sync, that's causes the shuffle benchmark regression. This PR fixes the regression back to the async mode by reusing the config `spark.shuffle.server.chunkFetchHandlerThreadsPercent`.
When the user sets the config, ChunkFetchReqeust will be processed in a separate event loop group, otherwise, the code path is exactly the same as before.

### Why are the changes needed?
Fix the shuffle performance regression described in  #22173 (comment)

### Does this PR introduce any user-facing change?
Yes, this PR disable the separate event loop for FetchRequest by default.

### How was this patch tested?
Existing UT.

Closes #27665 from xuanyuanking/SPARK-24355-follow.

Authored-by: Yuanjian Li <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit 0fe203e)
Signed-off-by: Wenchen Fan <[email protected]>
@cloud-fan
Copy link
Contributor

it is now. merge script runs slowly at my sides...

@dongjoon-hyun
Copy link
Member

Thank you!

@dongjoon-hyun
Copy link
Member

BTW, @xuanyuanking .
Could you confirm the above question?

@xuanyuanking xuanyuanking deleted the SPARK-24355-follow branch March 26, 2020 06:23
@xuanyuanking
Copy link
Member Author

Thanks for the review.

sjincho pushed a commit to sjincho/spark that referenced this pull request Apr 15, 2020
…event loop group

### What changes were proposed in this pull request?
Fix the regression caused by apache#22173.
The original PR changes the logic of handling `ChunkFetchReqeust` from async to sync, that's causes the shuffle benchmark regression. This PR fixes the regression back to the async mode by reusing the config `spark.shuffle.server.chunkFetchHandlerThreadsPercent`.
When the user sets the config, ChunkFetchReqeust will be processed in a separate event loop group, otherwise, the code path is exactly the same as before.

### Why are the changes needed?
Fix the shuffle performance regression described in  apache#22173 (comment)

### Does this PR introduce any user-facing change?
Yes, this PR disable the separate event loop for FetchRequest by default.

### How was this patch tested?
Existing UT.

Closes apache#27665 from xuanyuanking/SPARK-24355-follow.

Authored-by: Yuanjian Li <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants