[SPARK-30246][CORE][SHUFFLE]fix spark external shuffle memory leak #27060

012huang · 2019-12-31T11:40:18Z

What changes were proposed in this pull request?

An app finished abnormal sometimes may cause shuffe service memory leak. In one of our production cases, the app failed for Stage cancelled as SparkContext has already shut down. the strange is there are still requests for fetch shuffle data and cause error in server side as below:

2019-12-08 22:23:33,375 ERROR server.TransportRequestHandler (TransportRequestHandler.java:processFetchRequest(132)) - Error opening block StreamChunkId{streamId=1902064894814, chunkIndex=0} for request from /10.221.115.175:38582
java.lang.RuntimeException: Executor is not registered (appId=application_1574499669561_954327, execId=4514)

the client sie also show corresponding log like this:

org.apache.spark.shuffle.FetchFailedException: Failure while fetching StreamChunkId{streamId=1902064894814, chunkIndex=0}: java.lang.RuntimeException: Executor is not registered (appId=application_1574499669561_954327, execId=4514)

in some cases, the request for OpenBlocks is still on the fly. In the code ExternalShuffleBlockHandler#handleMessage, it will register a StreamState to OneForOneStreamManager#streams, then reply an success response to client unconditionally , the client receive the response and then fire ChunkFetchRequest to fetch chunk, but at this time, the app has got event APPLICATION_STOP and executed ExternalShuffleService#applicationRemoved method to clean the app's ExecutorShuffleInfo, this made Executor is not registered error happended. even though when the client channel is closing, the TransportRequestHandler#channelInactive was called to clean the StreamState with relate channel, but when cleanning the StreamState buffter, it also lookup ManagedBuffer with appId and execId info which have been cleaned in executors object. we can also find the log: StreamManager connectionTerminated() callback failed in NM's log file.

so, when an OpenBlocks request come, we should lookup ExternalShuffleBlockResolver#executors , if the realted app is exited, we should not registering a StreamState and just close the client (or reply an special message to client and in client side to handle it). and when an app get APPLICATION_STOP to call applicationRemoved, we should clean the the related streamState before ExecutorShuffleInfo has been cleaned, this is what the PR changes and prevents the shuffle service memory leak.

Why are the changes needed?

The external shuffle service memory leak has a great impact on cluster with dynanic on and may cause NM crash.

Does this PR introduce any user-facing change?

No

How was this patch tested?

with existing ut

AmplabJenkins · 2019-12-31T11:49:24Z

Can one of the admins verify this patch?

012huang · 2020-01-06T05:08:39Z

cc @viirya @dongjoon-hyun , can you help review this? thanks

vanzin · 2020-01-06T18:22:19Z

Why is this against 2.4 and not master? If the problem does not exist in master, please explain why, and why you're not backporting whatever fixed the issue in master instead.

This sounds similar to SPARK-26604 which says is fixed in 2.4.1, but the bug you filed says 2.4.3.

012huang · 2020-01-07T16:35:45Z

the shuffle service still exists memory leak in 2.4.3 and I am informed some other users have facing the problem.
The shuffle module have a great change since 3.0 and I have't go through it completely, I will take time to work against master for that, thank you for your reply.

fbrams · 2020-02-15T08:29:02Z

Just for a better understanding, if wenn are currently dealing with that bug in our environment:

what ist NM?
dynamics on - are you referring to "spark.dynamicAllocation.enabled" ?

fix spark external shuffle memory leak

de0ef57

fix LineLength

c33c5c8

012huang closed this Jan 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-30246][CORE][SHUFFLE]fix spark external shuffle memory leak #27060

[SPARK-30246][CORE][SHUFFLE]fix spark external shuffle memory leak #27060

Uh oh!

012huang commented Dec 31, 2019

Uh oh!

AmplabJenkins commented Dec 31, 2019

Uh oh!

012huang commented Jan 6, 2020

Uh oh!

vanzin commented Jan 6, 2020

Uh oh!

012huang commented Jan 7, 2020

Uh oh!

fbrams commented Feb 15, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[SPARK-30246][CORE][SHUFFLE]fix spark external shuffle memory leak #27060

[SPARK-30246][CORE][SHUFFLE]fix spark external shuffle memory leak #27060

Uh oh!

Conversation

012huang commented Dec 31, 2019

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

AmplabJenkins commented Dec 31, 2019

Uh oh!

012huang commented Jan 6, 2020

Uh oh!

vanzin commented Jan 6, 2020

Uh oh!

012huang commented Jan 7, 2020

Uh oh!

fbrams commented Feb 15, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants