[SPARK-22618][CORE] Catch exception in removeRDD to stop jobs from dying by brad-kaiser · Pull Request #19836 · apache/spark

brad-kaiser · 2017-11-28T19:57:54Z

What changes were proposed in this pull request?

I propose that BlockManagerMasterEndpoint.removeRdd() should catch and log any IOExceptions it receives. As it is now, the exception can bubble up to the main thread and kill user applications when called from RDD.unpersist(). I think this change is a better experience for the end user.

I chose to catch the exception in BlockManagerMasterEndpoint.removeRdd() instead of RDD.unpersist() because this way the RDD.unpersist() blocking option will still work correctly. Otherwise, blocking will get short circuited by the first error.

How was this patch tested?

This patch was tested with a job that shows the job killing behavior mentioned above.

@rxin, it looks like you originally wrote this method, I would appreciate it if you took a look. Thanks.

This contribution is my original work and is licensed under the project's open source license.

srowen · 2017-11-28T22:23:26Z

Warning, not error, I'd imagine. rdd -> RDD. Why this construction with a partial function rather than write this inline below?

Thanks for looking at my change. I changed the log to warning and "rdd" to "RDD". I had pulled out the partial function out because I felt like the expression was getting too deeply nested and hard to read. I certainly don't have to do that though.

jiangxb1987 · 2017-12-06T12:47:23Z

This looks reasonable, cc @cloud-fan

cloud-fan · 2017-12-06T13:06:07Z

+    }
+
+    val futures = blockManagerInfo.values.map { bm =>
+      bm.slaveEndpoint.ask[Int](removeMsg).recover(handleRemoveRddException)


personally I think

bm.slaveEndpoint.ask[Int](removeMsg).recover { case e: IOException => logWarning(s"Error trying to remove RDD $rddId", e) 0 // zero blocks were removed }

is more readable

Ok, I updated this. Thanks.

cloud-fan · 2017-12-06T13:06:44Z

ok to test

cloud-fan · 2017-12-06T14:50:22Z

LGTM

jiangxb1987

LGTM only one minor issue.

jiangxb1987 · 2017-12-06T15:16:51Z

+
+    val futures = blockManagerInfo.values.map { bm =>
+      bm.slaveEndpoint.ask[Int](removeMsg).recover {
+        case e: IOException =>


According to what is described in the JIRA, should we only ignore the IOException if dynamic allocation is enabled?

I think the logic for catching the error still applies even without dynamic allocation. If one of your nodes goes down while you happen to be in .unpersist, you wouldn't want your whole job to fail.

Dynamic allocation just makes this scenario more likely.

Sounds good, thanks!

SparkQA · 2017-12-06T16:29:03Z

Test build #84553 has finished for PR 19836 at commit fbd2497.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-12-06T18:07:24Z

Test build #84562 has finished for PR 19836 at commit e2ad8c3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-12-07T13:04:26Z

thanks, merging to master!

@rxin

…ing apache#19836 What changes were proposed in this pull request? I propose that BlockManagerMasterEndpoint.removeRdd() should catch and log any IOExceptions it receives. As it is now, the exception can bubble up to the main thread and kill user applications when called from RDD.unpersist(). I think this change is a better experience for the end user. I chose to catch the exception in BlockManagerMasterEndpoint.removeRdd() instead of RDD.unpersist() because this way the RDD.unpersist() blocking option will still work correctly. Otherwise, blocking will get short circuited by the first error. How was this patch tested? This patch was tested with a job that shows the job killing behavior mentioned above. @rxin, it looks like you originally wrote this method, I would appreciate it if you took a look. Thanks. This contribution is my original work and is licensed under the project's open source license.

srowen reviewed Nov 28, 2017

View reviewed changes

brad-kaiser force-pushed the catch-unpersist-exception branch from 889993f to 1152145 Compare November 29, 2017 03:10

[SPARK-22618][CORE] Catch exception in removeRDD to stop jobs from dying

fbd2497

brad-kaiser force-pushed the catch-unpersist-exception branch from 1152145 to fbd2497 Compare November 29, 2017 03:16

cloud-fan reviewed Dec 6, 2017

View reviewed changes

[SPARK-22618][CORE] inlined error handling function

e2ad8c3

jiangxb1987 reviewed Dec 6, 2017

View reviewed changes

asfgit closed this in beb717f Dec 7, 2017

Conversation

brad-kaiser commented Nov 28, 2017

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

srowen Nov 28, 2017

Choose a reason for hiding this comment

Uh oh!

brad-kaiser Nov 29, 2017

Choose a reason for hiding this comment

Uh oh!

jiangxb1987 commented Dec 6, 2017

Uh oh!

cloud-fan Dec 6, 2017

Choose a reason for hiding this comment

Uh oh!

brad-kaiser Dec 6, 2017

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Dec 6, 2017

Uh oh!

cloud-fan commented Dec 6, 2017

Uh oh!

jiangxb1987 left a comment

Choose a reason for hiding this comment

Uh oh!

jiangxb1987 Dec 6, 2017

Choose a reason for hiding this comment

Uh oh!

brad-kaiser Dec 6, 2017

Choose a reason for hiding this comment

Uh oh!

jiangxb1987 Dec 7, 2017

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 6, 2017

Uh oh!

SparkQA commented Dec 6, 2017

Uh oh!

cloud-fan commented Dec 7, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants