Skip to content

[SPARK-22618][CORE] Catch exception in removeRDD to stop jobs from dying#19836

Closed
brad-kaiser wants to merge 2 commits intoapache:masterfrom
brad-kaiser:catch-unpersist-exception
Closed

[SPARK-22618][CORE] Catch exception in removeRDD to stop jobs from dying#19836
brad-kaiser wants to merge 2 commits intoapache:masterfrom
brad-kaiser:catch-unpersist-exception

Conversation

@brad-kaiser
Copy link
Copy Markdown

What changes were proposed in this pull request?

I propose that BlockManagerMasterEndpoint.removeRdd() should catch and log any IOExceptions it receives. As it is now, the exception can bubble up to the main thread and kill user applications when called from RDD.unpersist(). I think this change is a better experience for the end user.

I chose to catch the exception in BlockManagerMasterEndpoint.removeRdd() instead of RDD.unpersist() because this way the RDD.unpersist() blocking option will still work correctly. Otherwise, blocking will get short circuited by the first error.

How was this patch tested?

This patch was tested with a job that shows the job killing behavior mentioned above.

@rxin, it looks like you originally wrote this method, I would appreciate it if you took a look. Thanks.

This contribution is my original work and is licensed under the project's open source license.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warning, not error, I'd imagine. rdd -> RDD. Why this construction with a partial function rather than write this inline below?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking at my change. I changed the log to warning and "rdd" to "RDD". I had pulled out the partial function out because I felt like the expression was getting too deeply nested and hard to read. I certainly don't have to do that though.

@brad-kaiser brad-kaiser force-pushed the catch-unpersist-exception branch from 889993f to 1152145 Compare November 29, 2017 03:10
@brad-kaiser brad-kaiser force-pushed the catch-unpersist-exception branch from 1152145 to fbd2497 Compare November 29, 2017 03:16
@jiangxb1987
Copy link
Copy Markdown
Contributor

This looks reasonable, cc @cloud-fan

}

val futures = blockManagerInfo.values.map { bm =>
bm.slaveEndpoint.ask[Int](removeMsg).recover(handleRemoveRddException)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

personally I think

bm.slaveEndpoint.ask[Int](removeMsg).recover {
  case e: IOException =>
    logWarning(s"Error trying to remove RDD $rddId", e)
    0 // zero blocks were removed
}

is more readable

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I updated this. Thanks.

@cloud-fan
Copy link
Copy Markdown
Contributor

ok to test

@cloud-fan
Copy link
Copy Markdown
Contributor

LGTM

Copy link
Copy Markdown
Contributor

@jiangxb1987 jiangxb1987 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM only one minor issue.


val futures = blockManagerInfo.values.map { bm =>
bm.slaveEndpoint.ask[Int](removeMsg).recover {
case e: IOException =>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to what is described in the JIRA, should we only ignore the IOException if dynamic allocation is enabled?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the logic for catching the error still applies even without dynamic allocation. If one of your nodes goes down while you happen to be in .unpersist, you wouldn't want your whole job to fail.

Dynamic allocation just makes this scenario more likely.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, thanks!

@SparkQA
Copy link
Copy Markdown

SparkQA commented Dec 6, 2017

Test build #84553 has finished for PR 19836 at commit fbd2497.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link
Copy Markdown

SparkQA commented Dec 6, 2017

Test build #84562 has finished for PR 19836 at commit e2ad8c3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Copy Markdown
Contributor

thanks, merging to master!

@asfgit asfgit closed this in beb717f Dec 7, 2017
zzcclp added a commit to zzcclp/spark that referenced this pull request Dec 6, 2018
…ing apache#19836

What changes were proposed in this pull request?

I propose that BlockManagerMasterEndpoint.removeRdd() should catch and log any IOExceptions it receives. As it is now, the exception can bubble up to the main thread and kill user applications when called from RDD.unpersist(). I think this change is a better experience for the end user.

I chose to catch the exception in BlockManagerMasterEndpoint.removeRdd() instead of RDD.unpersist() because this way the RDD.unpersist() blocking option will still work correctly. Otherwise, blocking will get short circuited by the first error.
How was this patch tested?

This patch was tested with a job that shows the job killing behavior mentioned above.

@rxin, it looks like you originally wrote this method, I would appreciate it if you took a look. Thanks.

This contribution is my original work and is licensed under the project's open source license.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants