[SPARK-6331] Load new master URL if present when recovering streaming context from checkpoint #5024

tdas · 2015-03-14T02:00:08Z

In streaming driver recovery, when the SparkConf is reconstructed based on the checkpointed configuration, it recovers the old master URL. This okay if the cluster on which the streaming application is relaunched is the same cluster as it was running before. But if that cluster changes, there is no way to inject the new master URL of the new cluster. As a result, the restarted app tries to connect to the non-existent old cluster and fails.

The solution is to check whether a master URL is set in the System properties (by Spark submit) before recreating the SparkConf. If a new master url is set in the properties, then use it as that is obviously the most relevant one. Otherwise load the old one (to maintain existing behavior).

… checkpoint

tdas · 2015-03-14T02:00:37Z

@harishreedharan Can you take a look?

SparkQA · 2015-03-14T02:03:08Z

Test build #28604 has started for PR 5024 at commit 222485d.

This patch merges cleanly.

SparkQA · 2015-03-14T02:18:05Z

Test build #28605 has started for PR 5024 at commit 6a0857c.

This patch merges cleanly.

SparkQA · 2015-03-14T03:24:15Z

Test build #28604 has finished for PR 5024 at commit 222485d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-03-14T03:24:20Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28604/
Test PASSed.

SparkQA · 2015-03-14T03:38:55Z

Test build #28605 has finished for PR 5024 at commit 6a0857c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-03-14T03:38:59Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28605/
Test PASSed.

harishreedharan · 2015-03-14T07:04:35Z

streaming/src/main/scala/org/apache/spark/streaming/Checkpoint.scala

case _ => should be enough, no? Is there a need to return None? (Since None is an object, I am not sure what the real cost is in this case for returning something - so this can be ignored if the cost here is ~zero)

Right, I can make it better by using foreach on option. That was stupid and
too hurried.
On Mar 14, 2015 12:05 AM, "Hari Shreedharan" [email protected]
wrote:

In streaming/src/main/scala/org/apache/spark/streaming/Checkpoint.scala
#5024 (comment):

.remove("spark.driver.host") .remove("spark.driver.port")

new SparkConf(loadDefaults = true).getOption("spark.master") match {

case Some(newMaster) => newSparkConf.setMaster(newMaster)

case _ => None

case _ => should be enough, no? Is there a need to return None? (Since
None is an object, I am not sure what the real cost is in this case for
returning something - so this can be ignored if the cost here is ~zero)

—
Reply to this email directly or view it on GitHub
https://github.com/apache/spark/pull/5024/files#r26436080.

harishreedharan · 2015-03-14T07:05:29Z

LGTM.

jerryshao · 2015-03-16T01:45:18Z

LGTM. A more general thinking maybe not relevant to this PR, if some configurations are changed after resubmitting the application, how to handle this, to choose the new configuration or still keep the old one, like memory size or core number.

SparkQA · 2015-03-16T19:18:09Z

Test build #28666 has started for PR 5024 at commit c7c0b99.

This patch merges cleanly.

SparkQA · 2015-03-16T20:35:42Z

Test build #28666 has finished for PR 5024 at commit c7c0b99.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-03-16T20:35:46Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28666/
Test PASSed.

harishreedharan · 2015-03-16T20:41:45Z

streaming/src/main/scala/org/apache/spark/streaming/Checkpoint.scala

Why is this being called oldMaster? Isn't this an option wrapping the new master?

SparkQA · 2015-03-16T21:53:06Z

Test build #28674 has started for PR 5024 at commit 392fd44.

This patch merges cleanly.

SparkQA · 2015-03-16T23:12:31Z

Test build #28674 has finished for PR 5024 at commit 392fd44.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-03-16T23:12:35Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28674/
Test PASSed.

… context from checkpoint In streaming driver recovery, when the SparkConf is reconstructed based on the checkpointed configuration, it recovers the old master URL. This okay if the cluster on which the streaming application is relaunched is the same cluster as it was running before. But if that cluster changes, there is no way to inject the new master URL of the new cluster. As a result, the restarted app tries to connect to the non-existent old cluster and fails. The solution is to check whether a master URL is set in the System properties (by Spark submit) before recreating the SparkConf. If a new master url is set in the properties, then use it as that is obviously the most relevant one. Otherwise load the old one (to maintain existing behavior). Author: Tathagata Das <[email protected]> Closes #5024 from tdas/SPARK-6331 and squashes the following commits: 392fd44 [Tathagata Das] Fixed naming issue. c7c0b99 [Tathagata Das] Addressed comments. 6a0857c [Tathagata Das] Updated testsuites. 222485d [Tathagata Das] Load new master URL if present when recovering streaming context from checkpoint (cherry picked from commit c928796) Signed-off-by: Tathagata Das <[email protected]>

Load new master URL if present when recovering streaming context from…

222485d

… checkpoint

Updated testsuites.

6a0857c

harishreedharan reviewed Mar 14, 2015
View reviewed changes

Addressed comments.

c7c0b99

harishreedharan reviewed Mar 16, 2015
View reviewed changes

Fixed naming issue.

392fd44

asfgit closed this in c928796 Mar 17, 2015

[SPARK-6331] Load new master URL if present when recovering streaming context from checkpoint #5024

[SPARK-6331] Load new master URL if present when recovering streaming context from checkpoint #5024

Uh oh!

Conversation

tdas commented Mar 14, 2015

Uh oh!

tdas commented Mar 14, 2015

Uh oh!

SparkQA commented Mar 14, 2015

Uh oh!

SparkQA commented Mar 14, 2015

Uh oh!

SparkQA commented Mar 14, 2015

Uh oh!

AmplabJenkins commented Mar 14, 2015

Uh oh!

SparkQA commented Mar 14, 2015

Uh oh!

AmplabJenkins commented Mar 14, 2015

Uh oh!

harishreedharan Mar 14, 2015

Choose a reason for hiding this comment

Uh oh!

tdas Mar 14, 2015

Choose a reason for hiding this comment

Uh oh!

harishreedharan commented Mar 14, 2015

Uh oh!

jerryshao commented Mar 16, 2015

Uh oh!

SparkQA commented Mar 16, 2015

Uh oh!

SparkQA commented Mar 16, 2015

Uh oh!

AmplabJenkins commented Mar 16, 2015

Uh oh!

harishreedharan Mar 16, 2015

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 16, 2015

Uh oh!

SparkQA commented Mar 16, 2015

Uh oh!

AmplabJenkins commented Mar 16, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants