Skip to content

Conversation

@tdas
Copy link
Contributor

@tdas tdas commented Mar 14, 2015

In streaming driver recovery, when the SparkConf is reconstructed based on the checkpointed configuration, it recovers the old master URL. This okay if the cluster on which the streaming application is relaunched is the same cluster as it was running before. But if that cluster changes, there is no way to inject the new master URL of the new cluster. As a result, the restarted app tries to connect to the non-existent old cluster and fails.

The solution is to check whether a master URL is set in the System properties (by Spark submit) before recreating the SparkConf. If a new master url is set in the properties, then use it as that is obviously the most relevant one. Otherwise load the old one (to maintain existing behavior).

@tdas
Copy link
Contributor Author

tdas commented Mar 14, 2015

@harishreedharan Can you take a look?

@SparkQA
Copy link

SparkQA commented Mar 14, 2015

Test build #28604 has started for PR 5024 at commit 222485d.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Mar 14, 2015

Test build #28605 has started for PR 5024 at commit 6a0857c.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Mar 14, 2015

Test build #28604 has finished for PR 5024 at commit 222485d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28604/
Test PASSed.

@SparkQA
Copy link

SparkQA commented Mar 14, 2015

Test build #28605 has finished for PR 5024 at commit 6a0857c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28605/
Test PASSed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

case _ => should be enough, no? Is there a need to return None? (Since None is an object, I am not sure what the real cost is in this case for returning something - so this can be ignored if the cost here is ~zero)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I can make it better by using foreach on option. That was stupid and
too hurried.
On Mar 14, 2015 12:05 AM, "Hari Shreedharan" [email protected]
wrote:

In streaming/src/main/scala/org/apache/spark/streaming/Checkpoint.scala
#5024 (comment):

   .remove("spark.driver.host")
   .remove("spark.driver.port")
  • new SparkConf(loadDefaults = true).getOption("spark.master") match {
  •  case Some(newMaster) => newSparkConf.setMaster(newMaster)
    
  •  case _ => None
    

case _ => should be enough, no? Is there a need to return None? (Since
None is an object, I am not sure what the real cost is in this case for
returning something - so this can be ignored if the cost here is ~zero)


Reply to this email directly or view it on GitHub
https://github.com/apache/spark/pull/5024/files#r26436080.

@harishreedharan
Copy link
Contributor

LGTM.

@jerryshao
Copy link
Contributor

LGTM. A more general thinking maybe not relevant to this PR, if some configurations are changed after resubmitting the application, how to handle this, to choose the new configuration or still keep the old one, like memory size or core number.

@SparkQA
Copy link

SparkQA commented Mar 16, 2015

Test build #28666 has started for PR 5024 at commit c7c0b99.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Mar 16, 2015

Test build #28666 has finished for PR 5024 at commit c7c0b99.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28666/
Test PASSed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this being called oldMaster? Isn't this an option wrapping the new master?

@SparkQA
Copy link

SparkQA commented Mar 16, 2015

Test build #28674 has started for PR 5024 at commit 392fd44.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Mar 16, 2015

Test build #28674 has finished for PR 5024 at commit 392fd44.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28674/
Test PASSed.

asfgit pushed a commit that referenced this pull request Mar 17, 2015
… context from checkpoint

In streaming driver recovery, when the SparkConf is reconstructed based on the checkpointed configuration, it recovers the old master URL. This okay if the cluster on which the streaming application is relaunched is the same cluster as it was running before. But if that cluster changes, there is no way to inject the new master URL of the new cluster. As a result, the restarted app tries to connect to the non-existent old cluster and fails.

The solution is to check whether a master URL is set in the System properties (by Spark submit) before recreating the SparkConf. If a new master url is set in the properties, then use it as that is obviously the most relevant one. Otherwise load the old one (to maintain existing behavior).

Author: Tathagata Das <[email protected]>

Closes #5024 from tdas/SPARK-6331 and squashes the following commits:

392fd44 [Tathagata Das] Fixed naming issue.
c7c0b99 [Tathagata Das] Addressed comments.
6a0857c [Tathagata Das] Updated testsuites.
222485d [Tathagata Das] Load new master URL if present when recovering streaming context from checkpoint

(cherry picked from commit c928796)
Signed-off-by: Tathagata Das <[email protected]>
@asfgit asfgit closed this in c928796 Mar 17, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants