[SPARK-21502][Mesos] fix --supervise for mesos in cluster mode #18705

skonto · 2017-07-21T17:21:06Z

What changes were proposed in this pull request?

With supervise enabled for a driver, re-launching it was failing because the driver had the same framework Id. This patch creates a new driver framework id every time we re-launch a driver, but we keep the driver submission id the same since that is the same with the task id the driver was launched with on mesos and retry state and other info within Dispatcher's data structures uses that as a key.
We append a "-retry-%4d" string as a suffix to the framework id passed by the dispatcher to the driver and the same value to the app_id created by each driver, except the first time where we dont need the retry suffix.
The previous format for the frameworkId was 'DispactherFId-DriverSubmissionId'.

We also detect the case where we have multiple spark contexts started from within the same driver and we do set proper names to their corresponding app-ids. The old practice was to unset the framework id passed from the dispatcher after the driver framework was started for the first time and let mesos decide the framework ID for subsequent spark contexts. The decided fId was passed as an appID.
This patch affects heavily the history server. Btw we dont have the issues of the standalone case where driver id must be different since the dispatcher will re-launch a driver(mesos task) only if it gets an update that it is dead and this is verified by mesos implicitly. We also dont fix the fine grained mode which is deprecated and of no use.

How was this patch tested?

This task was manually tested on dc/os. Launched a driver, stoped its container and verified the expected behavior.

Initial retry of the driver, driver in pending state:

Driver re-launched:

Another re-try:

The resulted entries in history server at the bottom:

Regarding multiple spark contexts here is the end result regarding the spark history server, for the second spark context we add an increasing number as a suffix:

skonto · 2017-07-21T17:37:05Z

@susanxhuynh @ArtRand pls review.

SparkQA · 2017-07-21T17:50:38Z

Test build #79845 has finished for PR 18705 at commit 8ae5d6c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-07-21T20:26:44Z

Test build #79844 has finished for PR 18705 at commit b987c4b.

This patch passes all tests.
This patch does not merge cleanly.
This patch adds no public classes.

ArtRand · 2017-07-21T20:28:16Z

LGTM

vanzin

Only small style suggestions.

vanzin · 2017-07-21T23:19:46Z

...rs/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosClusterScheduler.scala

nit: add spaces around the braces

vanzin · 2017-07-21T23:19:54Z

...main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala

"can be initialized"

vanzin · 2017-07-21T23:22:19Z

...rs/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosClusterScheduler.scala

nit: move the getOrElse() call out of the string for clarity?

val suffix = desc.retryState.map { }.getOrElse("")

susanxhuynh · 2017-07-22T00:06:09Z

@skonto LGTM

skonto · 2017-07-23T17:22:48Z

@vanzin thnx for the review. I updated the PR.

SparkQA · 2017-07-23T17:34:47Z

Test build #79891 has finished for PR 18705 at commit 37187e0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2017-07-24T18:11:14Z

Merging to master / 2.2.

vanzin · 2017-07-24T18:12:15Z

@skonto this doesn't merge to 2.2, please open a separate PR if you'd like a backport.

## What changes were proposed in this pull request? With supervise enabled for a driver, re-launching it was failing because the driver had the same framework Id. This patch creates a new driver framework id every time we re-launch a driver, but we keep the driver submission id the same since that is the same with the task id the driver was launched with on mesos and retry state and other info within Dispatcher's data structures uses that as a key. We append a "-retry-%4d" string as a suffix to the framework id passed by the dispatcher to the driver and the same value to the app_id created by each driver, except the first time where we dont need the retry suffix. The previous format for the frameworkId was 'DispactherFId-DriverSubmissionId'. We also detect the case where we have multiple spark contexts started from within the same driver and we do set proper names to their corresponding app-ids. The old practice was to unset the framework id passed from the dispatcher after the driver framework was started for the first time and let mesos decide the framework ID for subsequent spark contexts. The decided fId was passed as an appID. This patch affects heavily the history server. Btw we dont have the issues of the standalone case where driver id must be different since the dispatcher will re-launch a driver(mesos task) only if it gets an update that it is dead and this is verified by mesos implicitly. We also dont fix the fine grained mode which is deprecated and of no use. ## How was this patch tested? This task was manually tested on dc/os. Launched a driver, stoped its container and verified the expected behavior. Initial retry of the driver, driver in pending state: ![image](https://user-images.githubusercontent.com/7945591/28473862-1088b736-6e4f-11e7-8d7d-7b785b1da6a6.png) Driver re-launched: ![image](https://user-images.githubusercontent.com/7945591/28473885-26e02d16-6e4f-11e7-9eb8-6bf7bdb10cb8.png) Another re-try: ![image](https://user-images.githubusercontent.com/7945591/28473897-35702318-6e4f-11e7-9585-fd295ad7c6b6.png) The resulted entries in history server at the bottom: ![image](https://user-images.githubusercontent.com/7945591/28473910-4946dabc-6e4f-11e7-90a6-fa4f80893c61.png) Regarding multiple spark contexts here is the end result regarding the spark history server, for the second spark context we add an increasing number as a suffix: ![image](https://user-images.githubusercontent.com/7945591/28474432-69cf8b06-6e51-11e7-93c7-e6c0b04dec93.png) Author: Stavros Kontopoulos <[email protected]> Closes apache#18705 from skonto/fix_supervise_flag.

skonto force-pushed the fix_supervise_flag branch from b987c4b to 8ae5d6c Compare July 21, 2017 17:33

vanzin reviewed Jul 21, 2017

View reviewed changes

fix supervise for mesos in cluster mode

37187e0

skonto force-pushed the fix_supervise_flag branch from 8ae5d6c to 37187e0 Compare July 23, 2017 17:18

asfgit closed this in b09ec92 Jul 24, 2017

[SPARK-21502][Mesos] fix --supervise for mesos in cluster mode #18705

[SPARK-21502][Mesos] fix --supervise for mesos in cluster mode #18705

Uh oh!

Conversation

skonto commented Jul 21, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

skonto commented Jul 21, 2017

Uh oh!

SparkQA commented Jul 21, 2017

Uh oh!

SparkQA commented Jul 21, 2017

Uh oh!

ArtRand commented Jul 21, 2017

Uh oh!

vanzin left a comment

Choose a reason for hiding this comment

Uh oh!

vanzin Jul 21, 2017

Choose a reason for hiding this comment

Uh oh!

vanzin Jul 21, 2017

Choose a reason for hiding this comment

Uh oh!

vanzin Jul 21, 2017

Choose a reason for hiding this comment

Uh oh!

susanxhuynh commented Jul 22, 2017

Uh oh!

skonto commented Jul 23, 2017

Uh oh!

SparkQA commented Jul 23, 2017

Uh oh!

vanzin commented Jul 24, 2017

Uh oh!

vanzin commented Jul 24, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

skonto commented Jul 21, 2017 •

edited

Loading