[SPARK-2608][Core] Fixed command line option passing issue over Mesos #2145

liancheng · 2014-08-26T19:55:29Z

This PR is based on #1986, authored by @scwf.

The basic idea is to start Mesos executors directly with java (similar to what we do for standalone executors). This PR is different from #1986 in two major aspects:

Environment variables in sc.executorEnvs are properly set for Mesos executors
PYTHONPATH is properly set for fine grained Mesos executors by calling sbin/mesos-pyenv.sh

sbin/spark-executor is renamed to sbin/mesos-pyenv.sh since it's only responsible for setting PYTHONPATH now, and the executor is started by MesosSchedulerBackend after the change.

Conflicts: core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/CoarseMesosSchedulerBackend.scala

liancheng · 2014-08-26T20:03:03Z

/cc @andrewor14

SparkQA · 2014-08-26T20:38:55Z

QA tests have started for PR 2145. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19238/consoleFull

andrewor14 · 2014-08-26T20:43:47Z

sbin/mesos-pyenv.sh

It seems that we don't even need this file if we're doing this. We can just export the PYTHONPATH inside the mesos backend classes themselves.

I tried to, but still left this file here because it seemed non-trivial to figure out $FWDIR from Scala code.

We can use sparkHome... we need it anyways to find this script.

But sparkHome is driver side Spark home directory, which can't be used on driver side to assemble the Mesos executor side command line. On the other side, sbin/mesos-pyenv.sh is always executed on Mesos executor side, thus $FWDIR always points to the right position.

SparkQA · 2014-08-26T21:54:41Z

QA results for PR 2145:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19238/consoleFull

andrewor14 · 2014-08-26T21:56:18Z

test this please

SparkQA · 2014-08-26T22:56:18Z

QA tests have started for PR 2145. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19247/consoleFull

liancheng · 2014-08-26T22:59:12Z

The last build failure was caused by Spark Streaming, should be irrelevant.

SparkQA · 2014-08-27T00:03:41Z

QA results for PR 2145:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19247/consoleFull

tnachen · 2014-08-27T01:25:19Z

I think we need to consolidate all our fixes now :) So I think we all understand the problem and get the idea, the problem is really I think our fixes as we reviewed each other's stuff for now isn't completely fixing the problem.

What used to happen as we know now is that we're passing in extra options like java settings to as one of the param to spark-class which is incorrect.

Now your fix here although uses CommandUtils which is the consolidated util for generating command, it's still running compute-classpath and setting the classpath information directly in the CommandInfo's value. I tested my fix #2103 earlier with mesos master and slave on the same host, and although it ran it still not enough when running a mesos cluster with master and slaves in separate hosts as the classpath was wrong.

I think we should not assume the classpath from the framework launching tasks should be the same as the slave.

I think we should either 1) still use spark-class and let spark-class resolve classpaths 2) Update command utils to not run compute-classpath on the spot, but let the slave run compute-classpath to run the /usr/bin/java with.

tnachen · 2014-08-27T01:26:39Z

Btw, have you guys (@liancheng, @scwf) tested with a actual mesos and spark cluster? I have one setup now to make sure things will run

andrewor14 · 2014-08-27T01:39:02Z

Hi @liancheng @tnachen @scwf. It seems that there are many patches on fixing this that are based on each other (#2103, #1986). Can we flatten out the differences and consolidate all the changes within one PR?

liancheng · 2014-08-27T01:55:56Z

Hey @tnachen and @scwf, the only reason I opened this PR separately is that we're already late for the Spark 1.1 RC release, and we need a fix quickly. We would definitely list both of you two in the contributor list of this release :)

In this PR I used "." as executor side Spark home when executor URI is provided (see here and here). This should be valid because we first cd into the right Spark home directory (with the basename trick explained here).

As for testing, @tnachen you really hit the point :) I only tested this PR with a local single node Mesos cluster, which cannot simulate the situation that Mesos slave nodes doesn't have Spark installed in the same path as the driver side. It would be greatly appreciated if you can help testing this PR with your real distributed Spark over Mesos cluster. Thanks in advance!

pwendell · 2014-08-27T02:31:28Z

Hey @liancheng @tnachen @scwf - don't worry to much about consolidating the patches. I can just merge this one and give all three of you author credits on it... this is how we normally do it.

tnachen · 2014-08-27T02:41:24Z

@pwendall let's not merge this yet as I'll try to run it on a mesos cluster as I don't think it will work. I'll try to have something tonight or tomorrow mornjng

liancheng · 2014-08-27T02:52:55Z

@andrewor14 A summary of differences between all 4 PRs:

[SPARK-2608] fix executor backend launch commond over mesos mode #1513 by @scwf, the first PR for this issue
[SPARK-2608] fix executor backend launch commond over mesos mode #1986 by @scwf, rebased [SPARK-2608] fix executor backend launch commond over mesos mode #1513 to the master
[SPARK-2608] fix executor backend launch commond over mesos mode #2103 by @tnachen, same as [SPARK-2608] fix executor backend launch commond over mesos mode #1986 except that sc.sparkHome is used to locate Spark home directory on both driver side and executor side
[SPARK-2608][Core] Fixed command line option passing issue over Mesos #2145 by @liancheng, described in the PR description

liancheng · 2014-08-27T02:55:57Z

@tnachen Ah, I see where I'm wrong, although I pass "." as Spark home to run compute-classpath.sh, it converts "." to absolute path internally, thus resulted classpaths are all bound to driver side environment.

pwendell · 2014-08-27T04:13:24Z

core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/CoarseMesosSchedulerBackend.scala

This set of utilities is not intended to be used outside of the deploy code (i.e. Spark's standalone scheduler) that's why it's causing issues here.

tnachen · 2014-08-27T07:32:07Z

Ok I just tried it on a mesos cluster and it didn't work.

The classpath it put in the Mesos command is where spark lives in the spark shell in another host, but not in just pulled down spark-executor from the tar.

sh -c 'cd spark-1*; "/usr/bin/java" "-cp" "::/home/jclouds/src/spark/conf:/home/jclouds/src/spark/assembly/target/scala-2.10/spark-assembly-1.1.0-SNAPSHOT-hadoop1.0.4.jar" "-XX:MaxPermSize=128m" "-Xms512M" "-Xmx512M" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "akka.tcp://[email protected]:47860/user/CoarseGrainedScheduler" "20140818-071808-3483423754-5050-2070-4" "10.151.50.130" "2"'

So we must still compute classpaths after it's pulled down, not wherever the spark-shell is being executed and assume it's going to run the tared spark.

liancheng · 2014-08-27T07:49:00Z

@tnachen Thanks a lot! I'm currently working on another version that can figure out executor side classpath correctly. The basic idea is:

we still start the executor with spark-class, and
we pass extraJavaOpts and extraLibraryPath via SPARK_EXECUTOR_OPTS, which is recognized by spark-class and not used anywhere else.

You may find the WIP version here. Discussed with @pwendell about this solution tonight, and it seems workable. And it's also much simpler. For now, the only issue is that it cannot handle quoted string with spaces correctly (i.e. -Dfoo="bar bar"). It might be buggy in other ways though, still testing it.

pwendell · 2014-08-27T07:50:55Z

Hey guys, yeah this is an issue with the approach of using the utilities from the standalone deploy mode for this - it makes assumptions that don't hold in mesos mode. I spoke a bit offline with @liancheng and I think there is a much simpler/surgical fix that will unblock the Spark 1.1 release. But we should have a nicer way of building up the command in Scala like is done here. It might mean we slightly re-factor things so that parts of the utility functions for standalone mode can be used here.

tnachen · 2014-08-27T07:56:05Z

I'm glad we're having these conversations :) Really helping folks that have bad experience using Mesos with Spark. I'm looking forward for the fix and once it's updated I can verify the fix with our mesos cluster. I'm chatting with Mesos committers about different issues people are hitting and I'll be addressing those in future patches.

tnachen · 2014-08-27T08:05:23Z

Also I think I didn't mention it explicitly, I've been testing with having a spark tar ball available through a HTTP server and setting SPARK_EXECUTOR_URI to that, and slaves have no spark installed. I know folks are using both cases where the executor uri is either set or unset, which defaults to spark_home.

scwf · 2014-08-27T08:12:32Z

hi @liancheng , is there a situation we should cover -Dfoo="bar bar" ?

andrewor14 · 2014-08-27T17:42:10Z

We should eventually make -Dfoo="bar bar" work, though the top priority now is just to fix the core functionality of the Mesos code. This is a nice-to-have addition, but the lack of it should not block the release.

andrewor14 · 2014-08-27T18:19:44Z

@liancheng Given that there is now a newer PR that supersedes this one, would you mind closing this?

liancheng · 2014-08-27T18:35:24Z

Sure.

… via SPARK_EXECUTOR_OPTS This is another try after #2145 to fix [SPARK-2608](https://issues.apache.org/jira/browse/SPARK-2608). ### Basic Idea The basic idea is to pass `extraJavaOpts` and `extraLibraryPath` together via environment variable `SPARK_EXECUTOR_OPTS`. This variable is recognized by `spark-class` and not used anywhere else. In this way, we still launch Mesos executors with `spark-class`/`spark-executor`, but avoids the executor side Spark home issue. ### Known Issue Quoted string with spaces is not allowed in either `extraJavaOpts` or `extraLibraryPath` when using Spark over Mesos. The reason is that Mesos passes the whole command line as a single string argument to `sh -c` to start the executor, and this makes shell string escaping non-trivial to handle. This should be fixed in a later release. ### Background Classes in package `org.apache.spark.deploy` shouldn't be used as they assume Spark is deployed in standalone mode, and give wrong executor side Spark home directory. Please refer to comments in #2145 for more details. Author: Cheng Lian <[email protected]> Closes #2161 from liancheng/mesos-fix-with-env-var and squashes the following commits: ba59190 [Cheng Lian] Added fine grained Mesos executor support 1174076 [Cheng Lian] Draft fix for CoarseMesosSchedulerBackend

… via SPARK_EXECUTOR_OPTS This is another try after #2145 to fix [SPARK-2608](https://issues.apache.org/jira/browse/SPARK-2608). The basic idea is to pass `extraJavaOpts` and `extraLibraryPath` together via environment variable `SPARK_EXECUTOR_OPTS`. This variable is recognized by `spark-class` and not used anywhere else. In this way, we still launch Mesos executors with `spark-class`/`spark-executor`, but avoids the executor side Spark home issue. Quoted string with spaces is not allowed in either `extraJavaOpts` or `extraLibraryPath` when using Spark over Mesos. The reason is that Mesos passes the whole command line as a single string argument to `sh -c` to start the executor, and this makes shell string escaping non-trivial to handle. This should be fixed in a later release. Classes in package `org.apache.spark.deploy` shouldn't be used as they assume Spark is deployed in standalone mode, and give wrong executor side Spark home directory. Please refer to comments in #2145 for more details. Author: Cheng Lian <[email protected]> Closes #2161 from liancheng/mesos-fix-with-env-var and squashes the following commits: ba59190 [Cheng Lian] Added fine grained Mesos executor support 1174076 [Cheng Lian] Draft fix for CoarseMesosSchedulerBackend (cherry picked from commit 935bffe) Signed-off-by: Reynold Xin <[email protected]>

… via SPARK_EXECUTOR_OPTS This is another try after apache#2145 to fix [SPARK-2608](https://issues.apache.org/jira/browse/SPARK-2608). The basic idea is to pass `extraJavaOpts` and `extraLibraryPath` together via environment variable `SPARK_EXECUTOR_OPTS`. This variable is recognized by `spark-class` and not used anywhere else. In this way, we still launch Mesos executors with `spark-class`/`spark-executor`, but avoids the executor side Spark home issue. Quoted string with spaces is not allowed in either `extraJavaOpts` or `extraLibraryPath` when using Spark over Mesos. The reason is that Mesos passes the whole command line as a single string argument to `sh -c` to start the executor, and this makes shell string escaping non-trivial to handle. This should be fixed in a later release. Classes in package `org.apache.spark.deploy` shouldn't be used as they assume Spark is deployed in standalone mode, and give wrong executor side Spark home directory. Please refer to comments in apache#2145 for more details. Author: Cheng Lian <[email protected]> Closes apache#2161 from liancheng/mesos-fix-with-env-var and squashes the following commits: ba59190 [Cheng Lian] Added fine grained Mesos executor support 1174076 [Cheng Lian] Draft fix for CoarseMesosSchedulerBackend (cherry picked from commit 935bffe) Signed-off-by: Reynold Xin <[email protected]>

Fixed command line option passing issue over Mesos

478ccda

Conflicts: core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/CoarseMesosSchedulerBackend.scala

andrewor14 reviewed Aug 26, 2014
View reviewed changes

Fixed executor side Spark home, also commented the "basename" trick

7917a66

pwendell reviewed Aug 27, 2014
View reviewed changes

liancheng mentioned this pull request Aug 27, 2014

[SPARK-2608][Core] Fixed command line option passing issue over Mesos via SPARK_EXECUTOR_OPTS #2161

Closed

liancheng closed this Aug 27, 2014

liancheng deleted the fix-mesos-opts branch September 24, 2014 00:06

[SPARK-2608][Core] Fixed command line option passing issue over Mesos #2145

[SPARK-2608][Core] Fixed command line option passing issue over Mesos #2145

Uh oh!

Conversation

liancheng commented Aug 26, 2014

Uh oh!

liancheng commented Aug 26, 2014

Uh oh!

SparkQA commented Aug 26, 2014

Uh oh!

andrewor14 Aug 26, 2014

Choose a reason for hiding this comment

Uh oh!

liancheng Aug 26, 2014

Choose a reason for hiding this comment

Uh oh!

pwendell Aug 26, 2014

Choose a reason for hiding this comment

Uh oh!

liancheng Aug 26, 2014

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 26, 2014

Uh oh!

andrewor14 commented Aug 26, 2014

Uh oh!

SparkQA commented Aug 26, 2014

Uh oh!

liancheng commented Aug 26, 2014

Uh oh!

SparkQA commented Aug 27, 2014

Uh oh!

tnachen commented Aug 27, 2014

Uh oh!

tnachen commented Aug 27, 2014

Uh oh!

andrewor14 commented Aug 27, 2014

Uh oh!

liancheng commented Aug 27, 2014

Uh oh!

pwendell commented Aug 27, 2014

Uh oh!

tnachen commented Aug 27, 2014

Uh oh!

liancheng commented Aug 27, 2014

Uh oh!

liancheng commented Aug 27, 2014

Uh oh!

pwendell Aug 27, 2014

Choose a reason for hiding this comment

Uh oh!

tnachen commented Aug 27, 2014

Uh oh!

liancheng commented Aug 27, 2014

Uh oh!

pwendell commented Aug 27, 2014

Uh oh!

tnachen commented Aug 27, 2014

Uh oh!

tnachen commented Aug 27, 2014

Uh oh!

scwf commented Aug 27, 2014

Uh oh!

andrewor14 commented Aug 27, 2014

Uh oh!

andrewor14 commented Aug 27, 2014

Uh oh!

liancheng commented Aug 27, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants