Skip to content

Conversation

@liancheng
Copy link
Contributor

This PR is based on #1986, authored by @scwf.

The basic idea is to start Mesos executors directly with java (similar to what we do for standalone executors). This PR is different from #1986 in two major aspects:

  1. Environment variables in sc.executorEnvs are properly set for Mesos executors

  2. PYTHONPATH is properly set for fine grained Mesos executors by calling sbin/mesos-pyenv.sh

    sbin/spark-executor is renamed to sbin/mesos-pyenv.sh since it's only responsible for setting PYTHONPATH now, and the executor is started by MesosSchedulerBackend after the change.

Conflicts:
	core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/CoarseMesosSchedulerBackend.scala
@liancheng
Copy link
Contributor Author

/cc @andrewor14

@SparkQA
Copy link

SparkQA commented Aug 26, 2014

QA tests have started for PR 2145. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19238/consoleFull

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that we don't even need this file if we're doing this. We can just export the PYTHONPATH inside the mesos backend classes themselves.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to, but still left this file here because it seemed non-trivial to figure out $FWDIR from Scala code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use sparkHome... we need it anyways to find this script.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But sparkHome is driver side Spark home directory, which can't be used on driver side to assemble the Mesos executor side command line. On the other side, sbin/mesos-pyenv.sh is always executed on Mesos executor side, thus $FWDIR always points to the right position.

@SparkQA
Copy link

SparkQA commented Aug 26, 2014

QA results for PR 2145:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19238/consoleFull

@andrewor14
Copy link
Contributor

test this please

@SparkQA
Copy link

SparkQA commented Aug 26, 2014

QA tests have started for PR 2145. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19247/consoleFull

@liancheng
Copy link
Contributor Author

The last build failure was caused by Spark Streaming, should be irrelevant.

@SparkQA
Copy link

SparkQA commented Aug 27, 2014

QA results for PR 2145:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19247/consoleFull

@tnachen
Copy link
Contributor

tnachen commented Aug 27, 2014

I think we need to consolidate all our fixes now :) So I think we all understand the problem and get the idea, the problem is really I think our fixes as we reviewed each other's stuff for now isn't completely fixing the problem.

What used to happen as we know now is that we're passing in extra options like java settings to as one of the param to spark-class which is incorrect.

Now your fix here although uses CommandUtils which is the consolidated util for generating command, it's still running compute-classpath and setting the classpath information directly in the CommandInfo's value. I tested my fix #2103 earlier with mesos master and slave on the same host, and although it ran it still not enough when running a mesos cluster with master and slaves in separate hosts as the classpath was wrong.

I think we should not assume the classpath from the framework launching tasks should be the same as the slave.

I think we should either 1) still use spark-class and let spark-class resolve classpaths 2) Update command utils to not run compute-classpath on the spot, but let the slave run compute-classpath to run the /usr/bin/java with.

@tnachen
Copy link
Contributor

tnachen commented Aug 27, 2014

Btw, have you guys (@liancheng, @scwf) tested with a actual mesos and spark cluster? I have one setup now to make sure things will run

@andrewor14
Copy link
Contributor

Hi @liancheng @tnachen @scwf. It seems that there are many patches on fixing this that are based on each other (#2103, #1986). Can we flatten out the differences and consolidate all the changes within one PR?

@liancheng
Copy link
Contributor Author

Hey @tnachen and @scwf, the only reason I opened this PR separately is that we're already late for the Spark 1.1 RC release, and we need a fix quickly. We would definitely list both of you two in the contributor list of this release :)

In this PR I used "." as executor side Spark home when executor URI is provided (see here and here). This should be valid because we first cd into the right Spark home directory (with the basename trick explained here).

As for testing, @tnachen you really hit the point :) I only tested this PR with a local single node Mesos cluster, which cannot simulate the situation that Mesos slave nodes doesn't have Spark installed in the same path as the driver side. It would be greatly appreciated if you can help testing this PR with your real distributed Spark over Mesos cluster. Thanks in advance!

@pwendell
Copy link
Contributor

Hey @liancheng @tnachen @scwf - don't worry to much about consolidating the patches. I can just merge this one and give all three of you author credits on it... this is how we normally do it.

@tnachen
Copy link
Contributor

tnachen commented Aug 27, 2014

@pwendall let's not merge this yet as I'll try to run it on a mesos cluster as I don't think it will work. I'll try to have something tonight or tomorrow mornjng

@liancheng
Copy link
Contributor Author

@liancheng
Copy link
Contributor Author

@tnachen Ah, I see where I'm wrong, although I pass "." as Spark home to run compute-classpath.sh, it converts "." to absolute path internally, thus resulted classpaths are all bound to driver side environment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This set of utilities is not intended to be used outside of the deploy code (i.e. Spark's standalone scheduler) that's why it's causing issues here.

@tnachen
Copy link
Contributor

tnachen commented Aug 27, 2014

Ok I just tried it on a mesos cluster and it didn't work.

The classpath it put in the Mesos command is where spark lives in the spark shell in another host, but not in just pulled down spark-executor from the tar.

sh -c 'cd spark-1*; "/usr/bin/java" "-cp" "::/home/jclouds/src/spark/conf:/home/jclouds/src/spark/assembly/target/scala-2.10/spark-assembly-1.1.0-SNAPSHOT-hadoop1.0.4.jar" "-XX:MaxPermSize=128m" "-Xms512M" "-Xmx512M" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "akka.tcp://[email protected]:47860/user/CoarseGrainedScheduler" "20140818-071808-3483423754-5050-2070-4" "10.151.50.130" "2"'

So we must still compute classpaths after it's pulled down, not wherever the spark-shell is being executed and assume it's going to run the tared spark.

@liancheng
Copy link
Contributor Author

@tnachen Thanks a lot! I'm currently working on another version that can figure out executor side classpath correctly. The basic idea is:

  1. we still start the executor with spark-class, and
  2. we pass extraJavaOpts and extraLibraryPath via SPARK_EXECUTOR_OPTS, which is recognized by spark-class and not used anywhere else.

You may find the WIP version here. Discussed with @pwendell about this solution tonight, and it seems workable. And it's also much simpler. For now, the only issue is that it cannot handle quoted string with spaces correctly (i.e. -Dfoo="bar bar"). It might be buggy in other ways though, still testing it.

@pwendell
Copy link
Contributor

Hey guys, yeah this is an issue with the approach of using the utilities from the standalone deploy mode for this - it makes assumptions that don't hold in mesos mode. I spoke a bit offline with @liancheng and I think there is a much simpler/surgical fix that will unblock the Spark 1.1 release. But we should have a nicer way of building up the command in Scala like is done here. It might mean we slightly re-factor things so that parts of the utility functions for standalone mode can be used here.

@tnachen
Copy link
Contributor

tnachen commented Aug 27, 2014

I'm glad we're having these conversations :) Really helping folks that have bad experience using Mesos with Spark. I'm looking forward for the fix and once it's updated I can verify the fix with our mesos cluster. I'm chatting with Mesos committers about different issues people are hitting and I'll be addressing those in future patches.

@tnachen
Copy link
Contributor

tnachen commented Aug 27, 2014

Also I think I didn't mention it explicitly, I've been testing with having a spark tar ball available through a HTTP server and setting SPARK_EXECUTOR_URI to that, and slaves have no spark installed. I know folks are using both cases where the executor uri is either set or unset, which defaults to spark_home.

@scwf
Copy link
Contributor

scwf commented Aug 27, 2014

hi @liancheng , is there a situation we should cover -Dfoo="bar bar" ?

@andrewor14
Copy link
Contributor

We should eventually make -Dfoo="bar bar" work, though the top priority now is just to fix the core functionality of the Mesos code. This is a nice-to-have addition, but the lack of it should not block the release.

@andrewor14
Copy link
Contributor

@liancheng Given that there is now a newer PR that supersedes this one, would you mind closing this?

@liancheng
Copy link
Contributor Author

Sure.

@liancheng liancheng closed this Aug 27, 2014
asfgit pushed a commit that referenced this pull request Aug 27, 2014
… via SPARK_EXECUTOR_OPTS

This is another try after #2145 to fix [SPARK-2608](https://issues.apache.org/jira/browse/SPARK-2608).

### Basic Idea

The basic idea is to pass `extraJavaOpts` and `extraLibraryPath` together via environment variable `SPARK_EXECUTOR_OPTS`. This variable is recognized by `spark-class` and not used anywhere else. In this way, we still launch Mesos executors with `spark-class`/`spark-executor`, but avoids the executor side Spark home issue.

### Known Issue

Quoted string with spaces is not allowed in either `extraJavaOpts` or `extraLibraryPath` when using Spark over Mesos. The reason is that Mesos passes the whole command line as a single string argument to `sh -c` to start the executor, and this makes shell string escaping non-trivial to handle. This should be fixed in a later release.

### Background

Classes in package `org.apache.spark.deploy` shouldn't be used as they assume Spark is deployed in standalone mode, and give wrong executor side Spark home directory. Please refer to comments in #2145 for more details.

Author: Cheng Lian <[email protected]>

Closes #2161 from liancheng/mesos-fix-with-env-var and squashes the following commits:

ba59190 [Cheng Lian] Added fine grained Mesos executor support
1174076 [Cheng Lian] Draft fix for CoarseMesosSchedulerBackend
asfgit pushed a commit that referenced this pull request Aug 28, 2014
… via SPARK_EXECUTOR_OPTS

This is another try after #2145 to fix [SPARK-2608](https://issues.apache.org/jira/browse/SPARK-2608).

The basic idea is to pass `extraJavaOpts` and `extraLibraryPath` together via environment variable `SPARK_EXECUTOR_OPTS`. This variable is recognized by `spark-class` and not used anywhere else. In this way, we still launch Mesos executors with `spark-class`/`spark-executor`, but avoids the executor side Spark home issue.

Quoted string with spaces is not allowed in either `extraJavaOpts` or `extraLibraryPath` when using Spark over Mesos. The reason is that Mesos passes the whole command line as a single string argument to `sh -c` to start the executor, and this makes shell string escaping non-trivial to handle. This should be fixed in a later release.

Classes in package `org.apache.spark.deploy` shouldn't be used as they assume Spark is deployed in standalone mode, and give wrong executor side Spark home directory. Please refer to comments in #2145 for more details.

Author: Cheng Lian <[email protected]>

Closes #2161 from liancheng/mesos-fix-with-env-var and squashes the following commits:

ba59190 [Cheng Lian] Added fine grained Mesos executor support
1174076 [Cheng Lian] Draft fix for CoarseMesosSchedulerBackend

(cherry picked from commit 935bffe)
Signed-off-by: Reynold Xin <[email protected]>
xiliu82 pushed a commit to xiliu82/spark that referenced this pull request Sep 4, 2014
… via SPARK_EXECUTOR_OPTS

This is another try after apache#2145 to fix [SPARK-2608](https://issues.apache.org/jira/browse/SPARK-2608).

The basic idea is to pass `extraJavaOpts` and `extraLibraryPath` together via environment variable `SPARK_EXECUTOR_OPTS`. This variable is recognized by `spark-class` and not used anywhere else. In this way, we still launch Mesos executors with `spark-class`/`spark-executor`, but avoids the executor side Spark home issue.

Quoted string with spaces is not allowed in either `extraJavaOpts` or `extraLibraryPath` when using Spark over Mesos. The reason is that Mesos passes the whole command line as a single string argument to `sh -c` to start the executor, and this makes shell string escaping non-trivial to handle. This should be fixed in a later release.

Classes in package `org.apache.spark.deploy` shouldn't be used as they assume Spark is deployed in standalone mode, and give wrong executor side Spark home directory. Please refer to comments in apache#2145 for more details.

Author: Cheng Lian <[email protected]>

Closes apache#2161 from liancheng/mesos-fix-with-env-var and squashes the following commits:

ba59190 [Cheng Lian] Added fine grained Mesos executor support
1174076 [Cheng Lian] Draft fix for CoarseMesosSchedulerBackend

(cherry picked from commit 935bffe)
Signed-off-by: Reynold Xin <[email protected]>
@liancheng liancheng deleted the fix-mesos-opts branch September 24, 2014 00:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants