-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-2608][Core] Fixed command line option passing issue over Mesos #2145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Conflicts: core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/CoarseMesosSchedulerBackend.scala
|
/cc @andrewor14 |
|
QA tests have started for PR 2145. This patch merges cleanly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that we don't even need this file if we're doing this. We can just export the PYTHONPATH inside the mesos backend classes themselves.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to, but still left this file here because it seemed non-trivial to figure out $FWDIR from Scala code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can use sparkHome... we need it anyways to find this script.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But sparkHome is driver side Spark home directory, which can't be used on driver side to assemble the Mesos executor side command line. On the other side, sbin/mesos-pyenv.sh is always executed on Mesos executor side, thus $FWDIR always points to the right position.
|
QA results for PR 2145: |
|
test this please |
|
QA tests have started for PR 2145. This patch merges cleanly. |
|
The last build failure was caused by Spark Streaming, should be irrelevant. |
|
QA results for PR 2145: |
|
I think we need to consolidate all our fixes now :) So I think we all understand the problem and get the idea, the problem is really I think our fixes as we reviewed each other's stuff for now isn't completely fixing the problem. What used to happen as we know now is that we're passing in extra options like java settings to as one of the param to spark-class which is incorrect. Now your fix here although uses CommandUtils which is the consolidated util for generating command, it's still running compute-classpath and setting the classpath information directly in the CommandInfo's value. I tested my fix #2103 earlier with mesos master and slave on the same host, and although it ran it still not enough when running a mesos cluster with master and slaves in separate hosts as the classpath was wrong. I think we should not assume the classpath from the framework launching tasks should be the same as the slave. I think we should either 1) still use spark-class and let spark-class resolve classpaths 2) Update command utils to not run compute-classpath on the spot, but let the slave run compute-classpath to run the /usr/bin/java with. |
|
Btw, have you guys (@liancheng, @scwf) tested with a actual mesos and spark cluster? I have one setup now to make sure things will run |
|
Hi @liancheng @tnachen @scwf. It seems that there are many patches on fixing this that are based on each other (#2103, #1986). Can we flatten out the differences and consolidate all the changes within one PR? |
|
Hey @tnachen and @scwf, the only reason I opened this PR separately is that we're already late for the Spark 1.1 RC release, and we need a fix quickly. We would definitely list both of you two in the contributor list of this release :) In this PR I used As for testing, @tnachen you really hit the point :) I only tested this PR with a local single node Mesos cluster, which cannot simulate the situation that Mesos slave nodes doesn't have Spark installed in the same path as the driver side. It would be greatly appreciated if you can help testing this PR with your real distributed Spark over Mesos cluster. Thanks in advance! |
|
Hey @liancheng @tnachen @scwf - don't worry to much about consolidating the patches. I can just merge this one and give all three of you author credits on it... this is how we normally do it. |
|
@pwendall let's not merge this yet as I'll try to run it on a mesos cluster as I don't think it will work. I'll try to have something tonight or tomorrow mornjng |
|
@andrewor14 A summary of differences between all 4 PRs:
|
|
@tnachen Ah, I see where I'm wrong, although I pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This set of utilities is not intended to be used outside of the deploy code (i.e. Spark's standalone scheduler) that's why it's causing issues here.
|
Ok I just tried it on a mesos cluster and it didn't work. The classpath it put in the Mesos command is where spark lives in the spark shell in another host, but not in just pulled down spark-executor from the tar. sh -c 'cd spark-1*; "/usr/bin/java" "-cp" "::/home/jclouds/src/spark/conf:/home/jclouds/src/spark/assembly/target/scala-2.10/spark-assembly-1.1.0-SNAPSHOT-hadoop1.0.4.jar" "-XX:MaxPermSize=128m" "-Xms512M" "-Xmx512M" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "akka.tcp://[email protected]:47860/user/CoarseGrainedScheduler" "20140818-071808-3483423754-5050-2070-4" "10.151.50.130" "2"' So we must still compute classpaths after it's pulled down, not wherever the spark-shell is being executed and assume it's going to run the tared spark. |
|
@tnachen Thanks a lot! I'm currently working on another version that can figure out executor side classpath correctly. The basic idea is:
You may find the WIP version here. Discussed with @pwendell about this solution tonight, and it seems workable. And it's also much simpler. For now, the only issue is that it cannot handle quoted string with spaces correctly (i.e. |
|
Hey guys, yeah this is an issue with the approach of using the utilities from the standalone deploy mode for this - it makes assumptions that don't hold in mesos mode. I spoke a bit offline with @liancheng and I think there is a much simpler/surgical fix that will unblock the Spark 1.1 release. But we should have a nicer way of building up the command in Scala like is done here. It might mean we slightly re-factor things so that parts of the utility functions for standalone mode can be used here. |
|
I'm glad we're having these conversations :) Really helping folks that have bad experience using Mesos with Spark. I'm looking forward for the fix and once it's updated I can verify the fix with our mesos cluster. I'm chatting with Mesos committers about different issues people are hitting and I'll be addressing those in future patches. |
|
Also I think I didn't mention it explicitly, I've been testing with having a spark tar ball available through a HTTP server and setting SPARK_EXECUTOR_URI to that, and slaves have no spark installed. I know folks are using both cases where the executor uri is either set or unset, which defaults to spark_home. |
|
hi @liancheng , is there a situation we should cover |
|
We should eventually make |
|
@liancheng Given that there is now a newer PR that supersedes this one, would you mind closing this? |
|
Sure. |
… via SPARK_EXECUTOR_OPTS This is another try after #2145 to fix [SPARK-2608](https://issues.apache.org/jira/browse/SPARK-2608). ### Basic Idea The basic idea is to pass `extraJavaOpts` and `extraLibraryPath` together via environment variable `SPARK_EXECUTOR_OPTS`. This variable is recognized by `spark-class` and not used anywhere else. In this way, we still launch Mesos executors with `spark-class`/`spark-executor`, but avoids the executor side Spark home issue. ### Known Issue Quoted string with spaces is not allowed in either `extraJavaOpts` or `extraLibraryPath` when using Spark over Mesos. The reason is that Mesos passes the whole command line as a single string argument to `sh -c` to start the executor, and this makes shell string escaping non-trivial to handle. This should be fixed in a later release. ### Background Classes in package `org.apache.spark.deploy` shouldn't be used as they assume Spark is deployed in standalone mode, and give wrong executor side Spark home directory. Please refer to comments in #2145 for more details. Author: Cheng Lian <[email protected]> Closes #2161 from liancheng/mesos-fix-with-env-var and squashes the following commits: ba59190 [Cheng Lian] Added fine grained Mesos executor support 1174076 [Cheng Lian] Draft fix for CoarseMesosSchedulerBackend
… via SPARK_EXECUTOR_OPTS This is another try after #2145 to fix [SPARK-2608](https://issues.apache.org/jira/browse/SPARK-2608). The basic idea is to pass `extraJavaOpts` and `extraLibraryPath` together via environment variable `SPARK_EXECUTOR_OPTS`. This variable is recognized by `spark-class` and not used anywhere else. In this way, we still launch Mesos executors with `spark-class`/`spark-executor`, but avoids the executor side Spark home issue. Quoted string with spaces is not allowed in either `extraJavaOpts` or `extraLibraryPath` when using Spark over Mesos. The reason is that Mesos passes the whole command line as a single string argument to `sh -c` to start the executor, and this makes shell string escaping non-trivial to handle. This should be fixed in a later release. Classes in package `org.apache.spark.deploy` shouldn't be used as they assume Spark is deployed in standalone mode, and give wrong executor side Spark home directory. Please refer to comments in #2145 for more details. Author: Cheng Lian <[email protected]> Closes #2161 from liancheng/mesos-fix-with-env-var and squashes the following commits: ba59190 [Cheng Lian] Added fine grained Mesos executor support 1174076 [Cheng Lian] Draft fix for CoarseMesosSchedulerBackend (cherry picked from commit 935bffe) Signed-off-by: Reynold Xin <[email protected]>
… via SPARK_EXECUTOR_OPTS This is another try after apache#2145 to fix [SPARK-2608](https://issues.apache.org/jira/browse/SPARK-2608). The basic idea is to pass `extraJavaOpts` and `extraLibraryPath` together via environment variable `SPARK_EXECUTOR_OPTS`. This variable is recognized by `spark-class` and not used anywhere else. In this way, we still launch Mesos executors with `spark-class`/`spark-executor`, but avoids the executor side Spark home issue. Quoted string with spaces is not allowed in either `extraJavaOpts` or `extraLibraryPath` when using Spark over Mesos. The reason is that Mesos passes the whole command line as a single string argument to `sh -c` to start the executor, and this makes shell string escaping non-trivial to handle. This should be fixed in a later release. Classes in package `org.apache.spark.deploy` shouldn't be used as they assume Spark is deployed in standalone mode, and give wrong executor side Spark home directory. Please refer to comments in apache#2145 for more details. Author: Cheng Lian <[email protected]> Closes apache#2161 from liancheng/mesos-fix-with-env-var and squashes the following commits: ba59190 [Cheng Lian] Added fine grained Mesos executor support 1174076 [Cheng Lian] Draft fix for CoarseMesosSchedulerBackend (cherry picked from commit 935bffe) Signed-off-by: Reynold Xin <[email protected]>
This PR is based on #1986, authored by @scwf.
The basic idea is to start Mesos executors directly with
java(similar to what we do for standalone executors). This PR is different from #1986 in two major aspects:Environment variables in
sc.executorEnvsare properly set for Mesos executorsPYTHONPATHis properly set for fine grained Mesos executors by callingsbin/mesos-pyenv.shsbin/spark-executoris renamed tosbin/mesos-pyenv.shsince it's only responsible for settingPYTHONPATHnow, and the executor is started byMesosSchedulerBackendafter the change.