[SPARK-2960][Deploy] Support executing Spark from symlinks (reopen) #8669

jerryshao · 2015-09-09T14:18:08Z

This PR is based on the work of @roji to support running Spark scripts from symlinks. Thanks for the great work @roji . Would you mind taking a look at this PR, thanks a lot.

For releases like HDP and others, normally it will expose the Spark executables as symlinks and put in PATH, but current Spark's scripts do not support finding real path from symlink recursively, this will make spark fail to execute from symlink. This PR try to solve this issue by finding the absolute path from symlink.

Instead of using readlink -f like what this PR (#2386) implemented is that -f is not support for Mac, so here manually seeking the path through loop.

I've tested with Mac and Linux (Cent OS), looks fine.

This PR did not fix the scripts under sbin folder, not sure if it needs to be fixed also?

Please help to review, any comment is greatly appreciated.

srowen · 2015-09-09T14:29:47Z

bin/beeline

Hm, is there no way to refactor these and source the function once? would be a good excuse to create such a common script here if it doesn't exist already.

I think if we move this function to a new script, there still has a problem of how to find the real path of this new script.

I will continue to think a way to remove these annoying duplication, also do more corner tests about this function.

I see, I would imagine it could be next to the symlink and symlinked as well... but is the point that even sourcing isn't going to follow the symlink? I thought the point was not that the symlinks don't work as links to files, but that they don't help you find where Spark is installed.

Hi @srowen , yes the actual problem is to find to out where Spark is installed through symlink.

Yes, sourcing can follow the symlink, but the problem is how to find the path of sourcing file/symlink.

Say if we make spark-shell as a symlink and add to /usr/bin, when we execute this spark-shell in any directory, how to identify the path of sourcing file? If sourcing file is not linked to /usr/bin/xx, we still need to find out where Spark is installed.

I mean the file would be in /usr/bin too, yes. I understand it's a little ugly, yeah.
Hm, I wonder if this is really the right approach to finding where Spark is installed. Shouldn't this be covered by SPARK_HOME? that's how other software works. The title of the JIRA is actually misleading since symlinks works fine, themselves.

If we assume SPARK_HOME must be set, then this problem can be simply solved. But the problem is that we don't have such assumptions, so we still need some scripts to find out where Spark is installed and still the problem is back again.

BTW, project like Hadoop also has this issue to find out hadoop-config.sh.

srowen · 2015-09-09T14:30:17Z

Other than that if it doesn't change existing behavior and lets symlinks stand a chance of working, seems good.

SparkQA · 2015-09-09T17:06:27Z

Test build #42204 has finished for PR 8669 at commit 69bde5c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

patrungel · 2015-09-09T21:16:51Z

Applying the same fix to sbin executables would be great. In fact, the whole fix would be incomplete without it.

jerryshao · 2015-09-10T07:04:09Z

Hi @patrungel , I've also fixed this issue for sbin executables, mind taking a look at it?

SparkQA · 2015-09-10T09:38:25Z

Test build #42245 has finished for PR 8669 at commit d6183b9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

patrungel · 2015-09-14T11:16:55Z

Hi, @jerryshao
I did a quick test with alternatives (against sbin/start-history-server.sh); the fix works to me as expected. Thanks tons.

aniro · 2015-10-08T15:29:02Z

Can the following be fixed as well?

--- a/bin/load-spark-env.sh    2015-10-08 14:51:20.874478368 +0000
+++ b/bin/load-spark-env.sh    2015-10-08 15:04:02.109674611 +0000
@@ -47,10 +47,7 @@
 if [ -z "$SPARK_ENV_LOADED" ]; then
   export SPARK_ENV_LOADED=1

-  # Returns the parent of the directory this script lives in.
-  parent_dir="$(cd "`dirname "$0"`"/..; pwd)"
-
-  user_conf_dir="${SPARK_CONF_DIR:-"$parent_dir"/conf}"
+  user_conf_dir="${SPARK_CONF_DIR:-"$FWDIR"/conf}"

   if [ -f "${user_conf_dir}/spark-env.sh" ]; then
     # Promote all variable declarations to environment (exported) variables

JoshRosen · 2015-10-16T20:52:30Z

Woah, this PR is really big! @jerryshao, do you think that you could move this function into a common "helper library" file then source / include that file into the other scripts? That would help to reduce duplication here.

srowen · 2015-10-17T06:59:11Z

@JoshRosen yes agree. See the discussion above out the small chicken-and-egg problem there about the location of the helper script. Still may be a better solution than this much duplication, or else, declare that SPARK_HOME must be set.

vanzin · 2015-10-17T17:25:30Z

bin/load-spark-env.sh

Can this function be replaced by:

DIR="$(dirname "$(readlink -f "$0")")"

I see you're recursively resolving the symlink and putting a limit in the depth of recursion, but it sounds like readlink -f does just that.

Nevermind, I just saw the comment about Macs not supporting that. There's no alternative on Macs? (If no, that sucks.)

Hi @vanzin , the behavior of readlink -f is different under Mac, here is the discussion in stackoverflow. Another solution is to install greadlink through brew.

srowen · 2015-10-22T11:09:07Z

I think this is a whole lot of change, and the purpose really is just to avoid setting SPARK_HOME when Spark is symlinked. But that's pretty standard procedure. RIght? I think we should resolve this as wont-fix.

patrungel · 2015-10-22T12:46:29Z

The change aims to make spark binaries launch properly when symlinked, not just to avoid setting SPARK_HOME.
As of current master, SPARK_HOME value is overwritten in launchers by a

export SPARK_HOME="$(cd "`dirname "$0"`"/..; pwd)"

which is generically wrong and does not work for a symlinked set-up.

This behaviour prohibits proper ( = with alternatives) installations/packaging.

So, if 'resolve this as wont-fix' refers to the issue, I beg to differ (this discussion should rather be moved to the ticket then).
If it refers just to the change suggested, I don't protest it.

srowen · 2015-10-22T12:49:45Z

Gotcha, is the way forward really to not over-write SPARK_HOME if set? Because most of the work here seems to be about discovering the home otherwise. The symlinks themselves are fine, just not quite suitable for discovering the real location of the target.

jerryshao · 2015-10-22T13:27:02Z

Hi @srowen , I admitted currently solution is quite ugly, but like what you mentioned, it is a chicken-and-egg problem, hard to remove the duplication. Another solution is by setting SPARK_HOME to manually support symlink, but it still needs to change the current executables, currently we don't honor SPARK_HOME. Take beeline as a example, even SPARK_HOME is set, still we could not execute from symlink.

So if setting SPARK_HOME as a compromise solution agreed by you guys, I could update the codes.

What's your opinion?

jerryshao · 2015-10-22T13:34:14Z

If SPARK_HOME is the standard way, we should update the scripts to honor this, it's not a good idea to make this issue as won't fix, because lots of deployment and package actually require symlink.

The current scripts (e.g. pyspark) fail to run when they are executed via symlinks. A common Linux scenario would be to have Spark installed somewhere (e.g. /opt) and have a symlink to it in /usr/bin. Fixed the scripts to traverse symlinks until reaching the actual binary.

jerryshao · 2015-10-27T03:44:37Z

Hi @srowen @patrungel @JoshRosen and @vazin, I changed this patch to honor SPARK_HOME to solve symlink issue, if users want to run the executables from symlink, they should set SPARK_HOME manually. This instead removes lots of duplications, still has the ability to run on symlinks, please help to review, any comment is greatly appreciated.

jerryshao · 2015-10-27T05:18:54Z

Jenkins, retest this please.

SparkQA · 2015-10-27T06:34:53Z

Test build #44400 has finished for PR 8669 at commit c796397.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * \"$\n * \"$\n * exec \"$\n * exec \"$\n * nohup nice -n \"$SPARK_NICENESS\" \"$\n * nohup nice -n \"$SPARK_NICENESS\" \"$\n * \"$\n

srowen · 2015-10-27T06:43:43Z

bin/beeline

Yes, I very much like how this has standardized everything to use SPARK_HOME instead of FWDIR, and doesn't overwrite the value if already set. (Nit: all of the occurrences of this line have a 4-space indent instead of 2)

LGTM; does anyone see a reason this isn't a good idea? I suppose now SPARK_HOME, if set, has an effect everywhere, but it looks like the desired effect. Docs also make reference to SPARK_HOME as if it has this effect.

Thanks @srowen for your comments, I will change to the 2-space indent.

SparkQA · 2015-10-27T07:54:28Z

Test build #44408 has finished for PR 8669 at commit 41511b4.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * \"$\n * \"$\n * exec \"$\n * exec \"$\n * nohup nice -n \"$SPARK_NICENESS\" \"$\n * nohup nice -n \"$SPARK_NICENESS\" \"$\n * \"$\n

SparkQA · 2015-10-27T13:11:51Z

Test build #44423 has finished for PR 8669 at commit 28fd58d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * \"$\n * \"$\n * exec \"$\n * exec \"$\n * nohup nice -n \"$SPARK_NICENESS\" \"$\n * nohup nice -n \"$SPARK_NICENESS\" \"$\n * \"$\n

SparkQA · 2015-11-04T04:51:37Z

Test build #44988 has finished for PR 8669 at commit 16ea884.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * \"$\n * \"$\n * exec \"$\n * exec \"$\n * nohup nice -n \"$SPARK_NICENESS\" \"$\n * nohup nice -n \"$SPARK_NICENESS\" \"$\n * \"$\n

srowen · 2015-11-04T10:43:28Z

sbin/spark-config.sh

I have trouble parsing what this script was doing before, but the desired outcome seems clear. The only change here is that SPARK_PREFIX is no longer set and exported, but I believe you've correctly removed usages of it in favor of SPARK_HOME. Good cleanup and standardization IMHO.

This bunch of code is try to solve symlink, relative path of this file, since we now honor the SPARK_HOME to solve this issue, from my understanding the original code is not necessary.

srowen · 2015-11-04T10:47:10Z

LGTM. There's a lot going on here but it's all really standardizing on one approach to defining the Spark base directory. I scanned all the lines and it looks correct and carefully applied, so I feel good about merging this to master.

marmbrus · 2015-12-15T22:23:34Z

It has been pointed out that this commit might be the cause of a regression that is possibly holding up the 1.6 release (https://issues.apache.org/jira/browse/SPARK-12345). It would be great if those involved could take a look.

jerryshao · 2015-12-16T02:15:42Z

Thanks @marmbrus for your notice, I will take a look at the issues.

srowen · 2015-12-16T07:37:07Z

It's a good lead; I'm looking too. The effect of this change should be to respect SPARK_HOME if it's set to find Spark scripts and files. It's not clear why this would cause a remote machine to use the home dir set on the driver, unless something is shipping the SPARK_HOME value to the remote machine. I'll comment on the JIRA.

dragos · 2015-12-16T07:43:42Z

The MesosDispatchScheduler ships a number of environment variables.

jerryshao · 2015-12-16T07:47:07Z

@dragos Did you specify SPARK_HOME manually in your environment? By default no one will specify this SPARK_HOME.

srowen · 2015-12-16T07:49:47Z

@jerryshao though the scripts will now set SPARK_HOME for you if not already set.
Should the Mesos code ship this variable across machines though? I'd have thought it's properly local to a machine only.

If this is right, one of these two things could be a way forward: don't export SPARK_HOME or don't send it.

jerryshao · 2015-12-16T07:54:20Z

Ohh, I see. That possibly will be the problem. One fix is to remove export in all the scripts, only using SPARK_HOME as a shell variable.

jerryshao · 2015-12-16T08:02:55Z

@dragos , if spark.mesos.executor.home is specified to the directory where Spark is installed in the slave node, does your problem still exist? Also from my understanding, if it is due to SPARK_HOME inconsistency between driver and executor, client mode also should have such problem, is it right?

dragos · 2015-12-16T08:23:35Z

I'll check that. I don't have Spark installed on every slave node in my test setup.

jerryshao · 2015-12-16T08:30:22Z

One more question @dragos , did you specify spark.executor.uri or SPARK_EXECUTOR_URI?

dragos · 2015-12-16T08:30:48Z

I believe I tried both..

dragos · 2015-12-16T08:41:32Z

@jerryshao I observed the same failure using spark.mesos.executor.home.

EDIT: That's probably because spark.mesos.executor.home is not shipped by the Mesos dispatcher, so it falls back to SPARK_HOME

jerryshao · 2015-12-16T08:43:30Z

I see, can you please confirm Spark is installed in this path spark.mesos.executor.home?

dragos · 2015-12-16T08:44:17Z

@jerryshao Yes, it is.

jerryshao · 2015-12-16T09:00:40Z

It is a little weird. I assume if spark.mesos.executor.home is correctly set, from the code Mesos should be able to find the right path of executables:

    val executorSparkHome = conf.getOption("spark.mesos.executor.home")
      .orElse(sc.getSparkHome())
      .getOrElse {
        throw new SparkException("Executor Spark home `spark.mesos.executor.home` is not set!")
      }

      val runScript = new File(executorSparkHome, "./bin/spark-class").getCanonicalPath
      command.setValue(
        "%s \"%s\" org.apache.spark.executor.CoarseGrainedExecutorBackend"
          .format(prefixEnv, runScript) +
        s" --driver-url $driverURL" +
        s" --executor-id ${offer.getSlaveId.getValue}" +
        s" --hostname ${offer.getHostname}" +
        s" --cores $numCores" +
        s" --app-id $appId")

jerryshao · 2015-12-16T09:18:13Z

Anyway I think we should not export SPARK_HOME in all the scripts, what do you think @srowen ?

dragos · 2015-12-16T09:24:13Z

I admit it is very surprising. Here's what I see

Driver page on the Mesos dispatcher (note that spark.executor.uri is set, but also SPARK_HOME -- which is wrong, since it's an absolute path on my machine):

And here is the log. Note that the executor is fetched correctly, but the ~~submit script~~ spark-class is calculated relative to SPARK_HOME (inside spark-submit!)

I1216 09:18:44.974624   464 fetcher.cpp:369] Fetching URI '/var/spark/spark-1.6.0-bin-hadoop2.6.tgz'
I1216 09:18:44.974628   464 fetcher.cpp:243] Fetching directly into the sandbox directory
I1216 09:18:44.974639   464 fetcher.cpp:180] Fetching URI '/var/spark/spark-1.6.0-bin-hadoop2.6.tgz'
I1216 09:18:44.974652   464 fetcher.cpp:160] Copying resource with command:cp '/var/spark/spark-1.6.0-bin-hadoop2.6.tgz' '/tmp/mesos/slaves/ac7d8645-030b-4c99-a47f-2e02379d8c41-S1/frameworks/ac7d8645-030b-4c99-a47f-2e02379d8c41-0002/executors/driver-20151216101839-0001/runs/ab128426-95fa-4a3e-82f6-9c5b4baba816/spark-1.6.0-bin-hadoop2.6.tgz'
I1216 09:18:46.351866   464 fetcher.cpp:76] Extracting with command: tar -C '/tmp/mesos/slaves/ac7d8645-030b-4c99-a47f-2e02379d8c41-S1/frameworks/ac7d8645-030b-4c99-a47f-2e02379d8c41-0002/executors/driver-20151216101839-0001/runs/ab128426-95fa-4a3e-82f6-9c5b4baba816' -xf '/tmp/mesos/slaves/ac7d8645-030b-4c99-a47f-2e02379d8c41-S1/frameworks/ac7d8645-030b-4c99-a47f-2e02379d8c41-0002/executors/driver-20151216101839-0001/runs/ab128426-95fa-4a3e-82f6-9c5b4baba816/spark-1.6.0-bin-hadoop2.6.tgz'
I1216 09:18:48.625231   464 fetcher.cpp:84] Extracted '/tmp/mesos/slaves/ac7d8645-030b-4c99-a47f-2e02379d8c41-S1/frameworks/ac7d8645-030b-4c99-a47f-2e02379d8c41-0002/executors/driver-20151216101839-0001/runs/ab128426-95fa-4a3e-82f6-9c5b4baba816/spark-1.6.0-bin-hadoop2.6.tgz' into '/tmp/mesos/slaves/ac7d8645-030b-4c99-a47f-2e02379d8c41-S1/frameworks/ac7d8645-030b-4c99-a47f-2e02379d8c41-0002/executors/driver-20151216101839-0001/runs/ab128426-95fa-4a3e-82f6-9c5b4baba816'
I1216 09:18:48.625273   464 fetcher.cpp:446] Fetched '/var/spark/spark-1.6.0-bin-hadoop2.6.tgz' to '/tmp/mesos/slaves/ac7d8645-030b-4c99-a47f-2e02379d8c41-S1/frameworks/ac7d8645-030b-4c99-a47f-2e02379d8c41-0002/executors/driver-20151216101839-0001/runs/ab128426-95fa-4a3e-82f6-9c5b4baba816/spark-1.6.0-bin-hadoop2.6.tgz'
I1216 09:18:48.666476   539 logging.cpp:172] INFO level logging started!
I1216 09:18:48.667392   539 exec.cpp:134] Version: 0.25.0
I1216 09:18:48.669401   545 exec.cpp:208] Executor registered on slave ac7d8645-030b-4c99-a47f-2e02379d8c41-S1
bin/spark-submit: line 27: /Users/dragos/workspace/Spark/dev/rc-tests/spark-1.6.0-bin-hadoop2.6/bin/spark-class: No such file or directory

skonto · 2015-12-16T09:32:59Z

If you set executorUri.isDefined then it wont reach (this is what we have as well):

val executorSparkHome = conf.getOption("spark.mesos.executor.home")
.orElse(sc.getSparkHome())
.getOrElse {
throw new SparkException("Executor Spark home spark.mesos.executor.home is not set!")
}

https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosClusterScheduler.scala

jerryshao · 2015-12-16T09:34:02Z

Yes, agree with @skonto. Why still honor SPARK_HOME, a little strange.

skonto · 2015-12-16T09:40:19Z

Its because Mesos Dispatcher passed that env variable to the driver at line:
builder.setEnvironment(envBuilder.build())

In 1.6 the spark-submit script has changed...
https://github.com/apache/spark/blob/master/bin/spark-submit
This line make SPARK_HOME not to change to the local path on mesos slave because it has already been set.
if [ -z "${SPARK_HOME}" ]; then
export SPARK_HOME="$(cd "dirname "$0""/..; pwd)"
fi
And the driver is started with spark-submit command anyway...

jerryshao · 2015-12-16T09:46:05Z

Ahhh, I see. The problem is that we expose SPARK_HOME in all the shell scripts.

srowen · 2015-12-16T10:39:35Z

I'm not sure it's that. It's that SPARK_HOME is copied by the mesos infrastructure to other machines. I think this in inherently local. Still, not-exporting SPARK_HOME would be a quick way to test whether this theory is true.

srowen reviewed Sep 9, 2015
View reviewed changes

vanzin reviewed Oct 17, 2015
View reviewed changes

roji and others added 5 commits October 26, 2015 15:34

:Make Spark script work with symlink

1c08d80

recover the comment

350bec9

Address the same issue for sbin scripts

aa470fe

Changes to honor SPARK_HOME to solve symlink issue

c796397

jerryshao force-pushed the SPARK-2960 branch from d6183b9 to c796397 Compare October 27, 2015 03:38

Fix missing slash

41511b4

srowen reviewed Oct 27, 2015
View reviewed changes

Fix 2 space indent

28fd58d

Address the comments

16ea884

srowen reviewed Nov 4, 2015
View reviewed changes

asfgit closed this in 8aff36e Nov 4, 2015

skonto mentioned this pull request Dec 16, 2015

Tests fail with spark 1.6rc lightbend/mesos-spark-integration-tests#37

Closed

[SPARK-2960][Deploy] Support executing Spark from symlinks (reopen) #8669

[SPARK-2960][Deploy] Support executing Spark from symlinks (reopen) #8669

Uh oh!

Conversation

jerryshao commented Sep 9, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

srowen commented Sep 9, 2015

Uh oh!

SparkQA commented Sep 9, 2015

Uh oh!

patrungel commented Sep 9, 2015

Uh oh!

jerryshao commented Sep 10, 2015

Uh oh!

SparkQA commented Sep 10, 2015

Uh oh!

patrungel commented Sep 14, 2015

Uh oh!

aniro commented Oct 8, 2015

Uh oh!

JoshRosen commented Oct 16, 2015

Uh oh!

srowen commented Oct 17, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

srowen commented Oct 22, 2015

Uh oh!

patrungel commented Oct 22, 2015

Uh oh!

srowen commented Oct 22, 2015

Uh oh!

jerryshao commented Oct 22, 2015

Uh oh!

jerryshao commented Oct 22, 2015

Uh oh!

jerryshao commented Oct 27, 2015

Uh oh!

jerryshao commented Oct 27, 2015

Uh oh!

SparkQA commented Oct 27, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 27, 2015

Uh oh!

SparkQA commented Oct 27, 2015

Uh oh!

SparkQA commented Nov 4, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

srowen commented Nov 4, 2015

Uh oh!

marmbrus commented Dec 15, 2015

Uh oh!

jerryshao commented Dec 16, 2015

Uh oh!