Skip to content

Conversation

@Leemoonsoo
Copy link
Member

https://issues.apache.org/jira/browse/ZEPPELIN-262

This patch make zeppelin uses spark-submit to run spark interpreter process, when SPARK_HOME is defined. This will potentially solve all the configuration problems related to spark interpreter.

How to use?

Define SPARK_HOME env variable in conf/zeppelin-env.sh
Then it'll use your SPARK_HOME/bin/spark-submit, so you will not need any additional configuration :-)

Backward compatibility

If You have not defined your SPARK_HOME, you still able to run spark interpreter in old (current) way.
However it is not encouraged anymore.

@Leemoonsoo
Copy link
Member Author

Ready to merge. Please review the changes

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason the last half is taken out?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

brought back those lines.

@bzz
Copy link
Member

bzz commented Sep 3, 2015

This is awesome improvement, thank you @Leemoonsoo
Looks great to me.

@Leemoonsoo
Copy link
Member Author

I have pushed more commits, that handles pyspark. Please review them, too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CDH? Is there hadoop distribution specific path?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's part of heuristic to search and add hadoop jar files

@randerzander
Copy link
Contributor

@Leemoonsoo how does z.load work with spark-submit? Seems those dependency jars should be added automatically to spark-submit's --jars argument.

@Leemoonsoo
Copy link
Member Author

@randerzander dependency jars downloaded from z.load() is being loaded after SparkContext is created by calling sc.addJar(). So i think it'll not be affected by this change.

@felixcheung
Copy link
Member

looks good!

@Leemoonsoo
Copy link
Member Author

Merging, if there're no more discussions.

@asfgit asfgit closed this in b4b4f55 Sep 8, 2015
Leemoonsoo added a commit to Leemoonsoo/zeppelin that referenced this pull request Sep 17, 2015
https://issues.apache.org/jira/browse/ZEPPELIN-262

This patch make zeppelin uses spark-submit to run spark interpreter process, when SPARK_HOME is defined. This will potentially solve all the configuration problems related to spark interpreter.

#### How to use?

Define SPARK_HOME env variable in conf/zeppelin-env.sh
Then it'll use your SPARK_HOME/bin/spark-submit, so you will not need any additional configuration :-)

#### Backward compatibility

If You have not defined your SPARK_HOME, you still able to run spark interpreter in old (current) way.
However it is not encouraged anymore.

Author: Lee moon soo <[email protected]>

Closes apache#270 from Leemoonsoo/spark_submit and squashes the following commits:

4eb0848 [Lee moon soo] export and check SPARK_SUBMIT
a8a3440 [Lee moon soo] handle spark.files correctly for pyspark when spark-submit is used
d4acd1b [Lee moon soo] Add PYTHONPATH
c9418c6 [Lee moon soo] Bring back some entries with more commments
cac2bb8 [Lee moon soo] Take care classpath of SparkIMain
5d3154e [Lee moon soo] Remove clean. otherwise mvn clean package will remove interpreter/spark/dep directory
2d27e9c [Lee moon soo] use spark-submit to run spark interpreter process when SPARK_HOME is defined

(cherry picked from commit b4b4f55)
Signed-off-by: Lee moon soo <[email protected]>
@smusevic
Copy link

smusevic commented Mar 1, 2016

Hello,
I'm testing out zeppelin-0.5.6-incubating-bin-all.tgz.

I might be wrong but it seems to me that this change causes:

SPARK_CLASSPATH was detected (set to ':/etc/hbase/conf').
This is deprecated in Spark 1.0+.

Please instead use:
 - ./spark-submit with --driver-class-path to augment the driver classpath
 - spark.executor.extraClassPath to augment the executor classpath

16/03/01 08:11:50 WARN spark.SparkConf: Setting 'spark.executor.extraClassPath' to ':/etc/hbase/conf' as a work-around.
16/03/01 08:11:50 ERROR spark.SparkContext: Error initializing SparkContext.
org.apache.spark.SparkException: Found both spark.driver.extraClassPath and SPARK_CLASSPATH. Use only the former.
        at org.apache.spark.SparkConf$$anonfun$validateSettings$6$$anonfun$apply$8.apply(SparkConf.scala:473)
        at org.apache.spark.SparkConf$$anonfun$validateSettings$6$$anonfun$apply$8.apply(SparkConf.scala:471)
        at scala.collection.immutable.List.foreach(List.scala:318)
        at org.apache.spark.SparkConf$$anonfun$validateSettings$6.apply(SparkConf.scala:471)
        at org.apache.spark.SparkConf$$anonfun$validateSettings$6.apply(SparkConf.scala:459)
        at scala.Option.foreach(Option.scala:236)
        at org.apache.spark.SparkConf.validateSettings(SparkConf.scala:459)
        at org.apache.spark.SparkContext.<init>(SparkContext.scala:391)
        at org.apache.zeppelin.spark.SparkInterpreter.createSparkContext(SparkInterpreter.java:339)
        at org.apache.zeppelin.spark.SparkInterpreter.getSparkContext(SparkInterpreter.java:145)
        at org.apache.zeppelin.spark.SparkInterpreter.open(SparkInterpreter.java:465)
        at org.apache.zeppelin.interpreter.ClassloaderInterpreter.open(ClassloaderInterpreter.java:74)
        at org.apache.zeppelin.interpreter.LazyOpenInterpreter.open(LazyOpenInterpreter.java:68)
        at org.apache.zeppelin.interpreter.LazyOpenInterpreter.interpret(LazyOpenInterpreter.java:92)
        at org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer$InterpretJob.jobRun(RemoteInterpreterServer.java:300)
        at org.apache.zeppelin.scheduler.Job.run(Job.java:169)
        at org.apache.zeppelin.scheduler.FIFOScheduler$1.run(FIFOScheduler.java:134)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)

when conf/zeppelin-env.sh contains export SPARK_HOME=....
Removing the following text from added line 138 in bin/interpreter.sh:

--driver-class-path "${ZEPPELIN_CLASSPATH_OVERRIDES}:${CLASSPATH}"

the issue is resolved, as suggested by this email but something else happens:

| z
<console>:22: error: not found: value z
              z
              ^

which sadly blocks me from using z.load("path/to/jar") which is what I really need to do.
Please note that I do not have access to change any of the files inside SPARK_HOME, including any conf files residing therein.

Is there a workaround for this? Am I doing something wrong?
Thanks in advance!
S.

@Leemoonsoo
Copy link
Member Author

@smusevic You look like have export SPARK_CLASSPATH=/etc/hbase/conf in conf/zeppelin-env.sh
Could you try export ZEPPELIN_CLASSPATH=/etc/hbase/conf instead?

@smusevic
Copy link

smusevic commented Mar 2, 2016

@Leemoonsoo thanks for your reply. I most definitely do not have export SPARK_CLASSPATH=/etc/hbase/conf in my conf/zeppelin-env.sh, double checked just now.
However, is anyone aware of SPARK_CLASSPATH getting initialized during spark-submit?
SPARK_CLASSPATH does get initialized in the bit/interpreter.sh, but investigation has revealed that the line in question does not execute if SPARK_HOME is set in conf/zeppelin-env.sh.
Anyway, my question now would be: if SPARK_CLASSPATH is not set when spark-submit is executed then Zeppelin using a SPARK_HOME set to some value should work?
Thank you in advance!
Regards,
S.

@Leemoonsoo
Copy link
Member Author

@smusevic
Right, that should work. Set SPARK_HOME is preferred way to configure Zeppelin with Spark.

@smusevic
Copy link

smusevic commented Mar 3, 2016

Thanks, it turned out that SPARK_CLASSPATH was set in one of the shell files, bag practice... It works fine now, thanks!

export SPARK_SUBMIT="${SPARK_HOME}/bin/spark-submit"
SPARK_APP_JAR="$(ls ${ZEPPELIN_HOME}/interpreter/spark/zeppelin-spark*.jar)"
# This will evantually passes SPARK_APP_JAR to classpath of SparkIMain
ZEPPELIN_CLASSPATH=${SPARK_APP_JAR}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello, when I set SPARK_HOME to my external Spark, I found the zeppelin-interpreter-sparkxxx.log file are gone. I digged further and found if I change this line 79 in interpreter.sh ZEPPELIN_CLASSPATH=${SPARK_APP_JAR} to ZEPPELIN_CLASSPATH+=${SPARK_APP_JAR} I get all the Spark interpreter log back. Is this a bug in the code or I miss understood something?
Regards,
Weipu

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@weipuz Thanks for digging it !
I noticed that issue too. ( I set SPARK_HOME and can not get spark log file. )
I also changed ZEPPELIN_CLASSPATH=${SPARK_APP_JAR} to ZEPPELIN_CLASSPATH+=${SPARK_APP_JAR} as you said. Then finally I can get my zeppelin-interpreter-spark-***.log file. As long as it's not an intended implementation, I think we need to fix this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it need to be fixed 👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@weipuz @Leemoonsoo I pushed a patch for this issue with HOT FIX tag at #769 : )
Thanks again @weipuz for reporting this.

lelou6666 pushed a commit to lelou6666/incubator-zeppelin that referenced this pull request Mar 25, 2016
Changing mailinglist address to apache one
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants