[ZEPPELIN-871] [WIP] spark 2.0 interpreter on scala 2.11 #980

echarles · 2016-06-09T14:36:54Z

What is this PR for?

Spark interpreter for spark version 2.0.0 and scala 2.11 (implemented in Scala)

What type of PR is it?

[Feature]

Todos

- Clean code
- Test SQL
- Test Python
- Test R

What is the Jira issue?

https://issues.apache.org/jira/browse/ZEPPELIN--871

How should this be tested?

Build it with

mvn install   -Pscala-2.11 -Dscala.binary.version=2.11 -Dscala.version=2.11.8   -Pspark-2.0 -Dspark.version=2.0.0-SNAPSHOT   -Phadoop-2.6 -Dhadoop.version=2.7.2 -Pyarn   -Dmaven.findbugs.enable=false   -Drat.skip=true   -Ppyspark   -Psparkr   -Dcheckstyle.skip=true   -Dcobertura.skip=true   -Pbuild-distr   -pl '!alluxio,!cassandra,!elasticsearch,!file,!flink,!hbase,!hive,!ignite,!jdbc,!kylin,!lens,!phoenix,!postgresql,!tajo'   -DskipTests

Run and test the spark paragraph.

Screenshots (if appropriate)

Questions:

Does the licenses files need update? N
Is there breaking changes for older versions? N
Does this needs documentation? Y

echarles · 2016-06-09T14:42:35Z

This is WIP.

I'd like to get feedback on the approach (scala implementation in the current spark module), taking into consideration:

Spark 2.0 will ship by defaulit on scala 2.11 (makes sense to me to develop the interpeter on scala 2.11)
SqlContext and HiveContext are deprected in favor of SparkSession.
HttpServer is removed and some classes methods have access restricted to spark only packages.
SpakIMain and SparkJLineCompletion don't exist anymore.

Building on the current java classes with method invocation may be possible, but would make the code more than difficult to read and develop.

This PR proposes separated scala classes for this very specific 2.0 API breaking changes.

WDYT?

Based on feedback, I will further validate the functionalities (for now, a simple spark 2.0 call works well on my local env).

Leemoonsoo · 2016-06-09T17:12:30Z

@echarles Thanks for the contribution.

How about we divide problem into

scala 2.11 support
spark 2.0 support
reimplement spark interpreter in scala

@lresende is working on scala 2.11 support on #747. I'm also trying to help merge code for scala 2.10 and 2.11 into one via lresende#1
One of the goal in #747 is support scala 2.11 and 2.10 from the single Zeppelin binary package.
Once #747 is done, this PR can take scala 2.11 support from #747.

Regarding spark 2.0 and reimplementation in scala,
How do you think about support spark 2.0 as well as spark 1.x while many users will use Zeppelin with spark 1.x for a some time?

echarles · 2016-06-09T17:26:21Z

Sure, we can wait on #747 merge.

I see there is a spark/src/main/scala-2.11/org/apache/zeppelin/spark/SparkInterpreter.java: does this mean that there will be 2 separate implementations: one for 2.10 et and for 2.11?

Leemoonsoo · 2016-06-10T02:37:07Z

I'm trying to combine 2.10 and 2.11 into one implementation in lresende#1

felixcheung · 2016-06-10T22:54:50Z

spark-dependencies/pom.xml

+    <spark.download.url>http://archive.apache.org/dist/spark/${spark.archive}/${spark.archive}.tgz</spark.download.url>
    <spark.bin.download.url>http://archive.apache.org/dist/spark/spark-${spark.version}/spark-${spark.version}-bin-without-hadoop.tgz</spark.bin.download.url>
    <spark.dist.cache>${project.build.directory}/../../.spark-dist</spark.dist.cache>
    <py4j.version>0.8.2.1</py4j.version>


there's a new version of py4j too

thx, taking this into account in next push.

felixcheung · 2016-06-10T22:58:43Z

From what I see, we should still be able to use most of the existing code. Could you elaborate more on how bad it would be to support Spark 1.x and 2.x in the same interpreter code?

echarles · 2016-06-12T12:18:49Z

Dealing with mixed scala 2.10/2.11 and spark 1.x/2.x with a same implementation is always possible but drives to code plenty of method invocation (see e.g. https://github.com/lresende/incubator-zeppelin/pull/1/files#diff-dbda0c4083ad9c59ff05f0273b5e760fR216).

At the end, you have an implementation which a succession of if (...) then invokeMethod...

On the other hand, having multiple implementations drives to maintenance overhead and a risk of divergence.

In this particular case, I was thinking that having 2 lines was something worth to discuss:

1.- Spark 1.x on scala 2.10
2.- Spark 2.x on scala 2.11

I also don't see why we should still support the DepInterpreter in future developments (normally, deps should be configured via the interperter settings). Also, the 2 lines approach would make easier other evolutions such as for example having a ZeppelinContext available for all interpreters.

Also, we have kind-off already more than one spark impementation: The Livy one has its own implementation and features.

Leemoonsoo · 2016-06-13T15:11:48Z

Thanks for elaborate on idea about 2 lines approach.

I think there're always pros and cons.

If we go to 2 lines approach, code will be cleaner, simpler and more easier to read.

But this means Zeppelin generally enables create binary per interpreter dependency because we'll have the same policy for all interpreters. In the end, i'm afraid we end up with making bunch of binary packages like

zeppelin with spark 1.x on scala 2.10
zeppelin with spark 1.x on scala 2.11
zeppelin with spark 2.0 on scala 2.10
zeppelin with spark 2.0 on scala 2.11
zeppelin with spark 2.1 on scala 2.10
zeppelin with spark 2.1 on scala 2.10
zeppelin with python3.x
zeppelin with python2.x
zeppelin with hive1
zeppelin with hive2
.....

In the perspective of user, it can not be an improvement. A single Zeppelin binary used to work with everything, but now user have to understand difference between of packages and able to select them before uses.

I think code complexity will not endlessly increase if we cut the spark support for last x releases.

Leemoonsoo · 2016-06-13T15:16:15Z

When @minahlee contributed dependency loading through interpreter setting, we also thought DepInterpreter can be deprecated and removed. That's why current documentation of spark interpreter mark %dep is deprecated.

But since that, there were some strong feedbacks from users that they likes DepInterpreter especially it enables self documenting. i.e. Working code and it's dependency can be in the same notebook. And i think it really make sense.

So i think we need to reconsider deprecating DepInterpreter.

echarles · 2016-06-13T16:21:23Z

Agree with your agurment. If we push a bit further the discussion, we can not have a single binary which fits all users needs. What if I want spark-2-scala-2.11 with flink-0.9-scala-2.12 ? You single binary will not give me any solution...

I would rather see a zeppelin-kernel (nearly naked without interpeters) with the ability for the user to enable any interpreter via a shell command.

Regarding DepIntepreter, I can understand what users says, but for some packages, it simply does not work (example: spark-csv running on yarn if I remember well) - Not sure if it can be easily fixed.

Bottom line: I am no fan of the invoke coding style, but I will wait on the other PR and try to build on top of this.

Leemoonsoo · 2016-06-13T16:58:29Z

Why not single binary works for spark-2-scala-2.11 and flink-0.9-scala-2.12?
I'm also not a fan of reflection. But if i have to choose either complexity of code or complexity usage, then i would like choose keeping usage simple, even though code becomes more complex. I believe this is whole point of making a software.

And i 100% agree. Once we have https://issues.apache.org/jira/browse/ZEPPELIN-598 and proper UI / command line for list/download interpreter from maven repository, then i think things will become much more easier and flexible.

felixcheung · 2016-06-13T18:38:14Z

+1 on all of that. I think we could work together on a proposal to see what is the best way to go forward.
I think we are way overdue on separating "programming languages" (and versions) and "frameworks" (and versions).

echarles · 2016-06-14T07:13:01Z

@Leemoonsoo

I was just saying that a single binary will not fit all various users need (multiple combination of interpeters and versions).
ZEPPELIN-598 ] Dynamic loading for Interpreter #629 is merged. What remains to get ZEPPELIN-598 operational?

@felixcheung Is there already a jira for the proposal you are thinkg to? I feel we have nearly everything open, but maybe we see an umbrella?

echarles · 2016-06-14T07:15:28Z

@Leemoonsoo Oh, #908 is the one I was looking for

Leemoonsoo · 2016-06-14T08:15:33Z

@echarles

I was just saying that a single binary will not fit all various users need (multiple combination of interpeters and versions).

If single SparkInterpreter binary is compatible with various spark versions and scala versions, and single FlinkInterpreter binary is compatible with various flink versions and scala versions, user can choose any combination without rebuild. isn't it?

#629 is merged. What remains to get ZEPPELIN-598 operational?

After #908 is merged, we'll need proper commandline or GUI to access this feature.
And publish all interpreters to maven repository and remove interpreter dependencies from binary package.

echarles · 2016-06-14T08:19:48Z

If you ship in the same folder scala libs from different versions, it will fail at runtime - You don't have to rebuilt, but you have to arrange classpath is built with scala libs of the same version
Gui, cli... Makes sense. Is t is what @felixcheung was refereing to?

rnirmal · 2016-06-28T20:41:41Z

@echarles @Leemoonsoo do you know where the direction lies for Spark 2.0 with scala 2.11 support.. i.e as separate items or continued effort on this PR? Since Spark 2.0 is going to be release in the next week or two, would be great to see this made available on Zeppelin as a quick follow on.

I can help with any testing if needed, not as familiar with Zeppelin code yet to help with reviews.

Leemoonsoo · 2016-06-29T21:57:13Z

@rnirmal Vote for 0.6.0-rc1 is about to start. We'll need 0.6.1 release immediately after we have scala 2.11 and spark 2.0 support.

Currently, @lresende and me are trying to make Scala 2.11 support green.
@echarles made great work here. Spark 2.0 interpreter implementation in scala.

For 0.6.1 release, I'd like to add scala 2.11 and spark 2.0 support while keeping current way of implementation, to keep previous spark version support the same, more conservatively.

Meanwhile i think we can continue to discuss and working on scala implementation of spark interpreter, easier spark interpreter maintenance, and so on, on the master branch so they can be included in 0.7.0 release.

rnirmal · 2016-06-30T03:41:37Z

Sounds good, I'll keep a lookout for it to land.

echarles · 2016-07-20T16:31:49Z

Closing this. Spark 2.0 is implemented with #1195
Optionally, a new PR can be opened to port the Spark interpreter from Java to Scala if interest.

### What is this PR for? This PR implement spark 2.0 support based on #747. This PR has approach from #980 which is reimplementing code in scala. You can try build this branch ``` mvn clean package -Dscala-2.11 -Pspark-2.0 -Dspark.version=2.0.0-preview -Ppyspark -Psparkr -Pyarn -Phadoop-2.6 -DskipTests ``` ### What type of PR is it? Improvements ### Todos * [x] - Spark 2.0 support * [x] - Rebase after #747 merge * [x] - Update LICENSE file * [x] - Update related document (build) ### What is the Jira issue? https://issues.apache.org/jira/browse/ZEPPELIN-759 ### How should this be tested? Build and try ``` mvn clean package -Dscala-2.11 -Pspark-2.0 -Dspark.version=2.0.0-preview -Ppyspark -Psparkr -Pyarn -Phadoop-2.6 -DskipTests ``` ### Screenshots (if appropriate) ![spark2](https://cloud.githubusercontent.com/assets/1540981/16771611/fe804038-4805-11e6-8447-3fa4258bb51d.gif) ### Questions: * Does the licenses files need update? yes * Is there breaking changes for older versions? no * Does this needs documentation? yes Author: Lee moon soo <[email protected]> Closes #1195 from Leemoonsoo/spark-20 and squashes the following commits: d78b322 [Lee moon soo] trigger ci 8017e8b [Lee moon soo] Remove unnecessary spark.version property e3141bd [Lee moon soo] restart sparkcluster before sparkr test 1493b2c [Lee moon soo] print spark standalone cluster log when ci test fails a208cd0 [Lee moon soo] Debug sparkRTest 31369c6 [Lee moon soo] Update license 293896a [Lee moon soo] Update build instruction 862ff6c [Lee moon soo] Make ZeppelinSparkClusterTest.java work with spark 2 839912a [Lee moon soo] Update SPARK_HOME directory detection pattern for 2.0.0-preview in the test 3413707 [Lee moon soo] Update .travis.yml 02bcd5d [Lee moon soo] Update SparkSqlInterpreterTest f06a2fa [Lee moon soo] Spark 2.0 support

This PR implement spark 2.0 support based on #747. This PR has approach from #980 which is reimplementing code in scala. You can try build this branch ``` mvn clean package -Dscala-2.11 -Pspark-2.0 -Dspark.version=2.0.0-preview -Ppyspark -Psparkr -Pyarn -Phadoop-2.6 -DskipTests ``` Improvements * [x] - Spark 2.0 support * [x] - Rebase after #747 merge * [x] - Update LICENSE file * [x] - Update related document (build) https://issues.apache.org/jira/browse/ZEPPELIN-759 Build and try ``` mvn clean package -Dscala-2.11 -Pspark-2.0 -Dspark.version=2.0.0-preview -Ppyspark -Psparkr -Pyarn -Phadoop-2.6 -DskipTests ``` ![spark2](https://cloud.githubusercontent.com/assets/1540981/16771611/fe804038-4805-11e6-8447-3fa4258bb51d.gif) * Does the licenses files need update? yes * Is there breaking changes for older versions? no * Does this needs documentation? yes Author: Lee moon soo <[email protected]> Closes #1195 from Leemoonsoo/spark-20 and squashes the following commits: d78b322 [Lee moon soo] trigger ci 8017e8b [Lee moon soo] Remove unnecessary spark.version property e3141bd [Lee moon soo] restart sparkcluster before sparkr test 1493b2c [Lee moon soo] print spark standalone cluster log when ci test fails a208cd0 [Lee moon soo] Debug sparkRTest 31369c6 [Lee moon soo] Update license 293896a [Lee moon soo] Update build instruction 862ff6c [Lee moon soo] Make ZeppelinSparkClusterTest.java work with spark 2 839912a [Lee moon soo] Update SPARK_HOME directory detection pattern for 2.0.0-preview in the test 3413707 [Lee moon soo] Update .travis.yml 02bcd5d [Lee moon soo] Update SparkSqlInterpreterTest f06a2fa [Lee moon soo] Spark 2.0 support (cherry picked from commit 8546666) Signed-off-by: Lee moon soo <[email protected]>

### What is this PR for? This PR implement spark 2.0 support based on apache#747. This PR has approach from apache#980 which is reimplementing code in scala. You can try build this branch ``` mvn clean package -Dscala-2.11 -Pspark-2.0 -Dspark.version=2.0.0-preview -Ppyspark -Psparkr -Pyarn -Phadoop-2.6 -DskipTests ``` ### What type of PR is it? Improvements ### Todos * [x] - Spark 2.0 support * [x] - Rebase after apache#747 merge * [x] - Update LICENSE file * [x] - Update related document (build) ### What is the Jira issue? https://issues.apache.org/jira/browse/ZEPPELIN-759 ### How should this be tested? Build and try ``` mvn clean package -Dscala-2.11 -Pspark-2.0 -Dspark.version=2.0.0-preview -Ppyspark -Psparkr -Pyarn -Phadoop-2.6 -DskipTests ``` ### Screenshots (if appropriate) ![spark2](https://cloud.githubusercontent.com/assets/1540981/16771611/fe804038-4805-11e6-8447-3fa4258bb51d.gif) ### Questions: * Does the licenses files need update? yes * Is there breaking changes for older versions? no * Does this needs documentation? yes Author: Lee moon soo <[email protected]> Closes apache#1195 from Leemoonsoo/spark-20 and squashes the following commits: d78b322 [Lee moon soo] trigger ci 8017e8b [Lee moon soo] Remove unnecessary spark.version property e3141bd [Lee moon soo] restart sparkcluster before sparkr test 1493b2c [Lee moon soo] print spark standalone cluster log when ci test fails a208cd0 [Lee moon soo] Debug sparkRTest 31369c6 [Lee moon soo] Update license 293896a [Lee moon soo] Update build instruction 862ff6c [Lee moon soo] Make ZeppelinSparkClusterTest.java work with spark 2 839912a [Lee moon soo] Update SPARK_HOME directory detection pattern for 2.0.0-preview in the test 3413707 [Lee moon soo] Update .travis.yml 02bcd5d [Lee moon soo] Update SparkSqlInterpreterTest f06a2fa [Lee moon soo] Spark 2.0 support

Initial implemenation for spark 2.0 interpreter on scala 2.11

f041ec3

felixcheung reviewed Jun 10, 2016
View reviewed changes

This was referenced Jul 12, 2016

[WIP] [ZEPPELIN-759] Spark 2.0 support #1169

Closed

[ZEPPELIN-759] Spark 2.0 support #1195

Closed

asfgit closed this in ad91e8e Jul 22, 2016

[ZEPPELIN-871] [WIP] spark 2.0 interpreter on scala 2.11 #980

[ZEPPELIN-871] [WIP] spark 2.0 interpreter on scala 2.11 #980

Uh oh!

Conversation

echarles commented Jun 9, 2016

What is this PR for?

What type of PR is it?

Todos

What is the Jira issue?

How should this be tested?

Screenshots (if appropriate)

Questions:

Uh oh!

echarles commented Jun 9, 2016

Uh oh!

Leemoonsoo commented Jun 9, 2016

Uh oh!

echarles commented Jun 9, 2016

Uh oh!

Leemoonsoo commented Jun 10, 2016

Uh oh!

felixcheung Jun 10, 2016

Choose a reason for hiding this comment

Uh oh!

echarles Jun 12, 2016

Choose a reason for hiding this comment

Uh oh!

felixcheung commented Jun 10, 2016

Uh oh!

echarles commented Jun 12, 2016

Uh oh!

Leemoonsoo commented Jun 13, 2016

Uh oh!

Leemoonsoo commented Jun 13, 2016

Uh oh!

echarles commented Jun 13, 2016

Uh oh!

Leemoonsoo commented Jun 13, 2016

Uh oh!

felixcheung commented Jun 13, 2016

Uh oh!

echarles commented Jun 14, 2016

Uh oh!

echarles commented Jun 14, 2016

Uh oh!

Leemoonsoo commented Jun 14, 2016

Uh oh!

echarles commented Jun 14, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rnirmal commented Jun 28, 2016

Uh oh!

Leemoonsoo commented Jun 29, 2016

Uh oh!

rnirmal commented Jun 30, 2016

Uh oh!

echarles commented Jul 20, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

echarles commented Jun 14, 2016 •

edited

Loading