Skip to content

Conversation

@echarles
Copy link
Member

@echarles echarles commented Jun 9, 2016

What is this PR for?

Spark interpreter for spark version 2.0.0 and scala 2.11 (implemented in Scala)

What type of PR is it?

[Feature]

Todos

  • - Clean code
  • - Test SQL
  • - Test Python
  • - Test R

What is the Jira issue?

How should this be tested?

Build it with

mvn install   -Pscala-2.11 -Dscala.binary.version=2.11 -Dscala.version=2.11.8   -Pspark-2.0 -Dspark.version=2.0.0-SNAPSHOT   -Phadoop-2.6 -Dhadoop.version=2.7.2 -Pyarn   -Dmaven.findbugs.enable=false   -Drat.skip=true   -Ppyspark   -Psparkr   -Dcheckstyle.skip=true   -Dcobertura.skip=true   -Pbuild-distr   -pl '!alluxio,!cassandra,!elasticsearch,!file,!flink,!hbase,!hive,!ignite,!jdbc,!kylin,!lens,!phoenix,!postgresql,!tajo'   -DskipTests

Run and test the spark paragraph.

Screenshots (if appropriate)

Questions:

  • Does the licenses files need update? N
  • Is there breaking changes for older versions? N
  • Does this needs documentation? Y

@echarles
Copy link
Member Author

echarles commented Jun 9, 2016

This is WIP.

I'd like to get feedback on the approach (scala implementation in the current spark module), taking into consideration:

  • Spark 2.0 will ship by defaulit on scala 2.11 (makes sense to me to develop the interpeter on scala 2.11)
  • SqlContext and HiveContext are deprected in favor of SparkSession.
  • HttpServer is removed and some classes methods have access restricted to spark only packages.
  • SpakIMain and SparkJLineCompletion don't exist anymore.

Building on the current java classes with method invocation may be possible, but would make the code more than difficult to read and develop.

This PR proposes separated scala classes for this very specific 2.0 API breaking changes.

WDYT?

Based on feedback, I will further validate the functionalities (for now, a simple spark 2.0 call works well on my local env).

@Leemoonsoo
Copy link
Member

@echarles Thanks for the contribution.

How about we divide problem into

  • scala 2.11 support
  • spark 2.0 support
  • reimplement spark interpreter in scala

@lresende is working on scala 2.11 support on #747. I'm also trying to help merge code for scala 2.10 and 2.11 into one via lresende#1
One of the goal in #747 is support scala 2.11 and 2.10 from the single Zeppelin binary package.
Once #747 is done, this PR can take scala 2.11 support from #747.

Regarding spark 2.0 and reimplementation in scala,
How do you think about support spark 2.0 as well as spark 1.x while many users will use Zeppelin with spark 1.x for a some time?

@echarles
Copy link
Member Author

echarles commented Jun 9, 2016

Sure, we can wait on #747 merge.

I see there is a spark/src/main/scala-2.11/org/apache/zeppelin/spark/SparkInterpreter.java: does this mean that there will be 2 separate implementations: one for 2.10 et and for 2.11?

@Leemoonsoo
Copy link
Member

I'm trying to combine 2.10 and 2.11 into one implementation in lresende#1

<spark.download.url>http://archive.apache.org/dist/spark/${spark.archive}/${spark.archive}.tgz</spark.download.url>
<spark.bin.download.url>http://archive.apache.org/dist/spark/spark-${spark.version}/spark-${spark.version}-bin-without-hadoop.tgz</spark.bin.download.url>
<spark.dist.cache>${project.build.directory}/../../.spark-dist</spark.dist.cache>
<py4j.version>0.8.2.1</py4j.version>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's a new version of py4j too

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thx, taking this into account in next push.

@felixcheung
Copy link
Member

From what I see, we should still be able to use most of the existing code. Could you elaborate more on how bad it would be to support Spark 1.x and 2.x in the same interpreter code?

@echarles
Copy link
Member Author

Dealing with mixed scala 2.10/2.11 and spark 1.x/2.x with a same implementation is always possible but drives to code plenty of method invocation (see e.g. https://github.com/lresende/incubator-zeppelin/pull/1/files#diff-dbda0c4083ad9c59ff05f0273b5e760fR216).

At the end, you have an implementation which a succession of if (...) then invokeMethod...

On the other hand, having multiple implementations drives to maintenance overhead and a risk of divergence.

In this particular case, I was thinking that having 2 lines was something worth to discuss:

1.- Spark 1.x on scala 2.10
2.- Spark 2.x on scala 2.11

I also don't see why we should still support the DepInterpreter in future developments (normally, deps should be configured via the interperter settings). Also, the 2 lines approach would make easier other evolutions such as for example having a ZeppelinContext available for all interpreters.

Also, we have kind-off already more than one spark impementation: The Livy one has its own implementation and features.

@Leemoonsoo
Copy link
Member

Thanks for elaborate on idea about 2 lines approach.

I think there're always pros and cons.

If we go to 2 lines approach, code will be cleaner, simpler and more easier to read.

But this means Zeppelin generally enables create binary per interpreter dependency because we'll have the same policy for all interpreters. In the end, i'm afraid we end up with making bunch of binary packages like

  • zeppelin with spark 1.x on scala 2.10
  • zeppelin with spark 1.x on scala 2.11
  • zeppelin with spark 2.0 on scala 2.10
  • zeppelin with spark 2.0 on scala 2.11
  • zeppelin with spark 2.1 on scala 2.10
  • zeppelin with spark 2.1 on scala 2.10
  • zeppelin with python3.x
  • zeppelin with python2.x
  • zeppelin with hive1
  • zeppelin with hive2
  • .....

In the perspective of user, it can not be an improvement. A single Zeppelin binary used to work with everything, but now user have to understand difference between of packages and able to select them before uses.

I think code complexity will not endlessly increase if we cut the spark support for last x releases.

@Leemoonsoo
Copy link
Member

When @minahlee contributed dependency loading through interpreter setting, we also thought DepInterpreter can be deprecated and removed. That's why current documentation of spark interpreter mark %dep is deprecated.

But since that, there were some strong feedbacks from users that they likes DepInterpreter especially it enables self documenting. i.e. Working code and it's dependency can be in the same notebook. And i think it really make sense.

So i think we need to reconsider deprecating DepInterpreter.

@echarles
Copy link
Member Author

Agree with your agurment. If we push a bit further the discussion, we can not have a single binary which fits all users needs. What if I want spark-2-scala-2.11 with flink-0.9-scala-2.12 ? You single binary will not give me any solution...

I would rather see a zeppelin-kernel (nearly naked without interpeters) with the ability for the user to enable any interpreter via a shell command.

Regarding DepIntepreter, I can understand what users says, but for some packages, it simply does not work (example: spark-csv running on yarn if I remember well) - Not sure if it can be easily fixed.

Bottom line: I am no fan of the invoke coding style, but I will wait on the other PR and try to build on top of this.

@Leemoonsoo
Copy link
Member

Why not single binary works for spark-2-scala-2.11 and flink-0.9-scala-2.12?
I'm also not a fan of reflection. But if i have to choose either complexity of code or complexity usage, then i would like choose keeping usage simple, even though code becomes more complex. I believe this is whole point of making a software.

And i 100% agree. Once we have https://issues.apache.org/jira/browse/ZEPPELIN-598 and proper UI / command line for list/download interpreter from maven repository, then i think things will become much more easier and flexible.

@felixcheung
Copy link
Member

+1 on all of that. I think we could work together on a proposal to see what is the best way to go forward.
I think we are way overdue on separating "programming languages" (and versions) and "frameworks" (and versions).

@echarles
Copy link
Member Author

@Leemoonsoo

@felixcheung Is there already a jira for the proposal you are thinkg to? I feel we have nearly everything open, but maybe we see an umbrella?

@echarles
Copy link
Member Author

@Leemoonsoo Oh, #908 is the one I was looking for

@Leemoonsoo
Copy link
Member

@echarles

I was just saying that a single binary will not fit all various users need (multiple combination of interpeters and versions).

If single SparkInterpreter binary is compatible with various spark versions and scala versions, and single FlinkInterpreter binary is compatible with various flink versions and scala versions, user can choose any combination without rebuild. isn't it?

#629 is merged. What remains to get ZEPPELIN-598 operational?

After #908 is merged, we'll need proper commandline or GUI to access this feature.
And publish all interpreters to maven repository and remove interpreter dependencies from binary package.

@echarles
Copy link
Member Author

echarles commented Jun 14, 2016

  • If you ship in the same folder scala libs from different versions, it will fail at runtime - You don't have to rebuilt, but you have to arrange classpath is built with scala libs of the same version
  • Gui, cli... Makes sense. Is t is what @felixcheung was refereing to?

@rnirmal
Copy link

rnirmal commented Jun 28, 2016

@echarles @Leemoonsoo do you know where the direction lies for Spark 2.0 with scala 2.11 support.. i.e as separate items or continued effort on this PR? Since Spark 2.0 is going to be release in the next week or two, would be great to see this made available on Zeppelin as a quick follow on.

I can help with any testing if needed, not as familiar with Zeppelin code yet to help with reviews.

@Leemoonsoo
Copy link
Member

@rnirmal Vote for 0.6.0-rc1 is about to start. We'll need 0.6.1 release immediately after we have scala 2.11 and spark 2.0 support.

Currently, @lresende and me are trying to make Scala 2.11 support green.
@echarles made great work here. Spark 2.0 interpreter implementation in scala.

For 0.6.1 release, I'd like to add scala 2.11 and spark 2.0 support while keeping current way of implementation, to keep previous spark version support the same, more conservatively.

Meanwhile i think we can continue to discuss and working on scala implementation of spark interpreter, easier spark interpreter maintenance, and so on, on the master branch so they can be included in 0.7.0 release.

@rnirmal
Copy link

rnirmal commented Jun 30, 2016

Sounds good, I'll keep a lookout for it to land.

@echarles
Copy link
Member Author

Closing this. Spark 2.0 is implemented with #1195
Optionally, a new PR can be opened to port the Spark interpreter from Java to Scala if interest.

@asfgit asfgit closed this in ad91e8e Jul 22, 2016
asfgit pushed a commit that referenced this pull request Jul 24, 2016
### What is this PR for?
This PR implement spark 2.0 support based on #747.
This PR has approach from #980 which is reimplementing code in scala.

You can try build this branch

```
mvn clean package -Dscala-2.11 -Pspark-2.0 -Dspark.version=2.0.0-preview -Ppyspark -Psparkr -Pyarn -Phadoop-2.6 -DskipTests
```

### What type of PR is it?
Improvements

### Todos
* [x] - Spark 2.0 support
* [x] - Rebase after #747 merge
* [x] - Update LICENSE file
* [x] - Update related document (build)

### What is the Jira issue?
https://issues.apache.org/jira/browse/ZEPPELIN-759

### How should this be tested?

Build and try
```
mvn clean package -Dscala-2.11 -Pspark-2.0 -Dspark.version=2.0.0-preview -Ppyspark -Psparkr -Pyarn -Phadoop-2.6 -DskipTests
```

### Screenshots (if appropriate)
![spark2](https://cloud.githubusercontent.com/assets/1540981/16771611/fe804038-4805-11e6-8447-3fa4258bb51d.gif)

### Questions:
* Does the licenses files need update? yes
* Is there breaking changes for older versions? no
* Does this needs documentation? yes

Author: Lee moon soo <[email protected]>

Closes #1195 from Leemoonsoo/spark-20 and squashes the following commits:

d78b322 [Lee moon soo] trigger ci
8017e8b [Lee moon soo] Remove unnecessary spark.version property
e3141bd [Lee moon soo] restart sparkcluster before sparkr test
1493b2c [Lee moon soo] print spark standalone cluster log when ci test fails
a208cd0 [Lee moon soo] Debug sparkRTest
31369c6 [Lee moon soo] Update license
293896a [Lee moon soo] Update build instruction
862ff6c [Lee moon soo] Make ZeppelinSparkClusterTest.java work with spark 2
839912a [Lee moon soo] Update SPARK_HOME directory detection pattern for 2.0.0-preview in the test
3413707 [Lee moon soo] Update .travis.yml
02bcd5d [Lee moon soo] Update SparkSqlInterpreterTest
f06a2fa [Lee moon soo] Spark 2.0 support
asfgit pushed a commit that referenced this pull request Jul 24, 2016
This PR implement spark 2.0 support based on #747.
This PR has approach from #980 which is reimplementing code in scala.

You can try build this branch

```
mvn clean package -Dscala-2.11 -Pspark-2.0 -Dspark.version=2.0.0-preview -Ppyspark -Psparkr -Pyarn -Phadoop-2.6 -DskipTests
```

Improvements

* [x] - Spark 2.0 support
* [x] - Rebase after #747 merge
* [x] - Update LICENSE file
* [x] - Update related document (build)

https://issues.apache.org/jira/browse/ZEPPELIN-759

Build and try
```
mvn clean package -Dscala-2.11 -Pspark-2.0 -Dspark.version=2.0.0-preview -Ppyspark -Psparkr -Pyarn -Phadoop-2.6 -DskipTests
```

![spark2](https://cloud.githubusercontent.com/assets/1540981/16771611/fe804038-4805-11e6-8447-3fa4258bb51d.gif)

* Does the licenses files need update? yes
* Is there breaking changes for older versions? no
* Does this needs documentation? yes

Author: Lee moon soo <[email protected]>

Closes #1195 from Leemoonsoo/spark-20 and squashes the following commits:

d78b322 [Lee moon soo] trigger ci
8017e8b [Lee moon soo] Remove unnecessary spark.version property
e3141bd [Lee moon soo] restart sparkcluster before sparkr test
1493b2c [Lee moon soo] print spark standalone cluster log when ci test fails
a208cd0 [Lee moon soo] Debug sparkRTest
31369c6 [Lee moon soo] Update license
293896a [Lee moon soo] Update build instruction
862ff6c [Lee moon soo] Make ZeppelinSparkClusterTest.java work with spark 2
839912a [Lee moon soo] Update SPARK_HOME directory detection pattern for 2.0.0-preview in the test
3413707 [Lee moon soo] Update .travis.yml
02bcd5d [Lee moon soo] Update SparkSqlInterpreterTest
f06a2fa [Lee moon soo] Spark 2.0 support

(cherry picked from commit 8546666)
Signed-off-by: Lee moon soo <[email protected]>
PhilippGrulich pushed a commit to SWC-SENSE/zeppelin that referenced this pull request Aug 8, 2016
### What is this PR for?
This PR implement spark 2.0 support based on apache#747.
This PR has approach from apache#980 which is reimplementing code in scala.

You can try build this branch

```
mvn clean package -Dscala-2.11 -Pspark-2.0 -Dspark.version=2.0.0-preview -Ppyspark -Psparkr -Pyarn -Phadoop-2.6 -DskipTests
```

### What type of PR is it?
Improvements

### Todos
* [x] - Spark 2.0 support
* [x] - Rebase after apache#747 merge
* [x] - Update LICENSE file
* [x] - Update related document (build)

### What is the Jira issue?
https://issues.apache.org/jira/browse/ZEPPELIN-759

### How should this be tested?

Build and try
```
mvn clean package -Dscala-2.11 -Pspark-2.0 -Dspark.version=2.0.0-preview -Ppyspark -Psparkr -Pyarn -Phadoop-2.6 -DskipTests
```

### Screenshots (if appropriate)
![spark2](https://cloud.githubusercontent.com/assets/1540981/16771611/fe804038-4805-11e6-8447-3fa4258bb51d.gif)

### Questions:
* Does the licenses files need update? yes
* Is there breaking changes for older versions? no
* Does this needs documentation? yes

Author: Lee moon soo <[email protected]>

Closes apache#1195 from Leemoonsoo/spark-20 and squashes the following commits:

d78b322 [Lee moon soo] trigger ci
8017e8b [Lee moon soo] Remove unnecessary spark.version property
e3141bd [Lee moon soo] restart sparkcluster before sparkr test
1493b2c [Lee moon soo] print spark standalone cluster log when ci test fails
a208cd0 [Lee moon soo] Debug sparkRTest
31369c6 [Lee moon soo] Update license
293896a [Lee moon soo] Update build instruction
862ff6c [Lee moon soo] Make ZeppelinSparkClusterTest.java work with spark 2
839912a [Lee moon soo] Update SPARK_HOME directory detection pattern for 2.0.0-preview in the test
3413707 [Lee moon soo] Update .travis.yml
02bcd5d [Lee moon soo] Update SparkSqlInterpreterTest
f06a2fa [Lee moon soo] Spark 2.0 support
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants