-
Notifications
You must be signed in to change notification settings - Fork 2.8k
[ZEPPELIN-871] [WIP] spark 2.0 interpreter on scala 2.11 #980
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
This is WIP. I'd like to get feedback on the approach (scala implementation in the current spark module), taking into consideration:
Building on the current java classes with method invocation may be possible, but would make the code more than difficult to read and develop. This PR proposes separated scala classes for this very specific 2.0 API breaking changes. WDYT? Based on feedback, I will further validate the functionalities (for now, a simple spark 2.0 call works well on my local env). |
|
@echarles Thanks for the contribution. How about we divide problem into
@lresende is working on scala 2.11 support on #747. I'm also trying to help merge code for scala 2.10 and 2.11 into one via lresende#1 Regarding spark 2.0 and reimplementation in scala, |
|
Sure, we can wait on #747 merge. I see there is a |
|
I'm trying to combine 2.10 and 2.11 into one implementation in lresende#1 |
| <spark.download.url>http://archive.apache.org/dist/spark/${spark.archive}/${spark.archive}.tgz</spark.download.url> | ||
| <spark.bin.download.url>http://archive.apache.org/dist/spark/spark-${spark.version}/spark-${spark.version}-bin-without-hadoop.tgz</spark.bin.download.url> | ||
| <spark.dist.cache>${project.build.directory}/../../.spark-dist</spark.dist.cache> | ||
| <py4j.version>0.8.2.1</py4j.version> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there's a new version of py4j too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thx, taking this into account in next push.
|
From what I see, we should still be able to use most of the existing code. Could you elaborate more on how bad it would be to support Spark 1.x and 2.x in the same interpreter code? |
|
Dealing with mixed scala 2.10/2.11 and spark 1.x/2.x with a same implementation is always possible but drives to code plenty of method invocation (see e.g. https://github.com/lresende/incubator-zeppelin/pull/1/files#diff-dbda0c4083ad9c59ff05f0273b5e760fR216). At the end, you have an implementation which a succession of if (...) then invokeMethod... On the other hand, having multiple implementations drives to maintenance overhead and a risk of divergence. In this particular case, I was thinking that having 2 lines was something worth to discuss: 1.- Spark 1.x on scala 2.10 I also don't see why we should still support the DepInterpreter in future developments (normally, deps should be configured via the interperter settings). Also, the 2 lines approach would make easier other evolutions such as for example having a ZeppelinContext available for all interpreters. Also, we have kind-off already more than one spark impementation: The Livy one has its own implementation and features. |
|
Thanks for elaborate on idea about 2 lines approach. I think there're always pros and cons. If we go to 2 lines approach, code will be cleaner, simpler and more easier to read. But this means Zeppelin generally enables create binary per interpreter dependency because we'll have the same policy for all interpreters. In the end, i'm afraid we end up with making bunch of binary packages like
In the perspective of user, it can not be an improvement. A single Zeppelin binary used to work with everything, but now user have to understand difference between of packages and able to select them before uses. I think code complexity will not endlessly increase if we cut the spark support for last x releases. |
|
When @minahlee contributed dependency loading through interpreter setting, we also thought DepInterpreter can be deprecated and removed. That's why current documentation of spark interpreter mark %dep is deprecated. But since that, there were some strong feedbacks from users that they likes DepInterpreter especially it enables self documenting. i.e. Working code and it's dependency can be in the same notebook. And i think it really make sense. So i think we need to reconsider deprecating DepInterpreter. |
|
Agree with your agurment. If we push a bit further the discussion, we can not have a single binary which fits all users needs. What if I want spark-2-scala-2.11 with flink-0.9-scala-2.12 ? You single binary will not give me any solution... I would rather see a Regarding DepIntepreter, I can understand what users says, but for some packages, it simply does not work (example: spark-csv running on yarn if I remember well) - Not sure if it can be easily fixed. Bottom line: I am no fan of the |
|
Why not single binary works for spark-2-scala-2.11 and flink-0.9-scala-2.12? And i 100% agree. Once we have https://issues.apache.org/jira/browse/ZEPPELIN-598 and proper UI / command line for list/download interpreter from maven repository, then i think things will become much more easier and flexible. |
|
+1 on all of that. I think we could work together on a proposal to see what is the best way to go forward. |
@felixcheung Is there already a jira for the proposal you are thinkg to? I feel we have nearly everything open, but maybe we see an umbrella? |
|
@Leemoonsoo Oh, #908 is the one I was looking for |
If single SparkInterpreter binary is compatible with various spark versions and scala versions, and single FlinkInterpreter binary is compatible with various flink versions and scala versions, user can choose any combination without rebuild. isn't it?
After #908 is merged, we'll need proper commandline or GUI to access this feature. |
|
|
@echarles @Leemoonsoo do you know where the direction lies for Spark 2.0 with scala 2.11 support.. i.e as separate items or continued effort on this PR? Since Spark 2.0 is going to be release in the next week or two, would be great to see this made available on Zeppelin as a quick follow on. I can help with any testing if needed, not as familiar with Zeppelin code yet to help with reviews. |
|
@rnirmal Vote for 0.6.0-rc1 is about to start. We'll need 0.6.1 release immediately after we have scala 2.11 and spark 2.0 support. Currently, @lresende and me are trying to make Scala 2.11 support green. For 0.6.1 release, I'd like to add scala 2.11 and spark 2.0 support while keeping current way of implementation, to keep previous spark version support the same, more conservatively. Meanwhile i think we can continue to discuss and working on scala implementation of spark interpreter, easier spark interpreter maintenance, and so on, on the master branch so they can be included in 0.7.0 release. |
|
Sounds good, I'll keep a lookout for it to land. |
|
Closing this. Spark 2.0 is implemented with #1195 |
### What is this PR for? This PR implement spark 2.0 support based on #747. This PR has approach from #980 which is reimplementing code in scala. You can try build this branch ``` mvn clean package -Dscala-2.11 -Pspark-2.0 -Dspark.version=2.0.0-preview -Ppyspark -Psparkr -Pyarn -Phadoop-2.6 -DskipTests ``` ### What type of PR is it? Improvements ### Todos * [x] - Spark 2.0 support * [x] - Rebase after #747 merge * [x] - Update LICENSE file * [x] - Update related document (build) ### What is the Jira issue? https://issues.apache.org/jira/browse/ZEPPELIN-759 ### How should this be tested? Build and try ``` mvn clean package -Dscala-2.11 -Pspark-2.0 -Dspark.version=2.0.0-preview -Ppyspark -Psparkr -Pyarn -Phadoop-2.6 -DskipTests ``` ### Screenshots (if appropriate)  ### Questions: * Does the licenses files need update? yes * Is there breaking changes for older versions? no * Does this needs documentation? yes Author: Lee moon soo <[email protected]> Closes #1195 from Leemoonsoo/spark-20 and squashes the following commits: d78b322 [Lee moon soo] trigger ci 8017e8b [Lee moon soo] Remove unnecessary spark.version property e3141bd [Lee moon soo] restart sparkcluster before sparkr test 1493b2c [Lee moon soo] print spark standalone cluster log when ci test fails a208cd0 [Lee moon soo] Debug sparkRTest 31369c6 [Lee moon soo] Update license 293896a [Lee moon soo] Update build instruction 862ff6c [Lee moon soo] Make ZeppelinSparkClusterTest.java work with spark 2 839912a [Lee moon soo] Update SPARK_HOME directory detection pattern for 2.0.0-preview in the test 3413707 [Lee moon soo] Update .travis.yml 02bcd5d [Lee moon soo] Update SparkSqlInterpreterTest f06a2fa [Lee moon soo] Spark 2.0 support
This PR implement spark 2.0 support based on #747. This PR has approach from #980 which is reimplementing code in scala. You can try build this branch ``` mvn clean package -Dscala-2.11 -Pspark-2.0 -Dspark.version=2.0.0-preview -Ppyspark -Psparkr -Pyarn -Phadoop-2.6 -DskipTests ``` Improvements * [x] - Spark 2.0 support * [x] - Rebase after #747 merge * [x] - Update LICENSE file * [x] - Update related document (build) https://issues.apache.org/jira/browse/ZEPPELIN-759 Build and try ``` mvn clean package -Dscala-2.11 -Pspark-2.0 -Dspark.version=2.0.0-preview -Ppyspark -Psparkr -Pyarn -Phadoop-2.6 -DskipTests ```  * Does the licenses files need update? yes * Is there breaking changes for older versions? no * Does this needs documentation? yes Author: Lee moon soo <[email protected]> Closes #1195 from Leemoonsoo/spark-20 and squashes the following commits: d78b322 [Lee moon soo] trigger ci 8017e8b [Lee moon soo] Remove unnecessary spark.version property e3141bd [Lee moon soo] restart sparkcluster before sparkr test 1493b2c [Lee moon soo] print spark standalone cluster log when ci test fails a208cd0 [Lee moon soo] Debug sparkRTest 31369c6 [Lee moon soo] Update license 293896a [Lee moon soo] Update build instruction 862ff6c [Lee moon soo] Make ZeppelinSparkClusterTest.java work with spark 2 839912a [Lee moon soo] Update SPARK_HOME directory detection pattern for 2.0.0-preview in the test 3413707 [Lee moon soo] Update .travis.yml 02bcd5d [Lee moon soo] Update SparkSqlInterpreterTest f06a2fa [Lee moon soo] Spark 2.0 support (cherry picked from commit 8546666) Signed-off-by: Lee moon soo <[email protected]>
### What is this PR for? This PR implement spark 2.0 support based on apache#747. This PR has approach from apache#980 which is reimplementing code in scala. You can try build this branch ``` mvn clean package -Dscala-2.11 -Pspark-2.0 -Dspark.version=2.0.0-preview -Ppyspark -Psparkr -Pyarn -Phadoop-2.6 -DskipTests ``` ### What type of PR is it? Improvements ### Todos * [x] - Spark 2.0 support * [x] - Rebase after apache#747 merge * [x] - Update LICENSE file * [x] - Update related document (build) ### What is the Jira issue? https://issues.apache.org/jira/browse/ZEPPELIN-759 ### How should this be tested? Build and try ``` mvn clean package -Dscala-2.11 -Pspark-2.0 -Dspark.version=2.0.0-preview -Ppyspark -Psparkr -Pyarn -Phadoop-2.6 -DskipTests ``` ### Screenshots (if appropriate)  ### Questions: * Does the licenses files need update? yes * Is there breaking changes for older versions? no * Does this needs documentation? yes Author: Lee moon soo <[email protected]> Closes apache#1195 from Leemoonsoo/spark-20 and squashes the following commits: d78b322 [Lee moon soo] trigger ci 8017e8b [Lee moon soo] Remove unnecessary spark.version property e3141bd [Lee moon soo] restart sparkcluster before sparkr test 1493b2c [Lee moon soo] print spark standalone cluster log when ci test fails a208cd0 [Lee moon soo] Debug sparkRTest 31369c6 [Lee moon soo] Update license 293896a [Lee moon soo] Update build instruction 862ff6c [Lee moon soo] Make ZeppelinSparkClusterTest.java work with spark 2 839912a [Lee moon soo] Update SPARK_HOME directory detection pattern for 2.0.0-preview in the test 3413707 [Lee moon soo] Update .travis.yml 02bcd5d [Lee moon soo] Update SparkSqlInterpreterTest f06a2fa [Lee moon soo] Spark 2.0 support
What is this PR for?
Spark interpreter for spark version 2.0.0 and scala 2.11 (implemented in Scala)
What type of PR is it?
[Feature]
Todos
What is the Jira issue?
How should this be tested?
Build it with
Run and test the spark paragraph.
Screenshots (if appropriate)
Questions: