-
Notifications
You must be signed in to change notification settings - Fork 2.8k
Add Scala utility functions for display #80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
dc1f82d to
ec24686
Compare
|
@doanduyhai thanks for really great contribution with an awesome design docs! Utility functions that you have implemented are very useful indeed, the only question is: what is the benefit of having a separate sub-module for that, comparing with just adding those to spark interpreter itself? |
zeppelin-interpreter/pom.xml
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Zeppelin-interpreter is common dependency for all Interpreter implementations and need to minimize it's dependencies. Isn't it better 'spark' has 'zeppelin-spark-utils' as a dependency?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only reason I added a new Maven module is that the code of display utility function is written in Scala and I did not want to pollute existing Java code in the interpreter or spark module.
But anyway, it's very easy to change. Thanks to the maven-scala-plugin, the Scala code will be taken into account in the build process.
So just tell me in which project I should put the Scala code and I'll update my pull-request
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not because of adding new module, but because of zeppelin-interpreter has new module as a dependency.
Zeppelin want to minimize the dependency version conflict between interpreters. So, if zeppelin-interpreter has zeppelin-spark-utils as a dependency then spark-core and it's transitive dependency will be included in all other interpreter, too.
|
I guess there is a typo: Should be: |
|
Thanks aseigneurin for spotting the typo, I updated the PR comment accordingly |
|
Thanks for the great contribution! |
|
Ok, I re-pushed the commit, by putting Scala code directly into the spark module. I also removed the spark-utils module and it works (tested locally) |
|
Ok, I have updated my commit with:
List(
Person("jdoe","John DOE",32),
Person("hsue","Helen SUE",27),
Person("rsmith","Richard SMITH",45)
).displayAsTable("Login","Name","Age")It just works with plain Scala collection, no need to transform them to an RDD by calling |
|
@doanduyhai There was a simple display help function ZeppelinContext.show(Object o). Another thing is, it's very easy to user make mistake that use display function against very large RDD. it's hard to do .collect() large RDD and transfer data to front-end side. ZeppelinContext.show(Object o) does use .take(n) to survive from user mistake. Could you share your experience and opinion about it? |
|
Thank you for your interesting questions. Indeed I see 2 main points here:
So, my answers: 1.Default limit for fetched data For this we can take the same approach as the one taken by val defaultMaxLimit = 10000
def displayAsTable(maxLimit: Long, columnsLabel: String*) {...}
def displayAsTable(columnsLabel: String*) {
displayAsTable(defaultMaxLimit, columnsLabel)
}The above solution works, but I propose another approach. In my last commit, the Example: rdd
.map(...)
.filter(...)
.take(30) // User responsibility
.displayAsTable(...)This way, we put the responsibility of collecting data explicitly on the end-user, no surprise, what do you think ?
I had a look into the ZeppelinContext code and basically it is dealing with Scala code in Java. What we can do is: a. either put the b. or we port the code of I volunteer to port this class from Java to Scala (with unit tests of course). What do you think @Leemoonsoo ? |
|
@doanduyhai To me, it still make sense to display function support RDD. And about normalize, adding a display function in Scala is great idea. And if you can convert java to scala, it'll be awesome. One more thing, what do you think about making name simpler, .displayAsTable() -> .display() ? |
|
@doanduyhai |
|
Hello @Leemoonsoo So,
Here is a screen capture showing the new default limit feature ( For code normalization, I would prefer we merge and close this pull request and open a new JIRA to prepare the port of ZeppelinContext class in Scala with appropriate unit tests. Baby steps. |
|
Great work! |
Until now, to display data as a table, there are 2 alternatives:
1. Either use **Spark DataFrame** and Zeppeline built-in support
2. Or generate manually a `println(%table ...)`. As an example of displaying an `RDD[(String,String,Int)]` representing a collection of users:
```scala
val data = new java.lang.StringBuilder("%table Login\tName\tAge\n")
rdd.foreach {
case (login,name,age) => data.append(s"$login\t$name\t$age\n")
}
println(data.toString())
```
My proposal is to add a new utility function to make creating tables easier that the code example above. Of course one can always use **Spark DataFrame** but I find it quite restrictive. People using Spark versions lesser than 1.3 cannot rely on DataFrame and sometimes one does not want to transform an RDD to DataFrame for display.
How are the utility functions implemented ?
1. I added a new module **spark-utils** which provide Scala code for display utility functions. This module will use the **maven-scala-plugin** to compile all the classes in package `org.apache.zeppelin.spark.utils`.
2. Right now the package `org.apache.zeppelin.spark.utils` only contains 1 object `DisplayUtils` which augments RDDs/**Scala Traversable** of Tuples or case classes (all of them sub-class of trait `Product`) with the new method `displayAsTable(columnLabels: String*)`.
3. The `DisplayUtils` object is imported automatically into the `SparkInterpreter` with `intp.interpret("import org.apache.zeppelin.spark.utils.DisplayUtils._");`
4. The Maven module **interpreter** will now have a **runtime** dependency on the module **spark-utils** so that the utility class will be loaded at runtime
5. Usage of the new display utility function is:
**Paragraph1**
```scala
case class Person(login: String, name: String, age: Int)
val rddTuples:RDD[(String,String,Int)] = sc.parallelize(List(("jdoe","John DOE",32),("hsue","Helen SUE",27))
val rddCaseClass:RDD[Person] = sc.parallelize(List(Person("jdoe","John DOE",32),Person("hsue","Helen SUE",27))
```
**Paragraph2**
```scala
rddTuples.displayAsTable("Login","Name","Age")
```
**Paragraph3**
```scala
rddCaseClass.displayAsTable("Login","Name","Age")
```
6. The `displayAsTable()` method is error-proof, meaning that if the user provides **more** columns label that the number of elements in the tuples/case class, the extra column labels will ignored. If the user provides **less** column labels than expected, the method will pad missing column headers with **Column2**, **Column3** etc ...
7. In addition to the `displayAsTable` methods, I added some other utility methods to make it easier to handle custom HTML and images:
a. calling `html()` will generate the string `"%html "`
b. calling `html("<p> This is a test</p>)` will generate the string `"%html <p> This is a test</p>"`
c. calling `img("http://www.google.com")` will generate the string `"<img src='http://www.google.com' />"`
d. calling `img64()` will generate the string `"%img "`
e. calling `img64("ABCDE123")` will generate the string `"%img ABCDE123"`
Of course the `DisplayUtils` object can be extended with new other functions to support future advanced displaying features
Author: DuyHai DOAN <[email protected]>
Closes apache#80 from doanduyhai/DisplayUtils and squashes the following commits:
62a2311 [DuyHai DOAN] Add default limit to RDD using zeppelin.spark.maxResult property
47a1b1f [DuyHai DOAN] Rename displayAsTable() to display()
a15294e [DuyHai DOAN] Enhance display function utility to accept Scala Traversable in addition to RDD
c1ee8fe [DuyHai DOAN] Add Scala code in module spark to expose utility functions for display
(cherry picked from commit 23922ae)
Signed-off-by: Lee moon soo <[email protected]>
…i_to_integration_api to V_1.0.0 * commit '1aafb72f3b2a23c714affa9a81e8088df288f17f': [ZP-424] Fix review [ZP-424] Add integration api and logs

Until now, to display data as a table, there are 2 alternatives:
println(%table ...). As an example of displaying anRDD[(String,String,Int)]representing a collection of users:My proposal is to add a new utility function to make creating tables easier that the code example above. Of course one can always use Spark DataFrame but I find it quite restrictive. People using Spark versions lesser than 1.3 cannot rely on DataFrame and sometimes one does not want to transform an RDD to DataFrame for display.
How are the utility functions implemented ?
I added a new module spark-utils which provide Scala code for display utility functions. This module will use the maven-scala-plugin to compile all the classes in package
org.apache.zeppelin.spark.utils.Right now the package
org.apache.zeppelin.spark.utilsonly contains 1 objectDisplayUtilswhich augments RDDs/Scala Traversable of Tuples or case classes (all of them sub-class of traitProduct) with the new methoddisplayAsTable(columnLabels: String*).The
DisplayUtilsobject is imported automatically into theSparkInterpreterwithintp.interpret("import org.apache.zeppelin.spark.utils.DisplayUtils._");The Maven module interpreter will now have a runtime dependency on the module spark-utils so that the utility class will be loaded at runtime
Usage of the new display utility function is:
Paragraph1
Paragraph2
Paragraph3
The
displayAsTable()method is error-proof, meaning that if the user provides more columns label that the number of elements in the tuples/case class, the extra column labels will ignored. If the user provides less column labels than expected, the method will pad missing column headers with Column2, Column3 etc ...In addition to the
displayAsTablemethods, I added some other utility methods to make it easier to handle custom HTML and images:a. calling
html()will generate the string"%html "b. calling
html("<p> This is a test</p>)will generate the string"%html <p> This is a test</p>"c. calling
img("http://www.google.com")will generate the string"<img src='http://www.google.com' />"d. calling
img64()will generate the string"%img "e. calling
img64("ABCDE123")will generate the string"%img ABCDE123"Of course the
DisplayUtilsobject can be extended with new other functions to support future advanced displaying features