Add Scala utility functions for display #80

doanduyhai · 2015-05-23T20:36:04Z

Until now, to display data as a table, there are 2 alternatives:

Either use Spark DataFrame and Zeppeline built-in support
Or generate manually a println(%table ...). As an example of displaying an RDD[(String,String,Int)] representing a collection of users:

val data = new java.lang.StringBuilder("%table Login\tName\tAge\n")
rdd.foreach {
   case (login,name,age) => data.append(s"$login\t$name\t$age\n")
}

println(data.toString())

My proposal is to add a new utility function to make creating tables easier that the code example above. Of course one can always use Spark DataFrame but I find it quite restrictive. People using Spark versions lesser than 1.3 cannot rely on DataFrame and sometimes one does not want to transform an RDD to DataFrame for display.

How are the utility functions implemented ?

I added a new module spark-utils which provide Scala code for display utility functions. This module will use the maven-scala-plugin to compile all the classes in package org.apache.zeppelin.spark.utils.
Right now the package org.apache.zeppelin.spark.utils only contains 1 object DisplayUtils which augments RDDs/Scala Traversable of Tuples or case classes (all of them sub-class of trait Product) with the new method displayAsTable(columnLabels: String*).
The DisplayUtils object is imported automatically into the SparkInterpreter with intp.interpret("import org.apache.zeppelin.spark.utils.DisplayUtils._");
The Maven module interpreter will now have a runtime dependency on the module spark-utils so that the utility class will be loaded at runtime

Usage of the new display utility function is:

Paragraph1

case class Person(login: String, name: String, age: Int)
val rddTuples:RDD[(String,String,Int)] = sc.parallelize(List(("jdoe","John DOE",32),("hsue","Helen     SUE",27))
val rddCaseClass:RDD[Person] = sc.parallelize(List(Person("jdoe","John DOE",32),Person("hsue","Helen SUE",27))

Paragraph2

rddTuples.displayAsTable("Login","Name","Age")

Paragraph3

rddCaseClass.displayAsTable("Login","Name","Age")

The displayAsTable() method is error-proof, meaning that if the user provides more columns label that the number of elements in the tuples/case class, the extra column labels will ignored. If the user provides less column labels than expected, the method will pad missing column headers with Column2, Column3 etc ...
In addition to the displayAsTable methods, I added some other utility methods to make it easier to handle custom HTML and images:
a. calling html() will generate the string "%html "
b. calling html(" This is a test) will generate the string "%html This is a test"
c. calling img("http://www.google.com") will generate the string "<img src='http://www.google.com' />"
d. calling img64() will generate the string "%img "
e. calling img64("ABCDE123") will generate the string "%img ABCDE123"

Of course the DisplayUtils object can be extended with new other functions to support future advanced displaying features

bzz · 2015-05-26T02:38:38Z

@doanduyhai thanks for really great contribution with an awesome design docs!

Utility functions that you have implemented are very useful indeed, the only question is: what is the benefit of having a separate sub-module for that, comparing with just adding those to spark interpreter itself?

Leemoonsoo · 2015-05-26T08:47:13Z

zeppelin-interpreter/pom.xml

Zeppelin-interpreter is common dependency for all Interpreter implementations and need to minimize it's dependencies. Isn't it better 'spark' has 'zeppelin-spark-utils' as a dependency?

The only reason I added a new Maven module is that the code of display utility function is written in Scala and I did not want to pollute existing Java code in the interpreter or spark module.

But anyway, it's very easy to change. Thanks to the maven-scala-plugin, the Scala code will be taken into account in the build process.

So just tell me in which project I should put the Scala code and I'll update my pull-request

It is not because of adding new module, but because of zeppelin-interpreter has new module as a dependency.

Zeppelin want to minimize the dependency version conflict between interpreters. So, if zeppelin-interpreter has zeppelin-spark-utils as a dependency then spark-core and it's transitive dependency will be included in all other interpreter, too.

aseigneurin · 2015-05-26T08:49:22Z

I guess there is a typo:

val rddCaseClass:RDD[(String,String,Int)] = sc.parallelize(List(Person("jdoe","John DOE",32),Person("hsue","Helen SUE",27))

Should be:

val rddCaseClass:RDD[(Person)] = sc.parallelize(List(Person("jdoe","John DOE",32),Person("hsue","Helen SUE",27))

doanduyhai · 2015-05-26T15:57:06Z

Thanks aseigneurin for spotting the typo, I updated the PR comment accordingly

swkimme · 2015-05-27T01:12:40Z

Thanks for the great contribution!
I'm +1 for adding new utility module, it seems like a really great way to add some utilities.

doanduyhai · 2015-05-31T08:43:31Z

Ok, I re-pushed the commit, by putting Scala code directly into the spark module. I also removed the spark-utils module and it works (tested locally)

doanduyhai · 2015-06-02T20:32:15Z

Ok, I have updated my commit with:

A rebase from master to be up-to-date
I also enhanced the code to accept not only RDDs but also plain Scala collection of Tuples/case class. So now you'll be able to do:

List(
    Person("jdoe","John DOE",32),
    Person("hsue","Helen SUE",27),
    Person("rsmith","Richard SMITH",45)
).displayAsTable("Login","Name","Age")

It just works with plain Scala collection, no need to transform them to an RDD by calling sc.parallelize(myScalaCollection)

Leemoonsoo · 2015-06-03T10:27:18Z

@doanduyhai
Thanks for great contribution.

There was a simple display help function ZeppelinContext.show(Object o).
How do you want to deal with it? i mean utility function for the same purpose is placed in two different place.

Another thing is, it's very easy to user make mistake that use display function against very large RDD. it's hard to do .collect() large RDD and transfer data to front-end side. ZeppelinContext.show(Object o) does use .take(n) to survive from user mistake. Could you share your experience and opinion about it?

doanduyhai · 2015-06-03T14:45:33Z

@Leemoonsoo

Thank you for your interesting questions. Indeed I see 2 main points here:

How to guarantee that the end-user do not fetch a massive amount of data from RDD to the Driver programe, e.g. add some mechanism to limit the amount of fetched data for display
Normalize the showRDD() and displayAsTable() code to put them at the same place and not spread over

So, my answers:

1.Default limit for fetched data

For this we can take the same approach as the one taken by showRdd():

  val defaultMaxLimit = 10000 

  def displayAsTable(maxLimit: Long, columnsLabel: String*) {...}

  def displayAsTable(columnsLabel: String*) {
      displayAsTable(defaultMaxLimit, columnsLabel)
  }

The above solution works, but I propose another approach. In my last commit, the displayAsTable() method also works for plain Scala collection so I suggest to remove the support from RDD. If the end-user want to use displayAsTable(), they must convert their RDD first to a Scala collection. In some way, we force the end-user to call either collect() or take(n) explicitly.

Example:

rdd
.map(...)
.filter(...)
.take(30) // User responsibility
.displayAsTable(...)

This way, we put the responsibility of collecting data explicitly on the end-user, no surprise, what do you think ?

Normalize the showRDD() and displayAsTable() code

I had a look into the ZeppelinContext code and basically it is dealing with Scala code in Java. What we can do is:

a. either put the displayAsTable() method inside this ZeppelinContext class and code everything in Java but then we loose the power of Scala implicit conversion. We'll need to do:

val collection = rdd.map(...).filter(...).take(10)
z.displayAsTable(collection,"Header1","Header2",...)

b. or we port the code of ZeppelinContext in Scala directly so that both displayAsTable() and showRDD() are written directly in Scala. It would make sense because lots of the code in this class is dealing already with Scala code and it would avoid us converting back and forth between Java and Scala:

I volunteer to port this class from Java to Scala (with unit tests of course). What do you think @Leemoonsoo ?

Leemoonsoo · 2015-06-08T15:51:33Z

@doanduyhai
Thanks for sharing the idea.

To me, it still make sense to display function support RDD.
That i think make user feeling the similar to DataFrame.show().
In that case, i believe it's a matter of time that user do .collect().displayAsTable() against large dataset by mistake. To limit the number, if it can be configureable through interpreter

And about normalize, adding a display function in Scala is great idea. And if you can convert java to scala, it'll be awesome.

One more thing, what do you think about making name simpler, .displayAsTable() -> .display() ?

Leemoonsoo · 2015-06-11T07:51:24Z

@doanduyhai
Also i'd like to know if you're planning to add more commit here for Java to Scala and some improvements. Or if you prefer to merge it first and make new pull request for further improvements.

…ion to RDD

doanduyhai · 2015-06-12T11:40:44Z

Hello @Leemoonsoo

So,

I've renamed displayAsTable() to display() as suggested
I've introduce a default limit to RDD ( rdd.take()) using the zeppelin.spark.maxResult property
User can override the default limit to pass in their own limit value in the display() method
There is no default limit for Scala collection
I've rebased from master branch

Here is a screen capture showing the new default limit feature ( zeppelin.spark.maxResult has been set to 100):

For code normalization, I would prefer we merge and close this pull request and open a new JIRA to prepare the port of ZeppelinContext class in Scala with appropriate unit tests.

Baby steps.

Leemoonsoo · 2015-06-13T03:38:35Z

Great work!
Looks good to me. +1 for merge.

Until now, to display data as a table, there are 2 alternatives: 1. Either use **Spark DataFrame** and Zeppeline built-in support 2. Or generate manually a `println(%table ...)`. As an example of displaying an `RDD[(String,String,Int)]` representing a collection of users: ```scala val data = new java.lang.StringBuilder("%table Login\tName\tAge\n") rdd.foreach { case (login,name,age) => data.append(s"$login\t$name\t$age\n") } println(data.toString()) ``` My proposal is to add a new utility function to make creating tables easier that the code example above. Of course one can always use **Spark DataFrame** but I find it quite restrictive. People using Spark versions lesser than 1.3 cannot rely on DataFrame and sometimes one does not want to transform an RDD to DataFrame for display. How are the utility functions implemented ? 1. I added a new module **spark-utils** which provide Scala code for display utility functions. This module will use the **maven-scala-plugin** to compile all the classes in package `org.apache.zeppelin.spark.utils`. 2. Right now the package `org.apache.zeppelin.spark.utils` only contains 1 object `DisplayUtils` which augments RDDs/**Scala Traversable** of Tuples or case classes (all of them sub-class of trait `Product`) with the new method `displayAsTable(columnLabels: String*)`. 3. The `DisplayUtils` object is imported automatically into the `SparkInterpreter` with `intp.interpret("import org.apache.zeppelin.spark.utils.DisplayUtils._");` 4. The Maven module **interpreter** will now have a **runtime** dependency on the module **spark-utils** so that the utility class will be loaded at runtime 5. Usage of the new display utility function is: **Paragraph1** ```scala case class Person(login: String, name: String, age: Int) val rddTuples:RDD[(String,String,Int)] = sc.parallelize(List(("jdoe","John DOE",32),("hsue","Helen SUE",27)) val rddCaseClass:RDD[Person] = sc.parallelize(List(Person("jdoe","John DOE",32),Person("hsue","Helen SUE",27)) ``` **Paragraph2** ```scala rddTuples.displayAsTable("Login","Name","Age") ``` **Paragraph3** ```scala rddCaseClass.displayAsTable("Login","Name","Age") ``` 6. The `displayAsTable()` method is error-proof, meaning that if the user provides **more** columns label that the number of elements in the tuples/case class, the extra column labels will ignored. If the user provides **less** column labels than expected, the method will pad missing column headers with **Column2**, **Column3** etc ... 7. In addition to the `displayAsTable` methods, I added some other utility methods to make it easier to handle custom HTML and images: a. calling `html()` will generate the string `"%html "` b. calling `html(" This is a test)` will generate the string `"%html This is a test"` c. calling `img("http://www.google.com")` will generate the string `"<img src='http://www.google.com' />"` d. calling `img64()` will generate the string `"%img "` e. calling `img64("ABCDE123")` will generate the string `"%img ABCDE123"` Of course the `DisplayUtils` object can be extended with new other functions to support future advanced displaying features Author: DuyHai DOAN <[email protected]> Closes apache#80 from doanduyhai/DisplayUtils and squashes the following commits: 62a2311 [DuyHai DOAN] Add default limit to RDD using zeppelin.spark.maxResult property 47a1b1f [DuyHai DOAN] Rename displayAsTable() to display() a15294e [DuyHai DOAN] Enhance display function utility to accept Scala Traversable in addition to RDD c1ee8fe [DuyHai DOAN] Add Scala code in module spark to expose utility functions for display (cherry picked from commit 23922ae) Signed-off-by: Lee moon soo <[email protected]>

…i_to_integration_api to V_1.0.0 * commit '1aafb72f3b2a23c714affa9a81e8088df288f17f': [ZP-424] Fix review [ZP-424] Add integration api and logs

doanduyhai force-pushed the DisplayUtils branch 3 times, most recently from dc1f82d to ec24686 Compare May 25, 2015 11:16

Leemoonsoo reviewed May 26, 2015
View reviewed changes

doanduyhai force-pushed the DisplayUtils branch from ec24686 to b4b024c Compare May 31, 2015 08:41

doanduyhai force-pushed the DisplayUtils branch from b4b024c to ab41e52 Compare June 2, 2015 20:26

doanduyhai added 4 commits June 11, 2015 23:14

Add Scala code in module spark to expose utility functions for display

c1ee8fe

Enhance display function utility to accept Scala Traversable in addit…

a15294e

…ion to RDD

Rename displayAsTable() to display()

47a1b1f

Add default limit to RDD using zeppelin.spark.maxResult property

62a2311

doanduyhai force-pushed the DisplayUtils branch from ab41e52 to 62a2311 Compare June 12, 2015 11:36

asfgit closed this in 23922ae Jun 19, 2015

doanduyhai mentioned this pull request Jul 3, 2015

Move ZeppelinContext code to Scala #139

Closed

Add Scala utility functions for display #80

Add Scala utility functions for display #80

Uh oh!

Conversation

doanduyhai commented May 23, 2015

Uh oh!

bzz commented May 26, 2015

Uh oh!

Leemoonsoo May 26, 2015

Choose a reason for hiding this comment

Uh oh!

doanduyhai May 26, 2015

Choose a reason for hiding this comment

Uh oh!

Leemoonsoo May 31, 2015

Choose a reason for hiding this comment

Uh oh!

aseigneurin commented May 26, 2015

Uh oh!

doanduyhai commented May 26, 2015

Uh oh!

swkimme commented May 27, 2015

Uh oh!

doanduyhai commented May 31, 2015

Uh oh!

doanduyhai commented Jun 2, 2015

Uh oh!

Leemoonsoo commented Jun 3, 2015

Uh oh!

doanduyhai commented Jun 3, 2015

Uh oh!

Leemoonsoo commented Jun 8, 2015

Uh oh!

Leemoonsoo commented Jun 11, 2015

Uh oh!

doanduyhai commented Jun 12, 2015

Uh oh!

Leemoonsoo commented Jun 13, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants