Skip to content

Conversation

@doanduyhai
Copy link
Contributor

Until now, to display data as a table, there are 2 alternatives:

  1. Either use Spark DataFrame and Zeppeline built-in support
  2. Or generate manually a println(%table ...). As an example of displaying an RDD[(String,String,Int)] representing a collection of users:
val data = new java.lang.StringBuilder("%table Login\tName\tAge\n")
rdd.foreach {
   case (login,name,age) => data.append(s"$login\t$name\t$age\n")
}

println(data.toString())

My proposal is to add a new utility function to make creating tables easier that the code example above. Of course one can always use Spark DataFrame but I find it quite restrictive. People using Spark versions lesser than 1.3 cannot rely on DataFrame and sometimes one does not want to transform an RDD to DataFrame for display.

How are the utility functions implemented ?

  1. I added a new module spark-utils which provide Scala code for display utility functions. This module will use the maven-scala-plugin to compile all the classes in package org.apache.zeppelin.spark.utils.

  2. Right now the package org.apache.zeppelin.spark.utils only contains 1 object DisplayUtils which augments RDDs/Scala Traversable of Tuples or case classes (all of them sub-class of trait Product) with the new method displayAsTable(columnLabels: String*).

  3. The DisplayUtils object is imported automatically into the SparkInterpreter with intp.interpret("import org.apache.zeppelin.spark.utils.DisplayUtils._");

  4. The Maven module interpreter will now have a runtime dependency on the module spark-utils so that the utility class will be loaded at runtime

  5. Usage of the new display utility function is:

    Paragraph1

    case class Person(login: String, name: String, age: Int)
    val rddTuples:RDD[(String,String,Int)] = sc.parallelize(List(("jdoe","John DOE",32),("hsue","Helen     SUE",27))
    val rddCaseClass:RDD[Person] = sc.parallelize(List(Person("jdoe","John DOE",32),Person("hsue","Helen SUE",27))

    Paragraph2

    rddTuples.displayAsTable("Login","Name","Age")

    Paragraph3

    rddCaseClass.displayAsTable("Login","Name","Age")
  6. The displayAsTable() method is error-proof, meaning that if the user provides more columns label that the number of elements in the tuples/case class, the extra column labels will ignored. If the user provides less column labels than expected, the method will pad missing column headers with Column2, Column3 etc ...

  7. In addition to the displayAsTable methods, I added some other utility methods to make it easier to handle custom HTML and images:
    a. calling html() will generate the string "%html "
    b. calling html("<p> This is a test</p>) will generate the string "%html <p> This is a test</p>"
    c. calling img("http://www.google.com") will generate the string "<img src='http://www.google.com' />"
    d. calling img64() will generate the string "%img "
    e. calling img64("ABCDE123") will generate the string "%img ABCDE123"

Of course the DisplayUtils object can be extended with new other functions to support future advanced displaying features

@doanduyhai doanduyhai force-pushed the DisplayUtils branch 3 times, most recently from dc1f82d to ec24686 Compare May 25, 2015 11:16
@bzz
Copy link
Member

bzz commented May 26, 2015

@doanduyhai thanks for really great contribution with an awesome design docs!

Utility functions that you have implemented are very useful indeed, the only question is: what is the benefit of having a separate sub-module for that, comparing with just adding those to spark interpreter itself?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Zeppelin-interpreter is common dependency for all Interpreter implementations and need to minimize it's dependencies. Isn't it better 'spark' has 'zeppelin-spark-utils' as a dependency?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only reason I added a new Maven module is that the code of display utility function is written in Scala and I did not want to pollute existing Java code in the interpreter or spark module.

But anyway, it's very easy to change. Thanks to the maven-scala-plugin, the Scala code will be taken into account in the build process.

So just tell me in which project I should put the Scala code and I'll update my pull-request

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not because of adding new module, but because of zeppelin-interpreter has new module as a dependency.

Zeppelin want to minimize the dependency version conflict between interpreters. So, if zeppelin-interpreter has zeppelin-spark-utils as a dependency then spark-core and it's transitive dependency will be included in all other interpreter, too.

@aseigneurin
Copy link

I guess there is a typo:

val rddCaseClass:RDD[(String,String,Int)] = sc.parallelize(List(Person("jdoe","John DOE",32),Person("hsue","Helen SUE",27))

Should be:

val rddCaseClass:RDD[(Person)] = sc.parallelize(List(Person("jdoe","John DOE",32),Person("hsue","Helen SUE",27))

@doanduyhai
Copy link
Contributor Author

Thanks aseigneurin for spotting the typo, I updated the PR comment accordingly

@swkimme
Copy link
Contributor

swkimme commented May 27, 2015

Thanks for the great contribution!
I'm +1 for adding new utility module, it seems like a really great way to add some utilities.

@doanduyhai
Copy link
Contributor Author

Ok, I re-pushed the commit, by putting Scala code directly into the spark module. I also removed the spark-utils module and it works (tested locally)

@doanduyhai
Copy link
Contributor Author

Ok, I have updated my commit with:

  1. A rebase from master to be up-to-date
  2. I also enhanced the code to accept not only RDDs but also plain Scala collection of Tuples/case class. So now you'll be able to do:
List(
    Person("jdoe","John DOE",32),
    Person("hsue","Helen SUE",27),
    Person("rsmith","Richard SMITH",45)
).displayAsTable("Login","Name","Age")

It just works with plain Scala collection, no need to transform them to an RDD by calling sc.parallelize(myScalaCollection)

@Leemoonsoo
Copy link
Member

@doanduyhai
Thanks for great contribution.

There was a simple display help function ZeppelinContext.show(Object o).
How do you want to deal with it? i mean utility function for the same purpose is placed in two different place.

Another thing is, it's very easy to user make mistake that use display function against very large RDD. it's hard to do .collect() large RDD and transfer data to front-end side. ZeppelinContext.show(Object o) does use .take(n) to survive from user mistake. Could you share your experience and opinion about it?

@doanduyhai
Copy link
Contributor Author

@Leemoonsoo

Thank you for your interesting questions. Indeed I see 2 main points here:

  1. How to guarantee that the end-user do not fetch a massive amount of data from RDD to the Driver programe, e.g. add some mechanism to limit the amount of fetched data for display
  2. Normalize the showRDD() and displayAsTable() code to put them at the same place and not spread over

So, my answers:

1.Default limit for fetched data

For this we can take the same approach as the one taken by showRdd():

  val defaultMaxLimit = 10000 

  def displayAsTable(maxLimit: Long, columnsLabel: String*) {...}

  def displayAsTable(columnsLabel: String*) {
      displayAsTable(defaultMaxLimit, columnsLabel)
  }

The above solution works, but I propose another approach. In my last commit, the displayAsTable() method also works for plain Scala collection so I suggest to remove the support from RDD. If the end-user want to use displayAsTable(), they must convert their RDD first to a Scala collection. In some way, we force the end-user to call either collect() or take(n) explicitly.

Example:

rdd
.map(...)
.filter(...)
.take(30) // User responsibility
.displayAsTable(...)

This way, we put the responsibility of collecting data explicitly on the end-user, no surprise, what do you think ?

  1. Normalize the showRDD() and displayAsTable() code

I had a look into the ZeppelinContext code and basically it is dealing with Scala code in Java. What we can do is:

a. either put the displayAsTable() method inside this ZeppelinContext class and code everything in Java but then we loose the power of Scala implicit conversion. We'll need to do:

val collection = rdd.map(...).filter(...).take(10)
z.displayAsTable(collection,"Header1","Header2",...)

b. or we port the code of ZeppelinContext in Scala directly so that both displayAsTable() and showRDD() are written directly in Scala. It would make sense because lots of the code in this class is dealing already with Scala code and it would avoid us converting back and forth between Java and Scala:

I volunteer to port this class from Java to Scala (with unit tests of course). What do you think @Leemoonsoo ?

@Leemoonsoo
Copy link
Member

@doanduyhai
Thanks for sharing the idea.

To me, it still make sense to display function support RDD.
That i think make user feeling the similar to DataFrame.show().
In that case, i believe it's a matter of time that user do .collect().displayAsTable() against large dataset by mistake. To limit the number, if it can be configureable through interpreter

And about normalize, adding a display function in Scala is great idea. And if you can convert java to scala, it'll be awesome.

One more thing, what do you think about making name simpler, .displayAsTable() -> .display() ?

@Leemoonsoo
Copy link
Member

@doanduyhai
Also i'd like to know if you're planning to add more commit here for Java to Scala and some improvements. Or if you prefer to merge it first and make new pull request for further improvements.

@doanduyhai
Copy link
Contributor Author

Hello @Leemoonsoo

So,

  1. I've renamed displayAsTable() to display() as suggested
  2. I've introduce a default limit to RDD ( rdd.take()) using the zeppelin.spark.maxResult property
  3. User can override the default limit to pass in their own limit value in the display() method
  4. There is no default limit for Scala collection
  5. I've rebased from master branch

Here is a screen capture showing the new default limit feature ( zeppelin.spark.maxResult has been set to 100):
image

For code normalization, I would prefer we merge and close this pull request and open a new JIRA to prepare the port of ZeppelinContext class in Scala with appropriate unit tests.

Baby steps.

@Leemoonsoo
Copy link
Member

Great work!
Looks good to me. +1 for merge.

@asfgit asfgit closed this in 23922ae Jun 19, 2015
Leemoonsoo pushed a commit to Leemoonsoo/zeppelin that referenced this pull request Sep 17, 2015
Until now, to display data as a table, there are 2 alternatives:

1. Either use **Spark DataFrame** and Zeppeline built-in support
2. Or generate manually a `println(%table ...)`. As an example of displaying an `RDD[(String,String,Int)]` representing a collection of users:

```scala
val data = new java.lang.StringBuilder("%table Login\tName\tAge\n")
rdd.foreach {
   case (login,name,age) => data.append(s"$login\t$name\t$age\n")
}

println(data.toString())
```

My proposal is to add a new utility function to make creating tables easier that the code example above. Of course one can always use **Spark DataFrame** but I find it quite restrictive. People using Spark versions lesser than 1.3 cannot rely on DataFrame and sometimes one does not want to transform an RDD to DataFrame for display.

How are the utility functions implemented ?

1. I added a new module **spark-utils** which provide Scala code for display utility functions. This module will use the **maven-scala-plugin** to compile all the classes in package `org.apache.zeppelin.spark.utils`.

2. Right now the package `org.apache.zeppelin.spark.utils` only contains 1 object `DisplayUtils` which augments RDDs/**Scala Traversable** of Tuples or case classes (all of them sub-class of trait `Product`) with the new method `displayAsTable(columnLabels: String*)`.

3. The `DisplayUtils` object is imported automatically into the `SparkInterpreter` with `intp.interpret("import org.apache.zeppelin.spark.utils.DisplayUtils._");`

4. The Maven module **interpreter** will now have a **runtime** dependency on the module **spark-utils** so that the utility class will be loaded at runtime

5. Usage of the new display utility function is:

    **Paragraph1**
    ```scala
    case class Person(login: String, name: String, age: Int)
    val rddTuples:RDD[(String,String,Int)] = sc.parallelize(List(("jdoe","John DOE",32),("hsue","Helen     SUE",27))
    val rddCaseClass:RDD[Person] = sc.parallelize(List(Person("jdoe","John DOE",32),Person("hsue","Helen SUE",27))
    ```
    **Paragraph2**
    ```scala
    rddTuples.displayAsTable("Login","Name","Age")
    ```

    **Paragraph3**
    ```scala
    rddCaseClass.displayAsTable("Login","Name","Age")
    ```

6. The `displayAsTable()` method is error-proof, meaning that if the user provides **more** columns label that the number of elements in the tuples/case class, the extra column labels will ignored. If the user provides **less** column labels than expected, the method will pad missing column headers with **Column2**, **Column3** etc ...

7. In addition to the `displayAsTable` methods, I added some other utility methods to make it easier to handle custom HTML and images:
    a. calling `html()` will generate the string `"%html "`
    b. calling `html("<p> This is a test</p>)` will generate the string `"%html <p> This is a test</p>"`
    c. calling `img("http://www.google.com")` will generate the string `"<img src='http://www.google.com' />"`
   d. calling `img64()` will generate the string `"%img "`
   e. calling `img64("ABCDE123")` will generate the string `"%img ABCDE123"`

Of course the `DisplayUtils` object can be extended with new other functions to support future advanced displaying features

Author: DuyHai DOAN <[email protected]>

Closes apache#80 from doanduyhai/DisplayUtils and squashes the following commits:

62a2311 [DuyHai DOAN] Add default limit to RDD using zeppelin.spark.maxResult property
47a1b1f [DuyHai DOAN] Rename displayAsTable() to display()
a15294e [DuyHai DOAN] Enhance display function utility to accept Scala Traversable in addition to RDD
c1ee8fe [DuyHai DOAN] Add Scala code in module spark to expose utility functions for display

(cherry picked from commit 23922ae)
Signed-off-by: Lee moon soo <[email protected]>
egorklimov pushed a commit to Tinkoff/zeppelin that referenced this pull request Sep 18, 2019
…i_to_integration_api to V_1.0.0

* commit '1aafb72f3b2a23c714affa9a81e8088df288f17f':
  [ZP-424] Fix review
  [ZP-424] Add integration api and logs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants