ZEPPELIN-55 Make tutorial notebook independent from filesystem. #140

Leemoonsoo · 2015-07-03T20:57:50Z

Tutorial notebook is downloading data using wget and unzip and load the csv file.
This works only in local-mode and not going to work with cluster deployments.

Discussed solution in the issue ZEPPELIN-55 are

Upload data to HDFS
Upload data to S3

However, not all user will install HDFS, and accessing S3 via hdfs client needs accessKey and secretKey in configuration.

this PR make tutorial notebook independent from any filesystem, by reading data from http(s) address and parallelize directly.

Here's how this PR loads data

// load bank data
val bankText = sc.parallelize(
    IOUtils.toString(
        new URL("https://s3.amazonaws.com/apache-zeppelin/tutorial/bank/bank.csv"),
        Charset.forName("utf8")).split("\n"))

case class Bank(age: Integer, job: String, marital: String, education: String, balance: Integer)

val bank = bankText.map(s => s.split(";")).filter(s => s(0) != "\"age\"").map(
    s => Bank(s(0).toInt, 
            s(1).replaceAll("\"", ""),
            s(2).replaceAll("\"", ""),
            s(3).replaceAll("\"", ""),
            s(5).replaceAll("\"", "").toInt
        )
).toDF()
bank.registerTempTable("bank")

Leemoonsoo · 2015-07-03T20:57:59Z

Ready to merge

Leemoonsoo · 2015-07-04T18:39:00Z

Merging if there're no more discussions.

Tutorial notebook is downloading data using `wget` and unzip and load the csv file. This works only in local-mode and not going to work with cluster deployments. Discussed solution in the issue ZEPPELIN-55 are * Upload data to HDFS * Upload data to S3 However, not all user will install HDFS, and accessing S3 via hdfs client needs accessKey and secretKey in configuration. this PR make tutorial notebook independent from any filesystem, by reading data from http(s) address and parallelize directly. Here's how this PR loads data ``` // load bank data val bankText = sc.parallelize( IOUtils.toString( new URL("https://s3.amazonaws.com/apache-zeppelin/tutorial/bank/bank.csv"), Charset.forName("utf8")).split("\n")) case class Bank(age: Integer, job: String, marital: String, education: String, balance: Integer) val bank = bankText.map(s => s.split(";")).filter(s => s(0) != "\"age\"").map( s => Bank(s(0).toInt, s(1).replaceAll("\"", ""), s(2).replaceAll("\"", ""), s(3).replaceAll("\"", ""), s(5).replaceAll("\"", "").toInt ) ).toDF() bank.registerTempTable("bank") ``` Author: Lee moon soo <[email protected]> Closes #140 from Leemoonsoo/ZEPPELIN-55 and squashes the following commits: 653b1bc [Lee moon soo] Load data directly from http without using filesystem (cherry picked from commit 4fa7019) Signed-off-by: Lee moon soo <[email protected]>

Closes #6 (fixed by #1008) Closes #89 (fixed by #1008) Closes #46 (fixed by #140) Closes #60 (fixed by #361) Closes #211 (fixed by #361) Closes #100 (fixed by #114) Closes #190 (fixed by #796) Closes #527

Load data directly from http without using filesystem

653b1bc

asfgit closed this in 4fa7019 Jul 5, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ZEPPELIN-55 Make tutorial notebook independent from filesystem. #140

ZEPPELIN-55 Make tutorial notebook independent from filesystem. #140

Uh oh!

Leemoonsoo commented Jul 3, 2015

Uh oh!

Leemoonsoo commented Jul 3, 2015

Uh oh!

Leemoonsoo commented Jul 4, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ZEPPELIN-55 Make tutorial notebook independent from filesystem. #140

ZEPPELIN-55 Make tutorial notebook independent from filesystem. #140

Uh oh!

Conversation

Leemoonsoo commented Jul 3, 2015

Uh oh!

Leemoonsoo commented Jul 3, 2015

Uh oh!

Leemoonsoo commented Jul 4, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant