Skip to content

Conversation

@Leemoonsoo
Copy link
Member

Tutorial notebook is downloading data using wget and unzip and load the csv file.
This works only in local-mode and not going to work with cluster deployments.

Discussed solution in the issue ZEPPELIN-55 are

  • Upload data to HDFS
  • Upload data to S3

However, not all user will install HDFS, and accessing S3 via hdfs client needs accessKey and secretKey in configuration.

this PR make tutorial notebook independent from any filesystem, by reading data from http(s) address and parallelize directly.

Here's how this PR loads data

// load bank data
val bankText = sc.parallelize(
    IOUtils.toString(
        new URL("https://s3.amazonaws.com/apache-zeppelin/tutorial/bank/bank.csv"),
        Charset.forName("utf8")).split("\n"))

case class Bank(age: Integer, job: String, marital: String, education: String, balance: Integer)

val bank = bankText.map(s => s.split(";")).filter(s => s(0) != "\"age\"").map(
    s => Bank(s(0).toInt, 
            s(1).replaceAll("\"", ""),
            s(2).replaceAll("\"", ""),
            s(3).replaceAll("\"", ""),
            s(5).replaceAll("\"", "").toInt
        )
).toDF()
bank.registerTempTable("bank")

@Leemoonsoo
Copy link
Member Author

Ready to merge

@Leemoonsoo
Copy link
Member Author

Merging if there're no more discussions.

@asfgit asfgit closed this in 4fa7019 Jul 5, 2015
asfgit pushed a commit that referenced this pull request Jul 5, 2015
Tutorial notebook is downloading data using `wget` and unzip and load the csv file.
This works only in local-mode and not going to work with cluster deployments.

Discussed solution in the issue ZEPPELIN-55 are

 * Upload data to HDFS
 * Upload data to S3

However, not all user will install HDFS, and accessing S3 via hdfs client needs accessKey and secretKey in configuration.

this PR make tutorial notebook independent from any filesystem, by reading data from http(s) address and parallelize directly.

Here's how this PR loads data
```
// load bank data
val bankText = sc.parallelize(
    IOUtils.toString(
        new URL("https://s3.amazonaws.com/apache-zeppelin/tutorial/bank/bank.csv"),
        Charset.forName("utf8")).split("\n"))

case class Bank(age: Integer, job: String, marital: String, education: String, balance: Integer)

val bank = bankText.map(s => s.split(";")).filter(s => s(0) != "\"age\"").map(
    s => Bank(s(0).toInt,
            s(1).replaceAll("\"", ""),
            s(2).replaceAll("\"", ""),
            s(3).replaceAll("\"", ""),
            s(5).replaceAll("\"", "").toInt
        )
).toDF()
bank.registerTempTable("bank")
```

Author: Lee moon soo <[email protected]>

Closes #140 from Leemoonsoo/ZEPPELIN-55 and squashes the following commits:

653b1bc [Lee moon soo] Load data directly from http without using filesystem

(cherry picked from commit 4fa7019)
Signed-off-by: Lee moon soo <[email protected]>
asfgit pushed a commit that referenced this pull request Jun 22, 2016
Closes #6 (fixed by #1008)
Closes #89 (fixed by #1008)
Closes #46 (fixed by #140)
Closes #60 (fixed by #361)
Closes #211 (fixed by #361)
Closes #100 (fixed by #114)
Closes #190 (fixed by #796)
Closes #527
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant