Zeppelin

A debian:jessie based Spark and Zeppelin Docker container.

This image contains:

Spark 2.2.0 and Hadoop 2.7.3
PySpark support with Python 3.4, NumPy, and PandaSQL, but no scipy or matplotlib.
A partial list of interpreters out-of-the-box:
- spark
- shell
- angular
- markdown
- postgresql
- jdbc
- python
- hbase
- elasticsearch If your favorite interpreter isn't included, consider adding it with the api.

Simple Usage

To start Zeppelin with no configuration run the container:

docker run --rm -p 8080:8080 ssalaues/zeppelin

Zeppelin runs at http://${YOUR_DOCKER_HOST}:8080and cleans up the container on exit

Complex Usage

build

To build an image with custom JVM memory constraints, use the following command with your custom values. You can set the maximum memory along with maximum persistent memory with the commands below. (Default values are: MEM=2gb and MAX_PERM_SIZE=1024m)

docker build -t ssalaues/zeppelin --build-arg MEM=16g --build-arg MAX_PERM_SIZE=8g .

NOTE: Because of Spark memory constraints, allow for the entire data set to be stored in memory for optimal performance.

run

The container can be run with arguments that limit the amount of resources it can access. Adjust this tunable resource to conform to anticipated performance needs (Default values allow containers to use as many CPU resources as are available on the system).

 docker run --cpus=4 -v "$(pwd)"/examples/notebooks:/usr/zeppelin/notebook -p 8080:8080 ssalaues/zeppelin

NOTE: This command also creates a bind mount to the container's default notebook directory for persistent notebook storage.

Demo Using sql Commands on Spark

A pre-loaded "Demo" notebook (along with accessKey and secretKey values) enables connection to an S3-compatible endpoint. This basic demo uses this Yelp dataset as an example and has very simple code to load all JSON files in bucket foo, process it, and put it into a SQL table for search.

NOTE: Spark expects each line to be a separate JSON object. It fails if you load "pretty" formatted JSON files

import scala.collection.mutable.WrappedArray
import spark.implicits._
import org.apache.spark.sql.functions._

//Set the endpoint
sc.hadoopConfiguration.set("fs.s3a.endpoint", "http://endpoint:port");
// Allow S3 to be an accessible file system
sc.hadoopConfiguration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem");
// Access key and Secret key values
sc.hadoopConfiguration.set("fs.s3a.access.key", "accessKey1");
sc.hadoopConfiguration.set("fs.s3a.secret.key", "verySecretKey1");

// Load all json files (regex supported)
val allFiles = sc.textFile("s3a://foo/*.json");
val business = spark.read.json(allFiles);
 
val b = business.withColumn("category", explode(
    when(col("categories").isNotNull, col("categories"))
    .otherwise(array(lit(null).cast("string")))
    ))
    
b.registerTempTable("business")

Running a SQL query on the data loaded in the example above requires a new paragraph in Zeppelin. For example, the following query outputs the data parsed in a visualization:

%sql SELECT  category,city,avg(stars) as avg_stars from business  group by category,city order by category asc, avg_stars desc

Suggested reading for advanced JSON processing in Spark: http://blog.antlypls.com/blog/2016/01/30/processing-json-data-with-sparksql/

Name		Name	Last commit message	Last commit date
Latest commit History 95 Commits
examples		examples
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
core-site.xml.template		core-site.xml.template

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Zeppelin

Simple Usage

Complex Usage

build

NOTE: Because of Spark memory constraints, allow for the entire data set to be stored in memory for optimal performance.

run

NOTE: This command also creates a bind mount to the container's default notebook directory for persistent notebook storage.

Demo Using sql Commands on Spark

NOTE: Spark expects each line to be a separate JSON object. It fails if you load "pretty" formatted JSON files

About

Releases

Packages

ssalaues/docker-zeppelin

Folders and files

Latest commit

History

Repository files navigation

Zeppelin

Simple Usage

Complex Usage

build

NOTE: Because of Spark memory constraints, allow for the entire data set to be stored in memory for optimal performance.

run

NOTE: This command also creates a bind mount to the container's default notebook directory for persistent notebook storage.

Demo Using sql Commands on Spark

NOTE: Spark expects each line to be a separate JSON object. It fails if you load "pretty" formatted JSON files

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages