Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
72 changes: 10 additions & 62 deletions docs/interpreter/spark.md
Original file line number Diff line number Diff line change
Expand Up @@ -214,7 +214,7 @@ You can either specify them in `zeppelin-env.sh`, or in interpreter setting page
in interpreter setting page means you can use multiple versions of `spark` & `hadoop` in one zeppelin instance.

### 4. New Version of SparkInterpreter
There's one new version of SparkInterpreter with better spark support and code completion starting from Zeppelin 0.8.0. We enable it by default, but user can still use the old version of SparkInterpreter by setting `zeppelin.spark.useNew` as `false` in its interpreter setting.
Starting from 0.9, we totally removed the old spark interpreter implementation, and make the new spark interpreter as the official spark interpreter.

## SparkContext, SQLContext, SparkSession, ZeppelinContext
SparkContext, SQLContext and ZeppelinContext are automatically created and exposed as variable names `sc`, `sqlContext` and `z`, respectively, in Scala, Python and R environments.
Expand All @@ -232,14 +232,9 @@ There're 2 kinds of properties that would be passed to SparkConf
* Non-standard spark property (prefix with `zeppelin.spark.`). e.g. `zeppelin.spark.property_1`, `property_1` will be passed to `SparkConf`

## Dependency Management
There are two ways to load external libraries in Spark interpreter. First is using interpreter setting menu and second is loading Spark properties.

### 1. Setting Dependencies via Interpreter Setting
Please see [Dependency Management](../usage/interpreter/dependency_management.html) for the details.

### 2. Loading Spark Properties
Once `SPARK_HOME` is set in `conf/zeppelin-env.sh`, Zeppelin uses `spark-submit` as spark interpreter runner. `spark-submit` supports two ways to load configurations.
The first is command line options such as --master and Zeppelin can pass these options to `spark-submit` by exporting `SPARK_SUBMIT_OPTIONS` in `conf/zeppelin-env.sh`. Second is reading configuration options from `SPARK_HOME/conf/spark-defaults.conf`. Spark properties that user can set to distribute libraries are:
For spark interpreter, you should not use Zeppelin's [Dependency Management](../usage/interpreter/dependency_management.html) for managing
third party dependencies, (`%spark.dep` also is not the recommended approach starting from Zeppelin 0.8). Instead you should set spark properties (`spark.jars`, `spark.files`, `spark.jars.packages`) in 2 ways.

<table class="table-configuration">
<tr>
Expand All @@ -264,70 +259,23 @@ The first is command line options such as --master and Zeppelin can pass these o
</tr>
</table>

Here are few examples:
### 1. Set spark properties in zeppelin side.

* `SPARK_SUBMIT_OPTIONS` in `conf/zeppelin-env.sh`
In zeppelin side, you can either set them in spark interpreter setting page or via [Generic ConfInterpreter](../usage/interpreter/overview.html).
It is not recommended to set them in `SPARK_SUBMIT_OPTIONS`. Because it will be shared by all spark interpreters, you can not set different dependencies for different users.

```bash
export SPARK_SUBMIT_OPTIONS="--packages com.databricks:spark-csv_2.10:1.2.0 --jars /path/mylib1.jar,/path/mylib2.jar --files /path/mylib1.py,/path/mylib2.zip,/path/mylib3.egg"
```
### 2. Set spark properties in spark side.

In spark side, you can set them in `spark-defaults.conf`.

* `SPARK_HOME/conf/spark-defaults.conf`
e.g.

```
spark.jars /path/mylib1.jar,/path/mylib2.jar
spark.jars.packages com.databricks:spark-csv_2.10:1.2.0
spark.files /path/mylib1.py,/path/mylib2.egg,/path/mylib3.zip
```

### 3. Dynamic Dependency Loading via %spark.dep interpreter
> Note: `%spark.dep` interpreter loads libraries to `%spark` and `%spark.pyspark` but not to `%spark.sql` interpreter. So we recommend you to use the first option instead.

When your code requires external library, instead of doing download/copy/restart Zeppelin, you can easily do following jobs using `%spark.dep` interpreter.

* Load libraries recursively from maven repository
* Load libraries from local filesystem
* Add additional maven repository
* Automatically add libraries to SparkCluster (You can turn off)

Dep interpreter leverages Scala environment. So you can write any Scala code here.
Note that `%spark.dep` interpreter should be used before `%spark`, `%spark.pyspark`, `%spark.sql`.

Here's usages.

```scala
%spark.dep
z.reset() // clean up previously added artifact and repository

// add maven repository
z.addRepo("RepoName").url("RepoURL")

// add maven snapshot repository
z.addRepo("RepoName").url("RepoURL").snapshot()

// add credentials for private maven repository
z.addRepo("RepoName").url("RepoURL").username("username").password("password")

// add artifact from filesystem
z.load("/path/to.jar")

// add artifact from maven repository, with no dependency
z.load("groupId:artifactId:version").excludeAll()

// add artifact recursively
z.load("groupId:artifactId:version")

// add artifact recursively except comma separated GroupID:ArtifactId list
z.load("groupId:artifactId:version").exclude("groupId:artifactId,groupId:artifactId, ...")

// exclude with pattern
z.load("groupId:artifactId:version").exclude(*)
z.load("groupId:artifactId:version").exclude("groupId:artifactId:*")
z.load("groupId:artifactId:version").exclude("groupId:*")

// local() skips adding artifact to spark clusters (skipping sc.addJar())
z.load("groupId:artifactId:version").local()
```

## ZeppelinContext
Zeppelin automatically injects `ZeppelinContext` as variable `z` in your Scala/Python environment. `ZeppelinContext` provides some additional functions and utilities.
Expand Down

This file was deleted.

Loading