diff --git a/docs/quick-start.md b/docs/quick-start.md
index b88ae5f6bb313..cb5211af377e5 100644
--- a/docs/quick-start.md
+++ b/docs/quick-start.md
@@ -66,6 +66,11 @@ res3: Long = 15
./bin/pyspark
+
+Or if PySpark is installed with pip in your current enviroment:
+
+ pyspark
+
Spark's primary abstraction is a distributed collection of items called a Dataset. Datasets can be created from Hadoop InputFormats (such as HDFS files) or by transforming other Datasets. Due to Python's dynamic nature, we don't need the Dataset to be strongly-typed in Python. As a result, all Datasets in Python are Dataset[Row], and we call it `DataFrame` to be consistent with the data frame concept in Pandas and R. Let's make a new DataFrame from the text of the README file in the Spark source directory:
{% highlight python %}
@@ -206,7 +211,7 @@ a cluster, as described in the [RDD programming guide](rdd-programming-guide.htm
# Self-Contained Applications
Suppose we wish to write a self-contained application using the Spark API. We will walk through a
-simple application in Scala (with sbt), Java (with Maven), and Python.
+simple application in Scala (with sbt), Java (with Maven), and Python (pip).
@@ -367,6 +372,16 @@ Lines with a: 46, Lines with b: 23
Now we will show how to write an application using the Python API (PySpark).
+
+If you are building a packaged PySpark application or library you can add it to your setup.py file as:
+
+{% highlight python %}
+ install_requires=[
+ 'pyspark=={site.SPARK_VERSION}'
+ ]
+{% endhighlight %}
+
+
As an example, we'll create a simple Spark application, `SimpleApp.py`:
{% highlight python %}
@@ -406,6 +421,16 @@ $ YOUR_SPARK_HOME/bin/spark-submit \
Lines with a: 46, Lines with b: 23
{% endhighlight %}
+If you have PySpark pip installed into your enviroment (e.g. `pip instal pyspark` you can run your application with the regular Python interpeter or use the provided spark-submit as you prefer.
+
+{% highlight bash %}
+# Use spark-submit to run your application
+$ python SimpleApp.py
+...
+Lines with a: 46, Lines with b: 23
+{% endhighlight %}
+
+
diff --git a/docs/rdd-programming-guide.md b/docs/rdd-programming-guide.md
index 0966d3870e8f8..c0215c8fb62f6 100644
--- a/docs/rdd-programming-guide.md
+++ b/docs/rdd-programming-guide.md
@@ -89,7 +89,18 @@ import org.apache.spark.SparkConf;
Spark {{site.SPARK_VERSION}} works with Python 2.7+ or Python 3.4+. It can use the standard CPython interpreter,
so C libraries like NumPy can be used. It also works with PyPy 2.3+.
-To run Spark applications in Python, use the `bin/spark-submit` script located in the Spark directory.
+Python 2.6 support was removed in Spark 2.2.0.
+
+Spark applications in Python can either be run with the `bin/spark-submit` script which includes Spark at runtime, or by including including it in your setup.py as:
+
+{% highlight python %}
+ install_requires=[
+ 'pyspark=={site.SPARK_VERSION}'
+ ]
+{% endhighlight %}
+
+
+To run Spark applications in Python without pip installing PySpark, use the `bin/spark-submit` script located in the Spark directory.
This script will load Spark's Java/Scala libraries and allow you to submit applications to a cluster.
You can also use `bin/pyspark` to launch an interactive Python shell.
diff --git a/docs/structured-streaming-kafka-integration.md b/docs/structured-streaming-kafka-integration.md
index 217c1a91a16f3..bab0be8ddeb9f 100644
--- a/docs/structured-streaming-kafka-integration.md
+++ b/docs/structured-streaming-kafka-integration.md
@@ -15,6 +15,8 @@ For Scala/Java applications using SBT/Maven project definitions, link your appli
For Python applications, you need to add this above library and its dependencies when deploying your
application. See the [Deploying](#deploying) subsection below.
+For experimenting on `spark-shell`, you need to add this above library and its dependencies too when invoking `spark-shell`. Also see the [Deploying](#deploying) subsection below.
+
## Reading Data from Kafka
### Creating a Kafka Source for Streaming Queries
@@ -607,5 +609,9 @@ and its dependencies can be directly added to `spark-submit` using `--packages`,
./bin/spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_{{site.SCALA_BINARY_VERSION}}:{{site.SPARK_VERSION_SHORT}} ...
+For experimenting on `spark-shell`, you can also use `--packages` to add `spark-sql-kafka-0-10_{{site.SCALA_BINARY_VERSION}}` and its dependencies directly,
+
+ ./bin/spark-shell --packages org.apache.spark:spark-sql-kafka-0-10_{{site.SCALA_BINARY_VERSION}}:{{site.SPARK_VERSION_SHORT}} ...
+
See [Application Submission Guide](submitting-applications.html) for more details about submitting
applications with external dependencies.
diff --git a/pom.xml b/pom.xml
index 1b812636e4f6e..c24334333d687 100644
--- a/pom.xml
+++ b/pom.xml
@@ -510,7 +510,7 @@
org.slf4j
jcl-over-slf4j
${slf4j.version}
- runtime
+
log4j
diff --git a/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ArrowColumnVector.java b/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ArrowColumnVector.java
index 68e0abc11c39d..31dea6ad31b12 100644
--- a/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ArrowColumnVector.java
+++ b/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ArrowColumnVector.java
@@ -240,7 +240,7 @@ public int getArrayOffset(int rowId) {
}
@Override
- public void loadBytes(Array array) {
+ public void loadBytes(ColumnVector.Array array) {
throw new UnsupportedOperationException();
}
@@ -304,7 +304,7 @@ public ArrowColumnVector(ValueVector vector) {
childColumns = new ColumnVector[1];
childColumns[0] = new ArrowColumnVector(listVector.getDataVector());
- resultArray = new Array(childColumns[0]);
+ resultArray = new ColumnVector.Array(childColumns[0]);
} else if (vector instanceof MapVector) {
MapVector mapVector = (MapVector) vector;
accessor = new StructAccessor(mapVector);