From c57dfaef30dcf04d7c7911da1ac77679492d04c5 Mon Sep 17 00:00:00 2001 From: Liang-Chi Hsieh Date: Fri, 21 Jul 2017 08:43:38 +0100 Subject: [PATCH 1/4] [MINOR][SS][DOCS] Minor doc change for kafka integration ## What changes were proposed in this pull request? Minor change to kafka integration document for structured streaming. ## How was this patch tested? N/A, doc change only. Author: Liang-Chi Hsieh Closes #18550 from viirya/minor-ss-kafka-doc. --- docs/structured-streaming-kafka-integration.md | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/docs/structured-streaming-kafka-integration.md b/docs/structured-streaming-kafka-integration.md index 217c1a91a16f3..bab0be8ddeb9f 100644 --- a/docs/structured-streaming-kafka-integration.md +++ b/docs/structured-streaming-kafka-integration.md @@ -15,6 +15,8 @@ For Scala/Java applications using SBT/Maven project definitions, link your appli For Python applications, you need to add this above library and its dependencies when deploying your application. See the [Deploying](#deploying) subsection below. +For experimenting on `spark-shell`, you need to add this above library and its dependencies too when invoking `spark-shell`. Also see the [Deploying](#deploying) subsection below. + ## Reading Data from Kafka ### Creating a Kafka Source for Streaming Queries @@ -607,5 +609,9 @@ and its dependencies can be directly added to `spark-submit` using `--packages`, ./bin/spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_{{site.SCALA_BINARY_VERSION}}:{{site.SPARK_VERSION_SHORT}} ... +For experimenting on `spark-shell`, you can also use `--packages` to add `spark-sql-kafka-0-10_{{site.SCALA_BINARY_VERSION}}` and its dependencies directly, + + ./bin/spark-shell --packages org.apache.spark:spark-sql-kafka-0-10_{{site.SCALA_BINARY_VERSION}}:{{site.SPARK_VERSION_SHORT}} ... + See [Application Submission Guide](submitting-applications.html) for more details about submitting applications with external dependencies. From 2f1468429ff3d4a3e577d708880a3e28186d2585 Mon Sep 17 00:00:00 2001 From: Takuya UESHIN Date: Fri, 21 Jul 2017 21:06:56 +0800 Subject: [PATCH 2/4] [SPARK-21472][SQL][FOLLOW-UP] Introduce ArrowColumnVector as a reader for Arrow vectors. ## What changes were proposed in this pull request? This is a follow-up of #18680. In some environment, a compile error happens saying: ``` .../sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ArrowColumnVector.java:243: error: not found: type Array public void loadBytes(Array array) { ^ ``` This pr fixes it. ## How was this patch tested? Existing tests. Author: Takuya UESHIN Closes #18701 from ueshin/issues/SPARK-21472_fup1. --- .../spark/sql/execution/vectorized/ArrowColumnVector.java | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ArrowColumnVector.java b/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ArrowColumnVector.java index 68e0abc11c39d..31dea6ad31b12 100644 --- a/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ArrowColumnVector.java +++ b/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ArrowColumnVector.java @@ -240,7 +240,7 @@ public int getArrayOffset(int rowId) { } @Override - public void loadBytes(Array array) { + public void loadBytes(ColumnVector.Array array) { throw new UnsupportedOperationException(); } @@ -304,7 +304,7 @@ public ArrowColumnVector(ValueVector vector) { childColumns = new ColumnVector[1]; childColumns[0] = new ArrowColumnVector(listVector.getDataVector()); - resultArray = new Array(childColumns[0]); + resultArray = new ColumnVector.Array(childColumns[0]); } else if (vector instanceof MapVector) { MapVector mapVector = (MapVector) vector; accessor = new StructAccessor(mapVector); From 113399b8b0efd8f5e64fc929aec9d2d1a6fc68f2 Mon Sep 17 00:00:00 2001 From: Sean Owen Date: Fri, 21 Jul 2017 22:42:37 +0800 Subject: [PATCH 3/4] [SPARK-19810][BUILD][FOLLOW-UP] jcl-over-slf4j dependency needs to be compile scope for SBT build ## What changes were proposed in this pull request? jcl-over-slf4j dependency needs to be compile scope for SBT build, to make it available for commons-logging dependents like Hadoop https://github.com/apache/spark/pull/17150#issuecomment-316950717 https://github.com/apache/spark/pull/17150/files#r128728089 ## How was this patch tested? Manual tests Author: Sean Owen Closes #18703 from srowen/SPARK-19810.2. --- pom.xml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/pom.xml b/pom.xml index 1b812636e4f6e..c24334333d687 100644 --- a/pom.xml +++ b/pom.xml @@ -510,7 +510,7 @@ org.slf4j jcl-over-slf4j ${slf4j.version} - runtime + log4j From cc00e99d5396893b2d3d50960161080837cf950a Mon Sep 17 00:00:00 2001 From: Holden Karau Date: Fri, 21 Jul 2017 16:50:47 -0700 Subject: [PATCH 4/4] [SPARK-21434][PYTHON][DOCS] Add pyspark pip documentation. ## What changes were proposed in this pull request? Update the Quickstart and RDD programming guides to mention pip. ## How was this patch tested? Built docs locally. Author: Holden Karau Closes #18698 from holdenk/SPARK-21434-add-pyspark-pip-documentation. --- docs/quick-start.md | 27 ++++++++++++++++++++++++++- docs/rdd-programming-guide.md | 13 ++++++++++++- 2 files changed, 38 insertions(+), 2 deletions(-) diff --git a/docs/quick-start.md b/docs/quick-start.md index b88ae5f6bb313..cb5211af377e5 100644 --- a/docs/quick-start.md +++ b/docs/quick-start.md @@ -66,6 +66,11 @@ res3: Long = 15 ./bin/pyspark + +Or if PySpark is installed with pip in your current enviroment: + + pyspark + Spark's primary abstraction is a distributed collection of items called a Dataset. Datasets can be created from Hadoop InputFormats (such as HDFS files) or by transforming other Datasets. Due to Python's dynamic nature, we don't need the Dataset to be strongly-typed in Python. As a result, all Datasets in Python are Dataset[Row], and we call it `DataFrame` to be consistent with the data frame concept in Pandas and R. Let's make a new DataFrame from the text of the README file in the Spark source directory: {% highlight python %} @@ -206,7 +211,7 @@ a cluster, as described in the [RDD programming guide](rdd-programming-guide.htm # Self-Contained Applications Suppose we wish to write a self-contained application using the Spark API. We will walk through a -simple application in Scala (with sbt), Java (with Maven), and Python. +simple application in Scala (with sbt), Java (with Maven), and Python (pip).
@@ -367,6 +372,16 @@ Lines with a: 46, Lines with b: 23 Now we will show how to write an application using the Python API (PySpark). + +If you are building a packaged PySpark application or library you can add it to your setup.py file as: + +{% highlight python %} + install_requires=[ + 'pyspark=={site.SPARK_VERSION}' + ] +{% endhighlight %} + + As an example, we'll create a simple Spark application, `SimpleApp.py`: {% highlight python %} @@ -406,6 +421,16 @@ $ YOUR_SPARK_HOME/bin/spark-submit \ Lines with a: 46, Lines with b: 23 {% endhighlight %} +If you have PySpark pip installed into your enviroment (e.g. `pip instal pyspark` you can run your application with the regular Python interpeter or use the provided spark-submit as you prefer. + +{% highlight bash %} +# Use spark-submit to run your application +$ python SimpleApp.py +... +Lines with a: 46, Lines with b: 23 +{% endhighlight %} + +
diff --git a/docs/rdd-programming-guide.md b/docs/rdd-programming-guide.md index 0966d3870e8f8..c0215c8fb62f6 100644 --- a/docs/rdd-programming-guide.md +++ b/docs/rdd-programming-guide.md @@ -89,7 +89,18 @@ import org.apache.spark.SparkConf; Spark {{site.SPARK_VERSION}} works with Python 2.7+ or Python 3.4+. It can use the standard CPython interpreter, so C libraries like NumPy can be used. It also works with PyPy 2.3+. -To run Spark applications in Python, use the `bin/spark-submit` script located in the Spark directory. +Python 2.6 support was removed in Spark 2.2.0. + +Spark applications in Python can either be run with the `bin/spark-submit` script which includes Spark at runtime, or by including including it in your setup.py as: + +{% highlight python %} + install_requires=[ + 'pyspark=={site.SPARK_VERSION}' + ] +{% endhighlight %} + + +To run Spark applications in Python without pip installing PySpark, use the `bin/spark-submit` script located in the Spark directory. This script will load Spark's Java/Scala libraries and allow you to submit applications to a cluster. You can also use `bin/pyspark` to launch an interactive Python shell.