[SPARK-24499][SQL][DOC] Split the page of sql-programming-guide.html to multiple separate pages #22746

xuanyuanking · 2018-10-16T12:24:05Z

What changes were proposed in this pull request?

Split the main page of sql-programming-guide into 7 parts:

Getting Started
Data Sources
Performance Turing
Distributed SQL Engine
PySpark Usage Guide for Pandas with Apache Arrow
Migration Guide
Reference

Add left menu for sql-programming-guide, keep first level index for each part in the menu.

How was this patch tested?

Local test with jekyll build/serve.

SparkQA · 2018-10-16T12:43:36Z

Test build #97453 has finished for PR 22746 at commit c2ad4a3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

xuanyuanking · 2018-10-17T02:28:28Z

@gatorsmile Sorry for the late on this, please have a look when you have time.

gatorsmile · 2018-10-17T04:55:08Z

docs/sql-getting-started.md

+<div class="codetabs">
+<div data-lang="scala"  markdown="1">
+With a `SparkSession`, applications can create DataFrames from an [existing `RDD`](#interoperating-with-rdds),
+from a Hive table, or from [Spark data sources](#data-sources).


The link [Spark data sources](#data-sources) does not work after this change. Could you fix all the similar cases? Thanks!

Sorry for the missing, will check all inner link by <a href="# in generated html.

Done in 58115e5, also fix link in ml-pipeline.md\sparkr.md\structured-streaming-programming-guide.md

gengliangwang · 2018-10-17T05:21:25Z

docs/sql-reference.md

+Spark SQL and DataFrames support the following data types:
+
+* Numeric types
+    - `ByteType`: Represents 1-byte signed integer numbers.


nit: use 2 space indent.

Thanks, done in 58115e5.

cloud-fan · 2018-10-17T06:50:16Z

docs/_data/menu-sql.yaml

+      url: sql-getting-started.html#starting-point-sparksession
+    - text: Creating DataFrames
+      url: sql-getting-started.html#creating-dataframes
+    - text: Untyped Dataset Operations


how about Untyped Dataset Operations (DataFrame operations)

make sense, keep same with sql-getting-started.html#untyped-dataset-operations-aka-dataframe-operations, done in b3fc39d.

cloud-fan · 2018-10-17T06:51:42Z

This is very cool! thanks!

SparkQA · 2018-10-17T06:51:43Z

Test build #97482 has finished for PR 22746 at commit 58115e5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2018-10-17T13:13:57Z

docs/sql-reference.md

+ - NaN is treated as a normal value in join keys.
+ - NaN values go last when in ascending order, larger than any other numeric value.
+
+ ## Arithmetic operations


The space indent here is wrong.

ah thanks! Fix this and left menu in b3fc39d.

gengliangwang · 2018-10-17T13:15:51Z

This is cool +1 👍

xuanyuanking · 2018-10-17T16:49:38Z

My pleasure, thanks for reviewing this!

SparkQA · 2018-10-17T17:02:22Z

Test build #97500 has finished for PR 22746 at commit b3fc39d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2018-10-18T07:11:02Z

docs/_data/menu-sql.yaml

+    - text: Avro Files
+      url: sql-data-sources-avro.html
+    - text: Troubleshooting
+      url: sql-data-sources-other.html#troubleshooting


Hi, @xuanyuanking . Generally, it looks good.

Can we split sql-data-sources-other into three files? For me, troubleshooting looks weird in terms of level of information. Actually, sql-data-sources-other has only two files and troubleshooting for JDBC.

Maybe, sql-data-sources-orc, sql-data-sources-json and toubleshooting?

Make sense, will split into sql-data-sources-orc, sql-data-sources-json and sql-data-sources-troubleshooting(still need sql-data-sources prefix cause here we need "sql-data-sources" as the nav-left tag, otherwise the nav menu will not show the subitems). Done in 27b066d, thanks!

dongjoon-hyun · 2018-10-18T07:12:20Z

docs/sql-data-sources-other.md

+  </tr>
+</table>
+
+## JSON Datasets


For consistency with the other data sources, Datasets -> Files?

Maybe keep Datasets? As the below description Note that the file that is offered as a json file is not a typical JSON file. WDYT?

We support a typical JSON file, don't we?

For a regular multi-line JSON file, set the multiLine option to true.

IMO, that notice means we provides more flexibility.

Got it, will change it soon.

Done in 17995f9. Thanks!

dongjoon-hyun · 2018-10-18T07:17:02Z

docs/sql-distributed-sql-engine.md

+## Running the Thrift JDBC/ODBC server
+
+The Thrift JDBC/ODBC server implemented here corresponds to the [`HiveServer2`](https://cwiki.apache.org/confluence/display/Hive/Setting+Up+HiveServer2)
+in Hive 1.2.1 You can test the JDBC server with the beeline script that comes with either Spark or Hive 1.2.1.


nit. 1.2.1 You -> 1.2.1. You

Thanks, done in 27b066d.

SparkQA · 2018-10-18T08:31:43Z

Test build #97519 has finished for PR 22746 at commit 27b066d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

kiszk · 2018-10-18T09:11:42Z

docs/sql-data-sources.md

+
+
+* [Generic Load/Save Functions](sql-data-sources-load-save-functions.html)
+  * [Manually Sepcifying Options](sql-data-sources-load-save-functions.html#manually-sepcifying-options)


sepcifying -> specifying. In other places, too.

kiszk · 2018-10-18T09:23:03Z

docs/sql-data-sources-jdbc.md

+      Below are couple of restrictions while using this option.<br>
+      <ol>
+         <li> It is not allowed to specify `dbtable` and `query` options at the same time. </li>
+         <li> It is not allowed to spcify `query` and `partitionColumn` options at the same time. When specifying


spcify -> specify

kiszk · 2018-10-18T09:34:44Z

docs/sql-data-sources-load-save-functions.md

+  <td><code>"ignore"</code></td>
+  <td>
+    Ignore mode means that when saving a DataFrame to a data source, if data already exists,
+    the save operation is expected to not save the contents of the DataFrame and to not


nit: expected to not ... to not ... -> expected not to ... not to ...?

kiszk · 2018-10-18T09:39:21Z

docs/sql-data-sources-parquet.md

+
+### Schema Merging
+
+Like ProtocolBuffer, Avro, and Thrift, Parquet also supports schema evolution. Users can start with


ProtocolBuffer -> Protocol Buffers

docs/sql-data-sources-parquet.md

kiszk · 2018-10-18T09:52:57Z

docs/sql-distributed-sql-engine.md

+
+Configuration of Hive is done by placing your `hive-site.xml`, `core-site.xml` and `hdfs-site.xml` files in `conf/`.
+You may run `./bin/spark-sql --help` for a complete list of all available
+options.


super nit: this line can be concatenated with the previous line.

kiszk · 2018-10-18T10:06:26Z

docs/sql-migration-guide-upgrade.md

+
+## Upgrading From Spark SQL 2.4 to 3.0
+
+  - In PySpark, when creating a `SparkSession` with `SparkSession.builder.getOrCreate()`, if there is an existing `SparkContext`, the builder was trying to update the `SparkConf` of the existing `SparkContext` with configurations specified to the builder, but the `SparkContext` is shared by all `SparkSession`s, so we should not update them. Since 3.0, the builder come to not update the configurations. This is the same behavior as Java/Scala API in 2.3 and above. If you want to update them, you need to update them prior to creating a `SparkSession`.


the builder come -> the builder comes?
cc @ueshin

kiszk · 2018-10-18T10:07:52Z

docs/sql-migration-guide-upgrade.md

+  - Since Spark 2.4, Spark maximizes the usage of a vectorized ORC reader for ORC files by default. To do that, `spark.sql.orc.impl` and `spark.sql.orc.filterPushdown` change their default values to `native` and `true` respectively.
+  - In PySpark, when Arrow optimization is enabled, previously `toPandas` just failed when Arrow optimization is unable to be used whereas `createDataFrame` from Pandas DataFrame allowed the fallback to non-optimization. Now, both `toPandas` and `createDataFrame` from Pandas DataFrame allow the fallback by default, which can be switched off by `spark.sql.execution.arrow.fallback.enabled`.
+  - Since Spark 2.4, writing an empty dataframe to a directory launches at least one write task, even if physically the dataframe has no partition. This introduces a small behavior change that for self-describing file formats like Parquet and Orc, Spark creates a metadata-only file in the target directory when writing a 0-partition dataframe, so that schema inference can still work if users read that directory later. The new behavior is more reasonable and more consistent regarding writing empty dataframe.
+  - Since Spark 2.4, expression IDs in UDF arguments do not appear in column names. For example, an column name in Spark 2.4 is not `UDF:f(col0 AS colA#28)` but ``UDF:f(col0 AS `colA`)``.


an column -> a column

kiszk · 2018-10-18T10:12:20Z

docs/sql-migration-guide-upgrade.md

+  - In PySpark, `df.replace` does not allow to omit `value` when `to_replace` is not a dictionary. Previously, `value` could be omitted in the other cases and had `None` by default, which is counterintuitive and error-prone.
+  - Un-aliased subquery's semantic has not been well defined with confusing behaviors. Since Spark 2.3, we invalidate such confusing cases, for example: `SELECT v.i from (SELECT i FROM v)`, Spark will throw an analysis exception in this case because users should not be able to use the qualifier inside a subquery. See [SPARK-20690](https://issues.apache.org/jira/browse/SPARK-20690) and [SPARK-21335](https://issues.apache.org/jira/browse/SPARK-21335) for more details.
+
+  - When creating a `SparkSession` with `SparkSession.builder.getOrCreate()`, if there is an existing `SparkContext`, the builder was trying to update the `SparkConf` of the existing `SparkContext` with configurations specified to the builder, but the `SparkContext` is shared by all `SparkSession`s, so we should not update them. Since 2.3, the builder come to not update the configurations. If you want to update them, you need to update them prior to creating a `SparkSession`.


the build come -> the builder comes?

kiszk · 2018-10-18T10:22:10Z

docs/sql-performance-turing.md

+    <td>4194304 (4 MB)</td>
+    <td>
+      The estimated cost to open a file, measured by the number of bytes could be scanned in the same
+      time. This is used when putting multiple files into a partition. It is better to over estimated,


nit: It is better to over estimated -> It is better to over-estimate?

kiszk · 2018-10-18T11:08:18Z

docs/sql-migration-guide-upgrade.md

+            <b>AnalysisException is thrown since integer type can not be promoted to string type in a loss-less manner.</b>
+          </th>
+          <th>
+            <b>Users can use explict cast</b>


explict -> explicit

kiszk · 2018-10-18T11:08:36Z

docs/sql-migration-guide-upgrade.md

+            <b>AnalysisException is thrown since integer type can not be promoted to string type in a loss-less manner.</b>
+          </th>
+          <th>
+            <b>Users can use explict cast</b>


explict -> explicit

xuanyuanking · 2018-10-18T12:42:52Z

@kiszk Great thanks for all the detailed check! addressed in 17995f9. Also double checked by grep the typo for each error you found.

SparkQA · 2018-10-18T12:57:03Z

Test build #97535 has finished for PR 22746 at commit 17995f9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-10-18T18:57:21Z

LGTM

gatorsmile · 2018-10-18T18:58:45Z

We might miss something in the code review. Let us play the new doc and see whether we miss anything in the code review.

…to multiple separate pages 1. Split the main page of sql-programming-guide into 7 parts: - Getting Started - Data Sources - Performance Turing - Distributed SQL Engine - PySpark Usage Guide for Pandas with Apache Arrow - Migration Guide - Reference 2. Add left menu for sql-programming-guide, keep first level index for each part in the menu. ![image](https://user-images.githubusercontent.com/4833765/47016859-6332e180-d183-11e8-92e8-ce62518a83c4.png) Local test with jekyll build/serve. Closes #22746 from xuanyuanking/SPARK-24499. Authored-by: Yuanjian Li <[email protected]> Signed-off-by: gatorsmile <[email protected]> (cherry picked from commit 987f386) Signed-off-by: gatorsmile <[email protected]>

gatorsmile · 2018-10-18T19:13:50Z

Thanks! Merged to master/2.4. For 2.4 branch, I manually removed the migration guide from 2.4 to 3.0.

xuanyuanking · 2018-10-19T03:50:10Z

Thanks all reviewers! Sorry for still having some mistake in new doc and I'll keep checking on this.

…to multiple separate pages ## What changes were proposed in this pull request? 1. Split the main page of sql-programming-guide into 7 parts: - Getting Started - Data Sources - Performance Turing - Distributed SQL Engine - PySpark Usage Guide for Pandas with Apache Arrow - Migration Guide - Reference 2. Add left menu for sql-programming-guide, keep first level index for each part in the menu. ![image](https://user-images.githubusercontent.com/4833765/47016859-6332e180-d183-11e8-92e8-ce62518a83c4.png) ## How was this patch tested? Local test with jekyll build/serve. Closes apache#22746 from xuanyuanking/SPARK-24499. Authored-by: Yuanjian Li <[email protected]> Signed-off-by: gatorsmile <[email protected]>

Split the page of sql-programming-guide

c2ad4a3

xuanyuanking changed the title ~~[SPARK-24499][Doc] Split the page of sql-programming-guide~~ [SPARK-24499][SQL][DOC] Split the page of sql-programming-guide.html to multiple separate pages Oct 16, 2018

gatorsmile reviewed Oct 17, 2018

View reviewed changes

gengliangwang reviewed Oct 17, 2018

View reviewed changes

Fix link for sql-programming-guide.html

58115e5

cloud-fan reviewed Oct 17, 2018

View reviewed changes

gengliangwang reviewed Oct 17, 2018

View reviewed changes

Address comments

b3fc39d

gatorsmile mentioned this pull request Oct 18, 2018

[MINOR][DOC] Spacing items in migration guide for readability and consistency #22761

Closed

dongjoon-hyun reviewed Oct 18, 2018

View reviewed changes

split sql-data-sources-other

27b066d

kiszk reviewed Oct 18, 2018

View reviewed changes

docs/sql-data-sources-parquet.md Show resolved Hide resolved

kiszk reviewed Oct 18, 2018

View reviewed changes

Typo fix

17995f9

asfgit closed this in 987f386 Oct 18, 2018

xuanyuanking deleted the SPARK-24499 branch October 19, 2018 03:50



		* [Generic Load/Save Functions](sql-data-sources-load-save-functions.html)
		* [Manually Sepcifying Options](sql-data-sources-load-save-functions.html#manually-sepcifying-options)


		### Schema Merging

		Like ProtocolBuffer, Avro, and Thrift, Parquet also supports schema evolution. Users can start with


		## Upgrading From Spark SQL 2.4 to 3.0

		- In PySpark, when creating a `SparkSession` with `SparkSession.builder.getOrCreate()`, if there is an existing `SparkContext`, the builder was trying to update the `SparkConf` of the existing `SparkContext` with configurations specified to the builder, but the `SparkContext` is shared by all `SparkSession`s, so we should not update them. Since 3.0, the builder come to not update the configurations. This is the same behavior as Java/Scala API in 2.3 and above. If you want to update them, you need to update them prior to creating a `SparkSession`.

[SPARK-24499][SQL][DOC] Split the page of sql-programming-guide.html to multiple separate pages #22746

[SPARK-24499][SQL][DOC] Split the page of sql-programming-guide.html to multiple separate pages #22746

Uh oh!

Conversation

xuanyuanking commented Oct 16, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Oct 16, 2018

Uh oh!

xuanyuanking commented Oct 17, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Oct 17, 2018

Uh oh!

SparkQA commented Oct 17, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xuanyuanking Oct 17, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gengliangwang commented Oct 17, 2018

Uh oh!

xuanyuanking commented Oct 17, 2018

Uh oh!

SparkQA commented Oct 17, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xuanyuanking Oct 18, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 18, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xuanyuanking commented Oct 16, 2018 •

edited

Loading

xuanyuanking Oct 17, 2018 •

edited

Loading

xuanyuanking Oct 18, 2018 •

edited

Loading

xuanyuanking commented Oct 18, 2018 •

edited

Loading