Skip to content

Conversation

@xuanyuanking
Copy link
Member

@xuanyuanking xuanyuanking commented Oct 16, 2018

What changes were proposed in this pull request?

  1. Split the main page of sql-programming-guide into 7 parts:
  • Getting Started
  • Data Sources
  • Performance Turing
  • Distributed SQL Engine
  • PySpark Usage Guide for Pandas with Apache Arrow
  • Migration Guide
  • Reference
  1. Add left menu for sql-programming-guide, keep first level index for each part in the menu.
    image

How was this patch tested?

Local test with jekyll build/serve.

@xuanyuanking xuanyuanking changed the title [SPARK-24499][Doc] Split the page of sql-programming-guide [SPARK-24499][SQL][DOC] Split the page of sql-programming-guide.html to multiple separate pages Oct 16, 2018
@SparkQA
Copy link

SparkQA commented Oct 16, 2018

Test build #97453 has finished for PR 22746 at commit c2ad4a3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@xuanyuanking
Copy link
Member Author

@gatorsmile Sorry for the late on this, please have a look when you have time.

<div class="codetabs">
<div data-lang="scala" markdown="1">
With a `SparkSession`, applications can create DataFrames from an [existing `RDD`](#interoperating-with-rdds),
from a Hive table, or from [Spark data sources](#data-sources).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The link [Spark data sources](#data-sources) does not work after this change. Could you fix all the similar cases? Thanks!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the missing, will check all inner link by <a href="# in generated html.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 58115e5, also fix link in ml-pipeline.md\sparkr.md\structured-streaming-programming-guide.md

Spark SQL and DataFrames support the following data types:

* Numeric types
- `ByteType`: Represents 1-byte signed integer numbers.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: use 2 space indent.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, done in 58115e5.

url: sql-getting-started.html#starting-point-sparksession
- text: Creating DataFrames
url: sql-getting-started.html#creating-dataframes
- text: Untyped Dataset Operations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about Untyped Dataset Operations (DataFrame operations)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sense, keep same with sql-getting-started.html#untyped-dataset-operations-aka-dataframe-operations, done in b3fc39d.

@cloud-fan
Copy link
Contributor

This is very cool! thanks!

@SparkQA
Copy link

SparkQA commented Oct 17, 2018

Test build #97482 has finished for PR 22746 at commit 58115e5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

- NaN is treated as a normal value in join keys.
- NaN values go last when in ascending order, larger than any other numeric value.

## Arithmetic operations
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The space indent here is wrong.

image

Copy link
Member Author

@xuanyuanking xuanyuanking Oct 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah thanks! Fix this and left menu in b3fc39d.

@gengliangwang
Copy link
Member

This is cool +1 👍

@xuanyuanking
Copy link
Member Author

My pleasure, thanks for reviewing this!

@SparkQA
Copy link

SparkQA commented Oct 17, 2018

Test build #97500 has finished for PR 22746 at commit b3fc39d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

- text: Avro Files
url: sql-data-sources-avro.html
- text: Troubleshooting
url: sql-data-sources-other.html#troubleshooting
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @xuanyuanking . Generally, it looks good.

Can we split sql-data-sources-other into three files? For me, troubleshooting looks weird in terms of level of information. Actually, sql-data-sources-other has only two files and troubleshooting for JDBC.

Maybe, sql-data-sources-orc, sql-data-sources-json and toubleshooting?

Copy link
Member Author

@xuanyuanking xuanyuanking Oct 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense, will split into sql-data-sources-orc, sql-data-sources-json and sql-data-sources-troubleshooting(still need sql-data-sources prefix cause here we need "sql-data-sources" as the nav-left tag, otherwise the nav menu will not show the subitems). Done in 27b066d, thanks!

</tr>
</table>

## JSON Datasets
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency with the other data sources, Datasets -> Files?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe keep Datasets? As the below description Note that the file that is offered as a json file is not a typical JSON file. WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We support a typical JSON file, don't we?

For a regular multi-line JSON file, set the multiLine option to true.

IMO, that notice means we provides more flexibility.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, will change it soon.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 17995f9. Thanks!

## Running the Thrift JDBC/ODBC server

The Thrift JDBC/ODBC server implemented here corresponds to the [`HiveServer2`](https://cwiki.apache.org/confluence/display/Hive/Setting+Up+HiveServer2)
in Hive 1.2.1 You can test the JDBC server with the beeline script that comes with either Spark or Hive 1.2.1.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit. 1.2.1 You -> 1.2.1. You

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, done in 27b066d.

@SparkQA
Copy link

SparkQA commented Oct 18, 2018

Test build #97519 has finished for PR 22746 at commit 27b066d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.



* [Generic Load/Save Functions](sql-data-sources-load-save-functions.html)
* [Manually Sepcifying Options](sql-data-sources-load-save-functions.html#manually-sepcifying-options)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sepcifying -> specifying. In other places, too.

Below are couple of restrictions while using this option.<br>
<ol>
<li> It is not allowed to specify `dbtable` and `query` options at the same time. </li>
<li> It is not allowed to spcify `query` and `partitionColumn` options at the same time. When specifying
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spcify -> specify

<td><code>"ignore"</code></td>
<td>
Ignore mode means that when saving a DataFrame to a data source, if data already exists,
the save operation is expected to not save the contents of the DataFrame and to not
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: expected to not ... to not ... -> expected not to ... not to ...?


### Schema Merging

Like ProtocolBuffer, Avro, and Thrift, Parquet also supports schema evolution. Users can start with
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ProtocolBuffer -> Protocol Buffers


Configuration of Hive is done by placing your `hive-site.xml`, `core-site.xml` and `hdfs-site.xml` files in `conf/`.
You may run `./bin/spark-sql --help` for a complete list of all available
options.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super nit: this line can be concatenated with the previous line.


## Upgrading From Spark SQL 2.4 to 3.0

- In PySpark, when creating a `SparkSession` with `SparkSession.builder.getOrCreate()`, if there is an existing `SparkContext`, the builder was trying to update the `SparkConf` of the existing `SparkContext` with configurations specified to the builder, but the `SparkContext` is shared by all `SparkSession`s, so we should not update them. Since 3.0, the builder come to not update the configurations. This is the same behavior as Java/Scala API in 2.3 and above. If you want to update them, you need to update them prior to creating a `SparkSession`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the builder come -> the builder comes?
cc @ueshin

- Since Spark 2.4, Spark maximizes the usage of a vectorized ORC reader for ORC files by default. To do that, `spark.sql.orc.impl` and `spark.sql.orc.filterPushdown` change their default values to `native` and `true` respectively.
- In PySpark, when Arrow optimization is enabled, previously `toPandas` just failed when Arrow optimization is unable to be used whereas `createDataFrame` from Pandas DataFrame allowed the fallback to non-optimization. Now, both `toPandas` and `createDataFrame` from Pandas DataFrame allow the fallback by default, which can be switched off by `spark.sql.execution.arrow.fallback.enabled`.
- Since Spark 2.4, writing an empty dataframe to a directory launches at least one write task, even if physically the dataframe has no partition. This introduces a small behavior change that for self-describing file formats like Parquet and Orc, Spark creates a metadata-only file in the target directory when writing a 0-partition dataframe, so that schema inference can still work if users read that directory later. The new behavior is more reasonable and more consistent regarding writing empty dataframe.
- Since Spark 2.4, expression IDs in UDF arguments do not appear in column names. For example, an column name in Spark 2.4 is not `UDF:f(col0 AS colA#28)` but ``UDF:f(col0 AS `colA`)``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

an column -> a column

- In PySpark, `df.replace` does not allow to omit `value` when `to_replace` is not a dictionary. Previously, `value` could be omitted in the other cases and had `None` by default, which is counterintuitive and error-prone.
- Un-aliased subquery's semantic has not been well defined with confusing behaviors. Since Spark 2.3, we invalidate such confusing cases, for example: `SELECT v.i from (SELECT i FROM v)`, Spark will throw an analysis exception in this case because users should not be able to use the qualifier inside a subquery. See [SPARK-20690](https://issues.apache.org/jira/browse/SPARK-20690) and [SPARK-21335](https://issues.apache.org/jira/browse/SPARK-21335) for more details.

- When creating a `SparkSession` with `SparkSession.builder.getOrCreate()`, if there is an existing `SparkContext`, the builder was trying to update the `SparkConf` of the existing `SparkContext` with configurations specified to the builder, but the `SparkContext` is shared by all `SparkSession`s, so we should not update them. Since 2.3, the builder come to not update the configurations. If you want to update them, you need to update them prior to creating a `SparkSession`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the build come -> the builder comes?

<td>4194304 (4 MB)</td>
<td>
The estimated cost to open a file, measured by the number of bytes could be scanned in the same
time. This is used when putting multiple files into a partition. It is better to over estimated,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: It is better to over estimated -> It is better to over-estimate?

<b>AnalysisException is thrown since integer type can not be promoted to string type in a loss-less manner.</b>
</th>
<th>
<b>Users can use explict cast</b>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

explict -> explicit

<b>AnalysisException is thrown since integer type can not be promoted to string type in a loss-less manner.</b>
</th>
<th>
<b>Users can use explict cast</b>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

explict -> explicit

@xuanyuanking
Copy link
Member Author

xuanyuanking commented Oct 18, 2018

@kiszk Great thanks for all the detailed check! addressed in 17995f9. Also double checked by grep the typo for each error you found.

@SparkQA
Copy link

SparkQA commented Oct 18, 2018

Test build #97535 has finished for PR 22746 at commit 17995f9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

LGTM

@gatorsmile
Copy link
Member

We might miss something in the code review. Let us play the new doc and see whether we miss anything in the code review.

@asfgit asfgit closed this in 987f386 Oct 18, 2018
asfgit pushed a commit that referenced this pull request Oct 18, 2018
…to multiple separate pages

1. Split the main page of sql-programming-guide into 7 parts:

- Getting Started
- Data Sources
- Performance Turing
- Distributed SQL Engine
- PySpark Usage Guide for Pandas with Apache Arrow
- Migration Guide
- Reference

2. Add left menu for sql-programming-guide, keep first level index for each part in the menu.
![image](https://user-images.githubusercontent.com/4833765/47016859-6332e180-d183-11e8-92e8-ce62518a83c4.png)

Local test with jekyll build/serve.

Closes #22746 from xuanyuanking/SPARK-24499.

Authored-by: Yuanjian Li <[email protected]>
Signed-off-by: gatorsmile <[email protected]>
(cherry picked from commit 987f386)
Signed-off-by: gatorsmile <[email protected]>
@gatorsmile
Copy link
Member

Thanks! Merged to master/2.4. For 2.4 branch, I manually removed the migration guide from 2.4 to 3.0.

@xuanyuanking
Copy link
Member Author

Thanks all reviewers! Sorry for still having some mistake in new doc and I'll keep checking on this.

@xuanyuanking xuanyuanking deleted the SPARK-24499 branch October 19, 2018 03:50
jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
…to multiple separate pages

## What changes were proposed in this pull request?

1. Split the main page of sql-programming-guide into 7 parts:

- Getting Started
- Data Sources
- Performance Turing
- Distributed SQL Engine
- PySpark Usage Guide for Pandas with Apache Arrow
- Migration Guide
- Reference

2. Add left menu for sql-programming-guide, keep first level index for each part in the menu.
![image](https://user-images.githubusercontent.com/4833765/47016859-6332e180-d183-11e8-92e8-ce62518a83c4.png)

## How was this patch tested?

Local test with jekyll build/serve.

Closes apache#22746 from xuanyuanking/SPARK-24499.

Authored-by: Yuanjian Li <[email protected]>
Signed-off-by: gatorsmile <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants