-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-24499][SQL][DOC] Split the page of sql-programming-guide.html to multiple separate pages #22746
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #97453 has finished for PR 22746 at commit
|
|
@gatorsmile Sorry for the late on this, please have a look when you have time. |
docs/sql-getting-started.md
Outdated
| <div class="codetabs"> | ||
| <div data-lang="scala" markdown="1"> | ||
| With a `SparkSession`, applications can create DataFrames from an [existing `RDD`](#interoperating-with-rdds), | ||
| from a Hive table, or from [Spark data sources](#data-sources). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The link [Spark data sources](#data-sources) does not work after this change. Could you fix all the similar cases? Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the missing, will check all inner link by <a href="# in generated html.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 58115e5, also fix link in ml-pipeline.md\sparkr.md\structured-streaming-programming-guide.md
docs/sql-reference.md
Outdated
| Spark SQL and DataFrames support the following data types: | ||
|
|
||
| * Numeric types | ||
| - `ByteType`: Represents 1-byte signed integer numbers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: use 2 space indent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, done in 58115e5.
docs/_data/menu-sql.yaml
Outdated
| url: sql-getting-started.html#starting-point-sparksession | ||
| - text: Creating DataFrames | ||
| url: sql-getting-started.html#creating-dataframes | ||
| - text: Untyped Dataset Operations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about Untyped Dataset Operations (DataFrame operations)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make sense, keep same with sql-getting-started.html#untyped-dataset-operations-aka-dataframe-operations, done in b3fc39d.
|
This is very cool! thanks! |
|
Test build #97482 has finished for PR 22746 at commit
|
docs/sql-reference.md
Outdated
| - NaN is treated as a normal value in join keys. | ||
| - NaN values go last when in ascending order, larger than any other numeric value. | ||
|
|
||
| ## Arithmetic operations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah thanks! Fix this and left menu in b3fc39d.
|
This is cool +1 👍 |
|
My pleasure, thanks for reviewing this! |
|
Test build #97500 has finished for PR 22746 at commit
|
docs/_data/menu-sql.yaml
Outdated
| - text: Avro Files | ||
| url: sql-data-sources-avro.html | ||
| - text: Troubleshooting | ||
| url: sql-data-sources-other.html#troubleshooting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @xuanyuanking . Generally, it looks good.
Can we split sql-data-sources-other into three files? For me, troubleshooting looks weird in terms of level of information. Actually, sql-data-sources-other has only two files and troubleshooting for JDBC.
Maybe, sql-data-sources-orc, sql-data-sources-json and toubleshooting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
docs/sql-data-sources-other.md
Outdated
| </tr> | ||
| </table> | ||
|
|
||
| ## JSON Datasets |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For consistency with the other data sources, Datasets -> Files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe keep Datasets? As the below description Note that the file that is offered as a json file is not a typical JSON file. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We support a typical JSON file, don't we?
For a regular multi-line JSON file, set the
multiLineoption totrue.
IMO, that notice means we provides more flexibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, will change it soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 17995f9. Thanks!
docs/sql-distributed-sql-engine.md
Outdated
| ## Running the Thrift JDBC/ODBC server | ||
|
|
||
| The Thrift JDBC/ODBC server implemented here corresponds to the [`HiveServer2`](https://cwiki.apache.org/confluence/display/Hive/Setting+Up+HiveServer2) | ||
| in Hive 1.2.1 You can test the JDBC server with the beeline script that comes with either Spark or Hive 1.2.1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit. 1.2.1 You -> 1.2.1. You
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, done in 27b066d.
|
Test build #97519 has finished for PR 22746 at commit
|
docs/sql-data-sources.md
Outdated
|
|
||
|
|
||
| * [Generic Load/Save Functions](sql-data-sources-load-save-functions.html) | ||
| * [Manually Sepcifying Options](sql-data-sources-load-save-functions.html#manually-sepcifying-options) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sepcifying -> specifying. In other places, too.
docs/sql-data-sources-jdbc.md
Outdated
| Below are couple of restrictions while using this option.<br> | ||
| <ol> | ||
| <li> It is not allowed to specify `dbtable` and `query` options at the same time. </li> | ||
| <li> It is not allowed to spcify `query` and `partitionColumn` options at the same time. When specifying |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spcify -> specify
| <td><code>"ignore"</code></td> | ||
| <td> | ||
| Ignore mode means that when saving a DataFrame to a data source, if data already exists, | ||
| the save operation is expected to not save the contents of the DataFrame and to not |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: expected to not ... to not ... -> expected not to ... not to ...?
docs/sql-data-sources-parquet.md
Outdated
|
|
||
| ### Schema Merging | ||
|
|
||
| Like ProtocolBuffer, Avro, and Thrift, Parquet also supports schema evolution. Users can start with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ProtocolBuffer -> Protocol Buffers
docs/sql-distributed-sql-engine.md
Outdated
|
|
||
| Configuration of Hive is done by placing your `hive-site.xml`, `core-site.xml` and `hdfs-site.xml` files in `conf/`. | ||
| You may run `./bin/spark-sql --help` for a complete list of all available | ||
| options. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
super nit: this line can be concatenated with the previous line.
docs/sql-migration-guide-upgrade.md
Outdated
|
|
||
| ## Upgrading From Spark SQL 2.4 to 3.0 | ||
|
|
||
| - In PySpark, when creating a `SparkSession` with `SparkSession.builder.getOrCreate()`, if there is an existing `SparkContext`, the builder was trying to update the `SparkConf` of the existing `SparkContext` with configurations specified to the builder, but the `SparkContext` is shared by all `SparkSession`s, so we should not update them. Since 3.0, the builder come to not update the configurations. This is the same behavior as Java/Scala API in 2.3 and above. If you want to update them, you need to update them prior to creating a `SparkSession`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the builder come -> the builder comes?
cc @ueshin
docs/sql-migration-guide-upgrade.md
Outdated
| - Since Spark 2.4, Spark maximizes the usage of a vectorized ORC reader for ORC files by default. To do that, `spark.sql.orc.impl` and `spark.sql.orc.filterPushdown` change their default values to `native` and `true` respectively. | ||
| - In PySpark, when Arrow optimization is enabled, previously `toPandas` just failed when Arrow optimization is unable to be used whereas `createDataFrame` from Pandas DataFrame allowed the fallback to non-optimization. Now, both `toPandas` and `createDataFrame` from Pandas DataFrame allow the fallback by default, which can be switched off by `spark.sql.execution.arrow.fallback.enabled`. | ||
| - Since Spark 2.4, writing an empty dataframe to a directory launches at least one write task, even if physically the dataframe has no partition. This introduces a small behavior change that for self-describing file formats like Parquet and Orc, Spark creates a metadata-only file in the target directory when writing a 0-partition dataframe, so that schema inference can still work if users read that directory later. The new behavior is more reasonable and more consistent regarding writing empty dataframe. | ||
| - Since Spark 2.4, expression IDs in UDF arguments do not appear in column names. For example, an column name in Spark 2.4 is not `UDF:f(col0 AS colA#28)` but ``UDF:f(col0 AS `colA`)``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
an column -> a column
docs/sql-migration-guide-upgrade.md
Outdated
| - In PySpark, `df.replace` does not allow to omit `value` when `to_replace` is not a dictionary. Previously, `value` could be omitted in the other cases and had `None` by default, which is counterintuitive and error-prone. | ||
| - Un-aliased subquery's semantic has not been well defined with confusing behaviors. Since Spark 2.3, we invalidate such confusing cases, for example: `SELECT v.i from (SELECT i FROM v)`, Spark will throw an analysis exception in this case because users should not be able to use the qualifier inside a subquery. See [SPARK-20690](https://issues.apache.org/jira/browse/SPARK-20690) and [SPARK-21335](https://issues.apache.org/jira/browse/SPARK-21335) for more details. | ||
|
|
||
| - When creating a `SparkSession` with `SparkSession.builder.getOrCreate()`, if there is an existing `SparkContext`, the builder was trying to update the `SparkConf` of the existing `SparkContext` with configurations specified to the builder, but the `SparkContext` is shared by all `SparkSession`s, so we should not update them. Since 2.3, the builder come to not update the configurations. If you want to update them, you need to update them prior to creating a `SparkSession`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the build come -> the builder comes?
docs/sql-performance-turing.md
Outdated
| <td>4194304 (4 MB)</td> | ||
| <td> | ||
| The estimated cost to open a file, measured by the number of bytes could be scanned in the same | ||
| time. This is used when putting multiple files into a partition. It is better to over estimated, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: It is better to over estimated -> It is better to over-estimate?
docs/sql-migration-guide-upgrade.md
Outdated
| <b>AnalysisException is thrown since integer type can not be promoted to string type in a loss-less manner.</b> | ||
| </th> | ||
| <th> | ||
| <b>Users can use explict cast</b> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
explict -> explicit
docs/sql-migration-guide-upgrade.md
Outdated
| <b>AnalysisException is thrown since integer type can not be promoted to string type in a loss-less manner.</b> | ||
| </th> | ||
| <th> | ||
| <b>Users can use explict cast</b> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
explict -> explicit
|
Test build #97535 has finished for PR 22746 at commit
|
|
LGTM |
|
We might miss something in the code review. Let us play the new doc and see whether we miss anything in the code review. |
…to multiple separate pages 1. Split the main page of sql-programming-guide into 7 parts: - Getting Started - Data Sources - Performance Turing - Distributed SQL Engine - PySpark Usage Guide for Pandas with Apache Arrow - Migration Guide - Reference 2. Add left menu for sql-programming-guide, keep first level index for each part in the menu.  Local test with jekyll build/serve. Closes #22746 from xuanyuanking/SPARK-24499. Authored-by: Yuanjian Li <[email protected]> Signed-off-by: gatorsmile <[email protected]> (cherry picked from commit 987f386) Signed-off-by: gatorsmile <[email protected]>
|
Thanks! Merged to master/2.4. For 2.4 branch, I manually removed the migration guide from 2.4 to 3.0. |
|
Thanks all reviewers! Sorry for still having some mistake in new doc and I'll keep checking on this. |
…to multiple separate pages ## What changes were proposed in this pull request? 1. Split the main page of sql-programming-guide into 7 parts: - Getting Started - Data Sources - Performance Turing - Distributed SQL Engine - PySpark Usage Guide for Pandas with Apache Arrow - Migration Guide - Reference 2. Add left menu for sql-programming-guide, keep first level index for each part in the menu.  ## How was this patch tested? Local test with jekyll build/serve. Closes apache#22746 from xuanyuanking/SPARK-24499. Authored-by: Yuanjian Li <[email protected]> Signed-off-by: gatorsmile <[email protected]>

What changes were proposed in this pull request?
How was this patch tested?
Local test with jekyll build/serve.