-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-31866][SQL][DOCS] Add COALESCE/REPARTITION/REPARTITION_BY_RANGE Hints to SQL Reference #28672
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ion_By_Range Hints to SQL REF
|
Add Coalesce/Repartition/Repartition_By_Range Hints to SQL Reference per @gatorsmile request. |
|
Test build #123263 has finished for PR 28672 at commit
|
|
This is for 3.0? Btw, could you assign a new jira ID to this PR? |
| /*+ hint [ , ... ] */ | ||
| ``` | ||
|
|
||
| ### Coalesce/Repartition/Repartition_By_Range Hints |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about simply saying ### Partitioning Hints here?
|
|
||
| ### Examples | ||
| ```sql | ||
| SELECT /*+ COALESCE(3) */ * FROM t; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about showing a spark plan via explain?
|
|
||
| ### Coalesce/Repartition/Repartition_By_Range Hints | ||
|
|
||
| Coalesce/Repartition/Repartition_By_Range hints have functionalities equivalent to those of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you follow the same format with the Join hints? e.g., Coalesce -> `COALESCE`
|
Yes, it's for 3.0. I created jira SPARK-31866. @maropu |
|
Test build #123271 has finished for PR 28672 at commit
|
| Location: CatalogFileIndex[file:/spark/spark-warehouse/t], PartitionFilters: [], | ||
| PushedFilters: [], ReadSchema: struct<name:string> | ||
|
|
||
| SELECT /*+ REPARTITION(3) */ * FROM t; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We still need these statements having no output as the example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more comment; probably, the join hint section should have the same format for the examples.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I will only have one example for explain. Otherwise the example section will be too long.
I will leave the join hint example section as is for now. Don't want this section to be too long.
|
|
||
| ### Join Hints | ||
|
|
||
| Join Hints allow users to suggest the join strategy that Spark should use. Prior to Spark 3.0, only the `BROADCAST` Join Hint was supported. `MERGE`, `SHUFFLE_HASH` and `SHUFFLE_REPLICATE_NL` Joint Hints support was added in 3.0. When different join strategy hints are specified on both sides of a join, Spark prioritizes hints in the following order: `BROADCAST` over `MERGE` over `SHUFFLE_HASH` over `SHUFFLE_REPLICATE_NL`. When both sides are specified with the `BROADCAST` hint or the `SHUFFLE_HASH` hint, Spark will pick the build side based on the join type and the sizes of the relations. Since a given strategy may not support all join types, Spark is not guaranteed to use the join strategy suggested by the hint. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hints -> hints?
|
|
||
| ### Partitioning Hints | ||
|
|
||
| `COALESCE`/`REPARTITION`/`REPARTITION_BY_RANGE` hints have functionalities equivalent to those of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about rephrasing it like this?
Partitioning hints allow users to suggest a partitioning way that Spark should follow. COALESCE, REPARTITION, and REPARTITION_BY_RANGE hints are supported and they are equivalent to coalesce, repartition, and repartitionByRange Dataset APIs, respectively.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, could you add links to the Dataset APIs if we describe them here? https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset
| ### Partitioning Hints | ||
|
|
||
| `COALESCE`/`REPARTITION`/`REPARTITION_BY_RANGE` hints have functionalities equivalent to those of the | ||
| `Dataset` `coalesce`/`repartition`/`repartitionByRange` APIs. The `COALESCE` hint can be used to reduce |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about moving the explanations for each hint (e.g., The COALESCE hint can be used to reduce...) into a new section like ### Partitiong Hints Types?
|
Test build #123294 has finished for PR 28672 at commit
|
|
Test build #123299 has finished for PR 28672 at commit
|
| and `REPARTITION_BY_RANGE` hints are supported and are equivalent to `coalesce`, `repartition`, and | ||
| `repartitionByRange` [Dataset APIs](api/scala/org/apache/spark/sql/Dataset.html), respectively. These hints give users | ||
| a way to tune performance and control the number of output files in Spark SQL. When multiple partitioning hints are | ||
| specified, multiple nodes are inserted into the logical plan, but the leftmost hint is picked by the optimizer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: The description of multiple hints is duplicated in https://github.com/apache/spark/pull/28672/files#diff-84ec3ee2cc31db6fd14e15058e35435cR69, maybe we just keep the one with the example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your comment. I will keep the one in description.
| a way to tune performance and control the number of output files in Spark SQL. When multiple partitioning hints are | ||
| specified, multiple nodes are inserted into the logical plan, but the leftmost hint is picked by the optimizer. | ||
|
|
||
| ### Partitioning Hints Types |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#### Partitioning Hints Types?
| The `REPARTITION_BY_RANGE` hint can be used to repartition to the specified number of partitions using the specified partitioning expressions. It takes column names and an optional partition number as parameters. | ||
|
|
||
|
|
||
| ### Examples |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto, #### Examples
|
|
||
| The `REPARTITION_BY_RANGE` hint can be used to repartition to the specified number of partitions using the specified partitioning expressions. It takes column names and an optional partition number as parameters. | ||
|
|
||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: remove the unencessary blank line.
|
Test build #123305 has finished for PR 28672 at commit
|
maropu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks okay
xuanyuanking
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
cc @srowen This is for 3.0. Thank you! |
…E Hints to SQL Reference Add Coalesce/Repartition/Repartition_By_Range Hints to SQL Reference To make SQL reference complete <img width="1100" alt="Screen Shot 2020-05-29 at 6 46 38 PM" src="https://user-images.githubusercontent.com/13592258/83316782-d6fcf300-a1dc-11ea-87f6-e357b9c739fd.png"> <img width="1099" alt="Screen Shot 2020-05-29 at 6 43 30 PM" src="https://user-images.githubusercontent.com/13592258/83316784-d8c6b680-a1dc-11ea-95ea-10a1f75dcef9.png"> Only the the above pages are changed. The following two pages are the same as before. <img width="1100" alt="Screen Shot 2020-05-28 at 10 05 27 PM" src="https://user-images.githubusercontent.com/13592258/83223474-bfb3fc00-a12f-11ea-807a-824a618afa0b.png"> <img width="1099" alt="Screen Shot 2020-05-28 at 10 05 08 PM" src="https://user-images.githubusercontent.com/13592258/83223478-c2165600-a12f-11ea-806e-a1e57dc35ef4.png"> Manually build and check Closes #28672 from huaxingao/coalesce_hint. Authored-by: Huaxin Gao <[email protected]> Signed-off-by: Sean Owen <[email protected]> (cherry picked from commit 1b780f3) Signed-off-by: Sean Owen <[email protected]>
|
Merged to master/3.0. 3.0 had a very minor-looking merge conflict which I resolved directly. |
|
Thanks! @srowen @maropu @xuanyuanking |
What changes were proposed in this pull request?
Add Coalesce/Repartition/Repartition_By_Range Hints to SQL Reference
Why are the changes needed?
To make SQL reference complete
Does this PR introduce any user-facing change?
Only the the above pages are changed. The following two pages are the same as before.
How was this patch tested?
Manually build and check