-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-20694][DOCS][SQL] Document DataFrameWriter partitionBy, bucketBy and sortBy in SQL guide #17938
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-20694][DOCS][SQL] Document DataFrameWriter partitionBy, bucketBy and sortBy in SQL guide #17938
Changes from 16 commits
4a74328
573b0b9
90ad3f3
563a7e8
f9621d9
01cbfad
72806f1
0294e47
f76b113
7bf4bbc
cc1bfcf
606f1e3
a7aff81
c4d7856
b5babf6
92fb3b3
65ac310
f7b6f43
3a8b6e9
bea0676
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -581,6 +581,113 @@ Starting from Spark 2.1, persistent datasource tables have per-partition metadat | |
|
|
||
| Note that partition information is not gathered by default when creating external datasource tables (those with a `path` option). To sync the partition information in the metastore, you can invoke `MSCK REPAIR TABLE`. | ||
|
|
||
| ### Bucketing, Sorting and Partitioning | ||
|
|
||
| For file-based data source it is also possible to bucket and sort or partition the output. | ||
|
||
| Bucketing and sorting is applicable only to persistent tables: | ||
|
||
|
|
||
| <div class="codetabs"> | ||
|
|
||
| <div data-lang="scala" markdown="1"> | ||
| {% include_example write_sorting_and_bucketing scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %} | ||
| </div> | ||
|
|
||
| <div data-lang="java" markdown="1"> | ||
| {% include_example write_sorting_and_bucketing java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %} | ||
| </div> | ||
|
|
||
| <div data-lang="python" markdown="1"> | ||
| {% include_example write_sorting_and_bucketing python/sql/datasource.py %} | ||
| </div> | ||
|
|
||
| <div data-lang="sql" markdown="1"> | ||
|
|
||
| {% highlight sql %} | ||
|
|
||
| CREATE TABLE users_bucketed_by_name( | ||
| name STRING, | ||
| favorite_color STRING, | ||
| favorite_NUMBERS array<integer> | ||
| ) USING parquet | ||
| CLUSTERED BY(name) INTO 42 BUCKETS; | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. To be consistent with the example in the other APIs, it is missing the
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could you please use the same table names
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @zero323 Could you also resolve this? Thanks! |
||
|
|
||
| {% endhighlight %} | ||
|
|
||
| </div> | ||
|
|
||
| </div> | ||
|
|
||
| while partitioning can be used with both `save` and `saveAsTable`: | ||
|
||
|
|
||
|
|
||
| <div class="codetabs"> | ||
|
|
||
| <div data-lang="scala" markdown="1"> | ||
| {% include_example write_partitioning scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %} | ||
| </div> | ||
|
|
||
| <div data-lang="java" markdown="1"> | ||
| {% include_example write_partitioning java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %} | ||
| </div> | ||
|
|
||
| <div data-lang="python" markdown="1"> | ||
| {% include_example write_partitioning python/sql/datasource.py %} | ||
| </div> | ||
|
|
||
| <div data-lang="sql" markdown="1"> | ||
|
|
||
| {% highlight sql %} | ||
|
|
||
| CREATE TABLE users_by_favorite_color( | ||
| name STRING, | ||
| favorite_color STRING, | ||
| favorite_NUMBERS array<integer> | ||
| ) USING csv PARTITIONED BY(favorite_color); | ||
|
|
||
| {% endhighlight %} | ||
|
|
||
| </div> | ||
|
|
||
| </div> | ||
|
|
||
| It is possible to use both partitions and buckets for a single table: | ||
|
||
|
|
||
| <div class="codetabs"> | ||
|
|
||
| <div data-lang="scala" markdown="1"> | ||
| {% include_example write_partition_and_bucket scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %} | ||
| </div> | ||
|
|
||
| <div data-lang="java" markdown="1"> | ||
| {% include_example write_partition_and_bucket java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %} | ||
| </div> | ||
|
|
||
| <div data-lang="python" markdown="1"> | ||
| {% include_example write_partition_and_bucket python/sql/datasource.py %} | ||
| </div> | ||
|
|
||
| <div data-lang="sql" markdown="1"> | ||
|
|
||
| {% highlight sql %} | ||
|
|
||
| CREATE TABLE users_bucketed_and_partitioned( | ||
| name STRING, | ||
| favorite_color STRING, | ||
| favorite_NUMBERS array<integer> | ||
| ) USING parquet | ||
| PARTITIONED BY (favorite_color) | ||
| CLUSTERED BY(name) INTO 42 BUCKETS; | ||
|
|
||
| {% endhighlight %} | ||
|
|
||
| </div> | ||
|
|
||
| </div> | ||
|
|
||
| `partitionBy` creates a directory structure as described in the [Partition Discovery](#partition-discovery) section. | ||
| Because of that it has limited applicability to columns with high cardinality. In contrast `bucketBy` distributes | ||
|
||
| data across fixed number of buckets and can be used if a number of unique values is unbounded. | ||
|
||
|
|
||
| ## Parquet Files | ||
|
|
||
| [Parquet](http://parquet.io) is a columnar format that is supported by many other data processing systems. | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel that examples are missing writing to partitioned + bucketed table. eg.
There could be multiple possible orderings of
partitionBy,bucketByandsortBycalls. Not all of them are supported, not all of them would produce correct outputs. I have not done any exhaustive study of the same but I think this should be documented to guide people while using these APIsThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we emphasize partitioning? I think it's more widely used than bucketing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tejasapatil
Shouldn't the output be the same no matter the order?
sortByis not applicable forpartitionByand takes precedence overbucketBy, if both are present. This is Hive's behaviour if I am not mistaken, and at first glance Spark is doing the same thing. It there any gotcha here?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan I think we can redirect to partition discovery here. But explaining the difference and possible applications (low vs. high cardinality) could be a good idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Theoretically yes. Practically I don't know what happens. Since you are documenting, it will be worthwhile to check that and record if it works as expected (or if there is any weirdness).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I thought you are implying there are some known issues. This actually behaves sensibly - all supported options seem to work independent of the order, and unsupported ones (
partitionBy+sortBywithoutbucketByor overlappingbucketByandpartitionBycolumns) give enough feedback to diagnose the issue.I haven't tested this with large datasets though, so there can be hidden issues.