-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-20694][DOCS][SQL] Document DataFrameWriter partitionBy, bucketBy and sortBy in SQL guide #17938
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
[SPARK-20694][DOCS][SQL] Document DataFrameWriter partitionBy, bucketBy and sortBy in SQL guide #17938
Changes from all commits
Commits
Show all changes
20 commits
Select commit
Hold shift + click to select a range
4a74328
Add Scala examples
zero323 573b0b9
Add Python examples
zero323 90ad3f3
Add Java examples
zero323 563a7e8
Add examples to sql guide
zero323 f9621d9
Remove duplicated and
zero323 01cbfad
Add Python example for artitionBy + bucketBy
zero323 72806f1
Add Java example for artitionBy + bucketBy
zero323 0294e47
Add Scala example for artitionBy + bucketBy
zero323 f76b113
Add partitionBy + bucketBy to SQL Guide
zero323 7bf4bbc
Add cardinality note
zero323 cc1bfcf
Fix scala style
zero323 606f1e3
Missing drop
zero323 a7aff81
Python style
zero323 c4d7856
Add SQL partitionBy example
zero323 b5babf6
Add SQL examples for CLUSTERED BY
zero323 92fb3b3
Update PARTITION BY example to Spark syntax
zero323 65ac310
Include changes requested by gatorsmile
zero323 f7b6f43
Add SORTED BY
zero323 3a8b6e9
Constitent case for favorite_numbers
zero323 bea0676
Missing article
zero323 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel that examples are missing writing to partitioned + bucketed table. eg.
There could be multiple possible orderings of
partitionBy,bucketByandsortBycalls. Not all of them are supported, not all of them would produce correct outputs. I have not done any exhaustive study of the same but I think this should be documented to guide people while using these APIsThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we emphasize partitioning? I think it's more widely used than bucketing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tejasapatil
Shouldn't the output be the same no matter the order?
sortByis not applicable forpartitionByand takes precedence overbucketBy, if both are present. This is Hive's behaviour if I am not mistaken, and at first glance Spark is doing the same thing. It there any gotcha here?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan I think we can redirect to partition discovery here. But explaining the difference and possible applications (low vs. high cardinality) could be a good idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Theoretically yes. Practically I don't know what happens. Since you are documenting, it will be worthwhile to check that and record if it works as expected (or if there is any weirdness).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I thought you are implying there are some known issues. This actually behaves sensibly - all supported options seem to work independent of the order, and unsupported ones (
partitionBy+sortBywithoutbucketByor overlappingbucketByandpartitionBycolumns) give enough feedback to diagnose the issue.I haven't tested this with large datasets though, so there can be hidden issues.