-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-17654] [SQL] Enable populating hive bucketed tables #18954
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Jenkins test this please |
|
Test build #80711 has finished for PR 18954 at commit
|
|
Jenkins retest this please |
|
Test build #80733 has finished for PR 18954 at commit
|
4b009a9 to
4b2f1eb
Compare
|
Test build #80809 has finished for PR 18954 at commit
|
|
Jenkins retest this please |
|
Test build #80814 has finished for PR 18954 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is going to create a partitioning that satisfies that distribution. According to modified HashPartitioning, if numPartitions isn't equal to numClusters, satisfies returns false. It seems a conflict if we ask to create a partitioning of numPartitions with a ClusteredDistribution of numClusters if they are not equal?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I gave this more thought and have made changes
4b2f1eb to
9b8f084
Compare
|
I have a new PR (#19001) which supersedes this one. It has everything this PR does (ie. writer side changes) plus reader side changes. |
|
Test build #80878 has finished for PR 18954 at commit
|
9b8f084 to
d5cf3c9
Compare
|
Test build #81006 has finished for PR 18954 at commit
|
|
@tejasapatil Can you close this for now because it's not active for a long time. |
What changes were proposed in this pull request?
Semantics:
Just to compare how sort ordering is expressed for Spark native bucketing:
Why is there a difference ? With hive, since there bucketed insertions would need a shuffle, sort ordering can be relaxed for both non-partitioned and static partition cases. Every RDD partition would get rows corresponding to a single bucket so those can be written to corresponding output file after sort. In case of dynamic partitions, the rows need to be routed to appropriate partition which makes it similar to Spark's constraints.
Overwritemode is allowed for hive bucketed tables as any other mode will break the bucketing guarantees of the table. This is a difference wrt how Spark bucketing works.Summary of changes done:
ClusteredDistributionandHashPartitioningare modified to store the hashing function used.RunnableCommand's' can now express the required distribution and ordering. This is used byExecutedCommandExecwhich run these commandsFileFormatWriterwhich felt out of place. Ideally, this kinda adding of physical nodes should be done within the planner which is what happens with this PR.InsertIntoHiveTableenforces both distribution and sort orderingInsertIntoHadoopFsRelationCommandenforces sort ordering ONLY (and not the distribution)How was this patch tested?