[SPARK-17654] [SQL] Enable populating hive bucketed tables #18954

tejasapatil · 2017-08-16T04:43:06Z

What changes were proposed in this pull request?

Semantics:

If the Hive table is bucketed, then INSERT node expect the child distribution to be based on the hash of the bucket columns. Else it would be empty. (Just to compare with Spark native bucketing : the required distribution is not enforced even if the table is bucketed or not... this saves the shuffle in comparison with hive).
Sort ordering for INSERT node over Hive bucketed table is determined as follows:

Table type	Normal table	Bucketed table
non-partitioned insert	Nil	sort columns
static partition	Nil	sort columns
dynamic partitions	partition columns	(partition columns + bucketId + sort columns)

Just to compare how sort ordering is expressed for Spark native bucketing:

Table type	Normal table	Bucketed table
sort ordering	partition columns	(partition columns + bucketId + sort columns)

Why is there a difference ? With hive, since there bucketed insertions would need a shuffle, sort ordering can be relaxed for both non-partitioned and static partition cases. Every RDD partition would get rows corresponding to a single bucket so those can be written to corresponding output file after sort. In case of dynamic partitions, the rows need to be routed to appropriate partition which makes it similar to Spark's constraints.

Only Overwrite mode is allowed for hive bucketed tables as any other mode will break the bucketing guarantees of the table. This is a difference wrt how Spark bucketing works.
With the PR, if there are no files created for empty buckets, the query will fail. Will support creation of empty files in coming iteration. This is a difference wrt how Spark bucketing works as it does NOT need files for empty buckets.

Summary of changes done:

ClusteredDistribution and HashPartitioning are modified to store the hashing function used.
RunnableCommand's' can now express the required distribution and ordering. This is used by ExecutedCommandExec which run these commands
- The good thing about this is that I could remove the logic for enforcing sort ordering inside FileFormatWriter which felt out of place. Ideally, this kinda adding of physical nodes should be done within the planner which is what happens with this PR.
InsertIntoHiveTable enforces both distribution and sort ordering
InsertIntoHadoopFsRelationCommand enforces sort ordering ONLY (and not the distribution)
Fixed a bug due to which any alter commands to bucketed table (eg. updating stats) would wipe out the bucketing spec from metastore. This made insertions to bucketed table non-idempotent operation.

How was this patch tested?

Added new unit tests

tejasapatil · 2017-08-16T04:43:20Z

Jenkins test this please

SparkQA · 2017-08-16T07:04:49Z

Test build #80711 has finished for PR 18954 at commit 4b009a9.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

tejasapatil · 2017-08-16T10:54:44Z

Jenkins retest this please

SparkQA · 2017-08-16T13:38:43Z

Test build #80733 has finished for PR 18954 at commit 4b009a9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tejasapatil · 2017-08-16T14:11:28Z

cc @cloud-fan @gatorsmile @sameeragarwal @rxin

SparkQA · 2017-08-18T00:03:59Z

Test build #80809 has finished for PR 18954 at commit 4b2f1eb.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

tejasapatil · 2017-08-18T00:12:29Z

Jenkins retest this please

SparkQA · 2017-08-18T03:02:39Z

Test build #80814 has finished for PR 18954 at commit 4b2f1eb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2017-08-18T03:36:23Z

sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/EnsureRequirements.scala

This is going to create a partitioning that satisfies that distribution. According to modified HashPartitioning, if numPartitions isn't equal to numClusters, satisfies returns false. It seems a conflict if we ask to create a partitioning of numPartitions with a ClusteredDistribution of numClusters if they are not equal?

Good point. I gave this more thought and have made changes

tejasapatil · 2017-08-20T00:15:23Z

I have a new PR (#19001) which supersedes this one. It has everything this PR does (ie. writer side changes) plus reader side changes.

SparkQA · 2017-08-20T02:37:44Z

Test build #80878 has finished for PR 18954 at commit 9b8f084.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
sealed trait Distribution

…iter`

…tus object

SparkQA · 2017-08-22T22:58:40Z

Test build #81006 has finished for PR 18954 at commit d5cf3c9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2018-07-23T15:14:21Z

@tejasapatil Can you close this for now because it's not active for a long time.

tejasapatil changed the title ~~[SPARK-17654] [SQL] Enable creating hive bucketed tables~~ [SPARK-17654] [SQL] Enable populating hive bucketed tables Aug 16, 2017

tejasapatil force-pushed the bucket_write branch from 4b009a9 to 4b2f1eb Compare August 17, 2017 21:32

viirya reviewed Aug 18, 2017

View reviewed changes

tejasapatil force-pushed the bucket_write branch from 4b2f1eb to 9b8f084 Compare August 19, 2017 23:51

tejasapatil mentioned this pull request Aug 20, 2017

[SPARK-19256][SQL] Hive bucketing support #19001

Closed

tejasapatil added 4 commits August 22, 2017 13:01

bucketed writer implementation

bf01497

Move requiredOrdering into RunnableCommand instead of `FileFormatWr…

d10dfa4

…iter`

print only the files names in error message instead of entire FileSta…

80442fa

…tus object

change to avoid NPE

d5cf3c9

tejasapatil force-pushed the bucket_write branch from 9b8f084 to d5cf3c9 Compare August 22, 2017 20:08

tejasapatil closed this Jul 27, 2018

[SPARK-17654] [SQL] Enable populating hive bucketed tables #18954

[SPARK-17654] [SQL] Enable populating hive bucketed tables #18954

Uh oh!

Conversation

tejasapatil commented Aug 16, 2017

What changes were proposed in this pull request?

Semantics:

Summary of changes done:

How was this patch tested?

Uh oh!

tejasapatil commented Aug 16, 2017

Uh oh!

SparkQA commented Aug 16, 2017

Uh oh!

tejasapatil commented Aug 16, 2017

Uh oh!

SparkQA commented Aug 16, 2017

Uh oh!

tejasapatil commented Aug 16, 2017

Uh oh!

SparkQA commented Aug 18, 2017

Uh oh!

tejasapatil commented Aug 18, 2017

Uh oh!

SparkQA commented Aug 18, 2017

Uh oh!

viirya Aug 18, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tejasapatil Aug 20, 2017

Choose a reason for hiding this comment

Uh oh!

tejasapatil commented Aug 20, 2017

Uh oh!

SparkQA commented Aug 20, 2017

Uh oh!

SparkQA commented Aug 22, 2017

Uh oh!

maropu commented Jul 23, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

viirya Aug 18, 2017 •

edited

Loading