Skip to content

Conversation

@tejasapatil
Copy link
Contributor

What changes were proposed in this pull request?

Semantics:

  • If the Hive table is bucketed, then INSERT node expect the child distribution to be based on the hash of the bucket columns. Else it would be empty. (Just to compare with Spark native bucketing : the required distribution is not enforced even if the table is bucketed or not... this saves the shuffle in comparison with hive).
  • Sort ordering for INSERT node over Hive bucketed table is determined as follows:
Table type Normal table Bucketed table
non-partitioned insert Nil sort columns
static partition Nil sort columns
dynamic partitions partition columns (partition columns + bucketId + sort columns)

Just to compare how sort ordering is expressed for Spark native bucketing:

Table type Normal table Bucketed table
sort ordering partition columns (partition columns + bucketId + sort columns)

Why is there a difference ? With hive, since there bucketed insertions would need a shuffle, sort ordering can be relaxed for both non-partitioned and static partition cases. Every RDD partition would get rows corresponding to a single bucket so those can be written to corresponding output file after sort. In case of dynamic partitions, the rows need to be routed to appropriate partition which makes it similar to Spark's constraints.

  • Only Overwrite mode is allowed for hive bucketed tables as any other mode will break the bucketing guarantees of the table. This is a difference wrt how Spark bucketing works.
  • With the PR, if there are no files created for empty buckets, the query will fail. Will support creation of empty files in coming iteration. This is a difference wrt how Spark bucketing works as it does NOT need files for empty buckets.

Summary of changes done:

  • ClusteredDistribution and HashPartitioning are modified to store the hashing function used.
  • RunnableCommand's' can now express the required distribution and ordering. This is used by ExecutedCommandExec which run these commands
    • The good thing about this is that I could remove the logic for enforcing sort ordering inside FileFormatWriter which felt out of place. Ideally, this kinda adding of physical nodes should be done within the planner which is what happens with this PR.
  • InsertIntoHiveTable enforces both distribution and sort ordering
  • InsertIntoHadoopFsRelationCommand enforces sort ordering ONLY (and not the distribution)
  • Fixed a bug due to which any alter commands to bucketed table (eg. updating stats) would wipe out the bucketing spec from metastore. This made insertions to bucketed table non-idempotent operation.

How was this patch tested?

  • Added new unit tests

@tejasapatil
Copy link
Contributor Author

Jenkins test this please

@SparkQA
Copy link

SparkQA commented Aug 16, 2017

Test build #80711 has finished for PR 18954 at commit 4b009a9.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tejasapatil
Copy link
Contributor Author

Jenkins retest this please

@SparkQA
Copy link

SparkQA commented Aug 16, 2017

Test build #80733 has finished for PR 18954 at commit 4b009a9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tejasapatil
Copy link
Contributor Author

@tejasapatil tejasapatil changed the title [SPARK-17654] [SQL] Enable creating hive bucketed tables [SPARK-17654] [SQL] Enable populating hive bucketed tables Aug 16, 2017
@SparkQA
Copy link

SparkQA commented Aug 18, 2017

Test build #80809 has finished for PR 18954 at commit 4b2f1eb.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tejasapatil
Copy link
Contributor Author

Jenkins retest this please

@SparkQA
Copy link

SparkQA commented Aug 18, 2017

Test build #80814 has finished for PR 18954 at commit 4b2f1eb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@viirya viirya Aug 18, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to create a partitioning that satisfies that distribution. According to modified HashPartitioning, if numPartitions isn't equal to numClusters, satisfies returns false. It seems a conflict if we ask to create a partitioning of numPartitions with a ClusteredDistribution of numClusters if they are not equal?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I gave this more thought and have made changes

@tejasapatil
Copy link
Contributor Author

I have a new PR (#19001) which supersedes this one. It has everything this PR does (ie. writer side changes) plus reader side changes.

@SparkQA
Copy link

SparkQA commented Aug 20, 2017

Test build #80878 has finished for PR 18954 at commit 9b8f084.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • sealed trait Distribution

@SparkQA
Copy link

SparkQA commented Aug 22, 2017

Test build #81006 has finished for PR 18954 at commit d5cf3c9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member

maropu commented Jul 23, 2018

@tejasapatil Can you close this for now because it's not active for a long time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants