[SPARK-17495] [SQL] Add Hash capability semantically equivalent to Hive's #15047

tejasapatil · 2016-09-10T19:40:29Z

What changes were proposed in this pull request?

Jira : https://issues.apache.org/jira/browse/SPARK-17495

Spark internally uses Murmur3Hash for partitioning. This is different from the one used by Hive. For queries which use bucketing this leads to different results if one tries the same query on both engines. For us, we want users to have backward compatibility to that one can switch parts of applications across the engines without observing regressions.

This PR includes HiveHash, HiveHashFunction, HiveHasher which mimics Hive's hashing at https://github.com/apache/hive/blob/master/serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/ObjectInspectorUtils.java#L638

I am intentionally not introducing any usages of this hash function in rest of the code to keep this PR small. My eventual goal is to have Hive bucketing support in Spark. Once this PR gets in, I will make hash function pluggable in relevant areas (eg. HashPartitioning's partitionIdExpression has Murmur3 hardcoded : https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala#L265)

How was this patch tested?

Added HiveHashSuite

SparkQA · 2016-09-10T21:09:19Z

Test build #65214 has finished for PR 15047 at commit c898f5a.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class HiveHash(children: Seq[Expression], seed: Int) extends HashExpression[Int]

tejasapatil · 2016-09-10T23:25:08Z

@rxin : can you recommend me someone for reviewing this PR ?

hvanhovell · 2016-09-11T16:14:11Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveHash.scala

What is this? Can we assume that the sqlType of an UserDefinedType is an Array?

I mimic'ed exactly what happens in case of Murmur3Hash. See https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala#L388

@cloud-fan I think you wrote the initial version. Could you you tell us what is happening here?

the caller of hash guarantees the value matches the data type. So in this branch, if the value is ArrayData, the data type must be ArrayType or UDT of ArrayType

hvanhovell · 2016-09-11T16:32:50Z

@tejasapatil this looks pretty good overal. I left a few comments.

rxin · 2016-09-15T04:29:21Z

Can we move this into catalyst.expressions in sql/catalyst?

tejasapatil · 2016-09-16T04:36:59Z

@rxin : I could but the test case depends on few Hive classes for validation. I could either (keep the test case in sql/hive and move HiveHash to sql/catalyst) OR (move both to sql/catalyst and hard code expected output in the test case so that I need not have to depend on hive classes)

rxin · 2016-09-17T05:34:04Z

hard coding output seems like a good idea.

additionally, if you want to be super safe, you could also create a randomized test in sql/hive.

tejasapatil · 2016-09-19T21:05:52Z

@hvanhovell Done with all changes. Ready for review.

SparkQA · 2016-09-19T23:25:27Z

Test build #65614 has finished for PR 15047 at commit 8e42799.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-09-19T23:37:43Z

Test build #65615 has finished for PR 15047 at commit 4ae4856.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

hvanhovell · 2016-09-27T22:44:27Z

common/unsafe/src/main/java/org/apache/spark/unsafe/hash/HiveHasher.java

Lets put this in catalyst: org.apache.spark.sql.catalyst.expressions.

XXH64 also lives there.

hvanhovell · 2016-09-27T22:47:38Z

@tejasapatil could add this hash to HashByteArrayBenchmark and to HashBenchmark and update the results of these tests.

hvanhovell · 2016-09-27T22:51:20Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala

Is this the same as the HashExpression.computeHash?

Yes. @tailrec only works with private modifier so I was unable to make the parent class' version to be accessible to child classes.

I am introducing a wrapper method to avoid code duplication while still keeping tailrec's benefits. This method is used for generating the codegen string so would have negligible impact on overall query perf.

If you got any better solution, let me know.

hvanhovell · 2016-09-27T23:02:44Z

common/unsafe/src/main/java/org/apache/spark/unsafe/hash/HiveHasher.java

Seed is never used right?

Yes. Cleaned it up.

hvanhovell · 2016-09-27T23:05:18Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala

The hasher should be a static class. Show we don't really need to pass it around. We could use hasherClassName instead.

hvanhovell · 2016-09-27T23:06:16Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala

Why not do the 31 multiplication in the genHash* methods?

Never mind. ArrayType does something different.

hvanhovell · 2016-09-27T23:08:45Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala

This can be performance sensitive. Imperative while loops are better here.

changed this everywhere.

tejasapatil

@hvanhovell :

benchmark numbers for HashByteArrayBenchmark show that HiveHasher is way fast than the other two impls. However, I think it will be bad wrt. hash collisions.
HashBenchmark values for map datatype seem to differ a lot for interpreted version. I think that might be due to #13847

tejasapatil · 2016-09-28T06:15:08Z

common/unsafe/src/main/java/org/apache/spark/unsafe/hash/HiveHasher.java

tejasapatil · 2016-09-28T06:15:21Z

common/unsafe/src/main/java/org/apache/spark/unsafe/hash/HiveHasher.java

Yes. Cleaned it up.

tejasapatil · 2016-09-28T06:15:34Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala

tejasapatil · 2016-09-28T06:16:33Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala

Yes. @tailrec only works with private modifier so I was unable to make the parent class' version to be accessible to child classes.

I am introducing a wrapper method to avoid code duplication while still keeping tailrec's benefits. This method is used for generating the codegen string so would have negligible impact on overall query perf.

If you got any better solution, let me know.

tejasapatil · 2016-09-28T06:16:59Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala

changed this everywhere.

SparkQA · 2016-09-28T08:20:37Z

Test build #66026 has finished for PR 15047 at commit afc1d1b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-09-28T09:05:56Z

Test build #66027 has finished for PR 15047 at commit cf62891.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

hvanhovell · 2016-09-28T16:04:47Z

sql/catalyst/src/test/scala/org/apache/spark/sql/HashByteArrayBenchmark.scala

+    ------------------------------------------------------------------------------------------------
+    Murmur3_x86_32                                  11 /   12        198.9           5.0       1.0X
+    xxHash 64-bit                                   16 /   19        130.1           7.7       0.7X
+    HiveHasher                                       0 /    0     282254.6           0.0    1419.0X


This look to good to be true :).... I think the JVM is eliminating dead code. We should do something with the sum variable, and see what happens in that case.

@hvanhovell : Thanks for pointing this out. At first I printed the sum but that had lot of noise on console so moved sum out of the loop (both showed similar results).

SparkQA · 2016-09-28T22:37:27Z

Test build #66058 has finished for PR 15047 at commit 238dbb8.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

tejasapatil · 2016-09-28T22:41:56Z

jenkins retest this please

SparkQA · 2016-09-29T02:52:41Z

Test build #66069 has finished for PR 15047 at commit 238dbb8.

This patch fails from timeout after a configured wait of 250m.
This patch merges cleanly.
This patch adds no public classes.

tejasapatil · 2016-09-29T15:26:41Z

jenkins retest this please

SparkQA · 2016-09-29T17:35:13Z

Test build #66108 has finished for PR 15047 at commit 238dbb8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

hvanhovell · 2016-09-29T22:59:17Z

LGTM. I'll let @cloud-fan sign off on this.

hvanhovell · 2016-10-03T20:39:53Z

Retest this please

hvanhovell · 2016-10-03T20:46:00Z

@tejasapatil I have triggered a new build. I'll merge this as soon as it completes successfully.

SparkQA · 2016-10-03T22:56:07Z

Test build #66279 has finished for PR 15047 at commit 238dbb8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

hvanhovell · 2016-10-05T01:59:45Z

Merging to master. Thanks!

gatorsmile · 2016-11-07T07:43:15Z

Do we need a test suite for checking if the generated hash value is identical to the value by Hive?

tejasapatil · 2016-11-08T06:56:59Z

@gatorsmile : I have tests in HiveHasherSuite to compare the values against expected one. Initially I had thought about generating random input and calling the original Hive's hash function to compare the results but later dropped that as it would have added dependency on hive. See #15047 (comment)

gatorsmile · 2016-11-08T07:55:07Z

@tejasapatil It sounds like the test case coverage is limited. It does not cover all the data types, right?

rxin · 2016-11-08T08:22:01Z

One testing technique we have used internally at Databricks (not for Spark) is to use random data generator to generate a bunch of data, and run through the reference implementation to get the results, and then just pull the results in, rather than the dependency on the reference implementation.

gatorsmile · 2016-11-09T08:10:18Z

Yeah, my previous team also uses a similar FVT tool for populating database tables. It is pretty useful.

## What changes were proposed in this pull request? Jira : https://issues.apache.org/jira/browse/SPARK-17495 Spark internally uses Murmur3Hash for partitioning. This is different from the one used by Hive. For queries which use bucketing this leads to different results if one tries the same query on both engines. For us, we want users to have backward compatibility to that one can switch parts of applications across the engines without observing regressions. This PR includes `HiveHash`, `HiveHashFunction`, `HiveHasher` which mimics Hive's hashing at https://github.com/apache/hive/blob/master/serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/ObjectInspectorUtils.java#L638 I am intentionally not introducing any usages of this hash function in rest of the code to keep this PR small. My eventual goal is to have Hive bucketing support in Spark. Once this PR gets in, I will make hash function pluggable in relevant areas (eg. `HashPartitioning`'s `partitionIdExpression` has Murmur3 hardcoded : https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala#L265) ## How was this patch tested? Added `HiveHashSuite` Author: Tejas Patil <[email protected]> Closes apache#15047 from tejasapatil/SPARK-17495_hive_hash.

tejasapatil · 2017-02-24T05:19:43Z

@gatorsmile + @rxin : I had made a note of your comments but was not able to get to it that time because I had other time critical projects to be worked on. I have put out a PR which improves the unit test coverage : #17049

hvanhovell reviewed Sep 11, 2016
View reviewed changes

tejasapatil force-pushed the SPARK-17495_hive_hash branch from c898f5a to 8e42799 Compare September 19, 2016 21:04

tejasapatil force-pushed the SPARK-17495_hive_hash branch from 8e42799 to 4ae4856 Compare September 19, 2016 21:08

This was referenced Sep 24, 2016

[SPARK-17654] [SQL] Propagate bucketing information for Hive tables to / from Catalog #15228

Closed

[SPARK-17654] [SQL] Propagate bucketing information for Hive tables to planner #15229

Closed

hvanhovell reviewed Sep 27, 2016

View reviewed changes

tejasapatil added 2 commits September 27, 2016 18:47

Add Hashing capability equivalent to Hive

3011af5

do codeGen()

3bed3a1

tejasapatil force-pushed the SPARK-17495_hive_hash branch from 4ae4856 to afc1d1b Compare September 28, 2016 06:39

addressing hvanhovell's review comments

cf62891

tejasapatil force-pushed the SPARK-17495_hive_hash branch from afc1d1b to cf62891 Compare September 28, 2016 06:53

tejasapatil commented Sep 28, 2016

View reviewed changes

hvanhovell reviewed Sep 28, 2016

View reviewed changes

fix benchmark

238dbb8

tejasapatil mentioned this pull request Sep 29, 2016

[SPARK-17729] [SQL] Enable creating hive bucketed tables #15300

Closed

asfgit closed this in a99743d Oct 5, 2016

tejasapatil deleted the SPARK-17495_hive_hash branch October 7, 2016 00:13

tejasapatil mentioned this pull request Feb 28, 2017

[SPARK-17495] [SQL] Support Decimal type in Hive-hash #17056

Closed

tejasapatil mentioned this pull request Apr 15, 2017

[SPARK-17729] [SQL] Enable creating hive bucketed tables #17644

Closed

[SPARK-17495] [SQL] Add Hash capability semantically equivalent to Hive's #15047

[SPARK-17495] [SQL] Add Hash capability semantically equivalent to Hive's #15047

Uh oh!

Conversation

tejasapatil commented Sep 10, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Sep 10, 2016

Uh oh!

tejasapatil commented Sep 10, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hvanhovell commented Sep 11, 2016

Uh oh!

rxin commented Sep 15, 2016

Uh oh!

tejasapatil commented Sep 16, 2016

Uh oh!

rxin commented Sep 17, 2016

Uh oh!

tejasapatil commented Sep 19, 2016

Uh oh!

SparkQA commented Sep 19, 2016

Uh oh!

SparkQA commented Sep 19, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hvanhovell commented Sep 27, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tejasapatil Sep 28, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hvanhovell Sep 27, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tejasapatil left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tejasapatil Sep 28, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 28, 2016

tejasapatil Sep 28, 2016 •

edited

Loading

hvanhovell Sep 27, 2016 •

edited

Loading

tejasapatil Sep 28, 2016 •

edited

Loading

gatorsmile commented Nov 7, 2016 •

edited

Loading