Skip to content

Conversation

@tejasapatil
Copy link
Contributor

What changes were proposed in this pull request?

Jira : https://issues.apache.org/jira/browse/SPARK-17495

Spark internally uses Murmur3Hash for partitioning. This is different from the one used by Hive. For queries which use bucketing this leads to different results if one tries the same query on both engines. For us, we want users to have backward compatibility to that one can switch parts of applications across the engines without observing regressions.

This PR includes HiveHash, HiveHashFunction, HiveHasher which mimics Hive's hashing at https://github.com/apache/hive/blob/master/serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/ObjectInspectorUtils.java#L638

I am intentionally not introducing any usages of this hash function in rest of the code to keep this PR small. My eventual goal is to have Hive bucketing support in Spark. Once this PR gets in, I will make hash function pluggable in relevant areas (eg. HashPartitioning's partitionIdExpression has Murmur3 hardcoded : https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala#L265)

How was this patch tested?

Added HiveHashSuite

@SparkQA
Copy link

SparkQA commented Sep 10, 2016

Test build #65214 has finished for PR 15047 at commit c898f5a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class HiveHash(children: Seq[Expression], seed: Int) extends HashExpression[Int]

@tejasapatil
Copy link
Contributor Author

@rxin : can you recommend me someone for reviewing this PR ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this? Can we assume that the sqlType of an UserDefinedType is an Array?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan I think you wrote the initial version. Could you you tell us what is happening here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the caller of hash guarantees the value matches the data type. So in this branch, if the value is ArrayData, the data type must be ArrayType or UDT of ArrayType

@hvanhovell
Copy link
Contributor

@tejasapatil this looks pretty good overal. I left a few comments.

@rxin
Copy link
Contributor

rxin commented Sep 15, 2016

Can we move this into catalyst.expressions in sql/catalyst?

@tejasapatil
Copy link
Contributor Author

@rxin : I could but the test case depends on few Hive classes for validation. I could either (keep the test case in sql/hive and move HiveHash to sql/catalyst) OR (move both to sql/catalyst and hard code expected output in the test case so that I need not have to depend on hive classes)

@rxin
Copy link
Contributor

rxin commented Sep 17, 2016

hard coding output seems like a good idea.

additionally, if you want to be super safe, you could also create a randomized test in sql/hive.

@tejasapatil
Copy link
Contributor Author

@hvanhovell Done with all changes. Ready for review.

@SparkQA
Copy link

SparkQA commented Sep 19, 2016

Test build #65614 has finished for PR 15047 at commit 8e42799.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 19, 2016

Test build #65615 has finished for PR 15047 at commit 4ae4856.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets put this in catalyst: org.apache.spark.sql.catalyst.expressions.

XXH64 also lives there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@hvanhovell
Copy link
Contributor

@tejasapatil could add this hash to HashByteArrayBenchmark and to HashBenchmark and update the results of these tests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the same as the HashExpression.computeHash?

Copy link
Contributor Author

@tejasapatil tejasapatil Sep 28, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. @tailrec only works with private modifier so I was unable to make the parent class' version to be accessible to child classes.

I am introducing a wrapper method to avoid code duplication while still keeping tailrec's benefits. This method is used for generating the codegen string so would have negligible impact on overall query perf.

If you got any better solution, let me know.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seed is never used right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Cleaned it up.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hasher should be a static class. Show we don't really need to pass it around. We could use hasherClassName instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not do the 31 multiplication in the genHash* methods?

Copy link
Contributor

@hvanhovell hvanhovell Sep 27, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never mind. ArrayType does something different.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be performance sensitive. Imperative while loops are better here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed this everywhere.

Copy link
Contributor Author

@tejasapatil tejasapatil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hvanhovell :

  • benchmark numbers for HashByteArrayBenchmark show that HiveHasher is way fast than the other two impls. However, I think it will be bad wrt. hash collisions.
  • HashBenchmark values for map datatype seem to differ a lot for interpreted version. I think that might be due to #13847

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Cleaned it up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor Author

@tejasapatil tejasapatil Sep 28, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. @tailrec only works with private modifier so I was unable to make the parent class' version to be accessible to child classes.

I am introducing a wrapper method to avoid code duplication while still keeping tailrec's benefits. This method is used for generating the codegen string so would have negligible impact on overall query perf.

If you got any better solution, let me know.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed this everywhere.

@SparkQA
Copy link

SparkQA commented Sep 28, 2016

Test build #66026 has finished for PR 15047 at commit afc1d1b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 28, 2016

Test build #66027 has finished for PR 15047 at commit cf62891.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

------------------------------------------------------------------------------------------------
Murmur3_x86_32 11 / 12 198.9 5.0 1.0X
xxHash 64-bit 16 / 19 130.1 7.7 0.7X
HiveHasher 0 / 0 282254.6 0.0 1419.0X
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This look to good to be true :).... I think the JVM is eliminating dead code. We should do something with the sum variable, and see what happens in that case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hvanhovell : Thanks for pointing this out. At first I printed the sum but that had lot of noise on console so moved sum out of the loop (both showed similar results).

@SparkQA
Copy link

SparkQA commented Sep 28, 2016

Test build #66058 has finished for PR 15047 at commit 238dbb8.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tejasapatil
Copy link
Contributor Author

jenkins retest this please

@SparkQA
Copy link

SparkQA commented Sep 29, 2016

Test build #66069 has finished for PR 15047 at commit 238dbb8.

  • This patch fails from timeout after a configured wait of 250m.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tejasapatil
Copy link
Contributor Author

jenkins retest this please

@SparkQA
Copy link

SparkQA commented Sep 29, 2016

Test build #66108 has finished for PR 15047 at commit 238dbb8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@hvanhovell
Copy link
Contributor

LGTM. I'll let @cloud-fan sign off on this.

@hvanhovell
Copy link
Contributor

Retest this please

@hvanhovell
Copy link
Contributor

@tejasapatil I have triggered a new build. I'll merge this as soon as it completes successfully.

@SparkQA
Copy link

SparkQA commented Oct 3, 2016

Test build #66279 has finished for PR 15047 at commit 238dbb8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@hvanhovell
Copy link
Contributor

Merging to master. Thanks!

@asfgit asfgit closed this in a99743d Oct 5, 2016
@tejasapatil tejasapatil deleted the SPARK-17495_hive_hash branch October 7, 2016 00:13
@gatorsmile
Copy link
Member

gatorsmile commented Nov 7, 2016

Do we need a test suite for checking if the generated hash value is identical to the value by Hive?

@tejasapatil
Copy link
Contributor Author

@gatorsmile : I have tests in HiveHasherSuite to compare the values against expected one. Initially I had thought about generating random input and calling the original Hive's hash function to compare the results but later dropped that as it would have added dependency on hive. See #15047 (comment)

@gatorsmile
Copy link
Member

@tejasapatil It sounds like the test case coverage is limited. It does not cover all the data types, right?

@rxin
Copy link
Contributor

rxin commented Nov 8, 2016

One testing technique we have used internally at Databricks (not for Spark) is to use random data generator to generate a bunch of data, and run through the reference implementation to get the results, and then just pull the results in, rather than the dependency on the reference implementation.

@gatorsmile
Copy link
Member

Yeah, my previous team also uses a similar FVT tool for populating database tables. It is pretty useful.

uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
## What changes were proposed in this pull request?

Jira : https://issues.apache.org/jira/browse/SPARK-17495

Spark internally uses Murmur3Hash for partitioning. This is different from the one used by Hive. For queries which use bucketing this leads to different results if one tries the same query on both engines. For us, we want users to have backward compatibility to that one can switch parts of applications across the engines without observing regressions.

This PR includes `HiveHash`, `HiveHashFunction`, `HiveHasher` which mimics Hive's hashing at https://github.com/apache/hive/blob/master/serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/ObjectInspectorUtils.java#L638

I am intentionally not introducing any usages of this hash function in rest of the code to keep this PR small. My eventual goal is to have Hive bucketing support in Spark. Once this PR gets in, I will make hash function pluggable in relevant areas (eg. `HashPartitioning`'s `partitionIdExpression` has Murmur3 hardcoded : https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala#L265)

## How was this patch tested?

Added `HiveHashSuite`

Author: Tejas Patil <[email protected]>

Closes apache#15047 from tejasapatil/SPARK-17495_hive_hash.
@tejasapatil
Copy link
Contributor Author

@gatorsmile + @rxin : I had made a note of your comments but was not able to get to it that time because I had other time critical projects to be worked on. I have put out a PR which improves the unit test coverage : #17049

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants