-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-17495] [SQL] Add Hash capability semantically equivalent to Hive's #15047
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #65214 has finished for PR 15047 at commit
|
|
@rxin : can you recommend me someone for reviewing this PR ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this? Can we assume that the sqlType of an UserDefinedType is an Array?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mimic'ed exactly what happens in case of Murmur3Hash. See https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/misc.scala#L388
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan I think you wrote the initial version. Could you you tell us what is happening here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the caller of hash guarantees the value matches the data type. So in this branch, if the value is ArrayData, the data type must be ArrayType or UDT of ArrayType
|
@tejasapatil this looks pretty good overal. I left a few comments. |
|
Can we move this into catalyst.expressions in sql/catalyst? |
|
@rxin : I could but the test case depends on few Hive classes for validation. I could either (keep the test case in sql/hive and move HiveHash to sql/catalyst) OR (move both to sql/catalyst and hard code expected output in the test case so that I need not have to depend on hive classes) |
|
hard coding output seems like a good idea. additionally, if you want to be super safe, you could also create a randomized test in sql/hive. |
c898f5a to
8e42799
Compare
|
@hvanhovell Done with all changes. Ready for review. |
8e42799 to
4ae4856
Compare
|
Test build #65614 has finished for PR 15047 at commit
|
|
Test build #65615 has finished for PR 15047 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets put this in catalyst: org.apache.spark.sql.catalyst.expressions.
XXH64 also lives there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
@tejasapatil could add this hash to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the same as the HashExpression.computeHash?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. @tailrec only works with private modifier so I was unable to make the parent class' version to be accessible to child classes.
I am introducing a wrapper method to avoid code duplication while still keeping tailrec's benefits. This method is used for generating the codegen string so would have negligible impact on overall query perf.
If you got any better solution, let me know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seed is never used right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Cleaned it up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The hasher should be a static class. Show we don't really need to pass it around. We could use hasherClassName instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not do the 31 multiplication in the genHash* methods?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Never mind. ArrayType does something different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be performance sensitive. Imperative while loops are better here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed this everywhere.
4ae4856 to
afc1d1b
Compare
afc1d1b to
cf62891
Compare
tejasapatil
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- benchmark numbers for
HashByteArrayBenchmarkshow thatHiveHasheris way fast than the other two impls. However, I think it will be bad wrt. hash collisions. HashBenchmarkvalues for map datatype seem to differ a lot forinterpreted version. I think that might be due to #13847
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Cleaned it up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. @tailrec only works with private modifier so I was unable to make the parent class' version to be accessible to child classes.
I am introducing a wrapper method to avoid code duplication while still keeping tailrec's benefits. This method is used for generating the codegen string so would have negligible impact on overall query perf.
If you got any better solution, let me know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed this everywhere.
|
Test build #66026 has finished for PR 15047 at commit
|
|
Test build #66027 has finished for PR 15047 at commit
|
| ------------------------------------------------------------------------------------------------ | ||
| Murmur3_x86_32 11 / 12 198.9 5.0 1.0X | ||
| xxHash 64-bit 16 / 19 130.1 7.7 0.7X | ||
| HiveHasher 0 / 0 282254.6 0.0 1419.0X |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This look to good to be true :).... I think the JVM is eliminating dead code. We should do something with the sum variable, and see what happens in that case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hvanhovell : Thanks for pointing this out. At first I printed the sum but that had lot of noise on console so moved sum out of the loop (both showed similar results).
|
Test build #66058 has finished for PR 15047 at commit
|
|
jenkins retest this please |
|
Test build #66069 has finished for PR 15047 at commit
|
|
jenkins retest this please |
|
Test build #66108 has finished for PR 15047 at commit
|
|
LGTM. I'll let @cloud-fan sign off on this. |
|
Retest this please |
|
@tejasapatil I have triggered a new build. I'll merge this as soon as it completes successfully. |
|
Test build #66279 has finished for PR 15047 at commit
|
|
Merging to master. Thanks! |
|
Do we need a test suite for checking if the generated hash value is identical to the value by Hive? |
|
@gatorsmile : I have tests in |
|
@tejasapatil It sounds like the test case coverage is limited. It does not cover all the data types, right? |
|
One testing technique we have used internally at Databricks (not for Spark) is to use random data generator to generate a bunch of data, and run through the reference implementation to get the results, and then just pull the results in, rather than the dependency on the reference implementation. |
|
Yeah, my previous team also uses a similar FVT tool for populating database tables. It is pretty useful. |
## What changes were proposed in this pull request? Jira : https://issues.apache.org/jira/browse/SPARK-17495 Spark internally uses Murmur3Hash for partitioning. This is different from the one used by Hive. For queries which use bucketing this leads to different results if one tries the same query on both engines. For us, we want users to have backward compatibility to that one can switch parts of applications across the engines without observing regressions. This PR includes `HiveHash`, `HiveHashFunction`, `HiveHasher` which mimics Hive's hashing at https://github.com/apache/hive/blob/master/serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/ObjectInspectorUtils.java#L638 I am intentionally not introducing any usages of this hash function in rest of the code to keep this PR small. My eventual goal is to have Hive bucketing support in Spark. Once this PR gets in, I will make hash function pluggable in relevant areas (eg. `HashPartitioning`'s `partitionIdExpression` has Murmur3 hardcoded : https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala#L265) ## How was this patch tested? Added `HiveHashSuite` Author: Tejas Patil <[email protected]> Closes apache#15047 from tejasapatil/SPARK-17495_hive_hash.
|
@gatorsmile + @rxin : I had made a note of your comments but was not able to get to it that time because I had other time critical projects to be worked on. I have put out a PR which improves the unit test coverage : #17049 |
What changes were proposed in this pull request?
Jira : https://issues.apache.org/jira/browse/SPARK-17495
Spark internally uses Murmur3Hash for partitioning. This is different from the one used by Hive. For queries which use bucketing this leads to different results if one tries the same query on both engines. For us, we want users to have backward compatibility to that one can switch parts of applications across the engines without observing regressions.
This PR includes
HiveHash,HiveHashFunction,HiveHasherwhich mimics Hive's hashing at https://github.com/apache/hive/blob/master/serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/ObjectInspectorUtils.java#L638I am intentionally not introducing any usages of this hash function in rest of the code to keep this PR small. My eventual goal is to have Hive bucketing support in Spark. Once this PR gets in, I will make hash function pluggable in relevant areas (eg.
HashPartitioning'spartitionIdExpressionhas Murmur3 hardcoded : https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala#L265)How was this patch tested?
Added
HiveHashSuite