-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-17495] [SQL] Support Decimal type in Hive-hash #17056
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
ok to test |
|
Test build #73410 has started for PR 17056 at commit |
## What changes were proposed in this pull request? This PR adds tests hive-hash by comparing the outputs generated against Hive 1.2.1. Following datatypes are covered by this PR: - null - boolean - byte - short - int - long - float - double - string - array - map - struct Datatypes that I have _NOT_ covered but I will work on separately are: - Decimal (handled separately in #17056) - TimestampType - DateType - CalendarIntervalType ## How was this patch tested? NA Author: Tejas Patil <[email protected]> Closes #17049 from tejasapatil/SPARK-17495_remaining_types.
a378b3e to
8595305
Compare
|
retest this please |
|
Test build #73460 has finished for PR 17056 at commit
|
|
@cloud-fan @gatorsmile : can you please review this PR ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A quick question: these expected values are got from Hive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a quick check. Most are right, but some of them do not match
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These were generated over Hive 1.2.1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hive> select HASH(CAST("-18446744073709001000" AS DECIMAL(38,19)));
OK
0
Time taken: 0.035 seconds, Fetched: 1 row(s)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I figured out the problem was with the test case being not looking at the result of decimal.changePrecision. Fixed.
## What changes were proposed in this pull request? This PR adds tests hive-hash by comparing the outputs generated against Hive 1.2.1. Following datatypes are covered by this PR: - null - boolean - byte - short - int - long - float - double - string - array - map - struct Datatypes that I have _NOT_ covered but I will work on separately are: - Decimal (handled separately in apache#17056) - TimestampType - DateType - CalendarIntervalType ## How was this patch tested? NA Author: Tejas Patil <[email protected]> Closes apache#17049 from tejasapatil/SPARK-17495_remaining_types.
|
Jenkins test this please |
|
Test build #73595 has finished for PR 17056 at commit
|
|
Test build #73596 has finished for PR 17056 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where do we use ctx and d?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They both aren't used but are a part of the method signature since the default impl in abstract class needs those :
spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala
Line 321 in 3e40f6c
| protected def genHashDecimal( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
${classOf[HiveHashFunction].getName}.normalizeDecimal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HiveHashFunction is an object so cannot do classOf[]. Tried HiveHashFunction.getClass.getName as per other places in the codebase
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm it's hard to guarantee that we can produce same hash value as hive, can we run hive in the test and compare the result with spark?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The expected values are generated using hive 1.2.1. My original approach was to depend on Hive for generating expected values but as per discussion in a related PR, I was suggested to hardcode expected values. The main point being reduce dependency on Hive
|
Test build #73610 has finished for PR 17056 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
} else {
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: if (input == null) return null
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed
|
Let me manually check whether the results are consistent with Hive. |
428a9a4 to
c0c8390
Compare
|
@gatorsmile : I really appreciate your help in reviewing this PR to the extent that you are manually checking the hashes over Hive. If you haven't already embarked on that, here is the set of hive queries corresponding to the test case in the PR which you can easily copy paste: |
|
Thank you! I checked Hive 2.1. It has the exactly same hash values. |
|
LGTM. cc @cloud-fan for final signing-off |
|
Test build #73670 has finished for PR 17056 at commit
|
|
@cloud-fan ping !! |
| private val HiveDecimalMaxScale = 38 | ||
|
|
||
| // Mimics normalization done for decimals in Hive at HiveDecimalV1.normalize() | ||
| def normalizeDecimal(input: BigDecimal, allowRounding: Boolean): BigDecimal = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
allowRounding will never be false?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed that param
| HiveHasher.hashUnsafeBytes(base, offset, len) | ||
| } | ||
|
|
||
| private val HiveDecimalMaxPrecision = 38 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: HIVE_DECIMAL_MAX_PRECISION
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
renamed
|
Test build #73945 has finished for PR 17056 at commit
|
|
LGTM(if tests pass) |
|
Test build #74018 has finished for PR 17056 at commit
|
|
thanks, merging to master! |
What changes were proposed in this pull request?
Hive hash to support Decimal datatype. Hive internally normalises decimals and I have ported that logic as-is to HiveHash.
How was this patch tested?
Added unit tests