Skip to content

Conversation

@tejasapatil
Copy link
Contributor

@tejasapatil tejasapatil commented Feb 24, 2017

What changes were proposed in this pull request?

Hive hash to support Decimal datatype. Hive internally normalises decimals and I have ported that logic as-is to HiveHash.

How was this patch tested?

Added unit tests

@tejasapatil
Copy link
Contributor Author

ok to test

@SparkQA
Copy link

SparkQA commented Feb 24, 2017

Test build #73410 has started for PR 17056 at commit a378b3e.

asfgit pushed a commit that referenced this pull request Feb 24, 2017
## What changes were proposed in this pull request?

This PR adds tests hive-hash by comparing the outputs generated against Hive 1.2.1. Following datatypes are covered by this PR:
- null
- boolean
- byte
- short
- int
- long
- float
- double
- string
- array
- map
- struct

Datatypes that I have _NOT_ covered but I will work on separately are:
- Decimal (handled separately in #17056)
- TimestampType
- DateType
- CalendarIntervalType

## How was this patch tested?

NA

Author: Tejas Patil <[email protected]>

Closes #17049 from tejasapatil/SPARK-17495_remaining_types.
@tejasapatil
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Feb 25, 2017

Test build #73460 has finished for PR 17056 at commit 8595305.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tejasapatil
Copy link
Contributor Author

@cloud-fan @gatorsmile : can you please review this PR ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A quick question: these expected values are got from Hive?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a quick check. Most are right, but some of them do not match

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These were generated over Hive 1.2.1.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hive> select HASH(CAST("-18446744073709001000" AS DECIMAL(38,19)));
OK
0
Time taken: 0.035 seconds, Fetched: 1 row(s)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I figured out the problem was with the test case being not looking at the result of decimal.changePrecision. Fixed.

Yunni pushed a commit to Yunni/spark that referenced this pull request Feb 27, 2017
## What changes were proposed in this pull request?

This PR adds tests hive-hash by comparing the outputs generated against Hive 1.2.1. Following datatypes are covered by this PR:
- null
- boolean
- byte
- short
- int
- long
- float
- double
- string
- array
- map
- struct

Datatypes that I have _NOT_ covered but I will work on separately are:
- Decimal (handled separately in apache#17056)
- TimestampType
- DateType
- CalendarIntervalType

## How was this patch tested?

NA

Author: Tejas Patil <[email protected]>

Closes apache#17049 from tejasapatil/SPARK-17495_remaining_types.
@tejasapatil
Copy link
Contributor Author

Jenkins test this please

@SparkQA
Copy link

SparkQA commented Feb 28, 2017

Test build #73595 has finished for PR 17056 at commit 2387515.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tejasapatil
Copy link
Contributor Author

cc @cloud-fan @gatorsmile

@SparkQA
Copy link

SparkQA commented Feb 28, 2017

Test build #73596 has finished for PR 17056 at commit 2387515.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where do we use ctx and d?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They both aren't used but are a part of the method signature since the default impl in abstract class needs those :

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

${classOf[HiveHashFunction].getName}.normalizeDecimal

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HiveHashFunction is an object so cannot do classOf[]. Tried HiveHashFunction.getClass.getName as per other places in the codebase

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm it's hard to guarantee that we can produce same hash value as hive, can we run hive in the test and compare the result with spark?

Copy link
Contributor Author

@tejasapatil tejasapatil Feb 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The expected values are generated using hive 1.2.1. My original approach was to depend on Hive for generating expected values but as per discussion in a related PR, I was suggested to hardcode expected values. The main point being reduce dependency on Hive

@SparkQA
Copy link

SparkQA commented Feb 28, 2017

Test build #73610 has finished for PR 17056 at commit 428a9a4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

} else {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: if (input == null) return null

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed

@gatorsmile
Copy link
Member

Let me manually check whether the results are consistent with Hive.

@tejasapatil tejasapatil force-pushed the SPARK-17495_decimal branch from 428a9a4 to c0c8390 Compare March 1, 2017 06:32
@tejasapatil
Copy link
Contributor Author

@gatorsmile : I really appreciate your help in reviewing this PR to the extent that you are manually checking the hashes over Hive. If you haven't already embarked on that, here is the set of hive queries corresponding to the test case in the PR which you can easily copy paste:

SELECT HASH(CAST(18BD AS DECIMAL(38, 0)));
SELECT HASH(CAST(-18BD AS DECIMAL(38, 0)));
SELECT HASH(CAST(-18BD AS DECIMAL(38, 12)));
SELECT HASH(CAST(18446744073709001000BD AS DECIMAL(38, 19)));
SELECT HASH(CAST(-18446744073709001000BD AS DECIMAL(38, 22)));
SELECT HASH(CAST(-18446744073709001000BD AS DECIMAL(38, 3)));
SELECT HASH(CAST(18446744073709001000BD AS DECIMAL(38, 4)));
SELECT HASH(CAST(9223372036854775807BD AS DECIMAL(38, 4)));
SELECT HASH(CAST(-9223372036854775807BD AS DECIMAL(38, 5)));
SELECT HASH(CAST(00000.00000000000BD AS DECIMAL(38, 34)));
SELECT HASH(CAST(-00000.00000000000BD AS DECIMAL(38, 11)));
SELECT HASH(CAST(123456.1234567890BD AS DECIMAL(38, 2)));
SELECT HASH(CAST(123456.1234567890BD AS DECIMAL(38, 20)));
SELECT HASH(CAST(123456.1234567890BD AS DECIMAL(38, 10)));
SELECT HASH(CAST(-123456.1234567890BD AS DECIMAL(38, 10)));
SELECT HASH(CAST(123456.1234567890BD AS DECIMAL(38, 0)));
SELECT HASH(CAST(-123456.1234567890BD AS DECIMAL(38, 0)));
SELECT HASH(CAST(123456.1234567890BD AS DECIMAL(38, 20)));
SELECT HASH(CAST(-123456.1234567890BD AS DECIMAL(38, 20)));
SELECT HASH(CAST(123456.123456789012345678901234567890BD AS DECIMAL(38, 0)));
SELECT HASH(CAST(-123456.123456789012345678901234567890BD AS DECIMAL(38, 0)));
SELECT HASH(CAST(123456.123456789012345678901234567890BD AS DECIMAL(38, 10)));
SELECT HASH(CAST(-123456.123456789012345678901234567890BD AS DECIMAL(38, 10)));
SELECT HASH(CAST(123456.123456789012345678901234567890BD AS DECIMAL(38, 20)));
SELECT HASH(CAST(-123456.123456789012345678901234567890BD AS DECIMAL(38, 20)));
SELECT HASH(CAST(123456.123456789012345678901234567890BD AS DECIMAL(38, 30)));
SELECT HASH(CAST(-123456.123456789012345678901234567890BD AS DECIMAL(38, 30)));
SELECT HASH(CAST(123456.123456789012345678901234567890BD AS DECIMAL(38, 31)));

@gatorsmile
Copy link
Member

Thank you! I checked Hive 2.1. It has the exactly same hash values.

@gatorsmile
Copy link
Member

LGTM. cc @cloud-fan for final signing-off

@SparkQA
Copy link

SparkQA commented Mar 1, 2017

Test build #73670 has finished for PR 17056 at commit c0c8390.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tejasapatil
Copy link
Contributor Author

@cloud-fan ping !!

private val HiveDecimalMaxScale = 38

// Mimics normalization done for decimals in Hive at HiveDecimalV1.normalize()
def normalizeDecimal(input: BigDecimal, allowRounding: Boolean): BigDecimal = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

allowRounding will never be false?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed that param

HiveHasher.hashUnsafeBytes(base, offset, len)
}

private val HiveDecimalMaxPrecision = 38
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: HIVE_DECIMAL_MAX_PRECISION

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed

@SparkQA
Copy link

SparkQA commented Mar 6, 2017

Test build #73945 has finished for PR 17056 at commit 65a09e9.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

LGTM(if tests pass)

@SparkQA
Copy link

SparkQA commented Mar 6, 2017

Test build #74018 has finished for PR 17056 at commit 7c0b6c8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@asfgit asfgit closed this in 2a0bc86 Mar 6, 2017
@tejasapatil tejasapatil deleted the SPARK-17495_decimal branch March 6, 2017 19:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants