Skip to content

Fix hash calculation for Timestamp in HiveBucketing to be Hive Compatible#22980

Merged
kewang1024 merged 1 commit intoprestodb:masterfrom
kewang1024:fix-timestamp
Jun 14, 2024
Merged

Fix hash calculation for Timestamp in HiveBucketing to be Hive Compatible#22980
kewang1024 merged 1 commit intoprestodb:masterfrom
kewang1024:fix-timestamp

Conversation

@kewang1024
Copy link
Collaborator

@kewang1024 kewang1024 commented Jun 11, 2024

Presto Java calculates hash for Timestamp type using formula:
hash(seconds << 30 + milliseconds)
Hive:
hash(seconds << 30 + nanoseconds)
Spark:
hash(seconds << 30 + nanoseconds)
Velox:
hash(seconds << 30 + nanoseconds)

To keep a consistent behavior with other engines, fix the hash calculation formula for Timestamp.

== RELEASE NOTES ==

Hive Connector Changes
* Fix hash calculation for Timestamp column to be hive compatible when writing to a table bucketed by Timestamp.  :pr:`22890`
* Add config `hive.legacy-timestamp-bucketing` and session property ``hive.legacy_timestamp_bucketing`` to use the original hash function for Timestamp column, which is not hive compatible. :pr:`22890`

Copy link
Contributor

@rschlussel rschlussel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great find! Can you add a release note for this bug fix.

This could also cause problems with incompatibility between old partitions and new partitions using different bucketing algorithms, and possibly wrong results reading old partitions and filtering out the wrong buckets, joining the wrong buckets, etc. I wonder if we need to add a configuration property for hive(e.g. called something like legacy-timestamp-bucketing) to help with the migration here.

assertBucketEquals("timestamp", new Timestamp(1000 * LocalDateTime.of(1969, 12, 31, 23, 59, 59, 999_000_000).toEpochSecond(ZoneOffset.UTC)));
assertBucketEquals("timestamp", new Timestamp(1000 * LocalDateTime.of(1950, 11, 19, 12, 34, 56, 789_000_000).toEpochSecond(ZoneOffset.UTC)));
assertBucketEquals("timestamp", new Timestamp(1000 * LocalDateTime.of(2015, 11, 19, 7, 6, 5, 432_000_000).toEpochSecond(ZoneOffset.UTC)));
assertBucketEquals("timestamp", new Timestamp(10 + 1000 * LocalDateTime.of(1970, 1, 1, 0, 0, 0, 0).toEpochSecond(ZoneOffset.UTC)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you split this into a separate test (e.g. testTimestampBucketWithNanoseconds)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since all the data types are tested within testHashingCompare, I would keep them at the same place since testing millisecond doesn't sound like an edge case, it's just the original test somehow failed to include those. I added comment to make it more straightforward.

Let me know if you feel strongly against it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my motivation is basically that when you have a zillion test cases in a single test it gets annoying to debug when it fails, and you lose some signal (it fails on case x, so the next case doesn't run, then you fix x and run the test again to discover there was also an issue with y). I think ideally the original test is split up too, and since you're adding new tests, it's better at least not to exacerbate the problem.

@kewang1024 kewang1024 requested a review from sdruzkin as a code owner June 11, 2024 20:59
@kewang1024 kewang1024 changed the title Fix hash calculation for Timestamp in HiveBucketing Fix hash calculation for Timestamp in HiveBucketing to be Hive Compatible Jun 11, 2024
@kewang1024 kewang1024 force-pushed the fix-timestamp branch 2 times, most recently from 1e35a2d to 246f01d Compare June 11, 2024 21:43
sdruzkin
sdruzkin previously approved these changes Jun 11, 2024
Copy link
Collaborator

@sdruzkin sdruzkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ORC stuff LGTM

@kewang1024 kewang1024 force-pushed the fix-timestamp branch 3 times, most recently from 601a799 to f75da1e Compare June 11, 2024 23:20
jaystarshot
jaystarshot previously approved these changes Jun 12, 2024
@kewang1024 kewang1024 requested a review from feilong-liu as a code owner June 12, 2024 05:08
@kewang1024 kewang1024 requested a review from a team as a code owner June 12, 2024 18:49
@kewang1024 kewang1024 force-pushed the fix-timestamp branch 3 times, most recently from b52bb0a to 7b8ba60 Compare June 13, 2024 00:00
@kewang1024 kewang1024 requested a review from rschlussel June 13, 2024 05:39
@kewang1024
Copy link
Collaborator Author

Other PRs are also failing on singlestore tests so it's unrelated to this change, @rschlussel

Copy link
Contributor

@vivek-bharathan vivek-bharathan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a pretty insidious bug. Thanks for fixing!
nit - commit message is too long

Copy link
Contributor

@rschlussel rschlussel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good. thanks for the fix!

rschlussel
rschlussel previously approved these changes Jun 14, 2024
@steveburnett
Copy link
Contributor

Suggest minor changes to release note entry

== RELEASE NOTES ==

Hive Connector Changes
* Fix hash calculation for Timestamp column to be hive compatible when writing to a table bucketed by Timestamp.  :pr:`22890`
* Add config `hive.legacy-timestamp-bucketing` and session property ``hive.legacy_timestamp_bucketing`` to use the original hash function for Timestamp column, which is not hive compatible. :pr:`22890`

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants