Fix hash calculation for Timestamp in HiveBucketing to be Hive Compatible#22980
Fix hash calculation for Timestamp in HiveBucketing to be Hive Compatible#22980kewang1024 merged 1 commit intoprestodb:masterfrom
Conversation
There was a problem hiding this comment.
This is a great find! Can you add a release note for this bug fix.
This could also cause problems with incompatibility between old partitions and new partitions using different bucketing algorithms, and possibly wrong results reading old partitions and filtering out the wrong buckets, joining the wrong buckets, etc. I wonder if we need to add a configuration property for hive(e.g. called something like legacy-timestamp-bucketing) to help with the migration here.
| assertBucketEquals("timestamp", new Timestamp(1000 * LocalDateTime.of(1969, 12, 31, 23, 59, 59, 999_000_000).toEpochSecond(ZoneOffset.UTC))); | ||
| assertBucketEquals("timestamp", new Timestamp(1000 * LocalDateTime.of(1950, 11, 19, 12, 34, 56, 789_000_000).toEpochSecond(ZoneOffset.UTC))); | ||
| assertBucketEquals("timestamp", new Timestamp(1000 * LocalDateTime.of(2015, 11, 19, 7, 6, 5, 432_000_000).toEpochSecond(ZoneOffset.UTC))); | ||
| assertBucketEquals("timestamp", new Timestamp(10 + 1000 * LocalDateTime.of(1970, 1, 1, 0, 0, 0, 0).toEpochSecond(ZoneOffset.UTC))); |
There was a problem hiding this comment.
Can you split this into a separate test (e.g. testTimestampBucketWithNanoseconds)
There was a problem hiding this comment.
Since all the data types are tested within testHashingCompare, I would keep them at the same place since testing millisecond doesn't sound like an edge case, it's just the original test somehow failed to include those. I added comment to make it more straightforward.
Let me know if you feel strongly against it.
There was a problem hiding this comment.
my motivation is basically that when you have a zillion test cases in a single test it gets annoying to debug when it fails, and you lose some signal (it fails on case x, so the next case doesn't run, then you fix x and run the test again to discover there was also an issue with y). I think ideally the original test is split up too, and since you're adding new tests, it's better at least not to exacerbate the problem.
899664b to
623335c
Compare
623335c to
d742c83
Compare
presto-hive/src/main/java/com/facebook/presto/hive/HiveSessionProperties.java
Outdated
Show resolved
Hide resolved
1e35a2d to
246f01d
Compare
601a799 to
f75da1e
Compare
presto-hive-common/src/main/java/com/facebook/presto/hive/HiveCommonSessionProperties.java
Outdated
Show resolved
Hide resolved
f75da1e to
c2a94ce
Compare
c2a94ce to
28dc2c1
Compare
b52bb0a to
7b8ba60
Compare
|
Other PRs are also failing on singlestore tests so it's unrelated to this change, @rschlussel |
7b8ba60 to
d08834e
Compare
rschlussel
left a comment
There was a problem hiding this comment.
looks good. thanks for the fix!
presto-hive/src/main/java/com/facebook/presto/hive/HiveBucketing.java
Outdated
Show resolved
Hide resolved
|
Suggest minor changes to release note entry |
d08834e to
59b5341
Compare
Presto Java calculates hash for Timestamp type using formula:
hash(seconds << 30 + milliseconds)
Hive:
hash(seconds << 30 + nanoseconds)
Spark:
hash(seconds << 30 + nanoseconds)
Velox:
hash(seconds << 30 + nanoseconds)
To keep a consistent behavior with other engines, fix the hash calculation formula for Timestamp.