Fix textfile ambiguous timestamps and different storage timezones#23593
Merged
rschlussel merged 1 commit intoprestodb:masterfrom Sep 9, 2024
Merged
Fix textfile ambiguous timestamps and different storage timezones#23593rschlussel merged 1 commit intoprestodb:masterfrom
rschlussel merged 1 commit intoprestodb:masterfrom
Conversation
69a40c1 to
b459fce
Compare
For TEXTFILE tables, when structural types (array, map, row) contain timestamp columns, we weren't converting to the hive storage time zone as we were doing for primitive types. Instead the timestamps would be interpreted in the JVM timezone The conversion to hive storage time zone had a secondary benefit (even when the timezones were the same) of fixing the handling of ambiguous timestamps. Ambiguous timestamps are local times that can have more than one unixtime representation. It happens commonly during the fall DST conversion where the hour from 1-2am repeats. Generally in Presto we use the earlier of thetwo possible times when the unixtime is ambiguous. However, the hive library we use for parsing textfiles uses the later time. The code for adjusting the time based on the hive storage time zone has a secondary benefit of correcting ambiguous timestamps to the earlier unixtime representation. This change fixes those two issues for the structural type code path.
b459fce to
cc7fab3
Compare
spershin
approved these changes
Sep 6, 2024
Contributor
spershin
left a comment
There was a problem hiding this comment.
Thank you for looking into this and fixing it @rschlussel !
Looks good, need a committer's approval too.
rschlussel
commented
Sep 6, 2024
| { | ||
| Timestamp timestamp = getTimestamp(object, inspector); | ||
| return timestamp.getTime(); | ||
| long parsedJvmMillis = timestamp.getTime(); |
Contributor
Author
There was a problem hiding this comment.
This is the part with the change. Logic is copied from https://github.com/prestodb/presto/blob/master/presto-hive/src/main/java/com/facebook/presto/hive/GenericHiveRecordCursor.java#L320-L333
elharo
approved these changes
Sep 6, 2024
NikhilCollooru
approved these changes
Sep 6, 2024
25 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
For TEXTFILE tables, when structural types (array, map, row) contain timestamp columns, we weren't converting to the hive storage time zone as we were doing for primitive types. Instead the timestamps would be interpreted in the JVM timezone
The conversion to hive storage time zone had a secondary benefit (even when the timezones were the same) of fixing the handling of ambiguous timestamps. Ambiguous timestamps are local times that can have more than one unixtime representation. It happens commonly during the fall DST conversion where the hour from 1-2am repeats. Generally in Presto we use the earlier of thetwo possible times when the unixtime is ambiguous. However, the hive library we use for parsing textfiles uses the later time. The code for adjusting the time based on the hive storage time zone has a secondary benefit of correcting ambiguous timestamps to the earlier unixtime representation.
This change fixes those two issues for the structural type code path.
Motivation and Context
The motivation for this change is to provide consistent results for ambiguous timestamps regardless of where it is used. We also want to fix incorrectly not using the storage time zone for timestamps in structural types.
Impact
Ambiguous timestamps inside structural types in textfiles will now be interpreted in the earliest possible unixtime.
The hive.time-zone property will now be respected for timestamps inside structural types in textfiles
Test Plan
new tests
Contributor checklist
Release Notes
Please follow release notes guidelines and fill in the release notes below.