-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-34590: [C++][ORC] Fix timestamp type mapping between orc and arrow #34591
Conversation
|
@westonpace @wjones127 @lidavidm Could you please take a look? Thanks! |
cpp/src/arrow/adapters/orc/util.cc
Outdated
// The timestamp values stored in ORC are in the writer timezone. | ||
return timestamp(TimeUnit::NANO); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this true? I would have expected the values in the "UTC" timezone and the expectations is readers should interpret it in the local time zone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you're right.
For orc::TIMESTAMP
type:
- The Orc writer expects input data (i.e. in the orc::TimestampVectorBatch) to be in the "UTC" timezone and serializes it into the writer timezone: https://github.com/apache/orc/blob/main/c%2B%2B/src/ColumnWriter.cc#L1717
- The Orc reader deserializes the data from writer timezone and restores it into reader timezone: https://github.com/apache/orc/blob/main/c%2B%2B/src/ColumnReader.cc#L336
For orc::TIMESTAMP_INSTANT
type:
- The Orc writer expects input data to be in the "UTC" timezone and serializes it into the "UTC" timezone: https://github.com/apache/orc/blob/main/c%2B%2B/src/ColumnWriter.cc#L1644
- The Orc reader deserializes the data from "UTC" timezone and no more conversion is needed because writerTimezone and readerTimezone are both "UTC": https://github.com/apache/orc/blob/main/c%2B%2B/src/ColumnReader.cc#L282
We have seen many issues around orc::TIMESTAMP
type because of the writer-reader timezone conversion, especially with different day-light saving rules. So that's why orc::TIMESTAMP_INSTANT
type is added and is always preferred over orc::TIMESTAMP
type if user can take care of the timezone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay that makes sense. It does make me wonder if we should then be applying the local time zone to these types, instead of using naive timestamps:
// The timestamp values stored in ORC are in the writer timezone. | |
return timestamp(TimeUnit::NANO); | |
// The timestamp values stored in ORC are in the writer timezone. | |
return timestamp(TimeUnit::NANO, GetLocalTimeZone()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would fail the equality check of timestamp types in the roundtrip tests. Additionally, it seems that there is a lack of utility like GetLocalTimeZone()
?
IMHO, if writer uses naive timestamp types, reader should follow the same pattern.
@@ -1111,7 +1121,20 @@ Result<std::shared_ptr<DataType>> GetArrowType(const liborc::Type* type) { | |||
case liborc::CHAR: | |||
return fixed_size_binary(static_cast<int>(type->getMaximumLength())); | |||
case liborc::TIMESTAMP: | |||
// Values of TIMESTAMP type are stored in the writer timezone in the Orc file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added extensive comment to explain the issue and suggested usage. Please check again. Thanks @wjones127
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for adding those comments!
Benchmark runs are scheduled for baseline = ba1f992 and contender = 1f8a335. 1f8a335 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
… arrow (apache#34591) ### Rationale for this change Background: There was an effort to fix inconsistent timestamp types across different SQL-on-Hadoop engines: https://docs.google.com/document/d/1gNRww9mZJcHvUDCXklzjFEQGpefsuR_akCDfWsdE35Q In the Apache Orc, two timestamp types are provided: - TIMESTAMP: timestamp type without timezone, timestamp value is stored in the writer timezone . - TIMESTAMP_INSTANT: timestamp type with local timezone, timestamp value is stored in the UTC timezone. arrow::TimestampType has an optional timezone field: - If timezone is provided, values are normalized in UTC. - If timezone is missing, values can be in any timezone. ### What changes are included in this PR? The type mapping is fixed as below: - orc::TIMESTAMP <=> arrow::TimestampType w/o timezone - orc::TIMESTAMP_INSTANT <=> arrow::TimestampType w/ timezone ### Are these changes tested? Make sure all tests pass. ### Are there any user-facing changes? No. * Closes: apache#34590 Authored-by: Gang Wu <[email protected]> Signed-off-by: Will Jones <[email protected]>
Rationale for this change
Background: There was an effort to fix inconsistent timestamp types across different SQL-on-Hadoop engines: https://docs.google.com/document/d/1gNRww9mZJcHvUDCXklzjFEQGpefsuR_akCDfWsdE35Q
In the Apache Orc, two timestamp types are provided:
arrow::TimestampType has an optional timezone field:
What changes are included in this PR?
The type mapping is fixed as below:
Are these changes tested?
Make sure all tests pass.
Are there any user-facing changes?
No.