Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-34590: [C++][ORC] Fix timestamp type mapping between orc and arrow #34591

Merged
merged 2 commits into from
Mar 21, 2023

Conversation

wgtmac
Copy link
Member

@wgtmac wgtmac commented Mar 16, 2023

Rationale for this change

Background: There was an effort to fix inconsistent timestamp types across different SQL-on-Hadoop engines: https://docs.google.com/document/d/1gNRww9mZJcHvUDCXklzjFEQGpefsuR_akCDfWsdE35Q

In the Apache Orc, two timestamp types are provided:

  • TIMESTAMP: timestamp type without timezone, timestamp value is stored in the writer timezone .
  • TIMESTAMP_INSTANT: timestamp type with local timezone, timestamp value is stored in the UTC timezone.

arrow::TimestampType has an optional timezone field:

  • If timezone is provided, values are normalized in UTC.
  • If timezone is missing, values can be in any timezone.

What changes are included in this PR?

The type mapping is fixed as below:

  • orc::TIMESTAMP <=> arrow::TimestampType w/o timezone
  • orc::TIMESTAMP_INSTANT <=> arrow::TimestampType w/ timezone

Are these changes tested?

Make sure all tests pass.

Are there any user-facing changes?

No.

@github-actions
Copy link

@github-actions
Copy link

⚠️ GitHub issue #34590 has been automatically assigned in GitHub to PR creator.

@wgtmac
Copy link
Member Author

wgtmac commented Mar 17, 2023

@westonpace @wjones127 @lidavidm Could you please take a look? Thanks!

@wjones127 wjones127 self-requested a review March 17, 2023 18:10
Comment on lines 1123 to 1124
// The timestamp values stored in ORC are in the writer timezone.
return timestamp(TimeUnit::NANO);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this true? I would have expected the values in the "UTC" timezone and the expectations is readers should interpret it in the local time zone.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you're right.

For orc::TIMESTAMP type:

For orc::TIMESTAMP_INSTANT type:

We have seen many issues around orc::TIMESTAMP type because of the writer-reader timezone conversion, especially with different day-light saving rules. So that's why orc::TIMESTAMP_INSTANT type is added and is always preferred over orc::TIMESTAMP type if user can take care of the timezone.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay that makes sense. It does make me wonder if we should then be applying the local time zone to these types, instead of using naive timestamps:

Suggested change
// The timestamp values stored in ORC are in the writer timezone.
return timestamp(TimeUnit::NANO);
// The timestamp values stored in ORC are in the writer timezone.
return timestamp(TimeUnit::NANO, GetLocalTimeZone());

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would fail the equality check of timestamp types in the roundtrip tests. Additionally, it seems that there is a lack of utility like GetLocalTimeZone()?

IMHO, if writer uses naive timestamp types, reader should follow the same pattern.

@github-actions github-actions bot added awaiting changes Awaiting changes awaiting change review Awaiting change review and removed awaiting review Awaiting review awaiting changes Awaiting changes labels Mar 17, 2023
@@ -1111,7 +1121,20 @@ Result<std::shared_ptr<DataType>> GetArrowType(const liborc::Type* type) {
case liborc::CHAR:
return fixed_size_binary(static_cast<int>(type->getMaximumLength()));
case liborc::TIMESTAMP:
// Values of TIMESTAMP type are stored in the writer timezone in the Orc file.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added extensive comment to explain the issue and suggested usage. Please check again. Thanks @wjones127

@wgtmac wgtmac requested a review from wjones127 March 21, 2023 03:43
Copy link
Member

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for adding those comments!

@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting change review Awaiting change review labels Mar 21, 2023
@wjones127 wjones127 merged commit 1f8a335 into apache:main Mar 21, 2023
@ursabot
Copy link

ursabot commented Mar 21, 2023

Benchmark runs are scheduled for baseline = ba1f992 and contender = 1f8a335. 1f8a335 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.82% ⬆️0.0%] test-mac-arm
[Finished ⬇️0.51% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.25% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 1f8a335d ec2-t3-xlarge-us-east-2
[Finished] 1f8a335d test-mac-arm
[Finished] 1f8a335d ursa-i9-9960x
[Finished] 1f8a335d ursa-thinkcentre-m75q
[Finished] ba1f9924 ec2-t3-xlarge-us-east-2
[Failed] ba1f9924 test-mac-arm
[Finished] ba1f9924 ursa-i9-9960x
[Finished] ba1f9924 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

rtpsw pushed a commit to rtpsw/arrow that referenced this pull request Mar 27, 2023
… arrow (apache#34591)

### Rationale for this change

Background: There was an effort to fix inconsistent timestamp types across different SQL-on-Hadoop engines: https://docs.google.com/document/d/1gNRww9mZJcHvUDCXklzjFEQGpefsuR_akCDfWsdE35Q

In the Apache Orc, two timestamp types are provided:

- TIMESTAMP: timestamp type without timezone, timestamp value is stored in the writer timezone .
- TIMESTAMP_INSTANT: timestamp type with local timezone, timestamp value is stored in the UTC timezone.

arrow::TimestampType has an optional timezone field:
- If timezone is provided, values are normalized in UTC.
- If timezone is missing, values can be in any timezone.

### What changes are included in this PR?

The type mapping is fixed as below:
- orc::TIMESTAMP <=> arrow::TimestampType w/o timezone
- orc::TIMESTAMP_INSTANT <=> arrow::TimestampType w/ timezone

### Are these changes tested?

Make sure all tests pass.

### Are there any user-facing changes?

No.
* Closes: apache#34590

Authored-by: Gang Wu <[email protected]>
Signed-off-by: Will Jones <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[C++][ORC] Fix timestamp type mapping between orc and arrow
3 participants