Skip to content

Arrow Connector : Preserve TIME and TIMESTAMP as Wall-Clock (Align with JDBC Connector)#25901

Closed
lithinwxd wants to merge 6 commits intoprestodb:masterfrom
lithinwxd:feature_timestampFixForArrow
Closed

Arrow Connector : Preserve TIME and TIMESTAMP as Wall-Clock (Align with JDBC Connector)#25901
lithinwxd wants to merge 6 commits intoprestodb:masterfrom
lithinwxd:feature_timestampFixForArrow

Conversation

@lithinwxd
Copy link
Contributor

@lithinwxd lithinwxd commented Aug 27, 2025

Description

This PR fixes the handling of TIME and TIMESTAMP values in the Arrow connector.
Previously, Arrow interpreted raw millis as UTC instants, which caused unintended timezone shifts when queried from Presto. With this change, the Arrow connector preserves the values as wall-clock times, consistent with the behavior of the JDBC connector.

Example
A source value of 2025-08-01 10:00:00 (TIMESTAMP) was treated as 2025-08-01 10:00 UTC, which then shifted when displayed in Asia/Kolkata or America/Los_Angeles.

Similarly, TIME '10:00:00' was treated as 10:00 UTC instead of 10:00 wall-clock.

Motivation and Context

  • Align Arrow connector semantics with the JDBC connector for TIME and TIMESTAMP.
  • Ensure that values from data sources without timezone information are not shifted incorrectly.
  • Improves user experience by showing the same results across connectors.

Impact

  • User-facing change: TIME and TIMESTAMP values will now be preserved as stored (wall-clock).
  • This eliminates double timezone offset shifts.
  • No impact on other connectors or types.
  • No performance regression is expected, since the change only affects value interpretation.

Test Plan

  1. Updated unit tests in TestArrowBlockBuilder to validate correct preservation of TIME and TIMESTAMP.
  2. Verified results against JDBC connector outputs to confirm alignment.
  3. Ran queries in sessions with different time zones to confirm consistent wall-clock behavior.

Contributor checklist

  • Please make sure your submission complies with our contributing guide, in particular code style and commit standards.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.

Release Notes

Please follow release notes guidelines and fill in the release notes below.

== RELEASE NOTES ==

General Changes
* Fix Arrow connector handling of TIME and TIMESTAMP types.
  These values are now preserved as wall-clock times (without implicit UTC interpretation),
  consistent with the JDBC connector.

@lithinwxd lithinwxd requested a review from a team as a code owner August 27, 2025 15:21
@lithinwxd lithinwxd marked this pull request as draft August 27, 2025 15:22
@lithinwxd lithinwxd changed the title Feature timestamp fix for arrow Arrow Connector : Preserve TIME and TIMESTAMP as Wall-Clock (Align with JDBC Connector) Aug 27, 2025
@lithinwxd lithinwxd marked this pull request as ready for review August 28, 2025 04:28
@tdcmeehan tdcmeehan self-assigned this Aug 28, 2025
@tdcmeehan
Copy link
Contributor

Do you see this problem when you toggle the legacy_timestamp property?

@lithinwxd
Copy link
Contributor Author

legacy_timestamp

@tdcmeehan I see the issue even if i set the property SET SESSION legacy_timestamp = true;

@tdcmeehan tdcmeehan requested a review from imjalpreet September 8, 2025 14:24
@imjalpreet
Copy link
Member

I see the issue even if i set the property SET SESSION legacy_timestamp = true;

@lithinwxd, true is the default value. Did you also try to set it to false?

@lithinwxd
Copy link
Contributor Author

SET SESSION legacy_timestamp = true;

@imjalpreet I tried with false as well , same problem exists without the fix . After the fix , I see consistent results with jdbc and arrow . I tried testing with client as (Moscow UTC +3) and the results got rendered for client according to their timezone.

@lithinwxd
Copy link
Contributor Author

lithinwxd commented Sep 24, 2025

@imjalpreet here is the analysis for a sample Query in oracle for timestamp type

query : SELECT customer_id , created_at FROM tm_lakehouse_gfs_user.customer_master cm;

value in datasource : 2023-01-01 10:15:30.000

⏱ Timestamp Behavior Comparison

Legacy Timestamp = false

Tool customer_id created_at
opensource presto 1 2023-01-01 04:45:30.000
Arrow Flight (before fix) 1 2023-01-01 04:45:30.000
Arrow Flight (after fix) 1 2023-01-01 04:45:30.000
DBeaver 1 2023-01-01 10:15:30.000
DBeaver( after setting custom timezone) 1 2023-01-01 10:15:30.000

Legacy Timestamp = true

Tool customer_id created_at
opensource presto 1 2023-01-01 10:15:30.000
Arrow Flight (before fix) 1 2023-01-01 15:45:30.000
Arrow Flight (after fix) 1 2023-01-01 10:15:30.000
DBeaver 1 2023-01-01 15:45:30.000
DBeaver( after setting custom timezone) 1 2023-01-01 10:15:30.000

note : All queries are executed wrt to IST timezone(UTC+5h30min)

Copy link
Contributor

@BryanCutler BryanCutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @lithinwxd , but I don't think this is the right approach. An Arrow TimeMilliVector is not meant to be UTC and Arrow does not make any interpretation of values to be in UTC, see https://github.com/apache/arrow/blob/main/format/Schema.fbs#L307.

Presto has been interpreting these values as local time and we should continue to do this. If the data source is providing values in UTC, then it should be using a TimeStampMilliTZVector and set the time zone to UTC.

We should fix handling of explicit time zones and leave vectors with no time zone to be local time, not assume UTC. Then whatever data source is producing your values, they should either have a timezone or be in local time before writing to the Arrow vector.

@@ -432,11 +436,25 @@ public void assignBlockFromTimeStampMilliVector(TimeStampMilliVector vector, Typ
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about assignBlockFromTimeStampMicroVector ?

@@ -639,7 +659,9 @@ public void assignBlockFromTimeMilliTZVector(TimeStampMilliTZVector vector, Type
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A TimeStampMilliTZVector has a specific TimeZone, you would need to use this instead of treating it as UTC. Actually, I think we should be converting to the presto data type TimestampWithTimeZoneType and use the TimeZone from the Arrow vector. Can you try this?

.atZone(ZoneOffset.UTC) // interpret Arrow millis as UTC
.toLocalDateTime();

value = localDateTime
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: don't reuse value, make a new variable expected

.atZone(ZoneOffset.UTC) // interpret Arrow millis as UTC
.toLocalDateTime();

expectedMillis = localDateTime
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: change the definition of long expectedMillis to just long value and use expectedMillis here

long millis = vector.get(i);
type.writeLong(builder, millis);
// Interpret Arrow millis as if they were "local wall-clock time" in datasource
long localMillis = getLocalMillisFromUTCMillis(millis);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A TimeMilliVector has no associated timezone, but here you are forcing it to be a UTC timezone, we shouldn't do this. We have to assume that a timestamp without a timezone is local time.

try (ResultSet resultSet = stmt.executeQuery(query.toUpperCase())) {
JdbcToArrowConfig config = new JdbcToArrowConfigBuilder().setAllocator(allocator).setTargetBatchSize(2048)
.setCalendar(Calendar.getInstance(TimeZone.getDefault())).build();
.setCalendar(Calendar.getInstance(TimeZone.getTimeZone("UTC"))).build();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, your assumption is correct. Timestamp vectors are represented as UTC instants in Arrow since arrow doesn't have timezone.

This is wrong, an Arrow TimeStampMilliVector does not assume UTC. It's interpretation is up to presto, and we have been using this as local time. You must use a TimeStampMilliTZVector and set the time zone to UTC, if you want a UTC values.

You shouldn't need to change this in the test, it is saying that our datasource is generating localtime data with a timezone, resulting in a TimeStampMilliTZVector with the system timezone.

@lithinwxd
Copy link
Contributor Author

lithinwxd commented Sep 30, 2025

@BryanCutler I did some testing comparing the jdbc values vs arrow values and here is the result

The value obtained from jdbc oracle connector is 1672548330000 ms

Screenshot 2025-09-30 at 5 00 55 PM

The value obtained from arrow based oracle connector without fix is 1672568130000 ms (TimeStampMilliVector)

Screenshot 2025-09-30 at 4 56 29 PM

I checked the unix timestamp calculator for both of these values and here is the info found

jdbc
Screenshot 2025-09-30 at 5 19 55 PM

Arrow
Screenshot 2025-09-30 at 5 20 09 PM

Note : this was tested for a timestamp column and not for Timestamp with timezone column

@BryanCutler
Copy link
Contributor

BryanCutler commented Sep 30, 2025

I did some testing comparing the jdbc values vs arrow values and here is the result

For an Arrow TimeStampMilliVector the value is just a long, there are no conversions. You need to make sure whatever values produced in your Arrow Flight server are correct, not adjust Presto to fit your server. Values without timezones should not assume to have a UTC timezone, they can only be interpreted as time since epoch relative to the local system time.

Please read the Arrow format spec, it is very clear https://github.com/apache/arrow/blob/main/format/Schema.fbs#L304

@lithinwxd
Copy link
Contributor Author

lithinwxd commented Oct 1, 2025

I did some testing comparing the jdbc values vs arrow values and here is the result

For an Arrow TimeStampMilliVector the value is just a long, there are no conversions. You need to make sure whatever values produced in your Arrow Flight server are correct, not adjust Presto to fit your server. Values without timezones should not assume to have a UTC timezone, they can only be interpreted as time since epoch relative to the local system time.

Please read the Arrow format spec, it is very clear https://github.com/apache/arrow/blob/main/format/Schema.fbs#L304

From arrow server side perspective , if we are not using a UTC instant , instead if we use local system time that would default to arrow server time which could be anywhere say here (UTC-7) its present in a vm, then there would be double shifts . Datasource can be in a cloud but vm may be in UTC-7 zone . If you must use an Arrow timestamp vector ( from server side) like TimeStampMilliVector, you have no choice but to convert the local datetime to epoch millis using some timezone (e.g., system default or UTC), and timezone-based shifts will happen.

@BryanCutler
Copy link
Contributor

local system time that would default to arrow server time which could be anywhere

That is the problem, you are changing the data in Presto irreversibly by converting it to the system timezone. You are free to change the data in your flight server, but Presto should not be manipulating the data this way.

It would be best if you can make a new unit test that will fail without your changes, but pass with your changes applied. Once you have this, it will be easier to discuss. Can you do that?

@lithinwxd
Copy link
Contributor Author

lithinwxd commented Oct 3, 2025

local system time that would default to arrow server time which could be anywhere

That is the problem, you are changing the data in Presto irreversibly by converting it to the system timezone. You are free to change the data in your flight server, but Presto should not be manipulating the data this way.

It would be best if you can make a new unit test that will fail without your changes, but pass with your changes applied. Once you have this, it will be easier to discuss. Can you do that?

@BryanCutler I tried creating a new unit test to replicate the issue experienced in Arrow server but it seems even the existing test cases in master fails when the Presto client and Arrow server are in different timezone. We are not trying to irreversibly change the data may be until legacy timestamp issue is fixed this fix might solve the issue, we need to make sure both jdbc and arrow replicate same values. One thing I observed is arrow is returning same millis but presto according to whether legacy timestamp is set or not is reintrepreting the result. Until a fix for legacy timestamp is in place this issue cannot be resolved.

Screenshot 2025-10-03 at 9 05 13 PM

@lithinwxd
Copy link
Contributor Author

#26449 - shall close once legacy timestamp issue is merged

@tdcmeehan
Copy link
Contributor

Closing as this was trying to match a bug in the JDBC connectors, which was fixed in #26449.

@tdcmeehan tdcmeehan closed this Nov 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants