-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Respect execution timezone in to_timestamp and related functions
#18025
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Implement timezone-aware handling for to_timestamp functions by adding the ConfiguredTimeZone utilities. Refactor shared helpers to ensure naïve strings are interpreted using the configured execution zone. Extend unit tests to cover naïve and formatted inputs respecting non-UTC execution timezones.
Deferred execution of timezone parsing ensures that the configured timezone string is only interpreted when dealing with UTF-8 inputs. This change keeps numeric arguments unaffected by any invalid session timezone values.
|
I ran a simple test, but it appears this PR doesn't resolve the issue: > SET TIMEZONE="+05";
0 row(s) fetched.
Elapsed 0.001 seconds.
> SELECT arrow_typeof(to_timestamp('2023-01-31T09:26:56.123456789'));
+-------------------------------------------------------------------+
| arrow_typeof(to_timestamp(Utf8("2023-01-31T09:26:56.123456789"))) |
+-------------------------------------------------------------------+
| Timestamp(Nanosecond, None) |
+-------------------------------------------------------------------+
1 row(s) fetched.
Elapsed 0.006 seconds.The timezone is still |
|
Pausing this to continue after #18017 to see how we approach modifying return_field_from_args Modifying |
|
Thank you. I'm leaning towards handling the |
|
|
||
| fn parse_fixed_offset(tz: &str) -> Option<FixedOffset> { | ||
| let tz = tz.trim(); | ||
| if tz.eq_ignore_ascii_case("utc") || tz.eq_ignore_ascii_case("z") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tz::from_str(tz) doesn't account for these?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I double checked to confirm that it does not handle lower case:
fn parse_fixed_offset_accepts_lowercase_and_z() -> Result<()> {
use std::str::FromStr;
assert!(!Tz::from_str("utc").is_err());
ConfiguredTimeZone::parse("utc")?; // succeeds via parse_fixed_offset fallback
ConfiguredTimeZone::parse("Z")?; // succeeds via parse_fixed_offset fallback
Ok(())
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. While I expect it's to spec I can't see a reason why that impl shouldn't be case insensitive. I'll file a ticket for that as well.
| } | ||
| } | ||
|
|
||
| fn parse_fixed_offset(tz: &str) -> Option<FixedOffset> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am unsure as for the need for this as I think it already exists in https://github.com/apache/arrow-rs/blob/751b0822a7f0b2647c1c662131131b35f268bfef/arrow-array/src/timezone.rs#L25. As well, the Tz::from_str I think may already handle named and offsets https://github.com/apache/arrow-rs/blob/751b0822a7f0b2647c1c662131131b35f268bfef/arrow-array/src/timezone.rs#L91
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe it is a private fn
error[E0603]: function `parse_fixed_offset` is private
--> datafusion/functions/src/datetime/common.rs:18:29
|
18 | use arrow::array::timezone::parse_fixed_offset;
| ^^^^^^^^^^^^^^^^^^ private function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I'll file a ticket to make that pub.
|
|
||
| pub(crate) fn parse(tz: &str) -> Result<Self> { | ||
| if tz.trim().is_empty() { | ||
| return Ok(Self::utc()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that we may want to change this to allow for None - see the start of discussion @ #17993 (comment) and #18017 (comment)
|
I hope to have another go at reviewing this now that #18017 has landed |
Introduce a TimezoneResolver wrapper to defer parsing of configured execution time zones until a naïve timestamp is encountered. This ensures that explicit-offset strings bypass any misconfigured settings. Update all to_timestamp UDF variants to utilize the resolver, and add regression tests to confirm that invalid execution time zones only affect naïve inputs while explicit offsets remain successful.
Ensure that explicit formatted inputs with offsets do not trigger configuration errors. Added a regression test to confirm that %+ and %z formats succeed despite invalid execution timezone settings, while naive formats still report the expected configuration error.
Enhance has_explicit_timezone to recognize offsets directly appended to values, including colon-separated and compact %z patterns while ignoring scientific exponents. Update the invalid execution timezone regression test matrix to include compact %z formats and derive expectations to validate successful handling of these inputs without a valid session timezone.
Enhance has_explicit_timezone to recognize trailing named timezone tokens like UTC, GMT, and America/New_York. Update to_timestamp_formats_invalid_execution_timezone_behavior to ensure %Z inputs are handled properly while maintaining bypass for numeric-offset formats with invalid timezones.
Update has_explicit_timezone to search entire input for timezone names or offsets, including embedded tokens. Add unit tests to cover new behavior. Also extend to_timestamp_formats_invalid_execution_timezone_behavior to check for timezone preceding the timestamp, ensuring explicit zones bypass session resolution.
Improve has_explicit_timezone to fully validate timezone tokens and offsets without short-circuiting on bare plus/minus characters. Add regression tests to confirm that formatted timestamps respect non-UTC execution timezones and that invalid session timezone strings raise appropriate errors.
Add new_with_config constructors and update with_updated_config to rebuild functions on session setting changes. Replace the old TimezoneResolver with ConfiguredTimeZone::from_config, tighten offset parsing, and implement explicit debug/equality traits. Switch datetime registration to use make_udf_function_with_config! and reuse the new seconds-based constructor in to_unixtime. Update tests for timezone handling to verify constructor paths and fallback-to-UTC behavior.
…ns, add comprehensive tests
…on in timestamp strings
…and reused the configured Arc across every benchmark case to ensure the session-aware implementation is exercised consistently.
…timestamp functions
Modify ConfiguredTimeZone::parse to return an optional value. Update from_config to default to UTC on parsing failure or when the configuration is unset. Add coverage for empty time-zone strings and ensure existing timezone tests unwrap the parsed option safely. Adjust to_timestamp tests to handle the optional result from the timezone parser.
| } | ||
| } | ||
|
|
||
| impl Eq for ConfiguredTimeZone {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just Derive it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nm, I see the Tz doesn't impl Eq, etc. That's rather annoying.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to self - check upstream to see difficulty to impl for Tz
| // Check for timezone offset (+/-HHMM or +/-HH:MM) | ||
| '+' | '-' => { | ||
| // Skip scientific notation (e.g., 1.5e+10) | ||
| if i > 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how or why this could ever happen. It's not something I've ever seen when processing billions of timestamps with and without tz's.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a defensive exponent check guard but I am fine with removing it.
| return true; | ||
| } | ||
|
|
||
| // Handle timezone names with trailing offset (e.g., "PST+8") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't see a test below covering this case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added detects_named_timezones_with_trailing_offsets
| /// [`chrono::format::strftime`]: https://docs.rs/chrono/latest/chrono/format/strftime/index.html | ||
| /// | ||
| #[inline] | ||
| #[allow(dead_code)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see 5 usages of this function .. not sure why dead_code would be required here. The /dev/rust_lint.sh script doesn't complain when I remove it either
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
|
Overall this PR is quite thorough. Beyond a small test addition and perhaps the removal of a #[allow(dead_code] attribute I think this is good to merge. I still have two todo's to file tickets in arrow-rs for making parse_fixed_offset pub and Tz::from_str to be case insensitive which I hope to get to this week. |
|
🤖 |
|
🤖: Benchmark completed Details
|
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this PR @kosiew
I didn't go through all the logic carefully but my high level concerns are:
- It seems to be parsing timestamp strings, rather than delegating to Chrono
- There are no "high level" tests (aka .slt tests) to show that this feature's "plumbing" hooked up correctly
I think at least we should add some slt tests showing:
- calling to_timestamp with default timezone
- calling to_timestamp with a configured session timezone
Showing the above with timestamps:
- That are explicitly UTC (
Z) - That have another explicit timezone
- That have no explicit timezone
| #[doc = $DOC] | ||
| pub fn $FUNC($arg: Vec<datafusion_expr::Expr>) -> datafusion_expr::Expr { | ||
| use datafusion_common::config::ConfigOptions; | ||
| super::$FUNC(&ConfigOptions::default()).call($arg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can someone remind me why we are going through all the trouble of threading the ConfigOptions down into the functions, when the code to create the functions just passes in the default ConfigOptions?
Shouldn't the current ConfigOptions be passed in? If so perhaps we can do this as a follow on PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can someone remind me why we are going through all the trouble of threading the ConfigOptions down into the functions, when the code to create the functions just passes in the default ConfigOptions?
Shouldn't the current ConfigOptions be passed in? If so perhaps we can do this as a follow on PR
Config options as part of the constructor is only really required for functions that need config_options for simplify. Anything that needs it during invoke_with_args should just get it via the ScalarFunctionArgs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So then it seems inconsistent in this case -- if we pass the ConfigOptions to the constructor that means the function should be using it (in return type or simplify). If the options are only needed in invoke_with_args why are we passing it to the constructor in the first place 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here’s the reasoning behind the current setup:
-
Why the constructor takes
ConfigOptions: It’s primarily for performance and identity. Parsing timezones (or other config-dependent data) can be expensive, and since the timezone impacts the UDF’s behavior, it’s treated as part of its identity (Hash,Eq). This way, UDFs with different configs (e.g. timezones) are considered distinct instances. -
Why the macro uses defaults: The macro path is meant as a convenience layer for building expressions programmatically, not for actual runtime execution. It doesn’t have access to the session context, so it defaults to
ConfigOptions::default()just to simplify expression building. -
How runtime config is actually applied: When queries are executed, DataFusion calls
with_updated_config()to rebuild the UDF with the real session config. Example fromto_timestamp.rs:fn with_updated_config(&self, config: &ConfigOptions) -> Option<ScalarUDF> { Some(Self::new_with_config(config).into()) }
This ensures that during execution, the correct configuration (e.g. timezone) is used.
So while it looks inconsistent, it’s actually by design — the macro-created version isn’t the same one used at runtime. Still, I agree it’s a bit non-obvious, so adding a comment or doc note clarifying this distinction between the macro (Expr-building) and runtime (session-config) paths would definitely help avoid confusion.
| return FixedOffset::east_opt(0); | ||
| } | ||
|
|
||
| let (sign, rest) = if let Some(rest) = tz.strip_prefix('+') { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parsing timestamps is a delicate and subtle task and has many strange corner cases. I don't understand why this code is parsing timestamps rather than delegating to chrono
For example, perhaps
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that chrono is a better solution for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will add slt tests.
Replace manual-only parsing with digit normalization for offset formats (+HH, +HHMM, +HHMMSS -> +HH:MM[:SS]). Attempt chrono parse with formats %::z, %:z, %z using Parsed's to_fixed_offset(), and fall back to original logic if chrono fails. Fix compile-time type mismatch in Parsed::to_fixed_offset() returning Result.
Trim input and enable case-insensitive parsing for utc and z as zero offsets. Normalize digit-only offsets and attempt to parse using chrono formats. Return None on parsing failure.
| if has_rfc3339_timezone(value) { | ||
| return true; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’m afraid performance degrades when the fast path fails — the benchmark shows a significant slowdown.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe for Non-RFC3339 format, skip full parsing.
Add detailed comments to explain the two construction paths of the export_functions! macro. Describe how the macro-generated wrappers use ConfigOptions::default() for programmatic expression building and clarify that the session's actual ConfigOptions is applied later when DataFusion updates the UDF via with_updated_config().
a6d37bf to
d9b3d4b
Compare
f64ab68 to
b4bb002
Compare
Refactor has_explicit_timezone to perform lightweight string checks, such as checking for trailing 'Z', named timezones, common abbreviations, and numeric offset heuristics. This reduces the need for the heavier DateTime::parse_from_rfc3339, maintaining existing behavior by keeping RFC3339 parsing as a last-resort fallback.
Which issue does this PR close?
Closes #17998.
Rationale for this change
Previously, the
to_timestamp()family of functions (to_timestamp,to_timestamp_seconds,to_timestamp_millis,to_timestamp_micros,to_timestamp_nanos) always interpreted timezone-free (naïve) timestamps as UTC, ignoring thedatafusion.execution.time_zoneconfiguration option.This behavior caused inconsistencies when users configured a specific execution timezone and expected timestamp conversions to respect it.
This PR introduces full timezone awareness to these functions so that:
Z,+05:00, named zones) retain their own offsets and are not reinterpreted.What changes are included in this PR?
Added a new
ConfiguredTimeZonestruct indatetime/common.rsto encapsulate timezone configuration and parsing logic.Extended all
to_timestampfunction structs to include atimezonefield initialized fromConfigOptions.Introduced
make_udf_function_with_config!macro support for timezone-aware UDFs.Updated
to_unixtimeto propagate configuration into its nested timestamp conversion.Added robust timezone parsing (IANA names, offset strings, etc.) with comprehensive error handling.
Implemented explicit vs. implicit timezone detection via a fast heuristic (
has_explicit_timezone).Added unit tests and
sqllogictestcases validating timezone correctness across:Are these changes tested?
✅ Yes. This PR adds extensive unit and integration tests:
datetime/common.rsvalidate timezone parsing, explicit-offset detection, and behavior under invalid/ambiguous zones.datetime/to_timestamp.rsconfirm correct application of execution timezone across all precision variants.sqllogictestfileto_timestamp_timezone.sltcovers user-visible behavior for various timezone and configuration scenarios.Are there any user-facing changes?
Yes:
to_timestamp()and its precision variants now respectdatafusion.execution.time_zonewhen parsing timezone-free timestamps.SET datafusion.execution.time_zone = '<zone>'(e.g.'America/New_York','+05:30','UTC').These changes make timestamp functions consistent with session timezone semantics and improve correctness for global workloads.