Skip to content

Conversation

@kosiew
Copy link
Contributor

@kosiew kosiew commented Oct 12, 2025

Which issue does this PR close?

Closes #17998.

Rationale for this change

Previously, the to_timestamp() family of functions (to_timestamp, to_timestamp_seconds, to_timestamp_millis, to_timestamp_micros, to_timestamp_nanos) always interpreted timezone-free (naïve) timestamps as UTC, ignoring the datafusion.execution.time_zone configuration option.

This behavior caused inconsistencies when users configured a specific execution timezone and expected timestamp conversions to respect it.

This PR introduces full timezone awareness to these functions so that:

  • Naïve timestamp strings are interpreted relative to the configured execution timezone.
  • Timestamps with explicit timezones (e.g. Z, +05:00, named zones) retain their own offsets and are not reinterpreted.
  • Invalid or ambiguous local times (e.g. during DST transitions) raise errors, improving transparency and correctness.

What changes are included in this PR?

  • Added a new ConfiguredTimeZone struct in datetime/common.rs to encapsulate timezone configuration and parsing logic.

  • Extended all to_timestamp function structs to include a timezone field initialized from ConfigOptions.

  • Introduced make_udf_function_with_config! macro support for timezone-aware UDFs.

  • Updated to_unixtime to propagate configuration into its nested timestamp conversion.

  • Added robust timezone parsing (IANA names, offset strings, etc.) with comprehensive error handling.

  • Implemented explicit vs. implicit timezone detection via a fast heuristic (has_explicit_timezone).

  • Added unit tests and sqllogictest cases validating timezone correctness across:

    • UTC, offset-based, and IANA-named zones.
    • Explicit vs. implicit timezone strings.
    • DST transition and ambiguity cases.

Are these changes tested?

✅ Yes. This PR adds extensive unit and integration tests:

  • Unit tests in datetime/common.rs validate timezone parsing, explicit-offset detection, and behavior under invalid/ambiguous zones.
  • Integration tests in datetime/to_timestamp.rs confirm correct application of execution timezone across all precision variants.
  • New sqllogictest file to_timestamp_timezone.slt covers user-visible behavior for various timezone and configuration scenarios.

Are there any user-facing changes?

Yes:

  • to_timestamp() and its precision variants now respect datafusion.execution.time_zone when parsing timezone-free timestamps.
  • Users can configure the execution timezone via SET datafusion.execution.time_zone = '<zone>' (e.g. 'America/New_York', '+05:30', 'UTC').
  • Errors are now raised for ambiguous or invalid local times around DST transitions.

These changes make timestamp functions consistent with session timezone semantics and improve correctness for global workloads.

Implement timezone-aware handling for to_timestamp functions by
adding the ConfiguredTimeZone utilities. Refactor shared helpers
to ensure naïve strings are interpreted using the configured
execution zone. Extend unit tests to cover naïve and formatted
inputs respecting non-UTC execution timezones.
Deferred execution of timezone parsing ensures that the
configured timezone string is only interpreted when dealing
with UTF-8 inputs. This change keeps numeric arguments
unaffected by any invalid session timezone values.
@github-actions github-actions bot added the functions Changes to functions implementation label Oct 12, 2025
@kosiew kosiew marked this pull request as ready for review October 12, 2025 12:04
@Weijun-H
Copy link
Member

I ran a simple test, but it appears this PR doesn't resolve the issue:

> SET TIMEZONE="+05";
0 row(s) fetched. 
Elapsed 0.001 seconds.

> SELECT arrow_typeof(to_timestamp('2023-01-31T09:26:56.123456789'));
+-------------------------------------------------------------------+
| arrow_typeof(to_timestamp(Utf8("2023-01-31T09:26:56.123456789"))) |
+-------------------------------------------------------------------+
| Timestamp(Nanosecond, None)                                       |
+-------------------------------------------------------------------+
1 row(s) fetched. 
Elapsed 0.006 seconds.

The timezone is still None despite setting TIMEZONE='+05'.

@kosiew
Copy link
Contributor Author

kosiew commented Oct 14, 2025

Pausing this to continue after #18017 to see how we approach modifying return_field_from_args

Modifying return_field_from_args() to incorporate ConfigOptions would require extensive changes across multiple components with questionable return on investment. Alternatives such as special casing or schema deferral could be considered, but their effectiveness is uncertain.

@Omega359
Copy link
Contributor

Thank you. I'm leaning towards handling the return_field_from_args issue via a new constructor or some other mechanism rather than updating the args to that function but we'll have to fix that first I think before PR's like this can move forward.


fn parse_fixed_offset(tz: &str) -> Option<FixedOffset> {
let tz = tz.trim();
if tz.eq_ignore_ascii_case("utc") || tz.eq_ignore_ascii_case("z") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tz::from_str(tz) doesn't account for these?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I double checked to confirm that it does not handle lower case:

    fn parse_fixed_offset_accepts_lowercase_and_z() -> Result<()> {
        use std::str::FromStr;

        assert!(!Tz::from_str("utc").is_err());
        ConfiguredTimeZone::parse("utc")?; // succeeds via parse_fixed_offset fallback
        ConfiguredTimeZone::parse("Z")?; // succeeds via parse_fixed_offset fallback
        Ok(())
    }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. While I expect it's to spec I can't see a reason why that impl shouldn't be case insensitive. I'll file a ticket for that as well.

}
}

fn parse_fixed_offset(tz: &str) -> Option<FixedOffset> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it is a private fn

error[E0603]: function `parse_fixed_offset` is private
  --> datafusion/functions/src/datetime/common.rs:18:29
   |
18 | use arrow::array::timezone::parse_fixed_offset;
   |                             ^^^^^^^^^^^^^^^^^^ private function

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I'll file a ticket to make that pub.


pub(crate) fn parse(tz: &str) -> Result<Self> {
if tz.trim().is_empty() {
return Ok(Self::utc());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that we may want to change this to allow for None - see the start of discussion @ #17993 (comment) and #18017 (comment)

@Omega359
Copy link
Contributor

I hope to have another go at reviewing this now that #18017 has landed

kosiew added 16 commits October 30, 2025 20:31
Introduce a TimezoneResolver wrapper to defer parsing of configured
execution time zones until a naïve timestamp is encountered.
This ensures that explicit-offset strings bypass any
misconfigured settings. Update all to_timestamp UDF variants to
utilize the resolver, and add regression tests to confirm that
invalid execution time zones only affect naïve inputs while
explicit offsets remain successful.
Ensure that explicit formatted inputs with offsets do not
trigger configuration errors. Added a regression test to
confirm that %+ and %z formats succeed despite invalid
execution timezone settings, while naive formats still
report the expected configuration error.
Enhance has_explicit_timezone to recognize offsets directly
appended to values, including colon-separated and compact %z
patterns while ignoring scientific exponents. Update the
invalid execution timezone regression test matrix to include
compact %z formats and derive expectations to validate
successful handling of these inputs without a valid session
timezone.
Enhance has_explicit_timezone to recognize trailing named
timezone tokens like UTC, GMT, and America/New_York.
Update to_timestamp_formats_invalid_execution_timezone_behavior
to ensure %Z inputs are handled properly while maintaining
bypass for numeric-offset formats with invalid timezones.
Update has_explicit_timezone to search entire input for
timezone names or offsets, including embedded tokens. Add
unit tests to cover new behavior. Also extend
to_timestamp_formats_invalid_execution_timezone_behavior
to check for timezone preceding the timestamp, ensuring
explicit zones bypass session resolution.
Improve has_explicit_timezone to fully validate timezone
tokens and offsets without short-circuiting on bare plus/minus
characters.

Add regression tests to confirm that formatted timestamps
respect non-UTC execution timezones and that invalid session
timezone strings raise appropriate errors.
Add new_with_config constructors and update with_updated_config
to rebuild functions on session setting changes. Replace the old
TimezoneResolver with ConfiguredTimeZone::from_config, tighten
offset parsing, and implement explicit debug/equality traits.
Switch datetime registration to use make_udf_function_with_config!
and reuse the new seconds-based constructor in to_unixtime.
Update tests for timezone handling to verify constructor paths and
fallback-to-UTC behavior.
…and reused the configured Arc across every benchmark case to ensure the session-aware implementation is exercised consistently.
Modify ConfiguredTimeZone::parse to return an optional
value. Update from_config to default to UTC on parsing
failure or when the configuration is unset.

Add coverage for empty time-zone strings and ensure
existing timezone tests unwrap the parsed option safely.
Adjust to_timestamp tests to handle the optional result
from the timezone parser.
}
}

impl Eq for ConfiguredTimeZone {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just Derive it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nm, I see the Tz doesn't impl Eq, etc. That's rather annoying.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to self - check upstream to see difficulty to impl for Tz

// Check for timezone offset (+/-HHMM or +/-HH:MM)
'+' | '-' => {
// Skip scientific notation (e.g., 1.5e+10)
if i > 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how or why this could ever happen. It's not something I've ever seen when processing billions of timestamps with and without tz's.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a defensive exponent check guard but I am fine with removing it.

return true;
}

// Handle timezone names with trailing offset (e.g., "PST+8")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't see a test below covering this case

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added detects_named_timezones_with_trailing_offsets

/// [`chrono::format::strftime`]: https://docs.rs/chrono/latest/chrono/format/strftime/index.html
///
#[inline]
#[allow(dead_code)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see 5 usages of this function .. not sure why dead_code would be required here. The /dev/rust_lint.sh script doesn't complain when I remove it either

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

@Omega359
Copy link
Contributor

Omega359 commented Nov 4, 2025

Overall this PR is quite thorough. Beyond a small test addition and perhaps the removal of a #[allow(dead_code] attribute I think this is good to merge.

I still have two todo's to file tickets in arrow-rs for making parse_fixed_offset pub and Tz::from_str to be case insensitive which I hope to get to this week.

@kosiew kosiew marked this pull request as draft November 6, 2025 06:31
@kosiew kosiew marked this pull request as ready for review November 6, 2025 07:08
@alamb
Copy link
Contributor

alamb commented Nov 6, 2025

🤖 ./gh_compare_branch_bench.sh Benchmark Script Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing timestamp-17998 (a1ba817) to f2437d1 diff
BENCH_NAME=to_timestamp
BENCH_COMMAND=cargo bench --bench to_timestamp
BENCH_FILTER=
BENCH_BRANCH_NAME=timestamp-17998
Results will be posted here when complete

@alamb
Copy link
Contributor

alamb commented Nov 6, 2025

🤖: Benchmark completed

Details

group                                  main                                   timestamp-17998
-----                                  ----                                   ---------------
to_timestamp_no_formats_largeutf8      1.00  1242.2±23.90ns        ? ?/sec    2.26      2.8±0.01µs        ? ?/sec
to_timestamp_no_formats_utf8           1.00  1160.3±10.56ns        ? ?/sec    2.32      2.7±0.01µs        ? ?/sec
to_timestamp_no_formats_utf8view       1.00  1215.7±28.10ns        ? ?/sec    2.31      2.8±0.02µs        ? ?/sec
to_timestamp_with_formats_largeutf8    1.00      7.6±0.04µs        ? ?/sec    1.74     13.3±0.08µs        ? ?/sec
to_timestamp_with_formats_utf8         1.00      7.3±0.02µs        ? ?/sec    1.81     13.2±0.05µs        ? ?/sec
to_timestamp_with_formats_utf8view     1.00      7.2±0.03µs        ? ?/sec    1.84     13.3±0.06µs        ? ?/sec

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this PR @kosiew

I didn't go through all the logic carefully but my high level concerns are:

  1. It seems to be parsing timestamp strings, rather than delegating to Chrono
  2. There are no "high level" tests (aka .slt tests) to show that this feature's "plumbing" hooked up correctly

I think at least we should add some slt tests showing:

  1. calling to_timestamp with default timezone
  2. calling to_timestamp with a configured session timezone

Showing the above with timestamps:

  1. That are explicitly UTC (Z)
  2. That have another explicit timezone
  3. That have no explicit timezone

#[doc = $DOC]
pub fn $FUNC($arg: Vec<datafusion_expr::Expr>) -> datafusion_expr::Expr {
use datafusion_common::config::ConfigOptions;
super::$FUNC(&ConfigOptions::default()).call($arg)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can someone remind me why we are going through all the trouble of threading the ConfigOptions down into the functions, when the code to create the functions just passes in the default ConfigOptions?

Shouldn't the current ConfigOptions be passed in? If so perhaps we can do this as a follow on PR

Copy link
Contributor

@Omega359 Omega359 Nov 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can someone remind me why we are going through all the trouble of threading the ConfigOptions down into the functions, when the code to create the functions just passes in the default ConfigOptions?

Shouldn't the current ConfigOptions be passed in? If so perhaps we can do this as a follow on PR

Config options as part of the constructor is only really required for functions that need config_options for simplify. Anything that needs it during invoke_with_args should just get it via the ScalarFunctionArgs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So then it seems inconsistent in this case -- if we pass the ConfigOptions to the constructor that means the function should be using it (in return type or simplify). If the options are only needed in invoke_with_args why are we passing it to the constructor in the first place 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here’s the reasoning behind the current setup:

  • Why the constructor takes ConfigOptions: It’s primarily for performance and identity. Parsing timezones (or other config-dependent data) can be expensive, and since the timezone impacts the UDF’s behavior, it’s treated as part of its identity (Hash, Eq). This way, UDFs with different configs (e.g. timezones) are considered distinct instances.

  • Why the macro uses defaults: The macro path is meant as a convenience layer for building expressions programmatically, not for actual runtime execution. It doesn’t have access to the session context, so it defaults to ConfigOptions::default() just to simplify expression building.

  • How runtime config is actually applied: When queries are executed, DataFusion calls with_updated_config() to rebuild the UDF with the real session config. Example from to_timestamp.rs:

    fn with_updated_config(&self, config: &ConfigOptions) -> Option<ScalarUDF> {
        Some(Self::new_with_config(config).into())
    }

    This ensures that during execution, the correct configuration (e.g. timezone) is used.

So while it looks inconsistent, it’s actually by design — the macro-created version isn’t the same one used at runtime. Still, I agree it’s a bit non-obvious, so adding a comment or doc note clarifying this distinction between the macro (Expr-building) and runtime (session-config) paths would definitely help avoid confusion.

return FixedOffset::east_opt(0);
}

let (sign, rest) = if let Some(rest) = tz.strip_prefix('+') {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parsing timestamps is a delicate and subtle task and has many strange corner cases. I don't understand why this code is parsing timestamps rather than delegating to chrono

For example, perhaps

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that chrono is a better solution for this.

Copy link
Contributor Author

@kosiew kosiew Nov 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add slt tests.

Replace manual-only parsing with digit normalization for
offset formats (+HH, +HHMM, +HHMMSS -> +HH:MM[:SS]). Attempt
chrono parse with formats %::z, %:z, %z using Parsed's
to_fixed_offset(), and fall back to original logic if
chrono fails. Fix compile-time type mismatch in
Parsed::to_fixed_offset() returning Result.
Trim input and enable case-insensitive parsing for utc and z as
zero offsets. Normalize digit-only offsets and attempt to parse
using chrono formats. Return None on parsing failure.
@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Nov 8, 2025
Comment on lines 260 to 262
if has_rfc3339_timezone(value) {
return true;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m afraid performance degrades when the fast path fails — the benchmark shows a significant slowdown.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe for Non-RFC3339 format, skip full parsing.

Add detailed comments to explain the two construction
paths of the export_functions! macro. Describe how
the macro-generated wrappers use ConfigOptions::default()
for programmatic expression building and clarify that the
session's actual ConfigOptions is applied later when
DataFusion updates the UDF via with_updated_config().
Refactor has_explicit_timezone to perform lightweight string checks,
such as checking for trailing 'Z', named timezones, common
abbreviations, and numeric offset heuristics. This reduces
the need for the heavier DateTime::parse_from_rfc3339,
maintaining existing behavior by keeping RFC3339 parsing as a
last-resort fallback.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

functions Changes to functions implementation sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Make to_timestamp aware of execution timezone

4 participants