Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

String to datetime conversion with custom format #17167

Open
2 tasks done
tritemio opened this issue Jun 24, 2024 · 5 comments
Open
2 tasks done

String to datetime conversion with custom format #17167

tritemio opened this issue Jun 24, 2024 · 5 comments
Labels
A-timeseries Area: date/time functionality bug Something isn't working P-low Priority: low python Related to Python Polars upstream issue

Comments

@tritemio
Copy link

tritemio commented Jun 24, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl
import pandas as pd

df = pl.DataFrame({'dt': "2024-06-03 20:02:48.6800000"})
dt_format = "%Y-%m-%d %H:%M:%S%.6f0". # NOTE: this format has a trailing 0
df['dt'].str.to_datetime(dt_format, time_unit='ns')

Log output

Traceback (most recent call last):
  File "/Users/anto/src/poste/sda-poste-logistics/script/polars_bug_datetime.py", line 7, in <module>
    df["dt"].str.to_datetime(dt_format, time_unit="ns")
  File "/Users/anto/src/poste/sda-poste-logistics/venv/lib/python3.11/site-packages/polars/series/utils.py", line 107, in wrapper
    return s.to_frame().select_seq(f(*args, **kwargs)).to_series()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/anto/src/poste/sda-poste-logistics/venv/lib/python3.11/site-packages/polars/dataframe/frame.py", line 8524, in select_seq
    return self.lazy().select_seq(*exprs, **named_exprs).collect(_eager=True)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/anto/src/poste/sda-poste-logistics/venv/lib/python3.11/site-packages/polars/lazyframe/frame.py", line 1909, in collect
    return wrap_df(ldf.collect(callback))
                   ^^^^^^^^^^^^^^^^^^^^^
polars.exceptions.InvalidOperationError: conversion from `str` to `datetime[ns]` failed in column 'dt' for 1 out of 1 values: ["2024-06-03 20:02:48.6800000"]

You might want to try:
- setting `strict=False` to set values that cannot be converted to `null`
- using `str.strptime`, `str.to_date`, or `str.to_datetime` and providing a format string

Issue description

Converting from string to datetime with a format string should allow to decode custom formats.

In this example the input has 7 digits for fractional seconds after the decimal dot. However the last digit is always zero and should be ignored because there is a trailing 0 in the format string.

Instead, polars gives the above error during conversion.

Stripping the extra zero from the string before attempting the conversion works correctly in polars:

dt_format1 = "%Y-%m-%d %H:%M:%S%.6f"  # NOTE: no trailing 0 in the format
df['dt'].str.strip_suffix('0').str.to_datetime(dt_format1, time_unit='ns')

Pandas accepts the original format and convert the string correctly, as does python's datetime

x = pd.to_datetime(
    df["dt"].to_pandas(use_pyarrow_extension_array=True),
    format="%Y-%m-%d %H:%M:%S%.6f0",
)
df = df.with_columns(pl.from_pandas(x))

Expected behavior

The column should be converted to datetime without error, as done in pandas and datetime from python standard lib.

Installed versions

--------Version info---------
Polars:               1.0.0-rc.2
Index type:           UInt32
Platform:             macOS-14.5-arm64-arm-64bit
Python:               3.11.3 (main, Sep  1 2023, 14:56:45) [Clang 14.0.3 (clang-1403.0.22.14.1)]

----Optional dependencies----
adbc_driver_manager:  1.0.0
cloudpickle:          3.0.0
connectorx:           0.3.3
deltalake:            <not installed>
fastexcel:            0.10.4
fsspec:               2023.12.2
gevent:               24.2.1
great_tables:         <not installed>
hvplot:               0.10.0
matplotlib:           3.8.4
nest_asyncio:         1.6.0
numpy:                1.26.4
openpyxl:             <not installed>
pandas:               2.2.2
pyarrow:              16.1.0
pydantic:             2.7.3
pyiceberg:            0.6.1
sqlalchemy:           2.0.30
torch:                <not installed>
xlsx2csv:             0.8.2
xlsxwriter:           3.2.0
@tritemio tritemio added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Jun 24, 2024
@MarcoGorelli
Copy link
Collaborator

Thanks For the report, I'm surprised this didn't work, will check

@MarcoGorelli MarcoGorelli added A-timeseries Area: date/time functionality and removed needs triage Awaiting prioritization by a maintainer labels Jun 24, 2024
@Julian-J-S
Copy link
Contributor

Julian-J-S commented Jun 25, 2024

This might be a "bug" or design decision by chrono (the rust parser) BUT

  • "%Y-%m-%d %H:%M:%S%.6f0" 💥
    • this does NOT work in this case
    • this is usually supported.
  • "%Y-%m-%d %H:%M:%S.%6f0" 🍀
    • this WORKS
    • Note: moved the dot . left
    • alternative way to make a literal dot instead of telling chrono to handle it itself as part of the fractional seconds

@MarcoGorelli
Copy link
Collaborator

To me this looks like it might be a bug in Chrono:

use chrono::NaiveDateTime;
fn main() {

    let result = NaiveDateTime::parse_from_str(
        "2024-06-03 20:02:48.6800000",
        "%Y-%m-%d %H:%M:%S%.6f0",
    );
    println!("{:?}", result);
    let result = NaiveDateTime::parse_from_str(
        "2024-06-03 20:02:48.680000",
        "%Y-%m-%d %H:%M:%S%.6f",
    );
    println!("{:?}", result);
}

prints out

Err(ParseError(TooShort))
Ok(2024-06-03T20:02:48.680)

@tritemio tritemio changed the title String to dateime conversion with custom format String to datetime conversion with custom format Jun 25, 2024
@tritemio
Copy link
Author

For reference I have created a pytest file that tests several format strings on datetime.strptime, pandas.to_datetime and polars to_datetime. Results shows the valid format string is different across the 3, with datetime and pandas being similar, while polars requires a different format.

I found that polars converts the string even without the trailing 0 in the format when using either %.f or %.6f. This should be the right format for chrono (although different from the python conventions).

Full results:
Screenshot 2024-06-26 at 00 10 38

pytest file:

from datetime import datetime

import pandas as pd
import polars as pl
import pytest

dt_string = "2024-06-03 20:02:48.6800000"


dt_formats = [
    "%Y-%m-%d %H:%M:%S%.f",
    "%Y-%m-%d %H:%M:%S%.6f",
    "%Y-%m-%d %H:%M:%S.%f",
    "%Y-%m-%d %H:%M:%S.%6f",
]
dt_formats = dt_formats + [f + "0" for f in dt_formats]


@pytest.fixture
def df():
    return pl.DataFrame({"dt": dt_string})


@pytest.mark.parametrize("format", dt_formats)
def test_polars(df, format, capsys):
    with capsys.disabled():
        print(format)
        print(df)

    t_ref = pl.Series(name="dt", values=[datetime(2024, 6, 3, 20, 2, 48, 680000)])
    df2 = df.with_columns(dt=pl.col("dt").str.to_datetime(format, time_unit="ns"))

    with capsys.disabled():
        print(df2)
    assert (df2["dt"] == t_ref).all()


@pytest.mark.parametrize("format", dt_formats)
def test_pandas(df, format, capsys):
    with capsys.disabled():
        print(format)
        print(df)

    t_ref = pd.Series(name="dt", data=[datetime(2024, 6, 3, 20, 2, 48, 680000)])
    t = pd.to_datetime(
        df["dt"].to_pandas(use_pyarrow_extension_array=True), format=format
    )

    with capsys.disabled():
        print(t)

    assert (t == t_ref).all()


@pytest.mark.parametrize("format", dt_formats)
def test_datetime(df, format, capsys):
    dt_ref = datetime(2024, 6, 3, 20, 2, 48, 680000)
    dt = datetime.strptime(dt_string, format)

    assert dt == dt_ref

@Julian-J-S
Copy link
Contributor

@tritemio correct, python datetime and chrono have different behaviour for some details.

you can check out the documentation

python
image

chrono
image

image

Watch our for this (chrono)

date = "2020-01-01 10:00:00.1234"

  • "%Y-%m-%d %H:%M:%S%.f" -> 2020-01-01 10:00:00.123400
  • "%Y-%m-%d %H:%M:%S.%f" -> 2020-01-01 10:00:00.000001234

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-timeseries Area: date/time functionality bug Something isn't working P-low Priority: low python Related to Python Polars upstream issue
Projects
Status: Ready
Development

No branches or pull requests

3 participants