Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python][Parquet] timestamp[s] does not round-trip parquet serialization. #41382

Open
randolf-scholz opened this issue Apr 25, 2024 · 5 comments

Comments

@randolf-scholz
Copy link

randolf-scholz commented Apr 25, 2024

Describe the bug, including details regarding any error messages, version, and platform.

Timestamps with second resolution get upcasted to millisecond resolution when serializing and deserializing. They should either round trip, or there should be a warning/error when attempting to serialize them.

from datetime import datetime
import pyarrow as pa
import pyarrow.compute as pc
from pyarrow import parquet

dates = [
    datetime(2021, 1, 1, 0, 0, 3),
    datetime(2021, 1, 1, 0, 0, 4),
    datetime(2021, 1, 1, 0, 0, 5),
]

table = pa.table({"time": pa.array(dates, type=pa.timestamp("s"))})
print(table.schema)  # timestamp[s]
parquet.write_table(table, "timestamp_roundtrip.parquet")
table2 = parquet.read_table("timestamp_roundtrip.parquet")
print(table2.schema)  # timestamp[ms]

Tested with pyarrow 16.0.0

Component(s)

Parquet

@mapleFU
Copy link
Member

mapleFU commented Apr 26, 2024

@randolf-scholz
Copy link
Author

Of course one can perform a cast, the issue is that the time resolution of the data carries meaning, which is lost.

If Bob has a timeseries with second resolution and sends it to Alice as a .parquet file, then Alice sees timestamps with millisecond resolution. Alice might append some data that is not rounded to seconds and send the file back to Bob. Now suddenly Bob's pipeline breaks because he attempts to safe-cast to second resolution.

So, if pyarrow can't support round-tripping of timestamp[s], then imo there should be a warning when attempting to serialize timestamp[s] data, or even a straight up exception.

@randolf-scholz
Copy link
Author

ParquetWriter currently has the option coerce_timestamps, but it is enabled by default and cannot be disabled. As an end-user, I'd expect serialization/deserialization to not silently mess with the data types.

@mapleFU
Copy link
Member

mapleFU commented Apr 26, 2024

Maybe you can applying "cast" as a workaround first

@jorisvandenbossche
Copy link
Member

So, if pyarrow can't support round-tripping of timestamp[s], then imo there should be a warning when attempting to serialize timestamp[s] data, or even a straight up exception.

Personally, I think a warning might be annoying for most users where this difference in timestamp unit is not critical.
(we could certainly better document this limitation of the Parquet file format in our python docs)

For example, until very recently, we were also casting nanoseconds to microseconds by default (because only more recent Parquet versions support nanoseconds, and for compatibility with other readers it was best to not yet use this feature). But warning for this (except if actual data would be truncated) would be rather noisy (especially given pandas using nanoseconds by default; this is of course not the case for seconds)

In theory we could actually restore the original seconds resolution upon reading the Parquet file, because we store the original Arrow schema in the Parquet metadata (e.g. to allow restoring timezone). But, that would mean we do an actual cast after reading incurring a cost for everyone (and not only for those who manually choose to do this)

We could also add a keyword to control this, although I am hesitant to do this given we already have so many keyword (and also, if this would actually preserve the current behaviour as default, you still need to do something manually to get the roundtripping, and then you could also just cast manually)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants