-
Notifications
You must be signed in to change notification settings - Fork 787
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handling Unsupported Arrow Types in Parquet #1666
Comments
Great write-up, @tustvold ! I agree with your assessment of least surprise on option 3. The way I think about option 1 and 3 is: Option 1 is potentially lossy on data (some bytes may be lost) and lossless on metadata (metadata is preserved); option 3 is lossless on data and potentially lossy on metadata. In my experience metadata is easier to recover/preserve through other means (e.g. catalog, parquet's custom metadata, even column naming conventions) than data, that is usually unrecoverable. From this perspective, option 1 has a real impact on data integrity. fwiw we use option 3 on arrow2. E.g. |
Thanks for summarizing the current status and the proposals. @tustvold So looks like we have already used option 2 and option 3 on Timestamp and Date64? I think that we should come out a consistent approach on dealing with this kind of Arrow types. No matter it is option 2 or option 3.
I think that above two options are better choices and they have some advantages separately. Regarding with ecosystem compatibility, I feel it might be somehow confused to users too that Parquet files from one arrow implementation is not compatible with other implementations. Especially if the users aren't aware of the fact that the produced Parquet files might be not compatible with other tools. If there is a config/feature used to explicitly enable such possibly incompatible Parquet files from this crate. It might be helpful as users will know what they are doing. |
100% agree with @viirya on this one
I agree with @jorgecarleitao and @tustvold that this (option 3) would be best for some users (such as IOx), but as @viirya mentions this may not be ideal from a "least surprising" point of view with people using a broader selection of ecosystem tools Maybe we could add an option to /// Converts data prior to writing to maximize compatibility with other
/// parquet implementations. This may result in losing precision on some data
/// types that can not be natively represented, such as `Timestamp(Nanosecond)`
/// as parquet only supports millisecond precision timestamps natively
maximize_ecosystem_compatibility: bool |
Don't treat embedded arrow schema as authoritative (apache#1663) Fix projection of nested parquet files (apache#1652) (apache#1654) Fix schema inference for repeated fields (apache#1681) Support reading alternative list representations from parquet (apache#1680) Consistent handling of unsupported arrow types in parquet (apache#1666)
Thanks for the discussion. I too think the option 3 looks like the best choice, with a flag to coerce data types for inter-op with non-arrow Parquet readers. |
I've created #1938 as a follow up, thank you all for your feedback |
Problem
Data in parquet is stored as one of a limited number of physical types, there are then three mechanisms that an arrow reader can use to infer the type of the data.
All parquet readers support 1, all v2 readers support 2, and only arrow readers support 3.
In some cases the logical type is not necessary to correctly interpret the data, e.g. String vs LargeString, but in some cases it fundamentally alters the meaning of the data.
Timestamp
A nanosecond time units is not supported in 1, and a second time unit is only natively supported in 3.
Currently this crate encodes second time units as logical type of milliseconds, this is likely a bug. It also does not convert to nanoseconds to microseconds when using writer version 1, despite this only being supported in >2.6.
The python implementation will, depending on the writer version, potentially cast nanoseconds to microseconds, and seconds to milliseconds - see here
There does not appear to be a way to round-trip timezones in 1 or 2. The C++ implementation appears to always normalize timezones to UTC and set is_adjusted_to_utc to true.
Currently this crate sets is_adjusted_to_utc to true if the timezone is set, this is despite the writer not actually performing the normalisation. I think this is a bug.
Date64
The arrow type Date64 is milliseconds since epoch and does not have an equivalent ConvertedType nor LogicalType.
Currently this crate converts the type to Date32 on write, losing sub-second precision. This what the C++ implementation does - see here.
Interval
The interval data type has a ConvertedType but not a LogicalType, there is a PR to add LogicalType support apache/parquet-format#165 but it appears to have stalled somewhat.
An interval of MonthDayNano cannot be represented by the ConvertedType. The C++ implementation does not appear to support Interval types.
Proposal
There are broadly speaking 3 ways to handle data types that cannot be represented in the parquet schema:
Returning an error is the safest, but is not a great UX. Casting to a native parquet representation is inline with what other arrow implementations do and gives the best ecosystem compatibility, but also doesn't make for a great UX, just search StackOverflow for
allow_truncated_timestamps
to see the confusion this causes.The final option is the simplest to implement, the least surprising to users, and what I would propose implementing. It would break ecosystem interoperability in certain cases, but I think it is more important that we faithfully round-trip data than maintain maximal compatibility.
Users who care about interoperability can explicitly cast data as appropriate, similar to python's
coerce_timestamps
functionality, or just restrict themselves to using a supported subset of arrow types.Thoughts @sunchao @nevi-me @alamb @jorisvandenbossche @jorgecarleitao
Additional Context
The text was updated successfully, but these errors were encountered: