-
Notifications
You must be signed in to change notification settings - Fork 69
Add int96 from Apache Spark #73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
wgtmac
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@gszadovszky @pitrou WDYT?
|
Why not have one column per possible unit (nanosecond, microsecond, etc.)? |
I guess the reason is that Apache Spark can only produce microsecond. |
|
If confirmed, then that settles the issue. |
|
Spark does not support writing different resolutions to int96 since its internal timestamp type is only microsecond-resolution. It supports writing to non-deprecated TIMESTAMP_MICROS and TIMESTAMP_MILLIS, but those should be covered in other test files already. Spark isn't doing anything special with those types, and still defaults to int96 when writing to Parquet. |
|
@mbutrovich Thanks for confirming! Let me merge this. |
|
Thanks for the quick responses! |
We are adding Spark-compatible int96 support to DataFusion Comet when using arrow-rs's Parquet reader. To achieve this, we first added support for arrow-rs to read int96 at different resolutions than nanosecond. It would previously generate nulls for non-null values. Next, we will add support to DataFusion to generate the necessary schema for arrow-rs to read int96 at the resolution that Spark expects. Finally, we will connect everything together in DataFusion Comet for accelerated Parquet reading with int96 values. We would like to test compatibility in all of these projects, and DataFusion and arrow-rs rely on this repo for Parquet files to test against.
Please see the included markdown file for the details of the file. Please let me know if you think it would be helpful to mention that this type is now deprecated, and we are merely offering it for systems that want to maintain compatibility with Spark (which still defaults to writing this type for timestamps).
Additional context (taken from apache/arrow-rs#7220)