Skip to content

Conversation

@mbutrovich
Copy link
Contributor

We are adding Spark-compatible int96 support to DataFusion Comet when using arrow-rs's Parquet reader. To achieve this, we first added support for arrow-rs to read int96 at different resolutions than nanosecond. It would previously generate nulls for non-null values. Next, we will add support to DataFusion to generate the necessary schema for arrow-rs to read int96 at the resolution that Spark expects. Finally, we will connect everything together in DataFusion Comet for accelerated Parquet reading with int96 values. We would like to test compatibility in all of these projects, and DataFusion and arrow-rs rely on this repo for Parquet files to test against.

Please see the included markdown file for the details of the file. Please let me know if you think it would be helpful to mention that this type is now deprecated, and we are merely offering it for systems that want to maintain compatibility with Spark (which still defaults to writing this type for timestamps).

Additional context (taken from apache/arrow-rs#7220)

Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@gszadovszky @pitrou WDYT?

@pitrou
Copy link
Member

pitrou commented Apr 3, 2025

Why not have one column per possible unit (nanosecond, microsecond, etc.)?

@wgtmac
Copy link
Member

wgtmac commented Apr 3, 2025

Why not have one column per possible unit (nanosecond, microsecond, etc.)?

I guess the reason is that Apache Spark can only produce microsecond.

@pitrou
Copy link
Member

pitrou commented Apr 3, 2025

If confirmed, then that settles the issue.

@mbutrovich
Copy link
Contributor Author

mbutrovich commented Apr 3, 2025

Spark does not support writing different resolutions to int96 since its internal timestamp type is only microsecond-resolution. It supports writing to non-deprecated TIMESTAMP_MICROS and TIMESTAMP_MILLIS, but those should be covered in other test files already. Spark isn't doing anything special with those types, and still defaults to int96 when writing to Parquet.

@wgtmac
Copy link
Member

wgtmac commented Apr 3, 2025

@mbutrovich Thanks for confirming! Let me merge this.

@wgtmac wgtmac merged commit 6e851dd into apache:master Apr 3, 2025
@mbutrovich
Copy link
Contributor Author

Thanks for the quick responses!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants