Add int96 from Apache Spark #73

mbutrovich · 2025-03-31T21:39:31Z

We are adding Spark-compatible int96 support to DataFusion Comet when using arrow-rs's Parquet reader. To achieve this, we first added support for arrow-rs to read int96 at different resolutions than nanosecond. It would previously generate nulls for non-null values. Next, we will add support to DataFusion to generate the necessary schema for arrow-rs to read int96 at the resolution that Spark expects. Finally, we will connect everything together in DataFusion Comet for accelerated Parquet reading with int96 values. We would like to test compatibility in all of these projects, and DataFusion and arrow-rs rely on this repo for Parquet files to test against.

Please see the included markdown file for the details of the file. Please let me know if you think it would be helpful to mention that this type is now deprecated, and we are merely offering it for systems that want to maintain compatibility with Spark (which still defaults to writing this type for timestamps).

Additional context (taken from apache/arrow-rs#7220)

Please see Inconsistent Signedness Of Legacy Parquet Timestamps Written By Spark datafusion#7958 for relevant discussion from 2023.
Interpreting INT96 as a timestamp can be tough: it depends on the Spark config, the Spark version, and there still seems to be debate on whether arithmetic during conversion should wrap on overflow or not.

wgtmac

LGTM

@gszadovszky @pitrou WDYT?

pitrou · 2025-04-03T09:32:35Z

Why not have one column per possible unit (nanosecond, microsecond, etc.)?

wgtmac · 2025-04-03T09:50:47Z

Why not have one column per possible unit (nanosecond, microsecond, etc.)?

I guess the reason is that Apache Spark can only produce microsecond.

pitrou · 2025-04-03T09:54:47Z

If confirmed, then that settles the issue.

mbutrovich · 2025-04-03T12:07:42Z

Spark does not support writing different resolutions to int96 since its internal timestamp type is only microsecond-resolution. It supports writing to non-deprecated TIMESTAMP_MICROS and TIMESTAMP_MILLIS, but those should be covered in other test files already. Spark isn't doing anything special with those types, and still defaults to int96 when writing to Parquet.

wgtmac · 2025-04-03T14:34:56Z

@mbutrovich Thanks for confirming! Let me merge this.

mbutrovich · 2025-04-03T17:16:08Z

Thanks for the quick responses!

mbutrovich added 4 commits March 31, 2025 16:38

Add int96_from_spark.parquet

60ddd9c

Add a description for int96_from_spark.

adeb2fd

Fix type.

5cf995b

Add expected values.

05c2e10

mbutrovich mentioned this pull request Mar 31, 2025

Test int96 Parquet file from Spark apache/arrow-rs#7367

Merged

mbutrovich changed the title ~~int96 from Apache Spark~~ Add int96 from Apache Spark Mar 31, 2025

mbutrovich mentioned this pull request Apr 1, 2025

Add coerce int96 option for Parquet to support different TimeUnits, test int96_from_spark.parquet from parquet-testing apache/datafusion#15537

Merged

Update data/README.md

ffe455a

wgtmac approved these changes Apr 3, 2025

View reviewed changes

wgtmac merged commit 6e851dd into apache:master Apr 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add int96 from Apache Spark #73

Add int96 from Apache Spark #73

Uh oh!

mbutrovich commented Mar 31, 2025

Uh oh!

wgtmac left a comment

Uh oh!

pitrou commented Apr 3, 2025

Uh oh!

wgtmac commented Apr 3, 2025

Uh oh!

pitrou commented Apr 3, 2025

Uh oh!

mbutrovich commented Apr 3, 2025 •

edited

Loading

Uh oh!

wgtmac commented Apr 3, 2025

Uh oh!

mbutrovich commented Apr 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add int96 from Apache Spark #73

Add int96 from Apache Spark #73

Uh oh!

Conversation

mbutrovich commented Mar 31, 2025

Uh oh!

wgtmac left a comment

Choose a reason for hiding this comment

Uh oh!

pitrou commented Apr 3, 2025

Uh oh!

wgtmac commented Apr 3, 2025

Uh oh!

pitrou commented Apr 3, 2025

Uh oh!

mbutrovich commented Apr 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wgtmac commented Apr 3, 2025

Uh oh!

mbutrovich commented Apr 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mbutrovich commented Apr 3, 2025 •

edited

Loading