Skip to content

Conversation

@mbutrovich
Copy link
Contributor

@mbutrovich mbutrovich commented Mar 13, 2025

Which issue does this PR close?

Rationale for this change

See issue.

What changes are included in this PR?

  • IntoBuffer for Vec<T> where T is a Parquet type now takes an ArrowType arg so we know what to materialize INT96 into.
  • ArrowType metadata in Parquet can now specify the resolution. Alternatively, a supplied_schema to ArrowReaderOptions can achieve the same effect, which is how DataFusion will pass schema hints for INT96 (similar to StringView optimizations).
  • Extended the INT96 column reader test to support ArrowType hints.

Are there any user-facing changes?

#7250 (comment)

BTW despite it being marked pub IntoBuffer is not pub as this whole module is private:

I don't believe so.

@github-actions github-actions bot added the parquet Changes to the parquet crate label Mar 13, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mbutrovich and @a10y -- this is looking pretty good to me

I think we need to fix the API to be a non breaking change if we want to get this into the next non-breaking release (in the next few days)

Otherwise all i think this PR needs is Tests for timezones and it should be good to go

#[inline]
pub fn convert_int96(_descr: &ColumnDescPtr, value: Int96) -> Self {
Field::TimestampMillis(value.to_i64())
Field::TimestampMillis(value.to_millis())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be to_nanos() to preserve the old behavior

But then again it doesn't make sense to erturn a nanosecond timestamp for a value with millisecond precision 🤔

Copy link
Contributor Author

@mbutrovich mbutrovich Mar 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be to_nanos() to preserve the old behavior

The old behavior is actually to convert it to millis.

Current behavior for convert_i96 has it call to_i64 which converts to millis, so I tried to keep the behavior the same.

@alamb
Copy link
Contributor

alamb commented Mar 14, 2025

BTW the msrv test failure is not related:

@mbutrovich mbutrovich requested a review from alamb March 18, 2025 15:52
@mbutrovich mbutrovich changed the title Support different TimeUnits when reading Timestamps from INT96 Support different TimeUnits and timezones when reading Timestamps from INT96 Mar 18, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @mbutrovich -- this looks good to me.

Seems like there are some CI failures to address but then it should be good to merge

@alamb alamb merged commit 660a3ac into apache:main Mar 19, 2025
16 checks passed
@alamb
Copy link
Contributor

alamb commented Mar 19, 2025

Thanks again @mbutrovich

PinkCrow007 pushed a commit to PinkCrow007/arrow-rs that referenced this pull request Mar 20, 2025
…m INT96 (apache#7285)

* Support different Timestamp TimeUnit resolutions for INT96.

* Use i64 for subtracting JULIAN_DAY_OF_EPOCH.

* docs.

* Add deprecation comment.

* Fix typo.

* Add timezone test.

* Fix clippy.

* Try to fix Clippy again.
@mbutrovich mbutrovich deleted the timestamp_resolution branch April 3, 2025 15:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants