Skip to content

Allow reading INT96 timestamps generated by AWS wrangler#22854

Merged
raunaqmorarka merged 2 commits intotrinodb:masterfrom
raunaqmorarka:int96-wrangler
Jul 30, 2024
Merged

Allow reading INT96 timestamps generated by AWS wrangler#22854
raunaqmorarka merged 2 commits intotrinodb:masterfrom
raunaqmorarka:int96-wrangler

Conversation

@raunaqmorarka
Copy link
Copy Markdown
Member

@raunaqmorarka raunaqmorarka commented Jul 29, 2024

Description

Adjust for timeOfDayNanos values outside the range of [0, NANOSECONDS_PER_DAY]

Additional context and related issues

Fixes the problem identified in #19169 and aws/aws-sdk-pandas#592 (comment)

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text:

# Hive
* Fix reading INT96 timestamps in parquet files generated by AWS wrangler. ({issue}`22854`)

Adjust for timeOfDayNanos values outside the range of [0, NANOSECONDS_PER_DAY]

public static DecodedTimestamp decodeInt96Timestamp(long timeOfDayNanos, int julianDay)
{
verify(timeOfDayNanos >= 0 && timeOfDayNanos < NANOSECONDS_PER_DAY, "Invalid timeOfDayNanos: %s", timeOfDayNanos);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Has this been considered in #19169?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That PR was removing validation from io.trino.plugin.base.type.DecodedTimestamp#DecodedTimestamp constructor too which we want to avoid. I don't think that PR attempted anything another than just removing validation.


// int96_timestamps_nanos_outside_day_range.parquet file is prepared with timeOfDayNanos values which are
// outside the [0, NANOSECONDS_PER_DAY] range to simulate data generated by AWS wrangler.
// https://github.com/aws/aws-sdk-pandas/issues/592#issuecomment-920716270
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that comment suggests it might be a bug in pyarrow.
Will it be fixed there?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it's tackled by providing a flag to make awswrangler stop writing int96 timestamps
aws/aws-sdk-pandas#592 (comment)

wr.s3.to_parquet(..., pyarrow_additional_kwargs={"use_deprecated_int96_timestamps"=False})

@raunaqmorarka raunaqmorarka merged commit 80b0e09 into trinodb:master Jul 30, 2024
@raunaqmorarka raunaqmorarka deleted the int96-wrangler branch July 30, 2024 05:02
@github-actions github-actions bot added this to the 454 milestone Jul 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

3 participants