Skip to content

Use byte arrays for encoding short decimals in native hive parquet writer#12658

Merged
findepi merged 6 commits intotrinodb:masterfrom
raunaqmorarka:pq-writer-decimal
Jun 10, 2022
Merged

Use byte arrays for encoding short decimals in native hive parquet writer#12658
findepi merged 6 commits intotrinodb:masterfrom
raunaqmorarka:pq-writer-decimal

Conversation

@raunaqmorarka
Copy link
Copy Markdown
Member

@raunaqmorarka raunaqmorarka commented Jun 2, 2022

Description

Apache Hive fails to read short decimals encoded as INT32/INT64 by the optimized parquet hive writer in Trino.
This PR changes the native writer to use fixed length byte arrays for encoding short decimals when writing parquet files in hive connector. This makes the output readable by Apache Hive.

Is this change a fix, improvement, new feature, refactoring, or other?

Fix

Is this a change to the core query engine, a connector, client library, or the SPI interfaces? (be specific)

Parquet writer

How would you describe this change to a non-technical end user or system administrator?

Makes output of optimized parquet writer for short decimals compatible with Apache Hive.

Related issues, pull requests, and links

#6377
#10486

Documentation

(x) No documentation is needed.
( ) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.

Release notes

( ) No release notes entries required.
(x) Release notes entries required with the following suggested text:

# Section
* Fixes compatibility of files produced by optimized parquet writer for short decimal columns in Trino hive connnector with Apache Hive. ({issue}`12658`)

@cla-bot cla-bot bot added the cla-signed label Jun 2, 2022
@raunaqmorarka raunaqmorarka force-pushed the pq-writer-decimal branch 3 times, most recently from 86d68d6 to 756fc60 Compare June 2, 2022 16:51
Copy link
Copy Markdown
Member

@skrzypo987 skrzypo987 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM % comments

@findepi
Copy link
Copy Markdown
Member

findepi commented Jun 6, 2022

I skipped over the actual writers, assuming @skrzypo987 took a look at them. Thanks @skrzypo987 .

@alexjo2144 can you PTAL too?

Comment on lines 58 to 61
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe unfold the loop with a switch on numBytes

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@findepi what is the gain of doing this change?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i expect them to perform differently.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code snipped appears to be repeating (with slight modifications).
Consider creating a method for code reuse.

@findepi findepi force-pushed the pq-writer-decimal branch from 8b78956 to 62239be Compare June 9, 2022 20:31
@findepi
Copy link
Copy Markdown
Member

findepi commented Jun 9, 2022

(rebased because of a conflict)

@raunaqmorarka raunaqmorarka force-pushed the pq-writer-decimal branch 2 times, most recently from 8aa5304 to 9ad449b Compare June 10, 2022 06:47
@raunaqmorarka raunaqmorarka requested a review from findepi June 10, 2022 06:49
@raunaqmorarka raunaqmorarka requested a review from skrzypo987 June 10, 2022 06:49
Avoids wasting space by always using 16 bytes even though
lower precision values can be stored using fewer bytes.
@raunaqmorarka raunaqmorarka requested a review from findepi June 10, 2022 08:38
@raunaqmorarka raunaqmorarka requested a review from findepi June 10, 2022 09:41
@findepi
Copy link
Copy Markdown
Member

findepi commented Jun 10, 2022

Delta Lake will continue to use integer encoding for short decimals.
This change fixes compatbility with Apache Hive which expects
decimals to be encoded as fixed length byte arrays.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

4 participants