[parquet] Support 64-bit RLE-encoded ShortDecimal by ZacBlanco · Pull Request #23584 · prestodb/presto

ZacBlanco · 2024-09-04T18:24:27Z

Description

Previously, in the parquet writer short decimals could be written as RLE-encoded with an Int64 primitive type. However we lacked support in the reader to decode this type properly back into a short decimal.

This commit adds support for the RLE-encoded 64-bit short decimals.

Motivation and Context

Decimals can be stored in three different formats: INT32 (P <= 9), INT64 (9 < P <= 18), and FIXED_LEN_BYTE_ARRAY (P > 18). We were missing read support for RLE-encoded short decimals with 9 < P <= 18. The following sequence of actions leads to failure.

SET SESSION iceberg.parquet_batch_read_optimization_enabled = true;
SET SESSION iceberg.parquet_writer_version = 'PARQUET_2_0';
CREATE table tmp AS SELECT * FROM (VALUES (1, CAST(15.2 as decimal(15, 2))), (2, 15.2), (3, 15.2), (4, 15.2)) as t(i, j);

presto:tpch> SELECT * FROM tmp;

Query 20240904_171754_00022_5rgwk, FAILED, 2 nodes
Splits: 5 total, 0 done (0.00%)
[Latency: client-side: 0:03, server-side: 0:03] [0 rows, 0B] [0 rows/s, 0B/s]

Query 20240904_171754_00022_5rgwk failed: com.facebook.presto.parquet.batchreader.decoders.rle.Int64RLEDictionaryValuesDecoder cannot be cast to com.facebook.presto.parquet.batchreader.decoders.ValuesDecoder$ShortDecimalValuesDecoder

After these changes, this error no longer appears.

Impact

N/A

Test Plan

tests in parquet module

Contributor checklist

Please make sure your submission complies with our development, formatting, commit message, and attribution guidelines.
PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
Documented new properties (with its default value), SQL syntax, functions, or other functionality.
If release notes are required, they follow the release notes guidelines.
Adequate tests were added if applicable.
CI passed.

Release Notes

== NO RELEASE NOTE ==

hantangwangd

Thanks for this supplement. Change looks good to me, just one little thing.

hantangwangd · 2024-09-06T10:50:22Z

presto-parquet/src/main/java/com/facebook/presto/parquet/batchreader/decoders/Decoders.java

                    if (isTimeStampMicrosType(columnDescriptor)) {
                        return new Int64TimestampMicrosRLEDictionaryValuesDecoder(bitWidth, inputStream, (LongDictionary) dictionary);
                    }
+                    if (isDecimalType(columnDescriptor) && isShortDecimalType(columnDescriptor)) {


Can we omit the first condition? Is there a scenario where the columnDescriptor is a ShortDecimalType but not a DecimalType?

Confusing, but these technically, these check for two separate things

isDecimalType checks if the primitive type is DECIMAL

isShortDecimal actually checks the logical type annotation is decimal and then gets the precision parameter from the logical type to check if it is a short decimal

From my understanding isShortDecimal should always be true if isDecimalType returns true. However, this is the same check as in the previous block for FLOAT. I would prefer to stay consistent. Also, it is my understanding that Presto may not properly write logical type annotations yet according to #23388 -- so maybe let's just keep this now for consistency's sake?

If we look deeper at isDecimalType, we will find that it ultimately also verifies that the logical type annotation is decimal. So as I understand, if a columnDescriptor is of type ShortDecimalType, it must first be of type DecimalType. Furthermore, if the primitive type's logical type annotation is null, it can neither be a DecimalType nor a ShortDecimalType, so it won't break the conclusion above (The current experiments in my local show the same behavior). And we also can see that in encoding == PLAIN clause, the checks in INT32 and INT64 are all isShortDecimalType only.

I'm OK for now to keep it as is, since I am not very sure if there are still some special scenarios present. But once it is completely checked and confirmed, I think it would be better to delete the first condition (so as to the previous block you mentioned) because it would cause a lot of confusion for future readers.

...ook/presto/parquet/batchreader/decoders/rle/Int64ShortDecimalRLEDictionaryValuesDecoder.java

ZacBlanco · 2024-09-09T19:37:36Z

I will look into improving the other bits in the batch reader to reduce allocations on the read path. It is probably also a good issue for new folks getting started with Presto.

yingsu00 · 2024-09-10T04:18:58Z

...ook/presto/parquet/batchreader/decoders/rle/Int64ShortDecimalRLEDictionaryValuesDecoder.java

+
+import java.io.InputStream;
+
+public class Int64ShortDecimalRLEDictionaryValuesDecoder


This class is not doing anything different than Int64RLEDictionaryValuesDecoder. Can we just use Int64RLEDictionaryValuesDecoder directly?

I've updated this to add ShortDecimalValuesDecoder to the implements list on Int64RLEDictionaryValuesDecoder instead. I personally prefer having the explicit reader class as it makes the code a bit easier to follow IMO

Previously, in the parquet writer short decimals could be written as RLE-encoded with an Int64 logical type. However, we lacked support in the reader to decode this type properly back into a short decimal. This commit adds support for the RLE-encoded 64-bit short decimals.

ZacBlanco force-pushed the upstream-parquet-int64-short-dec-rle branch from 304b592 to b917df9 Compare September 4, 2024 19:40

ZacBlanco marked this pull request as ready for review September 4, 2024 20:48

ZacBlanco requested review from a team and shangxinli as code owners September 4, 2024 20:48

ZacBlanco requested review from hantangwangd, presto-oss and yingsu00 September 4, 2024 20:48

elharo previously approved these changes Sep 5, 2024

View reviewed changes

hantangwangd reviewed Sep 6, 2024

View reviewed changes

hantangwangd previously approved these changes Sep 6, 2024

View reviewed changes

yingsu00 reviewed Sep 7, 2024

View reviewed changes

...ook/presto/parquet/batchreader/decoders/rle/Int64ShortDecimalRLEDictionaryValuesDecoder.java Outdated Show resolved Hide resolved

ZacBlanco dismissed stale reviews from hantangwangd and elharo via 42c70da September 9, 2024 19:30

ZacBlanco force-pushed the upstream-parquet-int64-short-dec-rle branch 2 times, most recently from 42c70da to 74591dd Compare September 9, 2024 19:33

ZacBlanco force-pushed the upstream-parquet-int64-short-dec-rle branch from 74591dd to 29bc991 Compare September 9, 2024 19:54

yingsu00 reviewed Sep 10, 2024

View reviewed changes

yingsu00 mentioned this pull request Sep 10, 2024

Unify the Parquet dictionary value decoders #23612

Open

ZacBlanco force-pushed the upstream-parquet-int64-short-dec-rle branch 2 times, most recently from 79f07d6 to 13df141 Compare September 10, 2024 05:08

ZacBlanco force-pushed the upstream-parquet-int64-short-dec-rle branch from 13df141 to b268229 Compare September 10, 2024 05:15

ScrapCodes approved these changes Sep 10, 2024

View reviewed changes

yingsu00 approved these changes Sep 10, 2024

View reviewed changes

yingsu00 merged commit 52c090e into prestodb:master Sep 10, 2024

hantangwangd mentioned this pull request Sep 24, 2024

Remove redundant check for short decimal type in parquet decoders #23715

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

[parquet] Support 64-bit RLE-encoded ShortDecimal#23584

[parquet] Support 64-bit RLE-encoded ShortDecimal#23584
yingsu00 merged 1 commit intoprestodb:masterfrom
ZacBlanco:upstream-parquet-int64-short-dec-rle

ZacBlanco commented Sep 4, 2024 •

edited

Loading

Uh oh!

hantangwangd left a comment

Uh oh!

hantangwangd Sep 6, 2024

Uh oh!

ZacBlanco Sep 6, 2024

Uh oh!

hantangwangd Sep 6, 2024

Uh oh!

Uh oh!

ZacBlanco commented Sep 9, 2024

Uh oh!

yingsu00 Sep 10, 2024

Uh oh!

ZacBlanco Sep 10, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants


		import java.io.InputStream;

		public class Int64ShortDecimalRLEDictionaryValuesDecoder

Comments

Conversation

ZacBlanco commented Sep 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

Impact

Test Plan

Contributor checklist

Release Notes

Uh oh!

hantangwangd left a comment

Choose a reason for hiding this comment

Uh oh!

hantangwangd Sep 6, 2024

Choose a reason for hiding this comment

Uh oh!

ZacBlanco Sep 6, 2024

Choose a reason for hiding this comment

Uh oh!

hantangwangd Sep 6, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ZacBlanco commented Sep 9, 2024

Uh oh!

yingsu00 Sep 10, 2024

Choose a reason for hiding this comment

Uh oh!

ZacBlanco Sep 10, 2024

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ZacBlanco commented Sep 4, 2024 •

edited

Loading