Skip to content

Comments

[parquet] Support 64-bit RLE-encoded ShortDecimal#23584

Merged
yingsu00 merged 1 commit intoprestodb:masterfrom
ZacBlanco:upstream-parquet-int64-short-dec-rle
Sep 10, 2024
Merged

[parquet] Support 64-bit RLE-encoded ShortDecimal#23584
yingsu00 merged 1 commit intoprestodb:masterfrom
ZacBlanco:upstream-parquet-int64-short-dec-rle

Conversation

@ZacBlanco
Copy link
Contributor

@ZacBlanco ZacBlanco commented Sep 4, 2024

Description

Previously, in the parquet writer short decimals could be written as RLE-encoded with an Int64 primitive type. However we lacked support in the reader to decode this type properly back into a short decimal.

This commit adds support for the RLE-encoded 64-bit short decimals.

Motivation and Context

Decimals can be stored in three different formats: INT32 (P <= 9), INT64 (9 < P <= 18), and FIXED_LEN_BYTE_ARRAY (P > 18). We were missing read support for RLE-encoded short decimals with 9 < P <= 18. The following sequence of actions leads to failure.

SET SESSION iceberg.parquet_batch_read_optimization_enabled = true;
SET SESSION iceberg.parquet_writer_version = 'PARQUET_2_0';
CREATE table tmp AS SELECT * FROM (VALUES (1, CAST(15.2 as decimal(15, 2))), (2, 15.2), (3, 15.2), (4, 15.2)) as t(i, j);

presto:tpch> SELECT * FROM tmp;

Query 20240904_171754_00022_5rgwk, FAILED, 2 nodes
Splits: 5 total, 0 done (0.00%)
[Latency: client-side: 0:03, server-side: 0:03] [0 rows, 0B] [0 rows/s, 0B/s]

Query 20240904_171754_00022_5rgwk failed: com.facebook.presto.parquet.batchreader.decoders.rle.Int64RLEDictionaryValuesDecoder cannot be cast to com.facebook.presto.parquet.batchreader.decoders.ValuesDecoder$ShortDecimalValuesDecoder

After these changes, this error no longer appears.

Impact

N/A

Test Plan

  • tests in parquet module

Contributor checklist

  • Please make sure your submission complies with our development, formatting, commit message, and attribution guidelines.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.

Release Notes

== NO RELEASE NOTE ==

@ZacBlanco ZacBlanco force-pushed the upstream-parquet-int64-short-dec-rle branch from 304b592 to b917df9 Compare September 4, 2024 19:40
@ZacBlanco ZacBlanco marked this pull request as ready for review September 4, 2024 20:48
@ZacBlanco ZacBlanco requested review from a team and shangxinli as code owners September 4, 2024 20:48
elharo
elharo previously approved these changes Sep 5, 2024
Copy link
Member

@hantangwangd hantangwangd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this supplement. Change looks good to me, just one little thing.

if (isTimeStampMicrosType(columnDescriptor)) {
return new Int64TimestampMicrosRLEDictionaryValuesDecoder(bitWidth, inputStream, (LongDictionary) dictionary);
}
if (isDecimalType(columnDescriptor) && isShortDecimalType(columnDescriptor)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we omit the first condition? Is there a scenario where the columnDescriptor is a ShortDecimalType but not a DecimalType?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confusing, but these technically, these check for two separate things

  • isDecimalType checks if the primitive type is DECIMAL
  • isShortDecimal actually checks the logical type annotation is decimal and then gets the precision parameter from the logical type to check if it is a short decimal

From my understanding isShortDecimal should always be true if isDecimalType returns true. However, this is the same check as in the previous block for FLOAT. I would prefer to stay consistent. Also, it is my understanding that Presto may not properly write logical type annotations yet according to #23388 -- so maybe let's just keep this now for consistency's sake?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we look deeper at isDecimalType, we will find that it ultimately also verifies that the logical type annotation is decimal. So as I understand, if a columnDescriptor is of type ShortDecimalType, it must first be of type DecimalType. Furthermore, if the primitive type's logical type annotation is null, it can neither be a DecimalType nor a ShortDecimalType, so it won't break the conclusion above (The current experiments in my local show the same behavior). And we also can see that in encoding == PLAIN clause, the checks in INT32 and INT64 are all isShortDecimalType only.

I'm OK for now to keep it as is, since I am not very sure if there are still some special scenarios present. But once it is completely checked and confirmed, I think it would be better to delete the first condition (so as to the previous block you mentioned) because it would cause a lot of confusion for future readers.

hantangwangd
hantangwangd previously approved these changes Sep 6, 2024
@ZacBlanco ZacBlanco dismissed stale reviews from hantangwangd and elharo via 42c70da September 9, 2024 19:30
@ZacBlanco ZacBlanco force-pushed the upstream-parquet-int64-short-dec-rle branch 2 times, most recently from 42c70da to 74591dd Compare September 9, 2024 19:33
@ZacBlanco
Copy link
Contributor Author

I will look into improving the other bits in the batch reader to reduce allocations on the read path. It is probably also a good issue for new folks getting started with Presto.

@ZacBlanco ZacBlanco force-pushed the upstream-parquet-int64-short-dec-rle branch from 74591dd to 29bc991 Compare September 9, 2024 19:54

import java.io.InputStream;

public class Int64ShortDecimalRLEDictionaryValuesDecoder
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class is not doing anything different than Int64RLEDictionaryValuesDecoder. Can we just use Int64RLEDictionaryValuesDecoder directly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated this to add ShortDecimalValuesDecoder to the implements list on Int64RLEDictionaryValuesDecoder instead. I personally prefer having the explicit reader class as it makes the code a bit easier to follow IMO

@ZacBlanco ZacBlanco force-pushed the upstream-parquet-int64-short-dec-rle branch 2 times, most recently from 79f07d6 to 13df141 Compare September 10, 2024 05:08
Previously, in the parquet writer short decimals could be written
as RLE-encoded with an Int64 logical type. However, we lacked support
in the reader to decode this type properly back into a short decimal.

This commit adds support for the RLE-encoded 64-bit short decimals.
@ZacBlanco ZacBlanco force-pushed the upstream-parquet-int64-short-dec-rle branch from 13df141 to b268229 Compare September 10, 2024 05:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants