[parquet] Support 64-bit RLE-encoded ShortDecimal#23584
[parquet] Support 64-bit RLE-encoded ShortDecimal#23584yingsu00 merged 1 commit intoprestodb:masterfrom
Conversation
304b592 to
b917df9
Compare
hantangwangd
left a comment
There was a problem hiding this comment.
Thanks for this supplement. Change looks good to me, just one little thing.
| if (isTimeStampMicrosType(columnDescriptor)) { | ||
| return new Int64TimestampMicrosRLEDictionaryValuesDecoder(bitWidth, inputStream, (LongDictionary) dictionary); | ||
| } | ||
| if (isDecimalType(columnDescriptor) && isShortDecimalType(columnDescriptor)) { |
There was a problem hiding this comment.
Can we omit the first condition? Is there a scenario where the columnDescriptor is a ShortDecimalType but not a DecimalType?
There was a problem hiding this comment.
Confusing, but these technically, these check for two separate things
isDecimalTypechecks if the primitive type isDECIMALisShortDecimalactually checks the logical type annotation is decimal and then gets the precision parameter from the logical type to check if it is a short decimal
From my understanding isShortDecimal should always be true if isDecimalType returns true. However, this is the same check as in the previous block for FLOAT. I would prefer to stay consistent. Also, it is my understanding that Presto may not properly write logical type annotations yet according to #23388 -- so maybe let's just keep this now for consistency's sake?
There was a problem hiding this comment.
If we look deeper at isDecimalType, we will find that it ultimately also verifies that the logical type annotation is decimal. So as I understand, if a columnDescriptor is of type ShortDecimalType, it must first be of type DecimalType. Furthermore, if the primitive type's logical type annotation is null, it can neither be a DecimalType nor a ShortDecimalType, so it won't break the conclusion above (The current experiments in my local show the same behavior). And we also can see that in encoding == PLAIN clause, the checks in INT32 and INT64 are all isShortDecimalType only.
I'm OK for now to keep it as is, since I am not very sure if there are still some special scenarios present. But once it is completely checked and confirmed, I think it would be better to delete the first condition (so as to the previous block you mentioned) because it would cause a lot of confusion for future readers.
...ook/presto/parquet/batchreader/decoders/rle/Int64ShortDecimalRLEDictionaryValuesDecoder.java
Outdated
Show resolved
Hide resolved
42c70da to
74591dd
Compare
|
I will look into improving the other bits in the batch reader to reduce allocations on the read path. It is probably also a good issue for new folks getting started with Presto. |
74591dd to
29bc991
Compare
|
|
||
| import java.io.InputStream; | ||
|
|
||
| public class Int64ShortDecimalRLEDictionaryValuesDecoder |
There was a problem hiding this comment.
This class is not doing anything different than Int64RLEDictionaryValuesDecoder. Can we just use Int64RLEDictionaryValuesDecoder directly?
There was a problem hiding this comment.
I've updated this to add ShortDecimalValuesDecoder to the implements list on Int64RLEDictionaryValuesDecoder instead. I personally prefer having the explicit reader class as it makes the code a bit easier to follow IMO
79f07d6 to
13df141
Compare
Previously, in the parquet writer short decimals could be written as RLE-encoded with an Int64 logical type. However, we lacked support in the reader to decode this type properly back into a short decimal. This commit adds support for the RLE-encoded 64-bit short decimals.
13df141 to
b268229
Compare
Description
Previously, in the parquet writer short decimals could be written as RLE-encoded with an Int64 primitive type. However we lacked support in the reader to decode this type properly back into a short decimal.
This commit adds support for the RLE-encoded 64-bit short decimals.
Motivation and Context
Decimals can be stored in three different formats: INT32 (P <= 9), INT64 (9 < P <= 18), and FIXED_LEN_BYTE_ARRAY (P > 18). We were missing read support for RLE-encoded short decimals with 9 < P <= 18. The following sequence of actions leads to failure.
After these changes, this error no longer appears.
Impact
N/A
Test Plan
Contributor checklist
Release Notes