feat(parquet): Add config for datapage version#11151
feat(parquet): Add config for datapage version#11151svm1 wants to merge 1 commit intofacebookincubator:mainfrom
Conversation
✅ Deploy Preview for meta-velox canceled.
|
40a6c6c to
4982e1f
Compare
|
@yingsu00 @majetideepak Thank you for reviewing - made all necessary changes, please take a look! |
2f489a5 to
1807de1
Compare
parquet_writer_version session property|
@svm1 Arrow has two versions a user can set ParquetVersion and ParquetDataPageVersion. Based on the defaults (2.6), and the issue here prestodb/presto#22595, I see the Presto ParquetWriterVersion maps to ParquetVersion. https://github.com/apache/arrow/blob/main/cpp/src/parquet/properties.h#L258 @jkhaliqi Can you please evaluate these options with respect to RLE V2? |
I understand the two different versions - I think we had a discussion about this very debate back when I first began working on this fix, and we determined that the Presto I also investigated the Presto Java source - The Presto docs also seem to point to this mapping:
The format version also appears far more granular than simply "1 vs 2" (https://github.com/apache/parquet-format/blob/master/CHANGES.md). Therefore I believe the mapping established in this PR would be correct - please correct me if I am mistaken. |
|
@svm1 The issue reported here prestodb/presto#22595 hints that it is not the DataPageVersion but ParquetVersion. |
|
Also there are 4 ParquetVersions |
After a thorough investigation, I believe there is a misunderstanding regarding the nature of the original issue, and that the issue might not be exactly as described. The issue is based on the following observation:
However, I don't think the state of I was able to validate this by stepping through the Presto Java Parquet writer code. I observed that modifying I observed that the @jkhaliqi and I reviewed this and conducted some experiments to understand the current behavior in Java. The following table shows the resulting value of the FormatVersion, with different values of the session property used: Presto Java:
In all cases, the resulting Parquet files (analyzed via parquet-tools) are consistently created with FormatVersion 1.0, regardless of the session property value. Additionally, we noticed that the encoding for boolean columns aligns with the expected DataPage types for each version: PARQUET_1_0/V1 file: PARQUET_2_0/V2 file: Therefore, I believe |
54cc8f1 to
0e470ea
Compare
majetideepak
left a comment
There was a problem hiding this comment.
Let's also document this at https://facebookincubator.github.io/velox/configs.html#parquet-file-format-configuration
|
The doc for the flag should go here: https://github.com/facebookincubator/velox/blob/main/velox/docs/configs.rst#parquet-file-format-configuration |
|
@majetideepak @czentgr If we're doing away with Presto naming here, I think it would be less ambiguous/more intuitive to have the values for this flag be either "V1" or "V2"? As opposed to "PARQUET_1_0"/"PARQUET_2_0". |
|
@svm1 yes, let's use |
dcb4a9b to
a11bd66
Compare
|
Documentation change pushed. |
majetideepak
left a comment
There was a problem hiding this comment.
Looks great! Thanks for all the revisions and investigation @svm1
cf5fd27 to
0ff765b
Compare
czentgr
left a comment
There was a problem hiding this comment.
Thanks. Lets add more tests.
b22c8c3 to
66295c9
Compare
|
@kgpai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Add config and session properties
hive.parquet.writer.datapage-version,hive.parquet.writer.datapage_version, to determine the parquet writer datapage version (V1 or V2). Defaults to V1.