Skip to content

feat(parquet): Add config for datapage version#11151

Closed
svm1 wants to merge 1 commit intofacebookincubator:mainfrom
svm1:parq_2
Closed

feat(parquet): Add config for datapage version#11151
svm1 wants to merge 1 commit intofacebookincubator:mainfrom
svm1:parq_2

Conversation

@svm1
Copy link
Copy Markdown
Collaborator

@svm1 svm1 commented Oct 2, 2024

Add config and session properties hive.parquet.writer.datapage-version, hive.parquet.writer.datapage_version, to determine the parquet writer datapage version (V1 or V2). Defaults to V1.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 2, 2024
@netlify
Copy link
Copy Markdown

netlify bot commented Oct 2, 2024

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit cb03f98
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/67a56b75205d210008739b47

@svm1 svm1 force-pushed the parq_2 branch 2 times, most recently from 40a6c6c to 4982e1f Compare November 26, 2024 09:47
@svm1
Copy link
Copy Markdown
Collaborator Author

svm1 commented Nov 26, 2024

@yingsu00 @majetideepak Thank you for reviewing - made all necessary changes, please take a look!

@svm1 svm1 force-pushed the parq_2 branch 2 times, most recently from 2f489a5 to 1807de1 Compare January 25, 2025 02:54
@majetideepak majetideepak changed the title Add support for parquet_writer_version session property feat(parquet): Add support for parquet version config Jan 29, 2025
@majetideepak
Copy link
Copy Markdown
Collaborator

majetideepak commented Jan 29, 2025

@svm1 Arrow has two versions a user can set ParquetVersion and ParquetDataPageVersion. Based on the defaults (2.6), and the issue here prestodb/presto#22595, I see the Presto ParquetWriterVersion maps to ParquetVersion.

https://github.com/apache/arrow/blob/main/cpp/src/parquet/properties.h#L258

@jkhaliqi Can you please evaluate these options with respect to RLE V2?

@svm1
Copy link
Copy Markdown
Collaborator Author

svm1 commented Jan 29, 2025

@svm1 Arrow has two versions a user can set ParquetVersion and ParquetDataPageVersion. Based on the defaults (2.6), and the issue here prestodb/presto#22595, I see the Presto ParquetWriterVersion maps to ParquetVersion.

https://github.com/apache/arrow/blob/main/cpp/src/parquet/properties.h#L258

@jkhaliqi Can you please evaluate these options with respect to RLE V2?

@majetideepak

I understand the two different versions - ParquetVersion referring to the format version (2.4, 2.6, etc) and ParquetDataPageVersion referring to the datapage (V1 or V2).

I think we had a discussion about this very debate back when I first began working on this fix, and we determined that the Presto parquet_writer_version property does map to ParquetDataPageVersion - #9700 (comment)

I also investigated the Presto Java source - parquet_writer_version there maps to the org.apache.parquet.column.ParquetProperties.WriterVersion enum, which itself contains only PARQUET_1_0 and PARQUET_2_0, likely representing the V1/V2 datapage - https://www.javadoc.io/doc/org.apache.parquet/parquet-column/1.10.1/org/apache/parquet/column/ParquetProperties.WriterVersion.html#PARQUET_1_0

The Presto docs also seem to point to this mapping:

Presto now supports Parquet writer versions V1 and V2 for the Hive catalog. It can be toggled using the session property parquet_writer_version and the config property hive.parquet.writer.version. Valid values for these properties are PARQUET_1_0 and PARQUET_2_0. Default is PARQUET_1_0.

The format version also appears far more granular than simply "1 vs 2" (https://github.com/apache/parquet-format/blob/master/CHANGES.md).

Therefore I believe the mapping established in this PR would be correct - please correct me if I am mistaken.

@majetideepak
Copy link
Copy Markdown
Collaborator

@svm1 The issue reported here prestodb/presto#22595 hints that it is not the DataPageVersion but ParquetVersion.
Can you verify this?

@majetideepak
Copy link
Copy Markdown
Collaborator

majetideepak commented Jan 29, 2025

@svm1
Copy link
Copy Markdown
Collaborator Author

svm1 commented Feb 4, 2025

@majetideepak @czentgr

@svm1 The issue reported here prestodb/presto#22595 hints that it is not the DataPageVersion but ParquetVersion. Can you verify this?

After a thorough investigation, I believe there is a misunderstanding regarding the nature of the original issue, and that the issue might not be exactly as described.

The issue is based on the following observation:

Even if when setting set session hive.parquet_writer_version='PARQUET_1_0'; Parquet data is written in format_version: 2.6

However, I don't think the state of hive.parquet_writer_version would ever affect the FormatVersion used, because the session property seems to actually refer to the DataPageVersion.

I was able to validate this by stepping through the Presto Java Parquet writer code. I observed that modifying hive.parquet_writer_version resulted in the code path diverging at the following point:

switch (parquetProperties.getWriterVersion()) {
    case PARQUET_1_0:
        return new PrimitiveColumnWriterV1(prestoType,
                columnDescriptor,
                getValueWriter(parquetProperties.newValuesWriter(columnDescriptor), prestoType, columnDescriptor.getPrimitiveType()),
                parquetProperties.newDefinitionLevelWriter(columnDescriptor),
                parquetProperties.newRepetitionLevelWriter(columnDescriptor),
                compressionCodecName,
                parquetProperties.getPageSizeThreshold());
    case PARQUET_2_0:
        return new PrimitiveColumnWriterV2(prestoType,
                columnDescriptor,
                getValueWriter(parquetProperties.newValuesWriter(columnDescriptor), prestoType, columnDescriptor.getPrimitiveType()),
                parquetProperties.newDefinitionLevelEncoder(columnDescriptor),
                parquetProperties.newRepetitionLevelEncoder(columnDescriptor),
                compressionCodecName,
                parquetProperties.getPageSizeThreshold());

I observed that the WriterVersion field of parquetProperties is being populated with the value of the session property, hive.parquet_writer_version - this value then dictates whether a V1 or V2 writer should be instantiated, as seen above.

@jkhaliqi and I reviewed this and conducted some experiments to understand the current behavior in Java. The following table shows the resulting value of the FormatVersion, with different values of the session property used:

Presto Java:

hive.parquet_writer_version value DataPage observed in code path FormatVersion observed via output file
PARQUET_1_0 V1 1.0
PARQUET_2_0 V2 1.0

In all cases, the resulting Parquet files (analyzed via parquet-tools) are consistently created with FormatVersion 1.0, regardless of the session property value.

Additionally, we noticed that the encoding for boolean columns aligns with the expected DataPage types for each version:

PARQUET_1_0/V1 file:

jacobkhaliqi@Jacobs-MBP-69 ~ % parquet-tools meta Downloads/20250204_005554_00028_8adci_19f082cd-df97-4545-bd07-94f058abd33e
{"NumRowGroups":1,"RowGroups":[{"NumRows":1,"TotalByteSize":44,"Columns":[{"PathInSchema":["Flag"],"Type":"BOOLEAN","Encodings":["BIT_PACKED","RLE","PLAIN"],"CompressedSize":44,"UncompressedSize":24,"NumValues":1,"NullCount":0,"MaxValue":true,"MinValue":true,"CompressionCodec":"GZIP"}]}]}

PARQUET_2_0/V2 file:

jacobkhaliqi@Jacobs-MBP-69 ~ % parquet-tools meta Downloads/20250204_005648_00031_8adci_ca1a5061-bad4-4b29-b8a5-49b04d9073c4
{"NumRowGroups":1,"RowGroups":[{"NumRows":1,"TotalByteSize":50,"Columns":[{"PathInSchema":["Flag"],"Type":"BOOLEAN","Encodings":["RLE"],"CompressedSize":50,"UncompressedSize":30,"NumValues":1,"NullCount":0,"MaxValue":true,"MinValue":true,"CompressionCodec":"GZIP"}]}]}

Therefore, I believe hive.parquet_writer_version refers to the DataPageVersion and not the FormatVersion. The key change here is ensuring that this session property toggles the DataPageVersion in Prestissimo, similar to the existing behavior in Presto Java as implemented here: prestodb/presto#20957.

@svm1 svm1 force-pushed the parq_2 branch 2 times, most recently from 54cc8f1 to 0e470ea Compare February 4, 2025 20:28
Copy link
Copy Markdown
Collaborator

@majetideepak majetideepak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@czentgr
Copy link
Copy Markdown
Collaborator

czentgr commented Feb 5, 2025

The doc for the flag should go here: https://github.com/facebookincubator/velox/blob/main/velox/docs/configs.rst#parquet-file-format-configuration

@svm1
Copy link
Copy Markdown
Collaborator Author

svm1 commented Feb 5, 2025

@majetideepak @czentgr If we're doing away with Presto naming here, I think it would be less ambiguous/more intuitive to have the values for this flag be either "V1" or "V2"? As opposed to "PARQUET_1_0"/"PARQUET_2_0".

@majetideepak
Copy link
Copy Markdown
Collaborator

@svm1 yes, let's use "V1" or "V2" for Velox.

@svm1 svm1 force-pushed the parq_2 branch 2 times, most recently from dcb4a9b to a11bd66 Compare February 5, 2025 21:48
@svm1
Copy link
Copy Markdown
Collaborator Author

svm1 commented Feb 5, 2025

Documentation change pushed.

Copy link
Copy Markdown
Collaborator

@majetideepak majetideepak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Thanks for all the revisions and investigation @svm1

@svm1 svm1 force-pushed the parq_2 branch 3 times, most recently from cf5fd27 to 0ff765b Compare February 5, 2025 23:55
Copy link
Copy Markdown
Collaborator

@czentgr czentgr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Lets add more tests.

@svm1 svm1 force-pushed the parq_2 branch 2 times, most recently from b22c8c3 to 66295c9 Compare February 6, 2025 23:36
Copy link
Copy Markdown
Collaborator

@majetideepak majetideepak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@svm1 last comments. Thanks.

Copy link
Copy Markdown
Collaborator

@majetideepak majetideepak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @svm1

@majetideepak majetideepak added the ready-to-merge PR that have been reviewed and are ready for merging. PRs with this tag notify the Velox Meta oncall label Feb 7, 2025
@facebook-github-bot
Copy link
Copy Markdown
Contributor

@kgpai has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Copy Markdown
Contributor

@kgpai merged this pull request in e14fc1d.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Merged ready-to-merge PR that have been reviewed and are ready for merging. PRs with this tag notify the Velox Meta oncall

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants