Skip to content

feat(parquet): Add boolean rle decoder for Parquet#11282

Closed
jkhaliqi wants to merge 1 commit intofacebookincubator:mainfrom
jkhaliqi:rle_encoding
Closed

feat(parquet): Add boolean rle decoder for Parquet#11282
jkhaliqi wants to merge 1 commit intofacebookincubator:mainfrom
jkhaliqi:rle_encoding

Conversation

@jkhaliqi
Copy link
Copy Markdown
Collaborator

@jkhaliqi jkhaliqi commented Oct 16, 2024

RLE/BP is an Encoding for Boolean values for Parquet Version 2 files.
https://parquet.apache.org/docs/file-format/data-pages/encodings/
Fixes: #10943

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 16, 2024
@netlify
Copy link
Copy Markdown

netlify bot commented Oct 16, 2024

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 573b8af
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/67be325a8dc3010008e6bb83

@Yuhta Yuhta changed the title added in rle encoder for boolean added in rle decoder for boolean Oct 17, 2024
@Yuhta Yuhta self-requested a review October 17, 2024 21:04
@jkhaliqi jkhaliqi force-pushed the rle_encoding branch 2 times, most recently from 358855b to f638683 Compare October 29, 2024 19:22
@jkhaliqi jkhaliqi force-pushed the rle_encoding branch 3 times, most recently from f5735ba to 163bdb3 Compare October 30, 2024 18:03
@jkhaliqi jkhaliqi marked this pull request as ready for review October 30, 2024 18:17
@jkhaliqi jkhaliqi force-pushed the rle_encoding branch 2 times, most recently from a8f69b9 to c25886f Compare October 30, 2024 19:24
@minhancao minhancao force-pushed the rle_encoding branch 2 times, most recently from e8e82c4 to c760214 Compare November 1, 2024 20:45
@ethanyzhang
Copy link
Copy Markdown

@yingsu00 can you also take a look at this PR? Thank you!

@majetideepak majetideepak changed the title added in rle decoder for boolean feat: Add boolean rle decoder for Parquet Nov 12, 2024
@jkhaliqi jkhaliqi force-pushed the rle_encoding branch 2 times, most recently from 99e8783 to 4b8412d Compare November 12, 2024 22:32
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The magic number 4 is used multiple times. Please make it a static const here and use an appropriate name for it.

You don't need super:: here. There is no ambiguity here.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Do we need this comment once the 4 is not magic number anymore?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right removing comment since it is not necessary, thank you!

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a constexpr.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated to constexpr, thank you!

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function itself doesn't need to be constexpr. It is the if condition that should be constexpr.

if constexpr (hasNulls)

That means if the template argument is false this if expression is not generated.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be RleBpDecoder::skip(numValues) to disambiguate the function from this->skip(numValues)?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets also initialize it to 0.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the size of the vector 20?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry was using this output buffer for some other testing since it's not being used anymore will delete this line of code

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no problem if toSkip > 0 but we already advanced current by 1 on line 97? I suppose this someting about what visitor represents? This might need a comment to explain why this is ok.

Or maybe some comment on how the algorithm works when the read occurs.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be named numBytes_.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment above about using super vs the base class name to disambiguate.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that you've replaced super with the actual base class we don't need this anymore. I did not see that you defined super here. This explains why it was working before. But someone not familiar with Java would be confused so better to be explicit.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you removed this line!

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets replace the std::to_string with fmt provided in the VELOX_FAIL like so:

VELOX_FAIL("Received invalid length : {} (corrupt data page?)", len);

for all occurrences.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, updated!

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function itself doesn't need to be constexpr. It is the if condition that should be constexpr.

if constexpr (hasNulls)

That means if the template argument is false this if expression is not generated.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

len is not modified here and we don't need a reference.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated, thank you!

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is not clear to me is why you modify the base class member here. You have your own bufferStart_ so why not processing and modifying this using the base class methods (which you have to some degree).
The base class member (with the same name) is initialized in the constructor. But because you declared a new member of the same name that is never used on line 118 you need to explicitly refer to the base class here when this member is inherited.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch forgot to remove my own bufferStart_ which was being used for something else cleaned up the code with removing it, thank you!

@jkhaliqi jkhaliqi force-pushed the rle_encoding branch 2 times, most recently from 6bbaab0 to 27ad3f1 Compare November 23, 2024 00:09
@ethanyzhang
Copy link
Copy Markdown

@majetideepak should we have another pass?

@majetideepak
Copy link
Copy Markdown
Collaborator

majetideepak commented Feb 22, 2025

@jkhaliqi can you clarify why we cannot use the existing velox::parquet::RleBpDataDecoder class? Thanks.

@jkhaliqi
Copy link
Copy Markdown
Collaborator Author

@majetideepak Seems to be too many errors while using RleBpDataDecoder with things like processRle function and so i'm not sure if it's implementation is complete, @yingsu00 might have more on that, if possible?

@majetideepak
Copy link
Copy Markdown
Collaborator

majetideepak commented Feb 23, 2025

@jkhaliqi if the fastPath is not supported for bool type, we can skip that via dwio::common::useFastPath().. && !std::is_same_v<typename Visitor::DataType, bool>?
The remaining code seems very similar. Can you check?

Copy link
Copy Markdown
Collaborator

@majetideepak majetideepak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jkhaliqi Two minor comments. This looks good!

Co-authored-by: Minhan Cao <minhan.duc.cao@gmail.com>
@majetideepak majetideepak added the ready-to-merge PR that have been reviewed and are ready for merging. PRs with this tag notify the Velox Meta oncall label Feb 25, 2025
@facebook-github-bot
Copy link
Copy Markdown
Contributor

@kevinwilfong has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Copy Markdown
Contributor

@kevinwilfong merged this pull request in 9b2ad44.

@jkhaliqi jkhaliqi deleted the rle_encoding branch December 16, 2025 01:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Merged ready-to-merge PR that have been reviewed and are ready for merging. PRs with this tag notify the Velox Meta oncall

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add decoder for RLE parquet encoding

6 participants