-
Notifications
You must be signed in to change notification settings - Fork 0
Blog post about efficient filter representation in Parquet filter pushdown #10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First of all, this is amazing @hhhizzz -- I really like how you frame the post / deocder as a mini query engine and then proceed to teach the reader what late materialization is and how it works.
The figures are also quite amazing.
I suggest you incorporate any suggestions you would like, and then we move this content to the arrow-site repo for final polish and iteration.
Also, if you are willing, I am happy to help work on / propose changes directly to the blog, but I don't want to make a bunch of conflicts if you still have changes you plan to make.
Mechanics of posting to Arrow Blog
To publish this content on the arrow blog (https://arrow.apache.org/blog), we need to make a PR to https://github.com/apache/arrow-site. Once the PR to arrow-site is merged, it will automatically be published
Here is an example of such a PR apache/arrow-site#720 (I am also happy to help translate this content into a PR as well)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the Query in image.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't see this image referenced anywhere in the file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added in the doc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like this image as a high level overview of the late materialization process -- but it didn't seem to be referenced anywhere in the text
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added.
parquet/docs/fig2.png
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found this image a little confusing for two reasons:
- There is no Step 2
- The "Decoded B Values" didn't make sense -- there were 3 decoded values, but Row Mask 1 seems to have 5 that pass (so I think Decoded B Values should have 5 green boxes)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in the new image
parquet/docs/fig3.png
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is (also) an amazing image. One minor suggestion is to use a term other than "Track"
Perhaps instead of "Track A (Selection)" you could use a term like "A > 10 Filter Results (Selection)"
And instead of "Track B" use a term like "B < 5 Filter Results"
And then instead of "Result Track" use a term "Final Filter Result"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated in the new image.
| Starting from [v57.1.0](https://github.com/apache/arrow-rs/tree/57.1.0), `RowSelection` can switch between RLE (selectors) and bitmask. Bitmasks are faster when gaps are tiny and sparsity is high; RLE is friendlier to page-level skips. Details show up again in 3.3. | ||
|
|
||
| **Execution sketch** (`SELECT * FROM table WHERE A > 10 AND B < 5`): | ||
| 1. **Initial**: `selection = None` (equivalent to "select all"). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eventually I suggest we add links to the code / docs for ArrayReader, ReadPlanBuilder, etc -- I suggest we do this after a few more passes on the content
|
|
||
| --- | ||
|
|
||
| ## 2. Key Mechanics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest calling this section "Late Materialization in the Rust Parquet Reader" and stop after section 2.2 where you go through the actual code implementation that matches the high level overview you gave in section 1
(in other words, I suggest starting a new major section starting at what is currently "### 2.3 Smart caching"
|
|
||
|  | ||
|
|
||
| ### 2.3 Smart caching |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this would be a great place to start the "## 3. Engineering challenges" section
It would help I think if the introduction to that section said something like
While relatively straightforward in theory, actually implementing Late Materialization in a production grade system such as arrow-rs requires significant engineering. Previously the effort required meant that such technology is typically only available in proprietary engines and has been a struggle to implement in the open source community (see the DataFusion ticket about enabling filter pushdown). Now, after several years of effort, we are finally close to having Late Materialization as fast or faster in all cases. Getting to this point required several major implementation details, which are described below
|
|
||
|  | ||
|
|
||
| ### 3.3 Adaptive RowSelection policy (bitmask vs. RLE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend moving this section to the top of the engineering challenges section -- and title it "Row Selection Representation"
The idea being that different RowSelection representations have different tradeoffs for different filter patterns, and that in order to improve performance, we needed to provide different implementations
| ``` | ||
|
|
||
| #### 3.3.2 The bitmask trap: missing pages | ||
| The bitmask introduces a new failure mode. Consider a simple example: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want to include this section, i think we need to set it up a bit more carefully (give it some more context).
Maybe we can talk about it as "an example of why high performance engineering is so complicated -- we can also link to the ticket for the follow on work. I'll dig that up later
|
@alamb Thanks for the review! I’ve updated the images using the new drawing method. It’s getting late here (UTC+8), so I’ll update the rest tomorrow. |
Thank you! I will try and make a PR in the arrow-site repo tomorrow. Note that if I create the PR, then in order for you to edit it, you will need to make PRs to my fork (which is fine) If you would like to retain the ability to edit the post directly (which is probably good) you will need to open the PR. I can create a branch and then you could potentially open a PR 🤔 |
I just realized I also want to publish a Chinese translation at the same time. I think it’s better if I create the PR and update both the English and Chinese versions based on your comments or anyone else’s feedback. |
|
@alamb I submitted the PR, let me know if anything I can improve. apache/arrow-site#740 |
…ads (#740) - closes apache/arrow-rs#8843 See preview URL: https://hhhizzz.github.io/arrow-site/blog/ cc @alamb Original blog and Chinese translation. See earlier draft here: hhhizzz/arrow-rs#10 --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

apache#8843