Blog post about efficient filter representation in Parquet filter pushdown#10
Blog post about efficient filter representation in Parquet filter pushdown#10hhhizzz wants to merge 6 commits into
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
alamb
left a comment
There was a problem hiding this comment.
First of all, this is amazing @hhhizzz -- I really like how you frame the post / deocder as a mini query engine and then proceed to teach the reader what late materialization is and how it works.
The figures are also quite amazing.
I suggest you incorporate any suggestions you would like, and then we move this content to the arrow-site repo for final polish and iteration.
Also, if you are willing, I am happy to help work on / propose changes directly to the blog, but I don't want to make a bunch of conflicts if you still have changes you plan to make.
Mechanics of posting to Arrow Blog
To publish this content on the arrow blog (https://arrow.apache.org/blog), we need to make a PR to https://github.com/apache/arrow-site. Once the PR to arrow-site is merged, it will automatically be published
Here is an example of such a PR apache/arrow-site#720 (I am also happy to help translate this content into a PR as well)
There was a problem hiding this comment.
I didn't see this image referenced anywhere in the file
There was a problem hiding this comment.
I really like this image as a high level overview of the late materialization process -- but it didn't seem to be referenced anywhere in the text
There was a problem hiding this comment.
I found this image a little confusing for two reasons:
- There is no Step 2
- The "Decoded B Values" didn't make sense -- there were 3 decoded values, but Row Mask 1 seems to have 5 that pass (so I think Decoded B Values should have 5 green boxes)
There was a problem hiding this comment.
This is (also) an amazing image. One minor suggestion is to use a term other than "Track"
Perhaps instead of "Track A (Selection)" you could use a term like "A > 10 Filter Results (Selection)"
And instead of "Track B" use a term like "B < 5 Filter Results"
And then instead of "Result Track" use a term "Final Filter Result"
| Starting from [v57.1.0](https://github.com/apache/arrow-rs/tree/57.1.0), `RowSelection` can switch between RLE (selectors) and bitmask. Bitmasks are faster when gaps are tiny and sparsity is high; RLE is friendlier to page-level skips. Details show up again in 3.3. | ||
|
|
||
| **Execution sketch** (`SELECT * FROM table WHERE A > 10 AND B < 5`): | ||
| 1. **Initial**: `selection = None` (equivalent to "select all"). |
There was a problem hiding this comment.
Eventually I suggest we add links to the code / docs for ArrayReader, ReadPlanBuilder, etc -- I suggest we do this after a few more passes on the content
|
|
||
| --- | ||
|
|
||
| ## 2. Key Mechanics |
There was a problem hiding this comment.
I suggest calling this section "Late Materialization in the Rust Parquet Reader" and stop after section 2.2 where you go through the actual code implementation that matches the high level overview you gave in section 1
(in other words, I suggest starting a new major section starting at what is currently "### 2.3 Smart caching"
|
|
||
|  | ||
|
|
||
| ### 2.3 Smart caching |
There was a problem hiding this comment.
I think this would be a great place to start the "## 3. Engineering challenges" section
It would help I think if the introduction to that section said something like
While relatively straightforward in theory, actually implementing Late Materialization in a production grade system such as arrow-rs requires significant engineering. Previously the effort required meant that such technology is typically only available in proprietary engines and has been a struggle to implement in the open source community (see the DataFusion ticket about enabling filter pushdown). Now, after several years of effort, we are finally close to having Late Materialization as fast or faster in all cases. Getting to this point required several major implementation details, which are described below
|
|
||
|  | ||
|
|
||
| ### 3.3 Adaptive RowSelection policy (bitmask vs. RLE) |
There was a problem hiding this comment.
I recommend moving this section to the top of the engineering challenges section -- and title it "Row Selection Representation"
The idea being that different RowSelection representations have different tradeoffs for different filter patterns, and that in order to improve performance, we needed to provide different implementations
| ``` | ||
|
|
||
| #### 3.3.2 The bitmask trap: missing pages | ||
| The bitmask introduces a new failure mode. Consider a simple example: |
There was a problem hiding this comment.
If we want to include this section, i think we need to set it up a bit more carefully (give it some more context).
Maybe we can talk about it as "an example of why high performance engineering is so complicated -- we can also link to the ticket for the follow on work. I'll dig that up later
|
@alamb Thanks for the review! I’ve updated the images using the new drawing method. It’s getting late here (UTC+8), so I’ll update the rest tomorrow. |
Thank you! I will try and make a PR in the arrow-site repo tomorrow. Note that if I create the PR, then in order for you to edit it, you will need to make PRs to my fork (which is fine) If you would like to retain the ability to edit the post directly (which is probably good) you will need to open the PR. I can create a branch and then you could potentially open a PR 🤔 |
I just realized I also want to publish a Chinese translation at the same time. I think it’s better if I create the PR and update both the English and Chinese versions based on your comments or anyone else’s feedback. |
|
@alamb I submitted the PR, let me know if anything I can improve. apache/arrow-site#740 |
…ads (#740) - closes apache/arrow-rs#8843 See preview URL: https://hhhizzz.github.io/arrow-site/blog/ cc @alamb Original blog and Chinese translation. See earlier draft here: hhhizzz/arrow-rs#10 --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

apache#8843