-
Notifications
You must be signed in to change notification settings - Fork 122
Blog: Practical Dive Into Late Materialization in arrow-rs Parquet Reads #740
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Preview URL: https://hhhizzz.github.io/arrow-site If the preview URL doesn't work, you may forget to configure your fork repository for preview. |
|
Thank you 🙏 I plan to work on this more today |
|
I ran out of time to review this carefully, but I will do so first thing tomorrow |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I read this blog post this morning, and I really enjoyed it -- very impressive @hhhizzz
For other reviewers, you can see the rendered preview here https://hhhizzz.github.io/arrow-site/blog/2025/12/03/parquet-late-materialization-deep-dive/
I think this post could be published as is, though I think the introductory image is confusing and it would be better to change it. I have a proposal to improve it:
I also have some ideas on how to improve the text which I will submit as other PRs for your consideration
|
|
||
| Borrowing Abadi's classification from his [paper](https://www.cs.umd.edu/~abadi/papers/abadiicde2007.pdf), the target architecture is **LM-pipelined**: interleaving predicates and data column access instead of reading all columns at once and trying to **stitch them back together** into rows. | ||
|
|
||
| <figure style="text-align: center;"> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found this figure a little confusing as it refers to columns that don't appear in the blog post (e.g. Lineitem and Shipdate, which sound similar to, but not quite the same as TPC-H)
I recommend we either add a caption explaining this graphic more, or change to use a different image (hhhizzz#1)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update: I now see it is a reworked example from the paper. I still think hhhizzz#1 is a better image for the intro
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you again @hhhizzz and @devanbenz
I took the liberty of pushing some small commits to this PR to
- add links to the author names
- fix broken links
I have a few small suggestions, but I think this blog is ready to go. Let's plan to publish it later this week (I'll also post in the arrow-rs discord channel to see if anyone else is interested in reviewing)
Also, if he have time, I suspect @XiangpengHao may be interested in this post and I would value his feedback
|
|
||
| Chained filtering is a **hair-pulling** exercise in coordinate systems. "Row 1" in filter N might actually be "Row 10,001" in the file due to prior filters. | ||
|
|
||
| * **How do we keep the train on the rails?**: We [fuzz test] every `RowSelection` operation (`split_off`, `and_then`, `trim`). We need absolute certainty that our translation between relative and absolute offsets is pixel-perfect. This correctness is the bedrock that keeps the Reader stable under the triple threat of batch boundaries, sparse selections, and page pruning. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Too many mixed metaphors in this paragraph, but that might just be personal taste.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think some metaphors make it easier to read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fair, but 'pixel-perfect' in this context is a bit odd. I understand you can't make any mistakes here, but the description felt too hyperbolic to me for something that's relatively simple.
I'll accept that it's a matter of preference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think this is a stylistic thing -- while I likely would not have used this style, I think it does get the point across and lends a different voice (@hhhizzz 's ) to the narrative.
It is a good point
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Appreciate the flexibility.
pepijnve
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice blog post. Looks like a great improvement to the Parquet reader. I've add the notes I made while reading through the post.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an excellent blog post! From my read it seems good to ship as is.
Is there any thought on using runtime data to refine the approach taken?
I'm thinking something like if we have 32 files after processing 2-3 we probably have a good idea of how selective each filter is and how large the columns are on average. Could we use this to tune filter evaluation order, how we represent the mask, etc.? Or even switch back to eager materialization or drop specific filters if the filters are not selective at all?
|
|
||
| ### 3.3 Smart Caching | ||
|
|
||
| Late materialization puts us in a bit of a **Catch-22**: arrow-rs evaluates predicates progressively on all rows in a row group. This approach uses a small number of large I/Os, which performs well for slow remote storage systems such as object storage. However, it means we may need to read the same column twice—first to filter it, and then again to produce the final rows necessary for the output projection. Without caching, you're **paying double** for the same data: decoding it once for the predicate, and again for the output. [`CachedArrayReader`], introduced in [#arrow-rs/7850], fixes this: **stash the batch the first time you see it, and reuse it later.** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the cache also apply during the narrowing? I.e. if I have filter (a = 5 AND b = 6) OR (b = 6 AND c = 8) it sounds like this will produce two steps, will b = 6 or just the decoded b be cached between those steps?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That’s a really good question. The RowSelection for b = 6 will be cached, and the decoded values of b will be cached. However, the engine will still perform a full filter on b = 6 again using the cached values. Avoiding this redundant filtering should be handled by the SQL engine’s optimizer. A storage engine like arrow-rs is only responsible for executing the already optimized predicate it receives.
|
I took another read through this blog and I think it looks great -- thank you again @hhhizzz @XiangpengHao @pepijnve and @adriangb I'll plan to update the date and publish the post tomorrow, unless anyone would like more time to review This is so great |
|
Thanks again @hhhizzz -- the blog is published here: https://arrow.apache.org/blog/2025/12/11/parquet-late-materialization-deep-dive/ I plan to publicize this blog link around various social media sites. For linkedin in particular, is this the correct URL to use for your profile? https://www.linkedin.com/in/qiwei-huang-aa175811b (I can omit such a link if you prefer) |
Yes, that’s me, thank you! |
See preview URL: https://hhhizzz.github.io/arrow-site/blog/
cc @alamb
Original blog and Chinese translation.
See earlier draft here: hhhizzz/arrow-rs#10