Blog: Practical Dive Into Late Materialization in arrow-rs Parquet Reads #740

hhhizzz · 2025-12-03T15:59:30Z

closes Blog post about efficient filter representation in Parquet filter pushdown arrow-rs#8843

See preview URL: https://hhhizzz.github.io/arrow-site/blog/

Original blog and Chinese translation.

See earlier draft here: hhhizzz/arrow-rs#10

github-actions · 2025-12-03T15:59:41Z

Preview URL: https://hhhizzz.github.io/arrow-site

If the preview URL doesn't work, you may forget to configure your fork repository for preview.
See https://github.com/apache/arrow-site/blob/main/README.md#forks how to configure.

alamb · 2025-12-04T16:00:47Z

Thank you 🙏

I plan to work on this more today

alamb · 2025-12-04T23:10:55Z

I ran out of time to review this carefully, but I will do so first thing tomorrow

alamb

I read this blog post this morning, and I really enjoyed it -- very impressive @hhhizzz

For other reviewers, you can see the rendered preview here https://hhhizzz.github.io/arrow-site/blog/2025/12/03/parquet-late-materialization-deep-dive/

I think this post could be published as is, though I think the introductory image is confusing and it would be better to change it. I have a proposal to improve it:

hhhizzz#1

I also have some ideas on how to improve the text which I will submit as other PRs for your consideration

alamb · 2025-12-05T09:25:00Z

_posts/2025-12-11-parquet-late-materialization-deep-dive.md

+
+Borrowing Abadi's classification from his [paper](https://www.cs.umd.edu/~abadi/papers/abadiicde2007.pdf), the target architecture is **LM-pipelined**: interleaving predicates and data column access instead of reading all columns at once and trying to **stitch them back together** into rows.
+
+<figure style="text-align: center;">


I found this figure a little confusing as it refers to columns that don't appear in the blog post (e.g. Lineitem and Shipdate, which sound similar to, but not quite the same as TPC-H)

I recommend we either add a caption explaining this graphic more, or change to use a different image (hhhizzz#1)

Update: I now see it is a reworked example from the paper. I still think hhhizzz#1 is a better image for the intro

_posts/2025-12-03-parquet-late-materialization-deep-dive.md

_posts/2025-12-07-parquet-late-materialization-deep-dive.md

alamb

Thank you again @hhhizzz and @devanbenz

I took the liberty of pushing some small commits to this PR to

add links to the author names
fix broken links

I have a few small suggestions, but I think this blog is ready to go. Let's plan to publish it later this week (I'll also post in the arrow-rs discord channel to see if anyone else is interested in reviewing)

Also, if he have time, I suspect @XiangpengHao may be interested in this post and I would value his feedback

img/late-materialization/fig1.png

_posts/2025-12-07-parquet-late-materialization-deep-dive.md

_posts/2025-12-11-parquet-late-materialization-deep-dive.md

_posts/2025-12-07-parquet-late-materialization-deep-dive.md

pepijnve · 2025-12-08T16:58:35Z

_posts/2025-12-11-parquet-late-materialization-deep-dive.md

+
+Chained filtering is a **hair-pulling** exercise in coordinate systems. "Row 1" in filter N might actually be "Row 10,001" in the file due to prior filters.
+
+* **How do we keep the train on the rails?**: We [fuzz test] every `RowSelection` operation (`split_off`, `and_then`, `trim`). We need absolute certainty that our translation between relative and absolute offsets is pixel-perfect. This correctness is the bedrock that keeps the Reader stable under the triple threat of batch boundaries, sparse selections, and page pruning.


Too many mixed metaphors in this paragraph, but that might just be personal taste.

I think some metaphors make it easier to read.

That's fair, but 'pixel-perfect' in this context is a bit odd. I understand you can't make any mistakes here, but the description felt too hyperbolic to me for something that's relatively simple.
I'll accept that it's a matter of preference.

Yeah, I think this is a stylistic thing -- while I likely would not have used this style, I think it does get the point across and lends a different voice (@hhhizzz 's ) to the narrative.

It is a good point

Thanks! Appreciate the flexibility.

_posts/2025-12-07-parquet-late-materialization-deep-dive.md

pepijnve

Nice blog post. Looks like a great improvement to the Parquet reader. I've add the notes I made while reading through the post.

adriangb

This is an excellent blog post! From my read it seems good to ship as is.

Is there any thought on using runtime data to refine the approach taken?

I'm thinking something like if we have 32 files after processing 2-3 we probably have a good idea of how selective each filter is and how large the columns are on average. Could we use this to tune filter evaluation order, how we represent the mask, etc.? Or even switch back to eager materialization or drop specific filters if the filters are not selective at all?

adriangb · 2025-12-08T18:08:11Z

_posts/2025-12-07-parquet-late-materialization-deep-dive.md

+
+### 3.3 Smart Caching
+
+Late materialization puts us in a bit of a **Catch-22**: arrow-rs evaluates predicates progressively on all rows in a row group. This approach uses a small number of large I/Os, which performs well for slow remote storage systems such as object storage. However, it means we may need to read the same column twice—first to filter it, and then again to produce the final rows necessary for the output projection. Without caching, you're **paying double** for the same data: decoding it once for the predicate, and again for the output. [`CachedArrayReader`], introduced in [#arrow-rs/7850], fixes this: **stash the batch the first time you see it, and reuse it later.**


Does the cache also apply during the narrowing? I.e. if I have filter (a = 5 AND b = 6) OR (b = 6 AND c = 8) it sounds like this will produce two steps, will b = 6 or just the decoded b be cached between those steps?

That’s a really good question. The RowSelection for b = 6 will be cached, and the decoded values of b will be cached. However, the engine will still perform a full filter on b = 6 again using the cached values. Avoiding this redundant filtering should be handled by the SQL engine’s optimizer. A storage engine like arrow-rs is only responsible for executing the already optimized predicate it receives.

_posts/2025-12-07-parquet-late-materialization-deep-dive.md

XiangpengHao · 2025-12-08T18:24:25Z

Great work @hhhizzz @alamb I enjoyed reading it! (left a minor comment)

alamb · 2025-12-10T12:41:59Z

I took another read through this blog and I think it looks great -- thank you again @hhhizzz @XiangpengHao @pepijnve and @adriangb

I'll plan to update the date and publish the post tomorrow, unless anyone would like more time to review

This is so great

alamb · 2025-12-11T12:29:10Z

Thanks again @hhhizzz -- the blog is published here: https://arrow.apache.org/blog/2025/12/11/parquet-late-materialization-deep-dive/

I plan to publicize this blog link around various social media sites. For linkedin in particular, is this the correct URL to use for your profile? https://www.linkedin.com/in/qiwei-huang-aa175811b

(I can omit such a link if you prefer)

hhhizzz · 2025-12-11T13:00:47Z

Thanks again @hhhizzz -- the blog is published here: https://arrow.apache.org/blog/2025/12/11/parquet-late-materialization-deep-dive/

I plan to publicize this blog link around various social media sites. For linkedin in particular, is this the correct URL to use for your profile? https://www.linkedin.com/in/qiwei-huang-aa175811b

(I can omit such a link if you prefer)

Yes, that’s me, thank you!

Add blog

dfcedb9

hhhizzz mentioned this pull request Dec 3, 2025

Blog post about efficient filter representation in Parquet filter pushdown hhhizzz/arrow-rs#10

Closed

hhhizzz added 2 commits December 4, 2025 00:10

Try to retrigger the render

d1aa9bb

refine content

5060e3d

alamb changed the title ~~Blog post about efficient filter representation in Parquet filter pushdown~~ Blog: Practical Dive Into Late Materialization in arrow-rs Parquet Reads Dec 5, 2025

Update introduction figure

048c79b

alamb mentioned this pull request Dec 5, 2025

Update introduction figure hhhizzz/arrow-site#1

Merged

alamb approved these changes Dec 5, 2025

View reviewed changes

alamb added 5 commits December 5, 2025 04:49

update markdown

3abd8e6

Wordsmith introduction section

c9ff5ea

updates

9594895

Updates to section 2.1

10f28d7

tweaks

c837d00

This was referenced Dec 5, 2025

Wordsmith introduction section hhhizzz/arrow-site#2

Merged

Proposed Updates to section 2.1 hhhizzz/arrow-site#3

Merged

Update section 2.2

074356a

alamb mentioned this pull request Dec 5, 2025

Proposed edits to section 2.2 hhhizzz/arrow-site#4

Merged

Commnets for section 3.1

01adc37

alamb mentioned this pull request Dec 5, 2025

Proposed edits to section 3.1 hhhizzz/arrow-site#5

Merged

Suggestions for section 3.2

b0dc1af

alamb mentioned this pull request Dec 5, 2025

Proposed Updates to section 3.2 hhhizzz/arrow-site#6

Merged

Proposed improvements to section 3.3 and 3.4

116412b

alamb mentioned this pull request Dec 5, 2025

Proposed Updates to section 3.3 and 3.4 hhhizzz/arrow-site#7

Merged

Suggestions for section 3.5

a63bc62

alamb mentioned this pull request Dec 5, 2025

Suggestions for section 3.5 hhhizzz/arrow-site#8

Merged

alamb added 2 commits December 5, 2025 09:16

move example

c2f4cd0

Suggestions for conclusion

9aa2e75

Update the translation

3f7b1a2

hhhizzz commented Dec 7, 2025

View reviewed changes

_posts/2025-12-03-parquet-late-materialization-deep-dive.md Outdated Show resolved Hide resolved

devanbenz reviewed Dec 7, 2025

View reviewed changes

_posts/2025-12-07-parquet-late-materialization-deep-dive.md Outdated Show resolved Hide resolved

hhhizzz and others added 4 commits December 8, 2025 18:19

Update the link

9284e56

Update author and links

b51a87a

Fix broken links

9b610d9

Fix more links (using codex)

513aae2

alamb approved these changes Dec 8, 2025

View reviewed changes

img/late-materialization/fig1.png Outdated Show resolved Hide resolved

pepijnve reviewed Dec 8, 2025

View reviewed changes

_posts/2025-12-07-parquet-late-materialization-deep-dive.md Outdated Show resolved Hide resolved

pepijnve reviewed Dec 8, 2025

View reviewed changes

_posts/2025-12-07-parquet-late-materialization-deep-dive.md Outdated Show resolved Hide resolved

pepijnve reviewed Dec 8, 2025

View reviewed changes

_posts/2025-12-11-parquet-late-materialization-deep-dive.md Show resolved Hide resolved

pepijnve reviewed Dec 8, 2025

View reviewed changes

_posts/2025-12-07-parquet-late-materialization-deep-dive.md Outdated Show resolved Hide resolved

pepijnve reviewed Dec 8, 2025

View reviewed changes

_posts/2025-12-07-parquet-late-materialization-deep-dive.md Outdated Show resolved Hide resolved

pepijnve reviewed Dec 8, 2025

View reviewed changes

_posts/2025-12-07-parquet-late-materialization-deep-dive.md Outdated Show resolved Hide resolved

pepijnve reviewed Dec 8, 2025

View reviewed changes

_posts/2025-12-07-parquet-late-materialization-deep-dive.md Outdated Show resolved Hide resolved

pepijnve reviewed Dec 8, 2025

View reviewed changes

adriangb reviewed Dec 8, 2025

View reviewed changes

XiangpengHao reviewed Dec 8, 2025

View reviewed changes

_posts/2025-12-07-parquet-late-materialization-deep-dive.md Outdated Show resolved Hide resolved

hhhizzz added 2 commits December 9, 2025 22:01

Update according to comments

0ef9ed4

Add 'the number of'

a1ed43e

alamb mentioned this pull request Dec 10, 2025

consecutive repartitions blog post apache/datafusion-site#127

Merged

alamb added 3 commits December 11, 2025 06:55

Update date to 2025-12-11

9615a90

fix link

8386ead

fix link

0cbfd56

alamb merged commit 934c8bc into apache:main Dec 11, 2025
3 checks passed


		Borrowing Abadi's classification from his [paper](https://www.cs.umd.edu/~abadi/papers/abadiicde2007.pdf), the target architecture is LM-pipelined: interleaving predicates and data column access instead of reading all columns at once and trying to stitch them back together into rows.

		<figure style="text-align: center;">


		Chained filtering is a hair-pulling exercise in coordinate systems. "Row 1" in filter N might actually be "Row 10,001" in the file due to prior filters.

		* How do we keep the train on the rails?: We [fuzz test] every `RowSelection` operation (`split_off`, `and_then`, `trim`). We need absolute certainty that our translation between relative and absolute offsets is pixel-perfect. This correctness is the bedrock that keeps the Reader stable under the triple threat of batch boundaries, sparse selections, and page pruning.


		### 3.3 Smart Caching

		Late materialization puts us in a bit of a Catch-22: arrow-rs evaluates predicates progressively on all rows in a row group. This approach uses a small number of large I/Os, which performs well for slow remote storage systems such as object storage. However, it means we may need to read the same column twice—first to filter it, and then again to produce the final rows necessary for the output projection. Without caching, you're paying double for the same data: decoding it once for the predicate, and again for the output. [`CachedArrayReader`], introduced in [#arrow-rs/7850], fixes this: stash the batch the first time you see it, and reuse it later.

Blog: Practical Dive Into Late Materialization in arrow-rs Parquet Reads #740

Blog: Practical Dive Into Late Materialization in arrow-rs Parquet Reads #740

Uh oh!

Conversation

hhhizzz commented Dec 3, 2025 • edited by alamb Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Dec 3, 2025

Uh oh!

alamb commented Dec 4, 2025

Uh oh!

alamb commented Dec 4, 2025

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

alamb Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

alamb Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pepijnve Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hhhizzz Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

pepijnve Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

alamb Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

hhhizzz Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pepijnve left a comment

Choose a reason for hiding this comment

Uh oh!

adriangb left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adriangb Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

hhhizzz Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

XiangpengHao commented Dec 8, 2025

Uh oh!

alamb commented Dec 10, 2025

Uh oh!

Uh oh!

alamb commented Dec 11, 2025

Uh oh!

hhhizzz commented Dec 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

hhhizzz commented Dec 3, 2025 •

edited by alamb

Loading

alamb Dec 5, 2025 •

edited

Loading

pepijnve Dec 8, 2025 •

edited

Loading

adriangb left a comment •

edited

Loading