Skip to content

Conversation

@hhhizzz
Copy link
Contributor

@hhhizzz hhhizzz commented Dec 3, 2025

@github-actions
Copy link

github-actions bot commented Dec 3, 2025

Preview URL: https://hhhizzz.github.io/arrow-site

If the preview URL doesn't work, you may forget to configure your fork repository for preview.
See https://github.com/apache/arrow-site/blob/main/README.md#forks how to configure.

@alamb
Copy link
Contributor

alamb commented Dec 4, 2025

Thank you 🙏

I plan to work on this more today

@alamb
Copy link
Contributor

alamb commented Dec 4, 2025

I ran out of time to review this carefully, but I will do so first thing tomorrow

@alamb alamb changed the title Blog post about efficient filter representation in Parquet filter pushdown Blog: Practical Dive Into Late Materialization in arrow-rs Parquet Reads Dec 5, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read this blog post this morning, and I really enjoyed it -- very impressive @hhhizzz

For other reviewers, you can see the rendered preview here https://hhhizzz.github.io/arrow-site/blog/2025/12/03/parquet-late-materialization-deep-dive/

I think this post could be published as is, though I think the introductory image is confusing and it would be better to change it. I have a proposal to improve it:

Screenshot 2025-12-05 at 4 32 50 AM

I also have some ideas on how to improve the text which I will submit as other PRs for your consideration


Borrowing Abadi's classification from his [paper](https://www.cs.umd.edu/~abadi/papers/abadiicde2007.pdf), the target architecture is **LM-pipelined**: interleaving predicates and data column access instead of reading all columns at once and trying to **stitch them back together** into rows.

<figure style="text-align: center;">
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found this figure a little confusing as it refers to columns that don't appear in the blog post (e.g. Lineitem and Shipdate, which sound similar to, but not quite the same as TPC-H)

I recommend we either add a caption explaining this graphic more, or change to use a different image (hhhizzz#1)

Copy link
Contributor

@alamb alamb Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: I now see it is a reworked example from the paper. I still think hhhizzz#1 is a better image for the intro

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you again @hhhizzz and @devanbenz

I took the liberty of pushing some small commits to this PR to

  1. add links to the author names
  2. fix broken links
Image

I have a few small suggestions, but I think this blog is ready to go. Let's plan to publish it later this week (I'll also post in the arrow-rs discord channel to see if anyone else is interested in reviewing)

Also, if he have time, I suspect @XiangpengHao may be interested in this post and I would value his feedback


Chained filtering is a **hair-pulling** exercise in coordinate systems. "Row 1" in filter N might actually be "Row 10,001" in the file due to prior filters.

* **How do we keep the train on the rails?**: We [fuzz test] every `RowSelection` operation (`split_off`, `and_then`, `trim`). We need absolute certainty that our translation between relative and absolute offsets is pixel-perfect. This correctness is the bedrock that keeps the Reader stable under the triple threat of batch boundaries, sparse selections, and page pruning.
Copy link

@pepijnve pepijnve Dec 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Too many mixed metaphors in this paragraph, but that might just be personal taste.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think some metaphors make it easier to read.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fair, but 'pixel-perfect' in this context is a bit odd. I understand you can't make any mistakes here, but the description felt too hyperbolic to me for something that's relatively simple.
I'll accept that it's a matter of preference.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think this is a stylistic thing -- while I likely would not have used this style, I think it does get the point across and lends a different voice (@hhhizzz 's ) to the narrative.

It is a good point

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Appreciate the flexibility.

Copy link

@pepijnve pepijnve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice blog post. Looks like a great improvement to the Parquet reader. I've add the notes I made while reading through the post.

Copy link

@adriangb adriangb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an excellent blog post! From my read it seems good to ship as is.

Is there any thought on using runtime data to refine the approach taken?

I'm thinking something like if we have 32 files after processing 2-3 we probably have a good idea of how selective each filter is and how large the columns are on average. Could we use this to tune filter evaluation order, how we represent the mask, etc.? Or even switch back to eager materialization or drop specific filters if the filters are not selective at all?


### 3.3 Smart Caching

Late materialization puts us in a bit of a **Catch-22**: arrow-rs evaluates predicates progressively on all rows in a row group. This approach uses a small number of large I/Os, which performs well for slow remote storage systems such as object storage. However, it means we may need to read the same column twice—first to filter it, and then again to produce the final rows necessary for the output projection. Without caching, you're **paying double** for the same data: decoding it once for the predicate, and again for the output. [`CachedArrayReader`], introduced in [#arrow-rs/7850], fixes this: **stash the batch the first time you see it, and reuse it later.**
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the cache also apply during the narrowing? I.e. if I have filter (a = 5 AND b = 6) OR (b = 6 AND c = 8) it sounds like this will produce two steps, will b = 6 or just the decoded b be cached between those steps?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That’s a really good question. The RowSelection for b = 6 will be cached, and the decoded values of b will be cached. However, the engine will still perform a full filter on b = 6 again using the cached values. Avoiding this redundant filtering should be handled by the SQL engine’s optimizer. A storage engine like arrow-rs is only responsible for executing the already optimized predicate it receives.

@XiangpengHao
Copy link

Great work @hhhizzz @alamb I enjoyed reading it! (left a minor comment)

@alamb
Copy link
Contributor

alamb commented Dec 10, 2025

I took another read through this blog and I think it looks great -- thank you again @hhhizzz @XiangpengHao @pepijnve and @adriangb

I'll plan to update the date and publish the post tomorrow, unless anyone would like more time to review

This is so great

@alamb alamb merged commit 934c8bc into apache:main Dec 11, 2025
3 checks passed
@alamb
Copy link
Contributor

alamb commented Dec 11, 2025

Thanks again @hhhizzz -- the blog is published here: https://arrow.apache.org/blog/2025/12/11/parquet-late-materialization-deep-dive/

I plan to publicize this blog link around various social media sites. For linkedin in particular, is this the correct URL to use for your profile? https://www.linkedin.com/in/qiwei-huang-aa175811b

(I can omit such a link if you prefer)

@hhhizzz
Copy link
Contributor Author

hhhizzz commented Dec 11, 2025

Thanks again @hhhizzz -- the blog is published here: https://arrow.apache.org/blog/2025/12/11/parquet-late-materialization-deep-dive/

I plan to publicize this blog link around various social media sites. For linkedin in particular, is this the correct URL to use for your profile? https://www.linkedin.com/in/qiwei-huang-aa175811b

(I can omit such a link if you prefer)

Yes, that’s me, thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Blog post about efficient filter representation in Parquet filter pushdown

6 participants