Wordsmith introduction section by alamb · Pull Request #2 · hhhizzz/arrow-site

alamb · 2025-12-05T10:15:04Z

Here is some proposed "wordsmithing" changes to the introduction section of

Blog: Practical Dive Into Late Materialization in arrow-rs Parquet Reads apache/arrow-site#740

I'll comment inline with the rationale

github-actions · 2025-12-05T10:15:12Z

Preview URL: https://alamb.github.io/arrow-site

If the preview URL doesn't work, you may forget to configure your fork repository for preview.
See https://github.com/apache/arrow-site/blob/main/README.md#forks how to configure.

alamb · 2025-12-05T10:15:32Z

_posts/2025-12-03-parquet-late-materialization-deep-dive.md

 -->

-This article dives into the decisions and pitfalls of Late Materialization in `arrow-rs` (the engine powering DataFusion). We'll see how a humble file reader has evolved into something with the complex logic of a query engine—effectively becoming a **tiny query engine** in its own right.
+This article dives into the decisions and pitfalls of implementing Late Materialization in the [Apache Parquet] reader from [`arrow-rs`] (the reader powering [Apache DataFusion] among other projects). We'll see how a seemingly humble file reader requires complex logic to evaluate predicates—effectively becoming a **tiny query engine** in its own right.


I added some links and reworded this slightly to provide broader context

alamb · 2025-12-05T10:15:50Z

_posts/2025-12-03-parquet-late-materialization-deep-dive.md

 ## 1. Why Late Materialization?

-Columnar reads are a constant battle between **I/O bandwidth** and **CPU decode costs**. While skipping data is generally good, the act of skipping itself carries a computational cost. The goal in `arrow-rs` is **pipeline-style late materialization**: evaluate predicates first, then access projected columns, keeping the pipeline tight at the page level to ensure minimal reads and minimal decode work.
+Columnar reads are a constant battle between **I/O bandwidth** and **CPU decode costs**. While skipping data is generally good, the act of skipping itself carries a computational cost. The goal of the Parquet reader in `arrow-rs` is **pipeline-style late materialization**: evaluate predicates first, then access projected columns. For predicates that filter many rows, materializing after evaluation minimizes reads and decode work.


I tried to make the benefits a bit clearer

alamb · 2025-12-05T10:16:25Z

_posts/2025-12-03-parquet-late-materialization-deep-dive.md


-1.  Read column `A`, build a `RowSelection` (a sparse mask), and obtain the initial set of surviving rows.
-2.  Use that `RowSelection` to read column `B`, decoding and filtering on the fly to make the selection even sparser.
+1.  Read column `A` and evaluate `A > 10` to build a `RowSelection` (a sparse mask) representing the initial set of surviving rows.


I added the predicate evaluation explicitly into this example as I think that was easier to follow

Wordsmith introduction section

c9ff5ea

updates

9594895

alamb commented Dec 5, 2025

View reviewed changes

alamb mentioned this pull request Dec 5, 2025

Blog: Practical Dive Into Late Materialization in arrow-rs Parquet Reads apache/arrow-site#740

Merged

hhhizzz merged commit 05722c0 into hhhizzz:lm-pipeline-blog Dec 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Wordsmith introduction section#2

Wordsmith introduction section#2
hhhizzz merged 2 commits intohhhizzz:lm-pipeline-blogfrom
alamb:alamb/intro

alamb commented Dec 5, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Dec 5, 2025

Uh oh!

alamb Dec 5, 2025

Uh oh!

alamb Dec 5, 2025

Uh oh!

alamb Dec 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

alamb commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Dec 5, 2025

Uh oh!

alamb Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

alamb Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

alamb Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

alamb commented Dec 5, 2025 •

edited

Loading