Skip to content

Conversation

@hhhizzz
Copy link
Owner

@hhhizzz hhhizzz commented Nov 30, 2025

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Copy link

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First of all, this is amazing @hhhizzz -- I really like how you frame the post / deocder as a mini query engine and then proceed to teach the reader what late materialization is and how it works.

The figures are also quite amazing.

I suggest you incorporate any suggestions you would like, and then we move this content to the arrow-site repo for final polish and iteration.

Also, if you are willing, I am happy to help work on / propose changes directly to the blog, but I don't want to make a bunch of conflicts if you still have changes you plan to make.

Mechanics of posting to Arrow Blog

To publish this content on the arrow blog (https://arrow.apache.org/blog), we need to make a PR to https://github.com/apache/arrow-site. Once the PR to arrow-site is merged, it will automatically be published

Here is an example of such a PR apache/arrow-site#720 (I am also happy to help translate this content into a PR as well)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found this figure slightly confusing due to the fact that the upper right was different than the other two images

3 3 2-fig2

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the Query in image.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't see this image referenced anywhere in the file

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in the doc

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like this image as a high level overview of the late materialization process -- but it didn't seem to be referenced anywhere in the text

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found this image a little confusing for two reasons:

  1. There is no Step 2
  2. The "Decoded B Values" didn't make sense -- there were 3 decoded values, but Row Mask 1 seems to have 5 that pass (so I think Decoded B Values should have 5 green boxes)

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in the new image

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is (also) an amazing image. One minor suggestion is to use a term other than "Track"

Perhaps instead of "Track A (Selection)" you could use a term like "A > 10 Filter Results (Selection)"

And instead of "Track B" use a term like "B < 5 Filter Results"

And then instead of "Result Track" use a term "Final Filter Result"

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in the new image.

Starting from [v57.1.0](https://github.com/apache/arrow-rs/tree/57.1.0), `RowSelection` can switch between RLE (selectors) and bitmask. Bitmasks are faster when gaps are tiny and sparsity is high; RLE is friendlier to page-level skips. Details show up again in 3.3.

**Execution sketch** (`SELECT * FROM table WHERE A > 10 AND B < 5`):
1. **Initial**: `selection = None` (equivalent to "select all").
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eventually I suggest we add links to the code / docs for ArrayReader, ReadPlanBuilder, etc -- I suggest we do this after a few more passes on the content


---

## 2. Key Mechanics
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest calling this section "Late Materialization in the Rust Parquet Reader" and stop after section 2.2 where you go through the actual code implementation that matches the high level overview you gave in section 1

(in other words, I suggest starting a new major section starting at what is currently "### 2.3 Smart caching"


![](fig3.png)

### 2.3 Smart caching
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would be a great place to start the "## 3. Engineering challenges" section

It would help I think if the introduction to that section said something like

While relatively straightforward in theory, actually implementing Late Materialization in a production grade system such as arrow-rs requires significant engineering. Previously the effort required meant that such technology is typically only available in proprietary engines and has been a struggle to implement in the open source community (see the DataFusion ticket about enabling filter pushdown). Now, after several years of effort, we are finally close to having Late Materialization as fast or faster in all cases. Getting to this point required several major implementation details, which are described below


![](fig4.png)

### 3.3 Adaptive RowSelection policy (bitmask vs. RLE)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend moving this section to the top of the engineering challenges section -- and title it "Row Selection Representation"

The idea being that different RowSelection representations have different tradeoffs for different filter patterns, and that in order to improve performance, we needed to provide different implementations

```

#### 3.3.2 The bitmask trap: missing pages
The bitmask introduces a new failure mode. Consider a simple example:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to include this section, i think we need to set it up a bit more carefully (give it some more context).

Maybe we can talk about it as "an example of why high performance engineering is so complicated -- we can also link to the ticket for the follow on work. I'll dig that up later

@hhhizzz
Copy link
Owner Author

hhhizzz commented Dec 2, 2025

@alamb Thanks for the review! I’ve updated the images using the new drawing method. It’s getting late here (UTC+8), so I’ll update the rest tomorrow.
Feel free to make any changes directly! And it would be really helpful if you can help to transfer it to the arrow-site repo.

@alamb
Copy link

alamb commented Dec 2, 2025

@alamb Thanks for the review! I’ve updated the images using the new drawing method. It’s getting late here (UTC+8), so I’ll update the rest tomorrow. Feel free to make any changes directly! And it would be really helpful if you can help to transfer it to the arrow-site repo.

Thank you!

I will try and make a PR in the arrow-site repo tomorrow. Note that if I create the PR, then in order for you to edit it, you will need to make PRs to my fork (which is fine)

If you would like to retain the ability to edit the post directly (which is probably good) you will need to open the PR.

I can create a branch and then you could potentially open a PR 🤔

@hhhizzz
Copy link
Owner Author

hhhizzz commented Dec 3, 2025

@alamb Thanks for the review! I’ve updated the images using the new drawing method. It’s getting late here (UTC+8), so I’ll update the rest tomorrow. Feel free to make any changes directly! And it would be really helpful if you can help to transfer it to the arrow-site repo.

Thank you!

I will try and make a PR in the arrow-site repo tomorrow. Note that if I create the PR, then in order for you to edit it, you will need to make PRs to my fork (which is fine)

If you would like to retain the ability to edit the post directly (which is probably good) you will need to open the PR.

I can create a branch and then you could potentially open a PR 🤔

I just realized I also want to publish a Chinese translation at the same time. I think it’s better if I create the PR and update both the English and Chinese versions based on your comments or anyone else’s feedback.

@hhhizzz
Copy link
Owner Author

hhhizzz commented Dec 3, 2025

@alamb I submitted the PR, let me know if anything I can improve. apache/arrow-site#740

@hhhizzz hhhizzz closed this Dec 3, 2025
alamb added a commit to apache/arrow-site that referenced this pull request Dec 11, 2025
…ads (#740)

- closes apache/arrow-rs#8843

See preview URL: https://hhhizzz.github.io/arrow-site/blog/

cc @alamb 

Original blog and Chinese translation.

See earlier draft here: hhhizzz/arrow-rs#10

---------

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants