forked from apache/arrow-site
-
Notifications
You must be signed in to change notification settings - Fork 0
Proposed edits to section 2.2 #4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -99,32 +99,35 @@ I've drawn a simple flowchart to help you understand: | |
|
|
||
| Once the pipeline exists, the next question is **how to represent and combine these sparse selections** (the **Row Mask** in the diagram), which is where `RowSelection` comes in. | ||
|
|
||
| ### 2.2 Logical ops on row selectors (`RowSelection::and_then`) | ||
| ### 2.2 Combining row selectors (`RowSelection::and_then`) | ||
|
|
||
| `RowSelection`—defined in `selection.rs`—is the token that every stage passes around. It mostly uses RLE (`RowSelector::select/skip(len)`) to describe sparse ranges. `and_then` is the core operator for "apply one selection to another": left-hand side is "rows already allowed," right-hand side further filters those rows, and the output is their boolean AND. | ||
| [`RowSelection`] represents the set of rows that will eventually be produced. It currently uses RLE (`RowSelector::select/skip(len)`) to describe sparse ranges. [`RowSelection::and_then`] is the core operator for "apply one selection to another": the left-hand argument is "rows already passed" and the right-hand argument is "which of the passed rows also pass the second filter." The output is their boolean AND. | ||
|
|
||
| **Walkthrough**: | ||
| [`RowSelection`]: https://github.com/apache/arrow-rs/blob/ce4edd53203eb4bca96c10ebf3d2118299dad006/parquet/src/arrow/arrow_reader/selection.rs#L139 | ||
| [`RowSelection::and_then`]: https://github.com/apache/arrow-rs/blob/ce4edd53203eb4bca96c10ebf3d2118299dad006/parquet/src/arrow/arrow_reader/selection.rs#L345 | ||
|
|
||
| **Walkthrough Example**: | ||
|
|
||
| * **Input Selection A (already filtered)**: `[Skip 100, Select 50, Skip 50]` (physical rows 100-150 are selected) | ||
| * **Input Predicate B (filters within A)**: `[Select 10, Skip 40]` (within the 50 selected rows, only the first 10 survive B) | ||
| * **Selection B (filters within A)**: `[Select 10, Skip 40]` (within the 50 selected rows, only the first 10 survive B) | ||
| * **Result**: `[Skip 100, Select 10, Skip 90]`. | ||
|
|
||
| **How it runs**: | ||
| Think of it like a zipper: we traverse both lists simultaneously... | ||
| Think of it like a zipper: we traverse both lists simultaneously, as shown below: | ||
|
|
||
| 1. **First 100 rows**: A is Skip → result is Skip 100. | ||
| 2. **Next 50 rows**: A is Select. Look at B: | ||
| * B's first 10 are Select → result Select 10. | ||
| * B's remaining 40 are Skip → result Skip 40. | ||
| 3. **Final 50 rows**: A is Skip → result Skip 50. | ||
|
|
||
| **Result**: `[Skip 100, Select 10, Skip 90]`. | ||
|
|
||
| This keeps narrowing the filter while touching only lightweight metadata—no data copies. The implementation is a two-pointer linear scan; complexity is linear in selector segments. The sooner predicates shrink the selection, the cheaper later scans become. | ||
|
|
||
| <figure style="text-align: center;"> | ||
| <img src="{{ site.baseurl }}/img/late-materialization/fig3.jpg" alt="RowSelection logical AND walkthrough" width="100%" class="img-responsive"> | ||
| </figure> | ||
|
|
||
|
|
||
| This keeps narrowing the filter while touching only lightweight metadata—no data copies. The current implementation of `and_then` is a two-pointer linear scan; complexity is linear in selector segments. The sooner predicates shrink the selection, the cheaper later scans become. | ||
|
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I propose moving this paragraph below the diagram so the text that describes the diagram is immediately above it |
||
|
|
||
| ## 3. Engineering Challenges | ||
|
|
||
| It sounds simple enough in theory, but implementing Late Materialization in a production-grade system like `arrow-rs` is an absolute **engineering nightmare**. Historically, this stuff was so tricky that it was locked away in proprietary engines. In the open source world, we've been grinding away at this for years (just look at [the DataFusion ticket](https://github.com/apache/datafusion/issues/3463)), and finally, we can **flex our muscles** and go toe-to-toe with full materialization. To pull this off, we had to tackle some serious headaches. | ||
|
|
||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to make it more clear what this code was referring to