-
Notifications
You must be signed in to change notification settings - Fork 1k
[Parquet] Adaptive Parquet Predicate Pushdown #8733
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
The version without synthetic page: |
|
😮 thank you @hhhizzz -- I plan to review this PR carefully, but it will likely take me a few days |
|
fyi @zhuqi-lucas and @XiangpengHao |
|
🤖 |
|
🤖: Benchmark completed Details
|
|
🤖 |
|
🤖: Benchmark completed Details
|
|
🤖 |
|
🤖: Benchmark completed Details
|
|
For how I get the the average length to use the mask, here's some statistic, you can checkout to (https://github.com/hhhizzz/arrow-rs/tree/rowselectionempty-charts) and run One column
|
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First of all, thank you so much @hhhizzz -- I think this is really nice change and the code is well structured and a pleasure to read. Also thank you to @zhuqi-lucas for setting the stage for much of this work
Given the performance results so far (basically as good or better as the existing code) I think this PR is almost ready to go
The only thing I am not sure about is the null page / skipping thing -- I left more comments inline
I think there are several additional improvements that could be done as follow on work:
- The heuristic for when to use the masking strategy can likely be improved based on the types of values being filtered (for example the number of columns or the inclusion of StringView)
- Avoid creating
RowSelectionjust to turn it back to a BooleanArray (I left comments inline)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took a quick look, whilst I think orchestrating this skipping at the RecordReader level does have a certain elegance, it runs into the issue that the masked selections aren't necessarily page-aligned.
By definition the mask selection strategy requests rows that weren't part of the original selection, the problem is that this could result in requesting rows for pages that we know are irrelevant. In some cases this just results in wasted IO, however, when using prefetching IO systems (such as AsyncParquetReader) this results in errors. The hack of creating empty pages I'm not a big fan of.
I think a better solution would be to ensure we only construct MaskChunk that don't cross page boundaries. Ideally this would be done on a per-leaf column basis, but tbh I suspect just doing it globally would probably work just fine.
Edit: If one was feeling fancy, one could ignore page boundaries where both pages were present in the original selection, although in practice I suspect this not to make a huge difference.
ad51d87 to
ed51620
Compare
8742cd1 to
5e81ee4
Compare
|
Thank you -- I plan to review this PR again shortly |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much for this PR @hhhizzz -- I think it is looking really nice now.
I have a few ideas / comments on how to improve the API, but they can all be done as follow on PRs.
The one thing I think we do need to be careful of is any newly added public API functions / enums (e.g. RowSelectionStrategy and the change to ReadPlanBuilder) -- this is because once we release a version of the parquet crate with these APIs we will not be able to change them again
As long as this PR shows performance improvements (I am rerunning the benchmark now) I think we should merge it in and keep iterating on additional improvements as follow on PRs
Notes for myself:
- Add some enums to the public
SelectionStrategy(Autoand an explicit threshold) - Update the ReadPlanBuilder to have an iterator directly in it
- Try and avoid the conversion back and forth to bitmasks
- Add some end-to-end tests in io.rs that have a predicate on one column and select from another (to add additional coverage for the time when all rows in a page are filtered and we are using masks)
| offset_index: Option<&[OffsetIndexMetaData]>, | ||
| ) -> bool { | ||
| match offset_index { | ||
| Some(columns) => self.selection_skips_any_page(projection, columns), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a nice solution for the initial PR -- I think we can file a follow on ticket to figure out how to optimize this further in follow on PRs.
I think given my reading above, we can probably move move this logic into InMemoryRowGroup::fetch and then actually decide what to do with pages as necessary.
| RowSelectionBacking::Selectors(selectors) => { | ||
| let selector = selectors.pop_front()?; | ||
| self.position += selector.row_count; | ||
| Some(selector) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this always returns Some so we could change the signature to always return RowSelector rather than Option<RowSelector>
| /// Current to apply, includes all filters | ||
| selection: Option<RowSelection>, | ||
| /// Strategy to use when materialising the row selection | ||
| selection_strategy: RowSelectionStrategy, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this formulation as it makes this PR logic quite clear.
Also, As a follow on PR we could potentially combine this logic so the ReadPlanBuilder holds a Option<RowSelectionCursor> rather than a selection and selection_strategy`
This might allow us to both
- avoid converting BooleanArray --> RowSelection and then back again
- implement the page filtering in
InMemoryRowGroup::fetch_rangesin terms of masked selections as well
However, we can totally do this as a follow on PR -- I will file a ticket when we merge this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed Auto from the current design and didn't add it to the Builder API because Mask is currently unstable, requiring us to fall back to Selection in certain cases. I believe this unstable setting shouldn't be directly exposed to users. Moving forward, we could consider:
- Allowing users to choose only
Auto,Selection, or other stable strategies, implemented via runtime checks or a new enum.Maskwould only be selected heuristically whenAutois chosen. - Stabilizing
Maskby updating the fetch process, ensuring all pages are fetched if the user selectsMask.
| // it compiles down to a regular memory read with negligible performance overhead. | ||
| // The more expensive atomic operations with stronger ordering are only used in the | ||
| // test-only functions below. | ||
| static AVG_SELECTOR_LEN_MASK_THRESHOLD_OVERRIDE: AtomicUsize = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a little worried about this mechanism for testing -- I understand its appeal, but it might be better to avoid statics entirely and thread through the threshold value as an option on the selection strategy directly
For example
enum RowSelectionStrategy {
/// automatically pick the filter representation
/// based on heuristics
Auto,
/// If the average number of rows is selected is more than
/// the threshold, uses the Mask policy, otherwise uses Selectors
Threshold {
threshold: usize
},
Selection,
Mask
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is awesome! I've been struggling for days to implement the bench, and I never would have thought of using the RowSelectionStrategy enum.
|
|
||
| /// Strategy for materialising [`RowSelection`] during execution. | ||
| #[derive(Clone, Copy, Debug, Default, Eq, PartialEq)] | ||
| pub enum RowSelectionStrategy { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I liked the idea previously of having an Auto mode here based on heuristics that we could change over time.
I can help propose an upate to this API if you are amenable to this idea
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, my charts indicate that there are many rules for setting the RowSelectionStrategy, like the column type, column count, string length, and their combinations... We can create tickets and collaborate on improving these over time.
|
🤖 |
|
🤖: Benchmark completed Details
|
|
🤖 |
|
🤖: Benchmark completed Details
|
|
🤖 |
|
🤖: Benchmark completed Details
|












Which issue does this PR close?
Rationale for this change
Predicate pushdown and row filters often yield very short alternating select/skip runs; materialising those runs through a
VecDeque<RowSelector>is both allocation-heavy and prone to panicswhen entire pages are skipped and never decoded. This PR introduces a mask-backed execution path for short selectors, adds the heuristics and guards needed to decide between masks and selectors,
and provides tooling to measure the trade‑offs so we can tune the threshold confidently.
What changes are included in this PR?
RowSelectionStrategyplus aRowSelectionCursorthat can iterate either a boolean mask or the legacy selector queue, along with a public guard/override so tests and benchmarks cantweak the average-selector-length heuristic.
ReadPlanBuildernow trims selections, estimates their density, and chooses the faster mask strategy when safe;ParquetRecordBatchReaderstreams mask chunks, skips contiguous gaps, filtersthe projected batch with Arrow’s boolean kernel, and still falls back to selectors when needed.
RowSelectioncan now inspect offset indexes to detect when a selection would skip entire pages; both the synchronous and asynchronous readers consult that signal so row filters no longerpanic when predicate pruning drops whole pages.
row_selection_stateCriterion benchmark plusdev/row_selection_analysis.pyto run the bench, export CSV summaries, and render comparative plots across selector lengths, columnwidths, and Utf8View payload sizes; wired the bench into
parquet/Cargo.toml.test_row_selection_interleaved_skip,test_row_selection_mask_sparse_rows,test_row_filter_full_page_skip_is_handledand itsasync twin) and updated the push-decoder size assertion to reflect the new state.
Are these changes tested?
ReadPlanBuilderthreshold tests; the Criterion bench + Python tooling provide manual validation for performancetuning. Full parquet/arrow test suites will still run in CI.
Are there any user-facing changes?
RowSelectionStrategy,RowSelectionCursor, andset_avg_selector_len_mask_thresholdfor experimentation, and developers gain the new benchmarking/plottingworkflow. No breaking API changes were introduced.