Serialize dynamic filters on execution plan nodes (HashJoin, Aggregate, Sort) by jayshrivastava · Pull Request #2 · jayshrivastava/datafusion

jayshrivastava · 2026-02-20T19:52:17Z

Which issue does this PR close?

Informs: datafusion-contrib/datafusion-distributed#180
Follow up for: apache#20416

Rationale for this change

I'm interested in serializing a physical plan (post-physical optimizer) and executing it on a remote node. To do so, I need dynamic filters and references/pointers to dynamic filters to be preserved in the plan. Currently, nodes which produce filters such as HashJoinExec, AggregateExec, and SortExec, do not serialize their dynamic filters.

This change intends to update the above nodes to serialize dynamic filters and adds tests for the scenario above.

What changes are included in this PR?

Proto schema (datafusion.proto)

Added PhysicalExprNode dynamic_filter field to:

HashJoinExecNode (tag 11)
AggregateExecNode (tag 13)
SortExecNode (tag 5)

Plan node public API

Added with_dynamic_filter() and dynamic_filter() to HashJoinExec, AggregateExec, SortExec.

with_dynamic_filter() always

validates that the filter is valid for the plan node's schema
resets any internal state related to the dynamic filter

Serde

Using the new plan node public APIs above

Each node's try_from_* serialization function now reads dynamic_filter()
and serializes it via the proto converter
Each node's try_into_* deserialization function deserializes the field,
downcasts to DynamicFilterPhysicalExpr, and sets it on the node

Are these changes tested?

Added tests which create this plan and perform round-trip serialization on it (1 test each for HashJoinExec, AggregateExec, and SortExec).

    HashJoinExec ─── dynamic filter
         │               │  
         ▼               │
    FilterExec           | (optimizer pushes down this filter
         │               |
         ▼               ▼ 
    DataSourceExec  ─ dynamic filter

Added tests for with_dynamic_filter() and dynamic_filter() on dynamic_filter() to HashJoinExec, AggregateExec, SortExec.

jayshrivastava · 2026-02-23T21:48:10Z

Note for reviewers: I'm unsure if I should be using apply_expressions() or expressions() (see apache#20337) instead of with_dynamic_filter() and dynamic_filter()

LiaCastaneda · 2026-02-24T10:20:17Z

+    /// Returns the dynamic filter expression for this aggregate, if set.
+    pub fn dynamic_filter(&self) -> Option<&Arc<DynamicFilterPhysicalExpr>> {
+        self.dynamic_filter.as_ref().map(|df| &df.filter)
+    }


I think it would be cleaner to use apply_expressions (apache#20337), mainly because it's more generic and you can do basically anything with PhysicalExprs inside a plan, including detecting dynamic filters, and you wouldn't need to know beforehand which nodes are producers and consumers -- any custom logic can be done separately in the proto crate. It would also reduce overhead to people who wants to add a new ExecutionPlan that holds a DynamicFilterPhysicalExpr, they'd have to remember to also add the manual dynamic_filter() call here. Implementation for apply_expressions is part of ExecutionPlan and will not be optional, so users will not forget they have to do it in every node.

LiaCastaneda · 2026-02-24T10:46:45Z

+    pub fn with_dynamic_filter(
+        mut self,
+        filter: Arc<DynamicFilterPhysicalExpr>,
+    ) -> Result<Self> {


I see we do something similar for every producer/consumer, a more generic way to modify the expressions would probably implementing map_expressions like suggested here in ExecutionPlan to make it more generic?

Fixups for the cherry-picked commits from PRs apache#19437, apache#20037, apache#20416, and #2 to work with branch-52's partition-index APIs: - Update remap_children callers to use instance method signature - Adapt DynamicFilterUpdate::Global enum for new code paths - Add missing partitioned_exprs/runtime_partition fields to new constructors - Remove null_aware field (not on branch-52) - Replace FilterExecBuilder with FilterExec::try_new - Remove non-compiling tests that depend on upstream-only APIs - Fix duplicate imports in roundtrip test file Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Fixups for the cherry-picked commits from PRs apache#19437, apache#20037, apache#20416, and jayshrivastava#2 to work with branch-52's partition-index APIs: - Update remap_children callers to use instance method signature - Adapt DynamicFilterUpdate::Global enum for new code paths - Add missing partitioned_exprs/runtime_partition fields to new constructors - Remove null_aware field (not on branch-52) - Replace FilterExecBuilder with FilterExec::try_new - Remove non-compiling tests that depend on upstream-only APIs - Fix duplicate imports in roundtrip test file Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

apache#20960) Reproducer for apache#20937

## Which issue does this PR close? ## Rationale for this change Spin-off of apache#21383 to have a bench for `First_Value`, `Last_Value` available before a PR with logic change. ## What changes are included in this PR? - Add benchmark for `GroupsAccumulator`. It's pretty complicated to test aggregates with grouping, since many operations are stateful, so I introduced end-to-end evaluate test (to actually test taking state) and convert_to_state (as in other benches) - A bench for a simple `Accumulator` ## Are these changes tested? - Manual bench run ## Are there any user-facing changes?

@latest

…e#21434) Bumps [taiki-e/install-action](https://github.com/taiki-e/install-action) from 2.70.3 to 2.74.0. <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/taiki-e/install-action/releases">taiki-e/install-action's releases</a>.</em></p> <blockquote> <h2>2.74.0</h2> <ul> <li> <p>Support <code>cargo-deb</code>. (<a href="https://github.com/taiki-e/install-action/pull/1669">#1669</a>)</p> </li> <li> <p>Update <code>just@latest</code> to 1.49.0.</p> </li> <li> <p>Update <code>mise@latest</code> to 2026.4.4.</p> </li> </ul> <h2>2.73.0</h2> <ul> <li> <p>Introduce <a href="https://blog.yossarian.net/2025/11/21/We-should-all-be-using-dependency-cooldowns">dependency cooldown</a> when installing with <code>taiki-e/install-action@<tool_name></code>, <code>tool: <tool_name>@latest</code>, or <code>tool: <tool_name>@<omitted_version></code> to mitigate the risk of supply chain attacks by default. (<a href="https://github.com/taiki-e/install-action/pull/1666">#1666</a>)</p> <p>This action without this cooldown already takes a few hours to a few days for new releases to be reflected (as with other common package managers that verify checksums or signatures), so this should not affect most users.</p> <p>See the <a href="https://github.com/taiki-e/install-action#security">"Security" section in readme</a> for more details.</p> </li> <li> <p>Improve robustness for network failure.</p> </li> <li> <p>Documentation improvements.</p> </li> </ul> <h2>2.72.0</h2> <ul> <li> <p>Support <code>cargo-xwin</code>. (<a href="https://github.com/taiki-e/install-action/pull/1659">#1659</a>, thanks <a href="https://github.com/daxpedda"><code>@daxpedda</code></a>)</p> </li> <li> <p>Support trailing comma in <code>tool</code> input option.</p> </li> <li> <p>Update <code>tombi@latest</code> to 0.9.14.</p> </li> </ul> <h2>2.71.3</h2> <ul> <li> <p>Update <code>wasm-tools@latest</code> to 1.246.2.</p> </li> <li> <p>Update <code>mise@latest</code> to 2026.4.3.</p> </li> </ul> <h2>2.71.2</h2> <ul> <li> <p>Implement workaround for <a href="https://github.com/actions/partner-runner-images/issues/169">windows-11-arm runner bug</a> which sometimes causes installation failure. (<a href="https://github.com/taiki-e/install-action/pull/1657">#1657</a>)</p> <p>This addresses an issue that was attempted to be worked around in 2.71.0 but was insufficient.</p> </li> <li> <p>Update <code>mise@latest</code> to 2026.4.1.</p> </li> <li> <p>Update <code>uv@latest</code> to 0.11.3.</p> </li> </ul> <h2>2.71.1</h2> <ul> <li> <p>Fix a regression that caused an execution policy violation on self-hosted Windows runner due to use of non-default <code>powershell</code> shell, introduced in 2.71.0.</p> </li> <li> <p>Update <code>dprint@latest</code> to 0.53.2.</p> </li> </ul> <h2>2.71.0</h2> <ul> <li> <p>Support <code>wasm-tools</code>. (<a href="https://github.com/taiki-e/install-action/pull/1642">#1642</a>, thanks <a href="https://github.com/crepererum"><code>@crepererum</code></a>)</p> </li> <li> <p>Support <code>covgate</code>. (<a href="https://github.com/taiki-e/install-action/pull/1613">#1613</a>, thanks <a href="https://github.com/jesse-black"><code>@jesse-black</code></a>)</p> </li> <li> <p>Implement potential workaround for <a href="https://github.com/actions/partner-runner-images/issues/169">windows-11-arm runner bug</a> which sometimes causes issue that the action successfully completes but the tool is not installed. (<a href="https://github.com/taiki-e/install-action/pull/1647">#1647</a>)</p> </li> </ul>  </blockquote> <p>... (truncated)</p> </details> <details> <summary>Changelog</summary> <p><em>Sourced from <a href="https://github.com/taiki-e/install-action/blob/main/CHANGELOG.md">taiki-e/install-action's changelog</a>.</em></p> <blockquote> <h1>Changelog</h1> <p>All notable changes to this project will be documented in this file.</p> <p>This project adheres to <a href="https://semver.org">Semantic Versioning</a>.</p>  <h2>[Unreleased]</h2> <ul> <li>Update <code>tombi@latest</code> to 0.9.15.</li> </ul> <h2>[2.74.0] - 2026-04-06</h2> <ul> <li> <p>Support <code>cargo-deb</code>. (<a href="https://github.com/taiki-e/install-action/pull/1669">#1669</a>)</p> </li> <li> <p>Update <code>just@latest</code> to 1.49.0.</p> </li> <li> <p>Update <code>mise@latest</code> to 2026.4.4.</p> </li> </ul> <h2>[2.73.0] - 2026-04-05</h2> <ul> <li> <p>Introduce <a href="https://blog.yossarian.net/2025/11/21/We-should-all-be-using-dependency-cooldowns">dependency cooldown</a> when installing with <code>taiki-e/install-action@<tool_name></code>, <code>tool: <tool_name>@latest</code>, or <code>tool: <tool_name>@<omitted_version></code> to mitigate the risk of supply chain attacks by default. (<a href="https://github.com/taiki-e/install-action/pull/1666">#1666</a>)</p> <p>This action without this cooldown already takes a few hours to a few days for new releases to be reflected (as with other common package managers that verify checksums or signatures), so this should not affect most users.</p> <p>See the <a href="https://github.com/taiki-e/install-action#security">"Security" section in readme</a> for more details.</p> </li> <li> <p>Improve robustness for network failure.</p> </li> <li> <p>Documentation improvements.</p> </li> </ul> <h2>[2.72.0] - 2026-04-04</h2> <ul> <li> <p>Support <code>cargo-xwin</code>. (<a href="https://github.com/taiki-e/install-action/pull/1659">#1659</a>, thanks <a href="https://github.com/daxpedda"><code>@daxpedda</code></a>)</p> </li> <li> <p>Support trailing comma in <code>tool</code> input option.</p> </li> <li> <p>Update <code>tombi@latest</code> to 0.9.14.</p> </li> </ul> <h2>[2.71.3] - 2026-04-04</h2> <ul> <li> <p>Update <code>wasm-tools@latest</code> to 1.246.2.</p> </li> <li> <p>Update <code>mise@latest</code> to 2026.4.3.</p> </li> </ul> <h2>[2.71.2] - 2026-04-02</h2>  </blockquote> <p>... (truncated)</p> </details> <details> <summary>Commits</summary> <ul> <li><a href="https://github.com/taiki-e/install-action/commit/94cb46f8d6e437890146ffbd78a778b78e623fb2"><code>94cb46f</code></a> Release 2.74.0</li> <li><a href="https://github.com/taiki-e/install-action/commit/7fef44e1953572bcd24693fc866ad446fb1b4057"><code>7fef44e</code></a> Update changelog</li> <li><a href="https://github.com/taiki-e/install-action/commit/3bf2282bfd15630bbf9543653d4132bc64c9ca89"><code>3bf2282</code></a> Update mise manifest</li> <li><a href="https://github.com/taiki-e/install-action/commit/223b1d599eeacab3f4361624d257a1d50a152a7c"><code>223b1d5</code></a> Update tombi manifest</li> <li><a href="https://github.com/taiki-e/install-action/commit/fdcd834b4f2d5c0d663395c561633bbe19ecb08d"><code>fdcd834</code></a> Update <code>just@latest</code> to 1.49.0</li> <li><a href="https://github.com/taiki-e/install-action/commit/b45e8d6c436517e3d00a29c621a3534a176e4706"><code>b45e8d6</code></a> Update <code>mise@latest</code> to 2026.4.4</li> <li><a href="https://github.com/taiki-e/install-action/commit/4eac87a84609e7a285bcfd82df34e948017a9fcb"><code>4eac87a</code></a> ci: Update config</li> <li><a href="https://github.com/taiki-e/install-action/commit/5b413367489ec0bfe059fd6482a23cc544ed613e"><code>5b41336</code></a> Add issue template</li> <li><a href="https://github.com/taiki-e/install-action/commit/55a981690b2670493d925900a2569e5065371d31"><code>55a9816</code></a> Support cargo-deb</li> <li><a href="https://github.com/taiki-e/install-action/commit/7a562dfa955aa2e4d5b0fd6ebd57ff9715c07b0b"><code>7a562df</code></a> Release 2.73.0</li> <li>Additional commits viewable in <a href="https://github.com/taiki-e/install-action/compare/6ef672efc2b5aabc787a9e94baf4989aa02a97df...94cb46f8d6e437890146ffbd78a778b78e623fb2">compare view</a></li> </ul> </details> <br /> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=taiki-e/install-action&package-manager=github_actions&previous-version=2.70.3&new-version=2.74.0)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) </details> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

## Which issue does this PR close? - Closes apache#21354. ## Rationale for this change Currently, DataFusion supports 9 `datafusion.format.*` configs but their test coverage seem to be missed so this issue aims to add comprehensive test coverage for them. This is follow-up to recent `config framework` improvements: apache#20372 and apache#20816. ## What changes are included in this PR? New test coverage is being added for `datafusion.format.*` configs. ## Are these changes tested? Yes, new test coverage is being added for `datafusion.format.*` configs. ## Are there any user-facing changes? No

## Which issue does this PR close?  - Closes #. ## Rationale for this change This is an alternative approach to - apache#19687 Instead of reading the entire range in the json FileOpener, implement an AlignedBoundaryStream which scans the range for newlines as the FileStream requests data from the stream, by wrapping the original stream returned by the ObjectStore. This eliminated the overhead of the extra two get_opts requests needed by calculate_range and more importantly, it allows for efficient read-ahead implementations by the underlying ObjectStore. Previously this was inefficient because the streams opened by calculate_range included a stream from `(start - 1)` to file_size and another one from `(end - 1)` to end_of_file, just to find the two relevant newlines. ## What changes are included in this PR? Added the AlignedBoundaryStream which wraps a stream returned by the object store and finds the delimiting newlines for a particular file range. Notably it doesn't do any standalone reads (unlike the calculate_range function), eliminating two calls to get_opts. ## Are these changes tested? Yes, added unit tests.  ## Are there any user-facing changes? No --------- Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

…20926) ## Which issue does this PR close? Part of apache#20766 ## Rationale for this change Grouped aggregations currently estimate output rows as input_rows, ignoring available NDV statistics. Spark's AggregateEstimation and Trino's AggregationStatsRule both use NDV products to tighten this estimate. This PR is highly referenced by both. - [Spark reference](https://github.com/apache/spark/blob/e8d8e6a8d040d26aae9571e968e0c64bda0875dc/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/AggregateEstimation.scala#L38-L61) - [Trino reference](https://github.com/trinodb/trino/blob/43c8c3ba8bff814697c5926149ce13b9532f030b/core/trino-main/src/main/java/io/trino/cost/AggregationStatsRule.java#L92-L101) ## What changes are included in this PR? - Estimate aggregate output rows as min(input_rows, product(NDV_i + null_adj_i) * grouping_sets) - Cap by Top K limit when active since output row cannot be higher than K - Propagate distinct_count from child stats to group-by output columns ## Are these changes tested? Yes existing and new tests that cover different scenarios and edge cases ## Are there any user-facing changes? No

…e#21218) ## Which issue does this PR close?  - Closes apache#21217 ## What changes are included in this PR?  - Adds `ScalarUDFImpl::struct_field_mapping` - Adds logic in `ProjectionMapping` to decompose struct-producing functions into their field-level mapping entries so that orderings propagate through struct projections - Adds unit tests/SLT ## Are these changes tested? Yes. ## Are there any user-facing changes? N/A --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

## Which issue does this PR close?  - Closes apache#18816 . ## Rationale for this change  In `UserDefinedLogicalNodeCore`, the default implementation of `necessary_children_exprs ` returns `None`, which signals to the optimizer that it cannot determine which columns are required from the child. The optimizer takes a conservative approach and skips projection pruning for that node, leading to complex and redundant plans in the subtree. However, it would make more sense to assume all columns are required and let the optimizer proceed, rather than giving up on the entire subtree entirely. ## What changes are included in this PR? ```rust LogicalPlan::Extension(extension) => { if let Some(necessary_children_indices) = extension.node.necessary_children_exprs(indices.indices()) { ... } else { // Requirements from parent cannot be routed down to user defined logical plan safely // Assume it requires all input exprs here plan.inputs() .into_iter() .map(RequiredIndices::new_for_all_exprs) .collect() } } ``` instead of https://github.com/apache/datafusion/blob/b6d46a63824f003117297848d8d83b659ac2e759/datafusion/optimizer/src/optimize_projections/mod.rs#L331-L337  ## Are these changes tested? Yes. In addition to unit tests, I've also added a complete end-to-end integration test that reproduces the full scenario in the issue. This might seem redundant, bloated, or even unnecessary. Please let me know if I should remove these tests. An existing test is modified, but I think the newer behavior is expected.  ## Are there any user-facing changes? Yes. But I think the new implementation is the expected behavior.

…e#21436) ## Which issue does this PR close? - Closes #. ## Rationale for this change `Fix sort merge interleave overflow` (apache#20922) added a temporary `catch_unwind` shim around Arrow's `interleave` call because the upstream implementation still panicked on offset overflow at the time. Arrow 58.1.0 includes apache/arrow-rs#9549, which returns `ArrowError::OffsetOverflowError` directly instead of panicking. DataFusion main now depends on that release, so the panic recovery path is no longer needed and only broadens the set of failures we might accidentally treat as recoverable. ## What changes are included in this PR? - Remove the temporary panic-catching wrapper from `BatchBuilder::try_interleave_columns`. - Keep the existing retry logic, but trigger it only from the returned `OffsetOverflowError`. - Replace the panic-specific unit tests with a direct error-shape assertion. ## Are these changes tested? Yes. - `cargo test -p datafusion-physical-plan sorts::builder -- --nocapture` - `cargo test -p datafusion-physical-plan sorts:: -- --nocapture` - `./dev/rust_lint.sh` ## Are there any user-facing changes? No.

apache#21099) When the SQL unparser encountered a SubqueryAlias node whose direct child was an Aggregate (or other clause-building plan like Window, Sort, Limit, Union), it would flatten the subquery into a simple table alias, losing the aggregate entirely. For example, a plan representing: SELECT j1.col FROM j1 JOIN (SELECT max(id) AS m FROM j2) AS b ON j1.id = b.m would unparse to: SELECT j1.col FROM j1 INNER JOIN j2 AS b ON j1.id = b.m dropping the MAX aggregate and the subquery. Root cause: the SubqueryAlias handler in select_to_sql_recursively would call subquery_alias_inner_query_and_columns (which only unwraps Projection children) and unparse_table_scan_pushdown (which only handles TableScan/SubqueryAlias/Projection). When both returned nothing useful for an Aggregate child, the code recursed directly into the Aggregate, merging its GROUP BY into the outer SELECT instead of wrapping it in a derived subquery. The fix adds an early check: if the SubqueryAlias's direct child is a plan type that builds its own SELECT clauses (Aggregate, Window, Sort, Limit, Union), emit it as a derived subquery via self.derive() with the alias always attached, rather than falling through to the recursive path that would flatten it. ## Which issue does this PR close? - Closes apache#21098 ## Rationale for this change The SQL unparser silently drops subquery structure when a SubqueryAlias node directly wraps an Aggregate (or Window, Sort, Limit, Union). For example, a plan representing ```sql SELECT ... FROM j1 JOIN (SELECT max(id) FROM j2) AS b ... ``` unparses to ```sql SELECT ... FROM j1 JOIN j2 AS b ... ``` losing the aggregate entirely. This produces semantically incorrect SQL. ## What changes are included in this PR? In the SubqueryAlias handler within select_to_sql_recursively (`datafusion/sql/src/unparser/plan.rs`): - Added an early check: if the SubqueryAlias's direct child is a plan type that builds its own SELECT clauses (Aggregate, Window, Sort, Limit, Union) and cannot be reduced to a table scan, emit it as a derived subquery (SELECT ...) AS alias via self.derive() instead of recursing into the child and flattening it. - Added a helper requires_derived_subquery() that identifies plan types requiring this treatment. ## Are these changes tested? Yes. A new test test_unparse_manual_join_with_subquery_aggregate is added that constructs a SubqueryAlias > Aggregate plan (without an intermediate Projection) and asserts the unparsed SQL preserves the MAX() aggregate function call. This test fails without the fix. All current unparser tests succeed without modification ## Are there any user-facing changes? No. --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

## Which issue does this PR close? - Closes apache#21410. ## Rationale for this change When `split_part` is invoked with a `StringViewArray`, we can avoid copying when constructing the return value by instead returning pointers into the view buffers of the input `StringViewArray`. This PR only applies this optimization to the code path for scalar `delimiter` and `position`, because that's the most common usage mode in practice. We could also optimize the array-args case but it didn't seem worth the extra code. Benchmarks (M4 Max): - scalar_utf8view_very_long_parts/pos_first: 102 µs → 68 µs (-33%) - scalar_utf8view_long_parts/pos_middle: 164 µs → 137 µs (-15%) - scalar_utf8_single_char/pos_first: 42.5 µs → 42.9 µs (no change) - scalar_utf8_single_char/pos_middle: 96.5 µs → 99.5 µs (noise) - scalar_utf8_single_char/pos_negative: 48.3 µs → 48.6 µs (no change) - scalar_utf8_multi_char/pos_middle: 132 µs → 132 µs (no change) - scalar_utf8_long_strings/pos_middle: 1.06 ms → 1.08 ms (noise) - array_utf8_single_char/pos_middle: 355 µs → 365 µs (noise) - array_utf8_multi_char/pos_middle: 357 µs → 360 µs (no change) ## What changes are included in this PR? * Implement optimization * Add benchmark that covers this case * Improve SLT test coverage for this code path ## Are these changes tested? Yes. ## Are there any user-facing changes? No.

@alamb

) ## Which issue does this PR close? This attempts to bridge the missing test coverage mentioned by @alamb on this issue apache#8791  - Closes #. ## Rationale for this change  ## What changes are included in this PR?  ## Are these changes tested? The changes are test  ## Are there any user-facing changes? no

…apache#21460) ## Which issue does this PR close?  - Closes apache#21459 ## Rationale for this change When a `ProjectionExec` sits on top of a `FilterExec` that already carries an explicit projection, the `ProjectionPushdown` optimizer attempts to swap them via `try_swapping_with_projection`. The swap replaces the `FilterExec's` input with the narrower `ProjectionExec`, but `FilterExecBuilder::from(self)` carried over the old projection indices (e.g. [0, 1, 2]). After the swap the new input only has the columns selected by the `ProjectionExec` (e.g. 2 columns), so .build() tries to validate the stale projection against the narrower schema and panics with "project index 2 out of bounds, max field 2". ## What changes are included in this PR? In `FilterExec::try_swapping_with_projection`, after replacing the input with the narrower ProjectionExec, clear the FilterExec's own projection via .`apply_projection(None)`. The ProjectionExec that is now the input already handles column selection, so the FilterExec no longer needs its own projection. ## Are these changes tested? yes, add test case ## Are there any user-facing changes?

- Closes apache#21938 See apache#21938 (comment) I feel like this is quite a useful check - and it's relatively small - let's run it always?

…protobuf (apache#21913) ## Which issue does this PR close? - apache#21911 ## Rationale for this change The breaking-change detector added in apache#21499 fails on fork PRs with HTTP 403: > The GITHUB_TOKEN has read-only permissions in pull requests from forked repositories. > > From [GitHub Docs](https://docs.github.com/en/actions/reference/events-that-trigger-workflows#pull_request) A read-only token can't post the sticky comment, so the workflow errors out at the `gh api … POST /comments` call. We can't switch to `pull_request_target` either - ASF infra policy forbids it for any workflow exposing `GITHUB_TOKEN` (https://infra.apache.org/github-actions-policy.html), and `cargo-semver-checks` compiles fork-controlled code (`build.rs`, proc macros) anyway, so granting it a write token would be unsafe. ## What changes are included in this PR? Split the comment posting into a companion `workflow_run` workflow: - `breaking_changes_detector.yml` keeps the `pull_request` trigger but only stages the result (`pr_number`, `result`, `logs`) and uploads it as an artifact. No write token, no comment posting from this workflow. - `breaking_changes_detector_comment.yml` triggers on `workflow_run`, runs in the base-repo context with `pull-requests: write`, downloads the artifact, validates the inputs, and upserts/deletes the sticky comment via `actions-cool/maintain-one-comment`. Never checks out PR code. The comment workflow uses a runtime-randomized heredoc delimiter when piping the fork-controlled logs into `$GITHUB_OUTPUT`, to stop log content from closing the heredoc early and overwriting the validated `result` output (or injecting other keys). Drops the now-unused `comment` subcommand from `ci/scripts/changed_crates.sh`. ---- also installed protobuf as noticed this failed when building subtrait in: - apache#15591 ## Are these changes tested? No, cant really test it ## Are there any user-facing changes? No --------- Co-authored-by: Dmitrii Blaginin <dmitrii@blaginin.me> Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>

…#21854) Add a DataFusion-side trait that abstracts over the bulk-NULL string array builders (GenericStringArrayBuilder<O> and StringViewArrayBuilder), so that functions which dispatch over Utf8/LargeUtf8/Utf8View can adopt the new builders without giving up their single-bodied generic implementation. Convert `repeat` as the first call site. The output is null iff either input is null, so the per-row null match becomes a single NullBuffer::union over the input null buffers, evaluated once before the loop. Also mark the inherent append_value/append_placeholder methods on the new builders as #[inline]; without this, calls through the trait wrapper end up going through a non-inlined inherent and slow down small-output paths. ## Which issue does this PR close? - Closes apache#21853. ## Rationale for this change Optimize NULL handling in `repeat` using the bulk-NULL string builders that have recently been added. This requires adding `BulkNullStringArrayBuilder`, a trait that is similar in spirit to Arrow's `StringLikeArrayBuilder`. Benchmarks: - repeat_string overflow [size=1024, repeat_times=1073741824]: 1022.5ns → 1054.5ns (+3.13%) - repeat_string overflow [size=4096, repeat_times=1073741824]: 1016.6ns → 1055.3ns (+3.81%) - repeat_large_string [size=1024, repeat_times=3]: 32.4µs → 26.6µs (−17.90%) - repeat_large_string [size=4096, repeat_times=3]: 127.4µs → 104.0µs (−18.37%) - repeat_string [size=1024, repeat_times=3]: 32.6µs → 26.8µs (−17.79%) - repeat_string [size=4096, repeat_times=3]: 127.4µs → 105.5µs (−17.19%) - repeat_string_view [size=1024, repeat_times=3]: 37.3µs → 31.7µs (−15.01%) - repeat_string_view [size=4096, repeat_times=3]: 146.5µs → 124.5µs (−15.02%) - repeat_large_string [size=1024, repeat_times=30]: 82.0µs → 80.4µs (−1.95%) - repeat_large_string [size=4096, repeat_times=30]: 344.2µs → 338.7µs (−1.60%) - repeat_string [size=1024, repeat_times=30]: 81.7µs → 79.7µs (−2.45%) - repeat_string [size=4096, repeat_times=30]: 352.2µs → 334.7µs (−4.97%) - repeat_string_view [size=1024, repeat_times=30]: 88.1µs → 83.1µs (−5.68%) - repeat_string_view [size=4096, repeat_times=30]: 368.8µs → 342.6µs (−7.10%) - repeat/scalar_utf8: 174.7ns → 179.2ns (+2.58%) - repeat/scalar_utf8view: 174.5ns → 180.5ns (+3.44%) ## What changes are included in this PR? * Add `BulkNullStringArrayBuilder` * Optimize `repeat` using `BulkNullStringArrayBuilder` * Inline some functions in GenericStringBuilder; benchmarking suggests this is a win ## Are these changes tested? Yes. ## Are there any user-facing changes? No. --------- Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com> Co-authored-by: Dmitrii Blaginin <dmitrii@blaginin.me>

## Which issue does this PR close?  - Closes apache#20266  ## What changes are included in this PR? Created a new function to insert the separator (,) into the numbers if the flag is enabled  ## Are these changes tested? Tests passed locally Added unit tests to verify functionality  ## Are there any user-facing changes? No   No changes to public api ## Additional Info Claude was used to assist in identifying the source of the issue.

…che#21900) ## Which issue does this PR close? - Closes apache#21843. ## Rationale for this change Performance improvement on large hash-repartitions. TPC-H at bigger scale factor shows biggest benefit: <details> ``` -------------------- Benchmark tpch_sf10.json -------------------- ┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓ ┃ Query ┃ HEAD ┃ perf-strength-reduce-hash-partition ┃ Change ┃ ┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩ │ QQuery 1 │ 327.64 / 329.66 ±1.44 / 331.84 ms │ 327.82 / 330.00 ±1.68 / 331.48 ms │ no change │ │ QQuery 2 │ 131.35 / 138.32 ±3.91 / 143.24 ms │ 125.94 / 126.49 ±0.61 / 127.42 ms │ +1.09x faster │ │ QQuery 3 │ 286.47 / 300.24 ±8.19 / 308.99 ms │ 273.76 / 276.01 ±1.80 / 277.98 ms │ +1.09x faster │ │ QQuery 4 │ 158.30 / 160.81 ±2.21 / 163.54 ms │ 137.56 / 138.79 ±0.98 / 139.80 ms │ +1.16x faster │ │ QQuery 5 │ 428.90 / 437.68 ±4.45 / 440.83 ms │ 390.52 / 396.51 ±4.30 / 403.67 ms │ +1.10x faster │ │ QQuery 6 │ 131.88 / 132.83 ±1.17 / 134.81 ms │ 133.06 / 134.70 ±1.22 / 135.83 ms │ no change │ │ QQuery 7 │ 541.09 / 545.88 ±4.21 / 552.67 ms │ 508.51 / 531.75 ±16.82 / 548.21 ms │ no change │ │ QQuery 8 │ 467.86 / 476.44 ±6.95 / 483.87 ms │ 427.19 / 439.03 ±9.88 / 453.56 ms │ +1.09x faster │ │ QQuery 9 │ 649.16 / 660.07 ±10.12 / 676.72 ms │ 605.25 / 611.87 ±5.72 / 620.70 ms │ +1.08x faster │ │ QQuery 10 │ 327.64 / 339.90 ±6.92 / 348.85 ms │ 321.66 / 330.67 ±4.76 / 334.89 ms │ no change │ │ QQuery 11 │ 104.93 / 107.54 ±1.71 / 110.18 ms │ 92.80 / 101.35 ±12.27 / 125.63 ms │ +1.06x faster │ │ QQuery 12 │ 198.96 / 202.37 ±2.45 / 206.21 ms │ 195.07 / 197.77 ±4.26 / 206.26 ms │ no change │ │ QQuery 13 │ 300.44 / 312.37 ±6.87 / 321.90 ms │ 291.85 / 308.47 ±10.23 / 317.55 ms │ no change │ │ QQuery 14 │ 188.06 / 193.71 ±4.81 / 200.69 ms │ 182.89 / 186.72 ±3.67 / 192.75 ms │ no change │ │ QQuery 15 │ 334.88 / 339.95 ±5.81 / 350.78 ms │ 330.79 / 336.21 ±4.31 / 342.71 ms │ no change │ │ QQuery 16 │ 78.38 / 81.25 ±2.51 / 84.55 ms │ 74.35 / 76.61 ±2.80 / 81.94 ms │ +1.06x faster │ │ QQuery 17 │ 744.08 / 761.70 ±12.84 / 781.69 ms │ 703.40 / 724.66 ±23.35 / 770.05 ms │ no change │ │ QQuery 18 │ 760.17 / 782.23 ±12.12 / 796.85 ms │ 725.45 / 744.71 ±15.59 / 765.59 ms │ no change │ │ QQuery 19 │ 267.90 / 280.99 ±14.61 / 306.80 ms │ 275.58 / 298.23 ±27.69 / 351.75 ms │ 1.06x slower │ │ QQuery 20 │ 311.46 / 323.12 ±10.13 / 341.26 ms │ 312.13 / 319.42 ±4.39 / 324.46 ms │ no change │ │ QQuery 21 │ 816.40 / 837.33 ±19.78 / 870.18 ms │ 766.23 / 778.58 ±8.98 / 792.31 ms │ +1.08x faster │ │ QQuery 22 │ 81.46 / 84.94 ±2.58 / 88.20 ms │ 75.31 / 77.73 ±1.39 / 79.55 ms │ +1.09x faster │ └───────────┴────────────────────────────────────┴─────────────────────────────────────┴───────────────┘ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓ ┃ Benchmark Summary ┃ ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩ │ Total Time (HEAD) │ 7829.34ms │ │ Total Time (perf-strength-reduce-hash-partition) │ 7466.29ms │ │ Average Time (HEAD) │ 355.88ms │ │ Average Time (perf-strength-reduce-hash-partition) │ 339.38ms │ │ Queries Faster │ 10 │ │ Queries Slower │ 1 │ │ Queries with No Change │ 11 │ │ Queries with Failure │ 0 │ └────────────────────────────────────────────────────┴───────────┘ ``` </details> ## What changes are included in this PR? Use strength-reduce to speed up hash % partition ## Are these changes tested? Existing tests ## Are there any user-facing changes? A small change to `new_hash_partitioner` to return a `Result` instead of panic during runtime

## Which issue does this PR close?  - Closes #. ## Rationale for this change  ### Reproducer Under `benchmarks/`, run `./bench.sh run tpch`, the generated result file won't get ignored by git ### Reason A recent PR has changed one entry in `.gitignore` and caused the issue - https://github.com/apache/datafusion/pull/21707/changes#diff-8ef3f336d18af2c481452ec156ec35b744a9c459c4e11f4bd72ceeb75ea6b6d3 ### PR This PR reverted this entry to the previous version. Test: the above reproducer is working as expected after the change ## What changes are included in this PR?  ## Are these changes tested?  ## Are there any user-facing changes?

## Which issue does this PR close?  - Part of apache#8229 ## Rationale for this change DataFusion already has shared logic for merging `Statistics`, but `UnionExec` and `InterleaveExec` still used their own local merge code. That left duplicated path in the codebase and kept the behavior less consistent than the other statistics aggregation paths.  ## What changes are included in this PR? - Reuse `Statistics::try_merge_iter` for `UnionExec` statistics merging - Reuse the same shared path for `InterleaveExec` statistics merging - Remove the local union-specific statistics merge helpers - Add tests for union and interleave statistics merging - Add a test for interleave partition-level statistics merging  ## Are these changes tested? Yes  ## Are there any user-facing changes? No

## Which issue does this PR close? N/A. This is a benchmark follow-up for apache#21637. ## Rationale for this change This adds a ClickBench extended query that exercises Parquet filter pushdown when row group statistics can prove a string range predicate matches every row in the row group. This case is useful for validating the optimization in apache#21637: when Parquet statistics prove a row group is fully matched, DataFusion can avoid evaluating the pushed-down RowFilter for that row group. ## What changes are included in this PR? - Add `benchmarks/queries/clickbench/extended/q13.sql`. - Document Q13 in the ClickBench query README. ## Are these changes tested? I ran a local synthetic-data comparison for this query. With `target_partitions=1`, the apache#21637 branch reduced scan processing time from about 85.82ms to 24.89ms, reduced `bytes_scanned` from 26.12M to 400.6K, and reduced `row_pushdown_eval_time` from 4.12ms to effectively zero. ## Are there any user-facing changes? No public API changes. This adds a benchmark query and benchmark documentation.

## Which issue does this PR close?  - Closes #. ## Rationale for this change `array_any_match` is a commonly supported higher-order function in systems like Spark (`exists`), Trino (`any_match`) among other engines. It seems like a natural first addition alongside `array_transform` and worth upstreaming I think. ## What changes are included in this PR? Adds `array_any_match(array, predicate)` as a new higher-order function (with aliases `any_match` and `list_any_match`). It returns: `true` if any element satisfies the predicate `false` if no element does (including empty arrays) `null` if the predicate returns null for some elements and false for all others ## Are these changes tested? Yes I added unit tests and sqllogic tests ## Are there any user-facing changes? Yes -- new SQL functions array_any_match, any_match, and list_any_match are available. --------- Co-authored-by: Dmitrii Blaginin <dmitrii@blaginin.me>

apache#21898) ## Summary Adds `WITH ORDER (x)` to `CREATE EXTERNAL TABLE skew_parquet` / `skew_parquet_single` in `explain_analyze.slt` so `FileScanConfig` preserves scan ordering (`preserve_order`), keeping per-partition `output_rows` stable under dynamic file scheduling (PR apache#21351). ## Related - Follow-up to flaky skew assertions discussed around apache#21866 / apache#21850. ## Testing - `cargo test -p datafusion-sqllogictest --test sqllogictests -- explain_analyze` (recommended before merge) Sqllogictest-only change. Made with [Cursor](https://cursor.com) --------- Co-authored-by: Yongting You <2010youy01@gmail.com> Co-authored-by: Raz Luvaton <16746759+rluvaton@users.noreply.github.com>

…oin plans (apache#21947) ## Which issue does this PR close? Closes apache#21946 ## Rationale for this change `adjust_input_keys_ordering` returns `Transformed::yes` unconditionally in the default else branch, even when `requirements.data` is empty and no changes were made. This triggers unnecessary `with_new_children` rebuilds on every node in the plan tree for non-join/non-aggregate queries. For plans with custom `ExecutionPlan` nodes whose `with_new_children` is expensive (e.g. nodes that re-evaluate cost functions on rebuild), this causes significant overhead. ## What changes are included in this PR? Add an early return with `Transformed::no` when `requirements.data.is_empty()` in the default else branch of `adjust_input_keys_ordering`. This skips the unnecessary plan tree rebuild for simple scan/filter/limit plans that have no join key reordering requirements. ## Are these changes tested? Yes, two unit tests added: - `adjust_input_keys_ordering_no_transform_for_scan` — verifies a bare parquet scan returns `Transformed::no` - `adjust_input_keys_ordering_no_transform_for_filter_scan` — verifies a filter→scan tree returns `Transformed::no` via `transform_down` ## Are there any user-facing changes? No. This is a performance optimization that does not change query results or plan structure.

## Which issue does this PR close? Closes apache#21784 ## Rationale for this change Apache Arrow added `BooleanArray::has_true()` and `has_false()` so callers can answer “any true/false?” without a full bit count. That can short-circuit and avoid unnecessary work compared to patterns like `true_count() == 0` or `true_count() > 0`. This PR applies those APIs across DataFusion where the logic is purely existential (or equivalent via null-safe “all true” / “no true” checks), matching the audit suggested in the issue. ## What changes are included in this PR? - Replace hot-path checks that only needed existence or emptiness with `has_true()` / `has_false()` (and `null_count()` where needed), including: - Nested/array helpers (`array_has`, list replace), Spark `array_contains` null-semantics fast path - Physical expressions: `evaluate_selection`, binary AND/OR short-circuit, CASE/IN list loops - `scatter` fast paths - Top-K filter handling, sort-merge join filter, nested-loop join bitmap checks - Parquet column stats (`metadata.rs`, `has_any_exact_match`) - Keep `true_count()` / `false_count()` where an actual count is required (row counts, metrics, selectivity, `to_array(n)`, etc.) - Import `arrow::array::Array` where `null_count()` is used on `BooleanArray` in trait-heavy paths ## Are these changes tested? Existing tests cover this behavior; the edits are semantics-preserving refactors (same conditions, cheaper primitives). No new tests were added. ## Are there any user-facing changes? No. Behavior should be unchanged; this is an internal performance/clarity improvement. --------- Co-authored-by: Raushan Prabhakar <ros@Raushans-MacBook-Air.local> Co-authored-by: Dmitrii Blaginin <dmitrii@blaginin.me> Co-authored-by: Adrian Garcia Badaracco <1755071+adriangb@users.noreply.github.com>

…lue types that require an extra coercion step (apache#21924) The coerce_fn applied to REE values needs to be the higher-level coerce function so that any REE value can be coerced (not just primitive types). ## Which issue does this PR close?  - Closes apache#21923 ## Rationale for this change  Queries were failing ## What changes are included in this PR?  Correct unwrapping of REE values for regex/LIKE coercion ## Are these changes tested?  Yes, with slt tests. ## Are there any user-facing changes?  Queries that would previously error now pass.  --------- Signed-off-by: Alfonso Subiotto Marques <alfonso.subiotto@polarsignals.com>

…he#21894) ## Which issue does this PR close?  - Closes #. ## Rationale for this change For sliding window aggregation, `retract_batch` removes outgoing rows from the aggregate state on every window slide. `median` and `percentile_cont` store primitive numeric values internally, but their retract paths converted values through `ScalarValue` before matching them. This PR keeps retract matching on native Arrow values, reducing conversion and hashing overhead in that hot path.  ## What changes are included in this PR? - Optimize `median` and `percentile_cont` `retract_batch` using `Hashable<T::Native>` keys. - Add sliding-window benchmarks for `median` and `percentile_cont` with window sizes `256`, `4096`, and `16384`. ### Benchmarks ``` group main optimized ----- ---- --------- median sliding_window f64 no_nulls window_size=16384 2.38 3.3±0.06ms ? ?/sec 1.00 1396.6±36.31µs ? ?/sec median sliding_window f64 no_nulls window_size=256 2.73 781.3±20.80µs ? ?/sec 1.00 286.0±10.52µs ? ?/sec median sliding_window f64 no_nulls window_size=4096 2.11 1052.2±27.13µs ? ?/sec 1.00 499.3±19.44µs ? ?/sec median sliding_window f64 with_nulls window_size=16384 2.52 3.0±0.06ms ? ?/sec 1.00 1173.1±36.86µs ? ?/sec median sliding_window f64 with_nulls window_size=256 2.67 728.6±20.07µs ? ?/sec 1.00 272.8±12.90µs ? ?/sec median sliding_window f64 with_nulls window_size=4096 2.11 954.8±27.37µs ? ?/sec 1.00 452.6±13.08µs ? ?/sec percentile_cont sliding_window f64 no_nulls window_size=16384 3.86 10.7±0.24ms ? ?/sec 1.00 2.8±0.05ms ? ?/sec percentile_cont sliding_window f64 no_nulls window_size=256 2.49 797.8±25.51µs ? ?/sec 1.00 320.1±58.86µs ? ?/sec percentile_cont sliding_window f64 no_nulls window_size=4096 3.44 3.2±0.12ms ? ?/sec 1.00 928.2±42.15µs ? ?/sec percentile_cont sliding_window f64 with_nulls window_size=16384 3.72 6.7±0.90ms ? ?/sec 1.00 1790.9±22.20µs ? ?/sec percentile_cont sliding_window f64 with_nulls window_size=256 2.51 721.0±25.52µs ? ?/sec 1.00 286.7±30.34µs ? ?/sec percentile_cont sliding_window f64 with_nulls window_size=4096 3.34 2.2±0.14ms ? ?/sec 1.00 667.1±20.87µs ? ?/sec ```  ## Are these changes tested? Yes. existed slt passed.  ## Are there any user-facing changes? No.   --------- Co-authored-by: Dmitrii Blaginin <dmitrii@blaginin.me>

We were missing a couple of branches to unwrap REE in type_union_resolution_coercion. ## Which issue does this PR close?  - Closes apache#21918 ## Rationale for this change  Fix an unexpected error ## What changes are included in this PR?  Type coercion match arms for REE ## Are these changes tested? Yes, via sql logic tests  ## Are there any user-facing changes?   Queries that errored now complete successfully Signed-off-by: Alfonso Subiotto Marques <alfonso.subiotto@polarsignals.com>

## Which issue does this PR close? - Refers apache#12709. ## Rationale for this change Binary arguments are supported for concat UDFs, but not for the pipe operator (`||`), which supports only text. ## What changes are included in this PR? - Support binary concat by providing specialised kernels for pure binary operations. Avoid support of mixed string/binary arguments as it doesn't match the behaviour of major DBs, except for Postgres (see the table in the linked ticket). - Add `concat_elements_binary_view_array` kernel - Refactor private `binary_coercion` to support symmetric BinaryLike + BinaryLike - required for the new codeflow Concat UDFs are out of scope and supported separately. ## Are these changes tested? - Existing SLTs - Moved a few tests to a more appropriate `binary.slt` - Added new unit tests ## Are there any user-facing changes? Concatenation `||` operator now allows binary+binary concatenation (`SELECT x'636166c3a9' || x'68656c6c6f'`), but denies mixed string+binary concatenation `SELECT x'636166c3a9' || 'hello'` --------- Co-authored-by: Jeffrey Vo <jeffrey.vo.australia@gmail.com>

…1885) ## Which issue does this PR close?  - Closes apache#21871 ## Rationale for this change FilterExec supports two semantically different projection states: - None → return all columns (full projection) - Some(vec![]) → return no columns (empty projection) However, both cases were being serialized identically as an empty vector in the proto representation. During deserialization, an empty vector was always mapped back to None, meaning an empty projection would silently become a full projection after a serde round-trip. ## What changes are included in this PR?  ## Are these changes tested?  ## Are there any user-facing changes?  No

…ptionProperties fallible (apache#21603) ## Which issue does this PR close? - Closes apache#21602. ## Rationale for this change Fail quickly with a helpful error if we're unable to represent a `FileDecryptionProperties` instance as `ConfigFileDecryptionProperties` ## What changes are included in this PR? * Change the implementation of `From<&Arc<FileDecryptionProperties>>` for `ConfigFileDecryptionProperties` to `TryFrom`. * Fail the conversion if we can't get the footer key from the `FileDecryptionProperties` with empty metadata ## Are these changes tested? Yes I've added a new unit test. I also tested this with a branch of delta-rs that uses Datafusion with Parquet encryption, and this required only minor changes to tests and examples: corwinjoy/delta-rs@file_format_options_squashed...adamreeve:delta-rs:test-datafusion-change ## Are there any user-facing changes? Yes, this is a breaking API change. --------- Co-authored-by: Kumar Ujjawal <ujjawalpathak6@gmail.com> Co-authored-by: Nuno Faria <nunofpfaria@gmail.com>

## Which issue does this PR close?  Informs: datafusion-contrib/datafusion-distributed#180 Closes: apache#20418 ## Rationale for this change  Consider you have a plan with a `HashJoinExec` and `DataSourceExec` ``` HashJoinExec(dynamic_filter_1 on a@0) (...left side of join) ProjectionExec(a := Column("a", source_index)) DataSourceExec ParquetSource(predicate = dynamic_filter_2) ``` You serialize the plan, deserialize it, and execute it. What should happen is that the dynamic filter should "work", meaning: 1. When you deserialize the plan, both the `HashJoinExec` and `DataSourceExec` should have pointers to the same `DynamicFilterPhysicalExpr` 2. The `DynamicFilterPhysicalExpr` should be updated during execution by the `HashJoinExec` and the `DataSourceExec` should filter out rows This does not happen today for a few reasons, a couple of which this PR aims to address 1. `DynamicFilterPhysicalExpr` is not survive round-tripping. The internal exprs get inlined (ex. it may be serialized as `Literal`) due to the `PhysicalExpr::snapshot()` API 2. Even if `DynamicFilterPhysicalExpr` survives round-tripping, the one pushed down to the `DataSourceExec` often has different children. In this case, you have two `DynamicFilterPhysicalExpr` which do not survive deduping, causing referential integrity to be lost. ## What changes are included in this PR?  This PR aims to fix those problems by: 1. Removing the `snapshot()` call from the serialization process 2. Adding protos for `DynamicFilterPhysicalExpr` so it can be serialized and deserialized 3. Removing `Arc`-based deduplication. We now only dedupe on `expression_id` if the `PhysicalExpr` reports a `expression_id`. After this change, only `DynamicFilterPhysicalExpr` reports an `expression_id` to be deduped. 4. `expression_id` is now just a random u64. Since a given query likely only has a few `DynamicFilterPhysicalExpr` instances, the odds of a collision are very low 5. There's no need for a `DedupingSerializer` anymore since the `expression_id` is already stored in the dynamic filter proto itself Future work: 1. Serialize dynamic filters in `HashJoinExec`, `AggregateExec` and `SortExec` 2. Add tests which actually execute plans after deserialization and assert that dynamic filtering is functional 3. Add proto converters to the `PhysicalExtensionCodec` trait so implementors can utilize deduping logic ## Are these changes tested? - adds tests which roundtrip dynamic filters and assert that referential integrity is maintained - removes tests that test `Arc`-based deduplication and session id rotation since we don't support that anymore ## Are there any user-facing changes? - The default codec does not call `snapshot()` on `PhysicalExpr` during serialization anymore. This means that `DynamicFilterPhysicalExpr` are now serialized and deserialized without snapshotting. - All `PhysicalExpr` are not deduped anymore. Only `DynamicFilterPhysicalExpr` is --------- Co-authored-by: Dmitrii Blaginin <dmitrii@blaginin.me>

## Which issue does this PR close?  - Closes #. ## Rationale for this change Fix negative cases with `substring`, some tests were incorrect  ## What changes are included in this PR?  ## Are these changes tested?  ## Are there any user-facing changes?

Builds on the prior `DynamicFilterPhysicalExpr` proto serialization + dedupe work so plan-node references to a shared dynamic filter survive roundtrip. - Adds `dynamic_filter` to the proto messages for `SortExec`, `AggregateExec`, and `HashJoinExec` and wires it through to/from-proto. - Exposes `dynamic_filter()` / `with_dynamic_filter()` on those plan nodes so the dedupe deserializer can reattach the shared `DynamicFilterPhysicalExpr` after roundtrip. - Extracts `supported_accumulators_info()` on `AggregateExec` and uses it from `init_dynamic_filter` and `with_dynamic_filter`. - Adds `test_hash_join_with_dynamic_filter_roundtrip`, `test_aggregate_with_dynamic_filter_roundtrip`, and `test_sort_topk_with_dynamic_filter_roundtrip` to verify that the plan node and the pushdown-target `ParquetSource` predicate end up pointing at the same `expression_id` after roundtrip. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions Bot added physical-plan core proto physical-expr labels Feb 20, 2026

jayshrivastava changed the title ~~wip~~ Serialize dynamic filters on execution plan nodes (HashJoin, Aggregate, Sort) Feb 23, 2026

jayshrivastava force-pushed the js/serialize-dynamic-filters-in-execution-plans-2 branch from ff17e8a to 00b2a63 Compare February 23, 2026 19:44

github-actions Bot added physical-expr and removed physical-expr labels Feb 23, 2026

LiaCastaneda reviewed Feb 24, 2026

View reviewed changes

jayshrivastava force-pushed the js/dedupe-dynamic-filter-inner-state branch from c5d0e2f to fef4259 Compare February 26, 2026 18:48

jayshrivastava force-pushed the js/serialize-dynamic-filters-in-execution-plans-2 branch 2 times, most recently from 4889d13 to ed4c611 Compare February 26, 2026 18:53

jayshrivastava force-pushed the js/dedupe-dynamic-filter-inner-state branch from cb23b01 to 18b0289 Compare March 19, 2026 15:04

jayshrivastava pushed a commit that referenced this pull request Mar 19, 2026

test: add reproducer for Dictionary InList pushdown type mismatch (#2… (

3ece9ec

apache#20960) Reproducer for apache#20937

theirix and others added 13 commits April 7, 2026 13:16

github-actions Bot added functions optimizer spark substrait sqllogictest labels Apr 29, 2026

blaginin and others added 18 commits April 29, 2026 21:52

dependencies check are now required to merge ci (apache#21940)

42cd2fa

- Closes apache#21938 See apache#21938 (comment) I feel like this is quite a useful check - and it's relatively small - let's run it always?

jayshrivastava changed the base branch from js/dedupe-dynamic-filter-inner-state to js/dedupe-dynamic-filter-inner-state-v2 May 1, 2026 14:43

jayshrivastava and others added 3 commits May 1, 2026 17:22

jayshrivastava force-pushed the js/serialize-dynamic-filters-in-execution-plans-2 branch from 90abf76 to c3aef20 Compare May 4, 2026 17:47

jayshrivastava changed the base branch from js/dedupe-dynamic-filter-inner-state-v2 to main May 4, 2026 17:47

jayshrivastava closed this May 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Serialize dynamic filters on execution plan nodes (HashJoin, Aggregate, Sort)#2

Serialize dynamic filters on execution plan nodes (HashJoin, Aggregate, Sort)#2
jayshrivastava wants to merge 1589 commits intomainfrom
js/serialize-dynamic-filters-in-execution-plans-2

jayshrivastava commented Feb 20, 2026 •

edited

Loading

Uh oh!

jayshrivastava commented Feb 23, 2026

Uh oh!

LiaCastaneda Feb 24, 2026

Uh oh!

LiaCastaneda Feb 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

jayshrivastava commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Proto schema (datafusion.proto)

Plan node public API

Serde

Are these changes tested?

Uh oh!

jayshrivastava commented Feb 23, 2026

Uh oh!

LiaCastaneda Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

LiaCastaneda Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

jayshrivastava commented Feb 20, 2026 •

edited

Loading