Skip to content

Conversation

@delamarch3
Copy link
Contributor

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Account for columns in any comparisons to the equi_join_count. Eg a pair of columns can be compared to one equi_join. Then check for column pairs while creating the new on expression.

Are these changes tested?

Yes, I've added some unit tests.

Are there any user-facing changes?

with_new_exprs with a column will no longer return an error if the other conditions are correct.

@github-actions github-actions bot added the logical-expr Logical plan and expressions label Mar 6, 2025
assert!(expr.len() >= equi_expr_count);

let col_pair_count =
expr.iter().filter(|e| matches!(e, Expr::Column(_))).count() / 2;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The filter can also be a column expression if it's a boolean column. And the on expression may not be a column expr, e.g. select * from t1 join t2 on t1.a+2 = t2.a+1 where t2.b.

I think we don't need to match expr types; we just extract them according to the format returned by apply_expressions(), where the first on.len() * 2 elements are the on-expression pairs, and the last one is the filter expression.

new_on.push((left, right));
}
_ => internal_err!(
"The front part expressions should be a binary equality expression or a column expression, actual:{equi_expr}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"The front part expressions should be a binary equality expression or a column expression, actual:{equi_expr}"
"The front part expressions should be a binary equality expression or a column expression, actual: {equi_expr}"

}

#[test]
fn test_join_with_new_exprs() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
fn test_join_with_new_exprs() {
fn test_join_with_new_exprs() -> Result<()> {

Make the function fallible so that many unwrap's can be replaced with ?

@delamarch3
Copy link
Contributor Author

Thanks for the reviews, I've pushed up your suggestions @jonahgao @niebayes

Copy link
Member

@jonahgao jonahgao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM👍, thank you @delamarch3

// and the struct of each equi-expr is like `left-expr = right-expr`.
assert_eq!(expr.len(), equi_expr_count);
let new_on = expr.into_iter().map(|equi_expr| {
let mut new_on = Vec::new();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: use with_capacity would be preferable.

 let mut new_on = Vec::with_capacity(on.len());

@jonahgao jonahgao merged commit 8356c94 into apache:main Mar 8, 2025
24 checks passed
@jonahgao
Copy link
Member

jonahgao commented Mar 8, 2025

Thanks @delamarch3 @niebayes

danila-b pushed a commit to danila-b/datafusion that referenced this pull request Mar 8, 2025
* handle columns in with_new_exprs with Join

* test doesn't return result

* take join from result

* clippy

* make test fallible

* accept any pair of expression for new_on in with_new_exprs for Join

* use with_capacity
avantgardnerio pushed a commit to coralogix/arrow-datafusion that referenced this pull request Nov 17, 2025
* handle columns in with_new_exprs with Join

* test doesn't return result

* take join from result

* clippy

* make test fallible

* accept any pair of expression for new_on in with_new_exprs for Join

* use with_capacity

(cherry picked from commit 8356c94)
Dandandan pushed a commit to coralogix/arrow-datafusion that referenced this pull request Nov 17, 2025
apache#15055

* handle columns in with_new_exprs with Join

* test doesn't return result

* take join from result

* clippy

* make test fallible

* accept any pair of expression for new_on in with_new_exprs for Join

* use with_capacity
Dandandan pushed a commit to coralogix/arrow-datafusion that referenced this pull request Nov 17, 2025
apache#15055

* handle columns in with_new_exprs with Join

* test doesn't return result

* take join from result

* clippy

* make test fallible

* accept any pair of expression for new_on in with_new_exprs for Join

* use with_capacity
Dandandan added a commit to coralogix/arrow-datafusion that referenced this pull request Nov 17, 2025
apache#15055

* handle columns in with_new_exprs with Join

* test doesn't return result

* take join from result

* clippy

* make test fallible

* accept any pair of expression for new_on in with_new_exprs for Join

* use with_capacity

Co-authored-by: delamarch3 <[email protected]>
Dandandan added a commit to coralogix/arrow-datafusion that referenced this pull request Nov 17, 2025
* Get working build

* Add pool_size method to MemoryPool (#218) (#230)

* Add pool_size method to MemoryPool

* Fix

* Fmt

Co-authored-by: Daniël Heres <[email protected]>

* Respect `IGNORE NULLS` flag in `ARRAY_AGG` (#260) (apache#15544) v48

* Hook for doing distributed `CollectLeft` joins (#269)

* Ignore writer shutdown error (#271)

* ignore writer shutdown error

* cargo check

* Fix  bug in `swap_hash_join` (#278)

* Try and fix swap_hash_join

* Only swap projections when join does not have projections

* just backport upstream fix

* remove println

* Support Duration in min/max agg functions (#283) (apache#15310) v47

* Support Duration in min/max agg functions

* Attempt to fix build

* Attempt to fix build - Fix chrono version

* Revert "Attempt to fix build - Fix chrono version"

This reverts commit fd76fe6.

* Revert "Attempt to fix build"

This reverts commit 9114b86.

---------

Co-authored-by: svranesevic <[email protected]>

* Fix panics in array_union (#287) (apache#15149) v48

* Drop rust-toolchain

* Fix panics in array_union

* Fix the chrono

* Backport `GroupsAccumulator` for Duration min/max agg (#288) (apache#15322) v47

* Fix array_sort for empty record batch (#290) (apache#15149) v48

* fix: rewrite fetch, skip of the Limit node in correct order (apache#14496) v46

* fix: rewrite fetch, skip of the Limit node in correct order

* style: fix clippy

* Support aliases in ConstEvaluator (apache#14734) (#281) v46

* Support aliases in ConstEvaluator (apache#14734)

Not sure why they are not supported. It seems that if we're not careful,
some transformations can introduce aliases nested inside other expressions.

* Format Cargo.toml

* Preserve the name of grouping sets in SimplifyExpressions (#282) (apache#14888) v46

Whenever we use `recompute_schema` or `with_exprs_and_inputs`,
this ensures that we obtain the same schema.

* Support Duration in min/max agg functions (#284) (apache#15310) v47

Co-authored-by: svranesevic <[email protected]>

* fix case_column_or_null with nullable when conditions (apache#13886) v45

* fix case_column_or_null with nullable when conditions

* improve sqllogictests for case_column_or_null

---------

Co-authored-by: zhangli20 <[email protected]>

* Fix casewhen (apache#14156) v45

* Cherry-pick topk limit pushdown fix (apache#14192) v45

* fix: FULL OUTER JOIN and LIMIT produces wrong results (apache#14338) v45

* fix: FULL OUTER JOIN and LIMIT produces wrong results

* Fix minor slt testing

* fix test

(cherry picked from commit ecc5694)

* Cherry-pick global limit fix (apache#14245) v45

* fix: Limits are not applied correctly (apache#14418) v46

* fix: Limits are not applied correctly

* Add easy fix

* Add fix

* Add slt testing

* Address comments

* Disable grouping set in CSE

* Fix spm + limit (apache#14569) v46

* prost 0.13 / fix parquet dep

* Delete unreliable checks

* Segfault in ByteGroupValueBuilder (#294) (apache#15968) v50

* test to demonstrate segfault in ByteGroupValueBuilder

* check for offset overflow

* clippy

(cherry picked from commit 5bdaeaf)

* Update arrow dependency to include rowid (#295)

* Update arrow version

* Feat: Add fetch to CoalescePartitionsExec (apache#14499) (#298) v46

* add fetch info to CoalescePartitionsExec

* use Statistics with_fetch API on CoalescePartitionsExec

* check limit_reached only if fetch is assigned

Co-authored-by: mertak-synnada <[email protected]>

* Fix `CoalescePartitionsExec` proto serialization (apache#15824) (#299) v48

* add fetch to CoalescePartitionsExecNode

* gen proto code

* Add test

* fix

* fix build

* Fix test build

* remove comments

Co-authored-by: 张林伟 <[email protected]>

* Add JoinContext with JoinLeftData to TaskContext in HashJoinExec (#300)

* Add JoinContext with JoinLeftData to TaskContext in HashJoinExec

* Expose random state as const

* re-export ahash::RandomState

* JoinContext default impl

* Add debug log when setting join left data

* Update arrow version for not preserving dict_id (#303)

* Use partial aggregation schema for spilling to avoid column mismatch in GroupedHashAggregateStream (apache#13995) (#302) v45

* Refactor spill handling in GroupedHashAggregateStream to use partial aggregate schema

* Implement aggregate functions with spill handling in tests

* Add tests for aggregate functions with and without spill handling

* Move test related imports into mod test

* Rename spill pool test functions for clarity and consistency

* Refactor aggregate function imports to use fully qualified paths

* Remove outdated comments regarding input batch schema for spilling in GroupedHashAggregateStream

* Update aggregate test to use AVG instead of MAX

* assert spill count

* Refactor partial aggregate schema creation to use create_schema function

* Refactor partial aggregation schema creation and remove redundant function

* Remove unused import of Schema from arrow::datatypes in row_hash.rs

* move spill pool testing for aggregate functions to physical-plan/src/aggregates

* Use Arc::clone for schema references in aggregate functions

(cherry picked from commit 81b50c4)

Co-authored-by: kosiew <[email protected]>

* Update tag

* Push limits past windows (#337) (apache#17347) v50

* Restore old method for DQE

* feat(optimizer): Enable filter pushdown on window functions (apache#14026) v45

* Avoid Aliased Window Expr Enter Unreachable Code (apache#14109) v45

(cherry picked from commit fda500a)

* Use `Expr::qualified_name()` and `Column::new()` to extract partition keys from window and aggregate operators (#355) (apache#17757) v51

* Update PR template to be relevant to our fork

* Make limit pushdown work for SortPreservingMergeExec (apache#17893) (#361)

* re-publicise functions DQE relies on

* Handle columns in with_new_exprs with a Join (apache#15055) (#384)

apache#15055

* handle columns in with_new_exprs with Join

* test doesn't return result

* take join from result

* clippy

* make test fallible

* accept any pair of expression for new_on in with_new_exprs for Join

* use with_capacity

Co-authored-by: delamarch3 <[email protected]>

---------

Co-authored-by: Georgi Krastev <[email protected]>
Co-authored-by: Daniël Heres <[email protected]>
Co-authored-by: Dan Harris <[email protected]>
Co-authored-by: Faiaz Sanaulla <[email protected]>
Co-authored-by: Sava Vranešević <[email protected]>
Co-authored-by: svranesevic <[email protected]>
Co-authored-by: Yingwen <[email protected]>
Co-authored-by: Zhang Li <[email protected]>
Co-authored-by: zhangli20 <[email protected]>
Co-authored-by: Aleksey Kirilishin <[email protected]>
Co-authored-by: xudong.w <[email protected]>
Co-authored-by: Qi Zhu <[email protected]>
Co-authored-by: Martins Purins <[email protected]>
Co-authored-by: mertak-synnada <[email protected]>
Co-authored-by: 张林伟 <[email protected]>
Co-authored-by: kosiew <[email protected]>
Co-authored-by: nuno-faria <[email protected]>
Co-authored-by: Berkay Şahin <[email protected]>
Co-authored-by: Mason Hall <[email protected]>
Co-authored-by: delamarch3 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

logical-expr Logical plan and expressions

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: calling "with_new_exprs" on join after optimization unexpectedly fails

3 participants