Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Support null safe equals in ExtractEquijoinPredicate #12458

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

eejbyfeldt
Copy link
Contributor

Which issue does this PR close?

Closes #12299.

Rationale for this change

Allow more joins to be executed efficiently.

What changes are included in this PR?

Previously ExtractEquijoinPredicate did not inspect the value of nulls_equals_null for the join before extracting equi join predicates, which was likely a possible source of bugs. The new code will respect null_equals_null if set (and there is a equi join predicate) and if not it will first try to extra Operator::Eq predicates and if none is found it will try to extract Operator::IsNotDistinctFrom equi predicates.

Prev

Are these changes tested?

Yes, using existing and some new test cases.

Are there any user-facing changes?

For queries using IS NOT DISTINCT might have improved query plans.

@github-actions github-actions bot added optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt) labels Sep 13, 2024
@eejbyfeldt eejbyfeldt marked this pull request as ready for review September 14, 2024 14:23
Copy link
Contributor

@korowa korowa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @eejbyfeldt, this looks like a reasonable change.

I've left some comments, and it's probably also worth to check other logical optimizer rules, to ensure that if they somehow modify joins, it won't result in incorrect operator.

if join.filter.is_none() {
return Ok(Transformed::no(LogicalPlan::Join(join)));
}
let expr = join.filter.as_ref().unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will let-else work here to avoid unwrap?

) -> Result<(Vec<EquijoinPredicate>, Option<Expr>)> {
let exprs = split_conjunction_owned(filter);
let exprs = split_conjunction(filter);
Copy link
Contributor

@korowa korowa Sep 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a note: this part (along with cloning exprs) in elses of this function was changed in #10165 as a part of "reducing expr cloning in optimizer" initiative, so, probably, in this place it's preferred to avoid cloning, unless it's required.

@@ -96,6 +96,7 @@ impl OptimizerRule for EliminateCrossJoin {
filter.input.as_ref(),
LogicalPlan::Join(Join {
join_type: JoinType::Inner,
null_equals_null: false,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This rule also has it's own logic for identifying join keys in extract_possible_join_keys function. Shouldn't is not distinct from operator be supported there (perhaps by reusing logic from extract_equijoin_predicated_rule)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, seems like we should probably unify that logic before making this change.

@@ -96,6 +96,7 @@ impl OptimizerRule for EliminateCrossJoin {
filter.input.as_ref(),
LogicalPlan::Join(Join {
join_type: JoinType::Inner,
null_equals_null: false,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And also, not sure that this rule shouldn't have any interactions with null_equals_null = true joins -- what is the current behavior of such queries and won't this additional filter change it drastically?


# IS NOT DISTRINCT can be transformed into equijoin
query TT
EXPLAIN SELECT t1_id, t1_int, t2_int FROM t1 JOIN t2 ON t1_id IS NOT DISTINCT from t2_id
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be useful to also have the following test cases

  • demonstrating that equality predicates have priority over "is not distinct from" when picking ON conditions from the filter
  • how these joins will work with eliminate_cross_join rule (which if I'm not mistaken also works for filter over inner joins event with ON conditions)
  • how these join will be treated by push_down_filter -- in some cases it seems to be able to push filters into ON conditions -- if so, resulting query won't be correct in case of null_equals_null joins

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestions for:

how these join will be treated by push_down_filter -- in some cases it seems to be able to push filters into ON conditions -- if so, resulting query won't be correct in case of null_equals_null joins

I thought these would be pushed into the filter and not the on (https://github.com/eejbyfeldt/datafusion/blob/1d4cf53169fd9a611a7e24e9ac4880d3e24b131e/datafusion/expr/src/logical_plan/plan.rs#L2965-L2968) and therefore still be correct.

@eejbyfeldt eejbyfeldt marked this pull request as draft September 16, 2024 17:43
@eejbyfeldt
Copy link
Contributor Author

Converted to draft until I have time to address the feedback and look more into if the requires changes in other rules.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use hash join with null_equals_null = true for joins containing IS NOT DISTINCT FROM
2 participants