Skip to content

Conversation

@peter-toth
Copy link
Contributor

@peter-toth peter-toth commented Jan 8, 2026

What changes were proposed in this pull request?

Run NullPropagation after NOT IN subquery rewrite.

Why are the changes needed?

NOT IN subqueries like SELECT * FROM t1 WHERE c NOT IN (SELECT c FROM t2) are rewritten as left anti join t1.c = t2.c with additional OR IsNull(t1.c = t2.c) conditions which prevents equi join implementations to be used so those joins end up as BroadcastNestedLoopJoin. When we know the columns can't be null, we can either drop those additional conditions during subquery rewrite or call NullPropagation after the rewrite to simplify them to false. This PR contains the latter.

Please note that #29104 already optmized the single column NOT IN subqueries from BroadcastNestedLoopJoin to "null aware" BroadcastHashJoin very well, but when the columns are not nullable we can optimize multi column cases as well and the join don't need to be "null aware".

Does this PR introduce any user-facing change?

Yes, performance improvement.

How was this patch tested?

A new UTs was added and some exsisting tests were adjusted to keep their validity.

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions
Copy link

github-actions bot commented Jan 8, 2026

JIRA Issue Information

=== Improvement SPARK-54972 ===
Summary: Improve not-in subqueries with non nullable columns
Assignee: None
Status: Open
Affected: ["4.2.0"]


This comment was automatically generated by GitHub Actions

@github-actions github-actions bot added the SQL label Jan 8, 2026
// positive not in subquery case
var joinExec = assertJoin((
"select * from testData where key not in (select a from testData2)",
"select * from testData where key not in (select b from testData3)",
Copy link
Contributor Author

@peter-toth peter-toth Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

testData2 columns are not nullable, but we need a nullable column to keep this test valid (assert(joinExec.asInstanceOf[BroadcastHashJoinExec].isNullAwareAntiJoin)).

@peter-toth
Copy link
Contributor Author

cc @cloud-fan , @dongjoon-hyun

@peter-toth peter-toth force-pushed the SPARK-54972-improve-not-in-with-non-nullables branch from 11e8cf1 to 6961295 Compare January 8, 2026 19:25
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM.

Thank you for pinging me, @peter-toth . The PR itself looks reasonable to me. Could you simply re-trigger the CI failure because it looks irrelevant to this?

- SPARK-47148: AQE should avoid to submit shuffle job on cancellation *** FAILED *** (6 seconds, 125 milliseconds)

@peter-toth peter-toth closed this in 8e63b61 Jan 9, 2026
@peter-toth
Copy link
Contributor Author

peter-toth commented Jan 9, 2026

Thanks @dongjoon-hyun for the review.

Merged to master (4.2.0).

Yicong-Huang pushed a commit to Yicong-Huang/spark that referenced this pull request Jan 9, 2026
### What changes were proposed in this pull request?

Run `NullPropagation` after NOT IN subquery rewrite.

### Why are the changes needed?

NOT IN subqueries like `SELECT * FROM t1 WHERE c NOT IN (SELECT c FROM t2)` are rewritten as left anti join `t1.c = t2.c` with additional `OR IsNull(t1.c = t2.c)` conditions which prevents equi join implementations to be used so those joins end up as `BroadcastNestedLoopJoin`. When we know the columns can't be null, we can either drop those additional conditions during subquery rewrite or call `NullPropagation` after the rewrite to simplify them to `false`. This PR contains the latter.

Please note that apache#29104 already optmized the single column NOT IN subqueries from `BroadcastNestedLoopJoin` to "null aware" `BroadcastHashJoin` very well, but when the columns are not nullable we can optimize multi column cases as well and the join don't need to be "null aware".

### Does this PR introduce _any_ user-facing change?

Yes, performance improvement.

### How was this patch tested?

A new UTs was added and some exsisting tests were adjusted to keep their validity.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#53733 from peter-toth/SPARK-54972-improve-not-in-with-non-nullables.

Authored-by: Peter Toth <[email protected]>
Signed-off-by: Peter Toth <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants