Use hash join with null_equals_null = true for joins containing IS NOT DISTINCT FROM #12299

ThomasAWalker · 2024-09-03T08:14:43Z

Is your feature request related to a problem or challenge?

It would be nice if the optimiser could use hash joins when queries contain IS NOT DISTINCT FROM - with null_equals_null behaviour

e.g. as it is at the moment

SELECT t1.label, t1.v1*t2.v2 FROM t1 JOIN t2 ON t1.label=t2.label;

will use a hash join while

SELECT t1.label, t1.v1*t2.v2 FROM t1 JOIN t2 ON t1.label IS NOT DISTINCT FROM t2.label;

will use a nested loop

below

> CREATE table t1(label text, v1 int);
0 row(s) fetched. 
Elapsed 0.004 seconds.

> CREATE table t2(label text, v2 int);
0 row(s) fetched. 
Elapsed 0.002 seconds.

> EXPLAIN SELECT t1.label, t1.v1*t2.v2 FROM t1 JOIN t2 ON t1.label=t2.label; 
+---------------+----------------------------------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                                           |
+---------------+----------------------------------------------------------------------------------------------------------------+
| logical_plan  | Projection: t1.label, t1.v1 * t2.v2                                                                            |
|               |   Inner Join: t1.label = t2.label                                                                              |
|               |     TableScan: t1 projection=[label, v1]                                                                       |
|               |     TableScan: t2 projection=[label, v2]                                                                       |
| physical_plan | ProjectionExec: expr=[label@0 as label, v1@1 * v2@2 as t1.v1 * t2.v2]                                          |
|               |   CoalesceBatchesExec: target_batch_size=8192                                                                  |
|               |     HashJoinExec: mode=Partitioned, join_type=Inner, on=[(label@0, label@0)], projection=[label@0, v1@1, v2@3] |
|               |       CoalesceBatchesExec: target_batch_size=8192                                                              |
|               |         RepartitionExec: partitioning=Hash([label@0], 16), input_partitions=1                                  |
|               |           MemoryExec: partitions=1, partition_sizes=[0]                                                        |
|               |       CoalesceBatchesExec: target_batch_size=8192                                                              |
|               |         RepartitionExec: partitioning=Hash([label@0], 16), input_partitions=1                                  |
|               |           MemoryExec: partitions=1, partition_sizes=[0]                                                        |
|               |                                                                                                                |
+---------------+----------------------------------------------------------------------------------------------------------------+
2 row(s) fetched. 
Elapsed 0.004 seconds.

> EXPLAIN SELECT t1.label, t1.v1*t2.v2 FROM t1 JOIN t2 ON t1.label IS NOT DISTINCT FROM t2.label; 
+---------------+--------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                 |
+---------------+--------------------------------------------------------------------------------------+
| logical_plan  | Projection: t1.label, t1.v1 * t2.v2                                                  |
|               |   Inner Join:  Filter: t1.label IS NOT DISTINCT FROM t2.label                        |
|               |     TableScan: t1 projection=[label, v1]                                             |
|               |     TableScan: t2 projection=[label, v2]                                             |
| physical_plan | ProjectionExec: expr=[label@0 as label, v1@1 * v2@3 as t1.v1 * t2.v2]                |
|               |   RepartitionExec: partitioning=RoundRobinBatch(16), input_partitions=1              |
|               |     NestedLoopJoinExec: join_type=Inner, filter=label@0 IS NOT DISTINCT FROM label@1 |
|               |       MemoryExec: partitions=1, partition_sizes=[0]                                  |
|               |       MemoryExec: partitions=1, partition_sizes=[0]                                  |
|               |                                                                                      |
+---------------+--------------------------------------------------------------------------------------+
2 row(s) fetched. 
Elapsed 0.003 seconds.

Describe the solution you'd like

When optimising the logical plan if ExtractEquijoinPredicate encounters a join with no existing conditions in the "on" vector and no results for split_eq_and_noneq_join_predicate then do a similar check for "IS NOT DISTINCT FROM". If this returns some conditions then push these into the "on" vector and set null_equals_null to true.

We've got a change like this on our fork of DF. We also had to add a few checks to some optimiser rules to stop them rewriting joins without checking null_equals_null.

Describe alternatives you've considered

I don't know if this could potentially be left to a later optimisation pass.

Additional context

No response

The text was updated successfully, but these errors were encountered:

eejbyfeldt · 2024-09-13T15:14:10Z

I had a draft branch implementing this. I will go over it and put it up for review.

ThomasAWalker added the enhancement New feature or request label Sep 3, 2024

eejbyfeldt linked a pull request Sep 13, 2024 that will close this issue

feat: Support null safe equals in ExtractEquijoinPredicate #12458

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use hash join with null_equals_null = true for joins containing IS NOT DISTINCT FROM #12299

Use hash join with null_equals_null = true for joins containing IS NOT DISTINCT FROM #12299

ThomasAWalker commented Sep 3, 2024

eejbyfeldt commented Sep 13, 2024 •

edited

Loading

Use hash join with null_equals_null = true for joins containing IS NOT DISTINCT FROM #12299

Use hash join with null_equals_null = true for joins containing IS NOT DISTINCT FROM #12299

Comments

ThomasAWalker commented Sep 3, 2024

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

eejbyfeldt commented Sep 13, 2024 • edited Loading

eejbyfeldt commented Sep 13, 2024 •

edited

Loading