Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use hash join with null_equals_null = true for joins containing IS NOT DISTINCT FROM #12299

Open
ThomasAWalker opened this issue Sep 3, 2024 · 1 comment · May be fixed by #12458
Open

Use hash join with null_equals_null = true for joins containing IS NOT DISTINCT FROM #12299

ThomasAWalker opened this issue Sep 3, 2024 · 1 comment · May be fixed by #12458
Labels
enhancement New feature or request

Comments

@ThomasAWalker
Copy link

Is your feature request related to a problem or challenge?

It would be nice if the optimiser could use hash joins when queries contain IS NOT DISTINCT FROM - with null_equals_null behaviour

e.g. as it is at the moment

SELECT t1.label, t1.v1*t2.v2 FROM t1 JOIN t2 ON t1.label=t2.label;

will use a hash join while

SELECT t1.label, t1.v1*t2.v2 FROM t1 JOIN t2 ON t1.label IS NOT DISTINCT FROM t2.label;

will use a nested loop

below

> CREATE table t1(label text, v1 int);
0 row(s) fetched. 
Elapsed 0.004 seconds.

> CREATE table t2(label text, v2 int);
0 row(s) fetched. 
Elapsed 0.002 seconds.

> EXPLAIN SELECT t1.label, t1.v1*t2.v2 FROM t1 JOIN t2 ON t1.label=t2.label; 
+---------------+----------------------------------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                                           |
+---------------+----------------------------------------------------------------------------------------------------------------+
| logical_plan  | Projection: t1.label, t1.v1 * t2.v2                                                                            |
|               |   Inner Join: t1.label = t2.label                                                                              |
|               |     TableScan: t1 projection=[label, v1]                                                                       |
|               |     TableScan: t2 projection=[label, v2]                                                                       |
| physical_plan | ProjectionExec: expr=[label@0 as label, v1@1 * v2@2 as t1.v1 * t2.v2]                                          |
|               |   CoalesceBatchesExec: target_batch_size=8192                                                                  |
|               |     HashJoinExec: mode=Partitioned, join_type=Inner, on=[(label@0, label@0)], projection=[label@0, v1@1, v2@3] |
|               |       CoalesceBatchesExec: target_batch_size=8192                                                              |
|               |         RepartitionExec: partitioning=Hash([label@0], 16), input_partitions=1                                  |
|               |           MemoryExec: partitions=1, partition_sizes=[0]                                                        |
|               |       CoalesceBatchesExec: target_batch_size=8192                                                              |
|               |         RepartitionExec: partitioning=Hash([label@0], 16), input_partitions=1                                  |
|               |           MemoryExec: partitions=1, partition_sizes=[0]                                                        |
|               |                                                                                                                |
+---------------+----------------------------------------------------------------------------------------------------------------+
2 row(s) fetched. 
Elapsed 0.004 seconds.

> EXPLAIN SELECT t1.label, t1.v1*t2.v2 FROM t1 JOIN t2 ON t1.label IS NOT DISTINCT FROM t2.label; 
+---------------+--------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                 |
+---------------+--------------------------------------------------------------------------------------+
| logical_plan  | Projection: t1.label, t1.v1 * t2.v2                                                  |
|               |   Inner Join:  Filter: t1.label IS NOT DISTINCT FROM t2.label                        |
|               |     TableScan: t1 projection=[label, v1]                                             |
|               |     TableScan: t2 projection=[label, v2]                                             |
| physical_plan | ProjectionExec: expr=[label@0 as label, v1@1 * v2@3 as t1.v1 * t2.v2]                |
|               |   RepartitionExec: partitioning=RoundRobinBatch(16), input_partitions=1              |
|               |     NestedLoopJoinExec: join_type=Inner, filter=label@0 IS NOT DISTINCT FROM label@1 |
|               |       MemoryExec: partitions=1, partition_sizes=[0]                                  |
|               |       MemoryExec: partitions=1, partition_sizes=[0]                                  |
|               |                                                                                      |
+---------------+--------------------------------------------------------------------------------------+
2 row(s) fetched. 
Elapsed 0.003 seconds.

Describe the solution you'd like

When optimising the logical plan if ExtractEquijoinPredicate encounters a join with no existing conditions in the "on" vector and no results for split_eq_and_noneq_join_predicate then do a similar check for "IS NOT DISTINCT FROM". If this returns some conditions then push these into the "on" vector and set null_equals_null to true.

We've got a change like this on our fork of DF. We also had to add a few checks to some optimiser rules to stop them rewriting joins without checking null_equals_null.

Describe alternatives you've considered

I don't know if this could potentially be left to a later optimisation pass.

Additional context

No response

@ThomasAWalker ThomasAWalker added the enhancement New feature or request label Sep 3, 2024
@eejbyfeldt
Copy link
Contributor

eejbyfeldt commented Sep 13, 2024

I had a draft branch implementing this. I will go over it and put it up for review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants