Enable datafusion.optimizer.filter_null_join_keys by default #12369

Dandandan · 2024-09-07T07:57:50Z

Which issue does this PR close?

Rationale for this change

It is better to have it enabled by default, as pushing a filter down below a join is generally faster than the overhead saved by not having to execute the filter.
Besides that, in many cases the filter can be pushed down further to the scan or might enable other optimizations (now or future ones):

Benefits: Input to join is smaller, so smaller input, faster build, no nulls need to be hashed, lower chance for data skew, other join can be planned, downstream kernels are faster, can possibly be pushed down into scan, etc. In distributed setting, might save a lot of IO as well.

Downside: evaluation of expression + copying overhead if it doesn't filter out much rows. This might be slower than executing join without prefiltering on nulls.

What changes are included in this PR?

enable datafusion.optimizer.filter_null_join_keys, fix tests / expectations

Are these changes tested?

Existing tests

Are there any user-facing changes?

Only slightly different plans.

…datafusion into filter_null_join_keys

Dandandan · 2024-09-07T15:26:05Z

datafusion/optimizer/tests/optimizer_integration.rs

@@ -281,11 +278,9 @@ fn test_same_name_but_not_ambiguous() {
    let expected = "LeftSemi Join: t1.col_int32 = t2.col_int32\
    \n  Aggregate: groupBy=[[t1.col_int32]], aggr=[[]]\
    \n    SubqueryAlias: t1\
-    \n      Filter: test.col_int32 IS NOT NULL\


This was actually a regression in https://github.com/apache/datafusion/pull/

the link seems to be incomplete

This #12348 , I think I move the fix out of this PR, so it can be reviewed separately

alamb

do we have any benchmarks that pushing these into Filters is faster than evaluating them in joins?

I am surprised that the overhead of filtering (and this copying) all non-null rows outweighs the benefits

If the filter was pushed all the way to the scan, I could see it potentially helping. However, I don't see any plans where the filter is actually pushed into a scan (perhaps because all the tests operate on CSV / MemTable which don't support filters)

alamb · 2024-09-09T13:21:59Z

datafusion/optimizer/tests/optimizer_integration.rs

@@ -281,11 +278,9 @@ fn test_same_name_but_not_ambiguous() {
    let expected = "LeftSemi Join: t1.col_int32 = t2.col_int32\
    \n  Aggregate: groupBy=[[t1.col_int32]], aggr=[[]]\
    \n    SubqueryAlias: t1\
-    \n      Filter: test.col_int32 IS NOT NULL\


the link seems to be incomplete

alamb · 2024-09-09T13:24:52Z

datafusion/sqllogictest/test_files/group_by.slt

-07)------------TableScan: sales_global projection=[zip_code, country, sn, ts, currency]
-08)----------SubqueryAlias: e
-09)------------TableScan: sales_global projection=[sn, ts, currency, amount]
+07)------------Filter: sales_global.currency IS NOT NULL


Benefits: Input to join is smaller, so smaller input, faster build, no nulls need to be hashed

But in order to skip hashing nulls, the input array would have to be "filtered" (aka copy the matching rows)

lower chance for data skew, other join can be planned, downstream kernels are faster, can possibly be pushed down into scan, etc. In distributed setting, might save a lot of IO as well.

The argument in the distributed setting makes sense to me, but the other ones seem like they are all of the class "faster in some cases but slower in others"

But in order to skip hashing nulls, the input array would have to be "filtered" (aka copy the matching rows)

Correct, but you save some copying in RepartitionExec / build side concatenate as well, and copying / checking columns of keys in probe side.
In case there aren't any nulls (even if column is nullable), there is no copying happening.

Even with CSV / MemTable in many cases null filter can be combined with existing filter expressions, so no extra copying is happening (less copying in fact as fewer rows need to be copied).

Dandandan · 2024-09-09T14:11:36Z

do we have any benchmarks that pushing these into Filters is faster than evaluating them in joins?

I am surprised that the overhead of filtering (and this copying) all non-null rows outweighs the benefits

If the filter was pushed all the way to the scan, I could see it potentially helping. However, I don't see any plans where the filter is actually pushed into a scan (perhaps because all the tests operate on CSV / MemTable which don't support filters)

Thanks, I'll try and see if there are any benchmarks. I tried with TPCH, but there aren't any nulls in there so benchmarks aren't changed (as expected).

Filter null keys by default

0f8e9a8

github-actions bot added the common Related to common crate label Sep 7, 2024

null_equals_null

1624740

github-actions bot added the optimizer Optimizer rules label Sep 7, 2024

Docs

3de1017

github-actions bot added the documentation Improvements or additions to documentation label Sep 7, 2024

Dandandan added 3 commits September 7, 2024 10:24

Update filter_null_join_keys.rs

03784e2

Docs

b45a74f

:w:erge branch 'filter_null_join_keys' of github.com:Dandandan/arrow-…

ba73e8a

…datafusion into filter_null_join_keys

github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Sep 7, 2024

Dandandan added 4 commits September 7, 2024 15:45

Wip

df9e3db

Wip

dc876a9

WIP

5132ff8

Add constraints

245fc11

Dandandan changed the title ~~Filter null keys by default~~ Enable datafusion.optimizer.filter_null_join_keys by default Sep 7, 2024

Dandandan added 2 commits September 7, 2024 16:42

test failures

6b83f1a

Wip

8fa7295

Dandandan commented Sep 7, 2024

View reviewed changes

Dandandan added 8 commits September 7, 2024 17:29

Wip

0eca129

Wip

419165e

Wip

29f112a

Wip

30db2c9

Wip

b2c7412

Wip

001ad1a

Wip

835f94c

Wip

4427a9f

github-actions bot added the substrait label Sep 7, 2024

Dandandan added 2 commits September 7, 2024 21:15

Wipc

3645f41

Wip

54b344e

Dandandan marked this pull request as ready for review September 7, 2024 20:00

alamb reviewed Sep 9, 2024

View reviewed changes

Wip bench

f56293c

github-actions bot added the core Core DataFusion crate label Sep 9, 2024

alamb mentioned this pull request Sep 11, 2024

DataFusion weekly project plan (Andrew Lamb) - Sep 9, 2024 #12391

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable datafusion.optimizer.filter_null_join_keys by default #12369

Enable datafusion.optimizer.filter_null_join_keys by default #12369

Dandandan commented Sep 7, 2024 •

edited

Loading

Dandandan Sep 7, 2024

alamb Sep 9, 2024

Dandandan Sep 9, 2024

Dandandan Sep 9, 2024

alamb left a comment •

edited

Loading

alamb Sep 9, 2024

alamb Sep 9, 2024

Dandandan Sep 9, 2024 •

edited

Loading

Dandandan commented Sep 9, 2024

Enable datafusion.optimizer.filter_null_join_keys by default #12369

Are you sure you want to change the base?

Enable datafusion.optimizer.filter_null_join_keys by default #12369

Conversation

Dandandan commented Sep 7, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Dandandan Sep 7, 2024

Choose a reason for hiding this comment

alamb Sep 9, 2024

Choose a reason for hiding this comment

Dandandan Sep 9, 2024

Choose a reason for hiding this comment

Dandandan Sep 9, 2024

Choose a reason for hiding this comment

alamb left a comment • edited Loading

Choose a reason for hiding this comment

alamb Sep 9, 2024

Choose a reason for hiding this comment

alamb Sep 9, 2024

Choose a reason for hiding this comment

Dandandan Sep 9, 2024 • edited Loading

Choose a reason for hiding this comment

Dandandan commented Sep 9, 2024

Dandandan commented Sep 7, 2024 •

edited

Loading

alamb left a comment •

edited

Loading

Dandandan Sep 9, 2024 •

edited

Loading