Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support input reordering for NestedLoopJoinExec #9676

Merged
merged 8 commits into from
Apr 22, 2024

Conversation

korowa
Copy link
Contributor

@korowa korowa commented Mar 18, 2024

Which issue does this PR close?

Closes #8393.

Rationale for this change

Making plans, containing NestedLoopJoinExec optimizeable by

  1. adding them to join_selection rule of physical optimizer
  2. making NLJoin execution "type-agnostic" -- currently NLJoin build-side is chosen based on logical join type, so reordering join inputs won't help much without it

What changes are included in this PR?

  1. NestedLoopJoinExec covered by join_selection rule of physical optimizer
  2. NestedLoopJoinExec always picks left input as build side (which makes it consistent with e.g HashJoinExec operator), and reuses utility functions for other join implementations.

Are these changes tested?

Added tests for physical optimizer + NLJoinExec added to join_fuzz tests

Are there any user-facing changes?

In case both inputs have proper statistics, physical optimizer now picks build side properly. In addition, now there is an option to disable join_selection rule, and manually specify required join order.

@github-actions github-actions bot added core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels Mar 18, 2024
@korowa
Copy link
Contributor Author

korowa commented Mar 18, 2024

Regarding benchmarks -- this PR affects q11 and q22 in tpch, but the results differ much for tpch and tpch_mem (tpch_mem statistics estimations differ from ones in tpch over parquet files):

tpch_mem
┏━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃   master ┃ nl_join_reorder ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 11    │  40.20ms │         60.53ms │  1.51x slower │
│ QQuery 22    │  39.90ms │         48.82ms │  1.22x slower │
└──────────────┴──────────┴─────────────────┴───────────────┘
tpch
┏━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃   master ┃ nl_join_reorder ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 11    │  65.86ms │         37.58ms │ +1.75x faster │
│ QQuery 22    │  69.01ms │         72.40ms │     no change │
└──────────────┴──────────┴─────────────────┴───────────────┘

@Dandandan
Copy link
Contributor

let's run /benchmark

Copy link

Benchmark results

Benchmarks comparing 35ff7a6 (main) and 2da33f4 (PR)
Comparing 35ff7a6 and 2da33f4
--------------------
Benchmark tpch.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃  35ff7a6 ┃  2da33f4 ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │ 435.69ms │ 452.72ms │     no change │
│ QQuery 2     │  57.86ms │  60.13ms │     no change │
│ QQuery 3     │ 145.82ms │ 145.12ms │     no change │
│ QQuery 4     │  85.83ms │  88.60ms │     no change │
│ QQuery 5     │ 202.61ms │ 203.92ms │     no change │
│ QQuery 6     │ 108.07ms │ 105.53ms │     no change │
│ QQuery 7     │ 286.26ms │ 281.52ms │     no change │
│ QQuery 8     │ 199.76ms │ 197.78ms │     no change │
│ QQuery 9     │ 306.42ms │ 308.94ms │     no change │
│ QQuery 10    │ 242.59ms │ 239.58ms │     no change │
│ QQuery 11    │  63.12ms │  42.20ms │ +1.50x faster │
│ QQuery 12    │ 123.95ms │ 127.16ms │     no change │
│ QQuery 13    │ 178.68ms │ 178.25ms │     no change │
│ QQuery 14    │ 130.45ms │ 131.45ms │     no change │
│ QQuery 15    │ 194.89ms │ 189.04ms │     no change │
│ QQuery 16    │  51.72ms │  50.79ms │     no change │
│ QQuery 17    │ 328.88ms │ 311.22ms │ +1.06x faster │
│ QQuery 18    │ 454.21ms │ 440.76ms │     no change │
│ QQuery 19    │ 234.51ms │ 232.58ms │     no change │
│ QQuery 20    │ 198.28ms │ 194.71ms │     no change │
│ QQuery 21    │ 329.77ms │ 324.89ms │     no change │
│ QQuery 22    │  53.36ms │  72.41ms │  1.36x slower │
└──────────────┴──────────┴──────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary      ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (35ff7a6)   │ 4412.72ms │
│ Total Time (2da33f4)   │ 4379.31ms │
│ Average Time (35ff7a6) │  200.58ms │
│ Average Time (2da33f4) │  199.06ms │
│ Queries Faster         │         2 │
│ Queries Slower         │         1 │
│ Queries with No Change │        19 │
└────────────────────────┴───────────┘

--SortExec: TopK(fetch=10), expr=[value@1 DESC]
----ProjectionExec: expr=[ps_partkey@0 as ps_partkey, SUM(partsupp.ps_supplycost * partsupp.ps_availqty)@1 as value]
------NestedLoopJoinExec: join_type=Inner, filter=CAST(SUM(partsupp.ps_supplycost * partsupp.ps_availqty)@0 AS Decimal128(38, 15)) > SUM(partsupp.ps_supplycost * partsupp.ps_availqty) * Float64(0.0001)@1
--------CoalescePartitionsExec
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does seem to use run less in parallel?

Copy link
Contributor Author

@korowa korowa Mar 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, inner (always left) input must be collected into single partition, and outer (always right) input is executed in parallel. In case reordering is not happening due to lacking / misestimated statistics -- it'll also cause lack of parallelism.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case reordering is not happening due to lacking / misestimated statistics -- it'll also cause lack of parallelism.

should_swap_join_order is based on total_byte_size. It would become Precision::Absent after some operator like AggregateExec and ProjectionExec. I don't know whether it is a good idea to add a check by num_rows.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It already falls back to num_rows if any of join inputs doesn't provide bytes statistics.

--------------------HashJoinExec: mode=Partitioned, join_type=LeftAnti, on=[(c_custkey@0, o_custkey@0)], projection=[c_phone@1, c_acctbal@2]
----------------RepartitionExec: partitioning=RoundRobinBatch(4), input_partitions=1
------------------NestedLoopJoinExec: join_type=Inner, filter=CAST(c_acctbal@0 AS Decimal128(19, 6)) > AVG(customer.c_acctbal)@1
--------------------CoalescePartitionsExec
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here as well. before we had no CoalescePartitionsExec here so from the plan it looks like this join was using more parallelism before?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above -- the reason is that currently result of final aggregation (estimated as single row), cannot be reordered with result of partitioned aggregation (abscent statistics)

@Dandandan
Copy link
Contributor

QQuery 22

Here, QQuery 22 seems to run slower(1.36x slower) for tpch (non-memory) as well.

@korowa
Copy link
Contributor Author

korowa commented Mar 19, 2024

Here, QQuery 22 seems to run slower(1.36x slower) for tpch (non-memory) as well.

🤔 I'll recheck -- I've obtained my results from older version, and maybe statistics output has been affected by some commit since then

@korowa
Copy link
Contributor Author

korowa commented Mar 20, 2024

I'll recheck

Well, seems that I've screwed up while benchmarking locally as there is no significant runtime diff, and results are consistent with ones produced by GH action.

@Dandandan
Copy link
Contributor

/benchmark

@korowa
Copy link
Contributor Author

korowa commented Mar 31, 2024

Update: running tpch (parquet) on merged master with Semi/Anti join stats doesn't produce performance regressions for tpch anymore.

@metesynnada
Copy link
Contributor

/benchmark

Copy link

github-actions bot commented Apr 5, 2024

Benchmark results

Benchmarks comparing 2dad904 (main) and bdd5905 (PR)
Comparing 2dad904 and bdd5905
--------------------
Benchmark tpch_mem_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃  2dad904 ┃  bdd5905 ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │ 289.48ms │ 291.06ms │     no change │
│ QQuery 2     │  41.13ms │  42.37ms │     no change │
│ QQuery 3     │  59.08ms │  61.88ms │     no change │
│ QQuery 4     │  77.32ms │  82.71ms │  1.07x slower │
│ QQuery 5     │ 128.18ms │ 109.04ms │ +1.18x faster │
│ QQuery 6     │  21.50ms │  16.55ms │ +1.30x faster │
│ QQuery 7     │ 232.67ms │ 250.17ms │  1.08x slower │
│ QQuery 8     │  44.14ms │  46.56ms │  1.05x slower │
│ QQuery 9     │ 123.62ms │ 126.51ms │     no change │
│ QQuery 10    │ 116.73ms │ 114.69ms │     no change │
│ QQuery 11    │  46.64ms │  76.38ms │  1.64x slower │
│ QQuery 12    │  60.57ms │  61.19ms │     no change │
│ QQuery 13    │ 104.27ms │ 117.77ms │  1.13x slower │
│ QQuery 14    │  19.37ms │  19.86ms │     no change │
│ QQuery 15    │  31.96ms │  32.22ms │     no change │
│ QQuery 16    │  49.26ms │  47.73ms │     no change │
│ QQuery 17    │ 143.90ms │ 148.92ms │     no change │
│ QQuery 18    │ 575.82ms │ 599.09ms │     no change │
│ QQuery 19    │  65.20ms │  66.44ms │     no change │
│ QQuery 20    │ 118.11ms │ 131.08ms │  1.11x slower │
│ QQuery 21    │ 351.78ms │ 339.24ms │     no change │
│ QQuery 22    │  39.74ms │  30.63ms │ +1.30x faster │
└──────────────┴──────────┴──────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary      ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (2dad904)   │ 2740.46ms │
│ Total Time (bdd5905)   │ 2812.10ms │
│ Average Time (2dad904) │  124.57ms │
│ Average Time (bdd5905) │  127.82ms │
│ Queries Faster         │         3 │
│ Queries Slower         │         6 │
│ Queries with No Change │        13 │
└────────────────────────┴───────────┘
--------------------
Benchmark tpch_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃  2dad904 ┃  bdd5905 ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │ 435.28ms │ 433.46ms │     no change │
│ QQuery 2     │  57.60ms │  60.11ms │     no change │
│ QQuery 3     │ 145.37ms │ 145.74ms │     no change │
│ QQuery 4     │  91.09ms │  90.39ms │     no change │
│ QQuery 5     │ 204.11ms │ 205.20ms │     no change │
│ QQuery 6     │ 108.91ms │ 108.76ms │     no change │
│ QQuery 7     │ 279.90ms │ 300.12ms │  1.07x slower │
│ QQuery 8     │ 195.70ms │ 202.59ms │     no change │
│ QQuery 9     │ 302.57ms │ 295.42ms │     no change │
│ QQuery 10    │ 234.85ms │ 242.60ms │     no change │
│ QQuery 11    │  63.57ms │  41.70ms │ +1.52x faster │
│ QQuery 12    │ 128.23ms │ 126.32ms │     no change │
│ QQuery 13    │ 182.22ms │ 187.13ms │     no change │
│ QQuery 14    │ 128.07ms │ 129.07ms │     no change │
│ QQuery 15    │ 194.21ms │ 197.92ms │     no change │
│ QQuery 16    │  50.48ms │  53.30ms │  1.06x slower │
│ QQuery 17    │ 308.33ms │ 319.43ms │     no change │
│ QQuery 18    │ 459.37ms │ 473.85ms │     no change │
│ QQuery 19    │ 235.09ms │ 232.97ms │     no change │
│ QQuery 20    │ 199.36ms │ 201.44ms │     no change │
│ QQuery 21    │ 335.03ms │ 335.01ms │     no change │
│ QQuery 22    │  55.36ms │  44.52ms │ +1.24x faster │
└──────────────┴──────────┴──────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary      ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (2dad904)   │ 4394.71ms │
│ Total Time (bdd5905)   │ 4427.03ms │
│ Average Time (2dad904) │  199.76ms │
│ Average Time (bdd5905) │  201.23ms │
│ Queries Faster         │         2 │
│ Queries Slower         │         2 │
│ Queries with No Change │        18 │
└────────────────────────┴───────────┘
--------------------
Benchmark tpch_sf10.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃   2dad904 ┃   bdd5905 ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │ 4261.78ms │ 4284.23ms │     no change │
│ QQuery 2     │  494.79ms │  537.10ms │  1.09x slower │
│ QQuery 3     │ 1750.07ms │ 1735.21ms │     no change │
│ QQuery 4     │  838.70ms │  832.06ms │     no change │
│ QQuery 5     │ 2240.24ms │ 2283.63ms │     no change │
│ QQuery 6     │ 1049.19ms │ 1046.87ms │     no change │
│ QQuery 7     │ 3777.45ms │ 3895.30ms │     no change │
│ QQuery 8     │ 2523.06ms │ 2522.64ms │     no change │
│ QQuery 9     │ 4261.89ms │ 4301.60ms │     no change │
│ QQuery 10    │ 2591.54ms │ 2610.27ms │     no change │
│ QQuery 11    │  572.57ms │  349.80ms │ +1.64x faster │
│ QQuery 12    │ 1221.10ms │ 1213.01ms │     no change │
│ QQuery 13    │ 2383.38ms │ 2394.47ms │     no change │
│ QQuery 14    │ 1286.08ms │ 1294.95ms │     no change │
│ QQuery 15    │ 1986.21ms │ 1981.72ms │     no change │
│ QQuery 16    │  520.46ms │  529.13ms │     no change │
│ QQuery 17    │ 5259.84ms │ 5349.47ms │     no change │
│ QQuery 18    │ 6952.12ms │ 7205.29ms │     no change │
│ QQuery 19    │ 2252.07ms │ 2239.19ms │     no change │
│ QQuery 20    │ 2642.42ms │ 2671.71ms │     no change │
│ QQuery 21    │ 4695.84ms │ 4525.04ms │     no change │
│ QQuery 22    │  573.27ms │  471.46ms │ +1.22x faster │
└──────────────┴───────────┴───────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary      ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (2dad904)   │ 54134.08ms │
│ Total Time (bdd5905)   │ 54274.15ms │
│ Average Time (2dad904) │  2460.64ms │
│ Average Time (bdd5905) │  2467.01ms │
│ Queries Faster         │          2 │
│ Queries Slower         │          1 │
│ Queries with No Change │         19 │
└────────────────────────┴────────────┘

@alamb
Copy link
Contributor

alamb commented Apr 13, 2024

What is the current status of this PR? Is it ready to go?

@korowa
Copy link
Contributor Author

korowa commented Apr 14, 2024

What is the current status of this PR? Is it ready to go?

Join behavior is now consistent with HJ, and it doesn't introduce any performance regressions for tpch. The only issue is incorrect left/right placing in tpch_mem q11 caused by difference in parquet & memory table statistics content -- I don't think it is relevant to this PR, so I suppose this PR to be "ready for review".

(Some clarification regarding "I don't think it is relevant to this PR" -- even with inputs misplacing due to incorrect/absent statistics, this PR gives an option to disable optimizer rule and specify join inputs as required -- this option is not available in current NLJ implementation, as build-side is picked based on logical join type)

@alamb
Copy link
Contributor

alamb commented Apr 17, 2024

Ok, thanks @korowa -- I will try and find time to review it over the next day or so

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much @korowa -- this looks like a very nice improvement to the NestedLoopsJoinExec. I am sorry it took so long to find time to review it (I should have known reviewing this would be straightforward given your past history of writing well documented and reviewed code 🙏 )

I think the unit test (big_col > small_col vs big_col > big_col) is worth double checking but otherwise I think this PR is good to go.

Thanks again

datafusion/core/src/physical_optimizer/join_selection.rs Outdated Show resolved Hide resolved
@@ -73,7 +79,7 @@ async fn test_full_join_1k() {
}

#[tokio::test]
async fn test_semi_join_1k() {
async fn test_semi_join_10k() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a nice drive by cleanup

datafusion/physical-plan/src/joins/nested_loop_join.rs Outdated Show resolved Hide resolved
/// This step is also executed in parallel (once per probe input partition), and to avoid
/// duplicate output of unmatched data (due to shared nature build-side data), each thread
/// "reports" about probe phase completion (which means that "visited" bitmap won't be
/// updated anymore), and only the last thread, reporting about completion, will return output.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the documentation updates 👍

@alamb alamb merged commit 8f8e105 into apache:main Apr 22, 2024
23 checks passed
@alamb
Copy link
Contributor

alamb commented Apr 22, 2024

Thanks again @korowa

@Dandandan
Copy link
Contributor

Thank you @korowa 🙏

ccciudatu pushed a commit to hstack/arrow-datafusion that referenced this pull request Apr 26, 2024
* support input reordering for NestedLoopJoinExec

* renamed variables and struct fields

* fixed nl join filter expression in tests

* Update datafusion/physical-plan/src/joins/nested_loop_join.rs

Co-authored-by: Andrew Lamb <[email protected]>

* typo fixed

---------

Co-authored-by: Andrew Lamb <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Range/inequality joins are slow
5 participants