Ensure Nested Loop Join output follows probe order#10651
Ensure Nested Loop Join output follows probe order#10651pedroerp wants to merge 1 commit intofacebookincubator:mainfrom
Conversation
|
This pull request was exported from Phabricator. Differential Revision: D60685613 |
✅ Deploy Preview for meta-velox canceled.
|
velox/exec/NestedLoopJoinProbe.cpp
Outdated
There was a problem hiding this comment.
This will kill the performance. Can we accumulate buildRows as CopyRanges or toSourceRow and only call copy once per output batch?
|
Also maybe we should consider merging build side vectors into one large vector, this way we would be able to wrap them in dictionary. Is there any concern about this? |
We discussed this offline. Depending on the size of the build it may be tricky to allocate a large enough contiguous memory location. We decided to move on with this approach for now. I'll take a look at batching the copies like you suggested. |
|
This pull request was exported from Phabricator. Differential Revision: D60685613 |
4eea1c8 to
123b157
Compare
|
Just added the optimization suggested by @Yuhta . The code now keeps track of the ranges of rows from the build side to be copied to the output (merging them when possible), then copying them to the output using Another potential optimization would be to figure out if only copies from the same build buffer are need for an output buffer, we could create a dictionary instead. But it's not clear how frequently this would happen. Could be done as a follow up. We also validated that our internal Presto query failing because of the issue described in prestodb/presto#22585 now works. Cc: @amitkdutta @aditi-pandit |
|
This pull request was exported from Phabricator. Differential Revision: D60685613 |
123b157 to
33dcb48
Compare
velox/exec/NestedLoopJoinProbe.cpp
Outdated
There was a problem hiding this comment.
just moved this function up to make the code order more readable; no changes here.
velox/exec/NestedLoopJoinProbe.cpp
Outdated
There was a problem hiding this comment.
same for these two functions.
velox/exec/NestedLoopJoinProbe.cpp
Outdated
There was a problem hiding this comment.
from now on the changes start
|
This pull request was exported from Phabricator. Differential Revision: D60685613 |
33dcb48 to
48c8e73
Compare
|
@pedroerp Thanks a lot Pedro for priortizing and fixing this long lasting issue. Recently @kgpai made a PR to disbale NLJ in Prestissimo worker (prestodb/presto#23341), it will be great to run the e2e tests with Prestissimo to see if queries are working expected. This will test this PR well and advancing velox in PrestoDb won't create any unnecessary issues. CC: @aditi-pandit @majetideepak |
@pedroerp : Agree. Would be great to re-enable NLJ and test all the queries in presto-native-execution/src/test/java/com/facebook/presto/nativeworker/AbstractTestNativeGeneralQueries.java in https://github.com/prestodb/presto/pull/23315/files#diff-a07cf9ace74d4a8c17fce624676de4770bbcf47547e69d35c70d12777d57d9a1 |
aditi-pandit
left a comment
There was a problem hiding this comment.
Thanks @pedroerp for the code and detailed comments.
Had some very minor comments.
There was a problem hiding this comment.
Generally, HashJoin is used for predicates with equality. It might be better to use another comparison for more realistic use of NLJ.
velox/exec/NestedLoopJoinProbe.h
Outdated
There was a problem hiding this comment.
Nit : This statement is confusing. The output follows the order of the "probe" side right ? Or do you mean something else ?
There was a problem hiding this comment.
Good catch, I indeed meant "probe" here. :)
velox/exec/NestedLoopJoinProbe.h
Outdated
There was a problem hiding this comment.
Nit : grammar "will collect all build matches at the end"
velox/exec/NestedLoopJoinProbe.cpp
Outdated
velox/exec/NestedLoopJoinProbe.cpp
Outdated
48c8e73 to
2a12284
Compare
mbasmanova
left a comment
There was a problem hiding this comment.
It would be nice to update documentation in https://facebookincubator.github.io/velox/develop/operators.html#nestedloopjoinnode and in PlanNode.h
velox/exec/NestedLoopJoinProbe.cpp
Outdated
velox/exec/NestedLoopJoinProbe.cpp
Outdated
velox/exec/NestedLoopJoinProbe.cpp
Outdated
velox/exec/NestedLoopJoinProbe.h
Outdated
There was a problem hiding this comment.
nit: drop "This class" and start with a verb "Implements ..."
velox/exec/NestedLoopJoinProbe.h
Outdated
There was a problem hiding this comment.
Can this change be extracted into a separate PR?
Can we call loadedVectorShared once per column vs. once per column per row?
|
This pull request was exported from Phabricator. Differential Revision: D60685613 |
2a12284 to
f9c9ee3
Compare
Confirming that we run the e2e tests in meta internal tests and they are all good in this PR. |
|
This pull request was exported from Phabricator. Differential Revision: D60685613 |
f9c9ee3 to
190a041
Compare
velox/core/PlanNode.h
Outdated
There was a problem hiding this comment.
It would be nice to also update documentation in https://facebookincubator.github.io/velox/develop/operators.html#nestedloopjoinnode
Should we clarify that the order is preserved only within a single thread of execution?
|
Thank you @mbasmanova @Yuhta and @aditi-pandit for the quick code review. All comments are addressed. Will merge it later today unless I hear any last objections :) |
…#10651) Summary: Pull Request resolved: facebookincubator#10651 To follow expectations from engines such as Presto, changing NLJ operator to emit output in the same order as probe inputs are read. This required us to make changes to how the operator processes data internally (check the documentation added to the header file). . The main difference (other than a large code refactor), is the now each probe record needs to be processed entirely before we can move to the next; this means we can emit probe mismatches right away in the right order. . As a downside, since we process probe records entirely, the output may contain records from multiple build vectors, so we can't just wrap them into dictionaries. We still produce dictionaries wrapped around probe columns though. Reviewed By: mbasmanova Differential Revision: D60685613
|
This pull request was exported from Phabricator. Differential Revision: D60685613 |
190a041 to
5ca9f68
Compare
|
This pull request has been merged in c0fa8f2. |
|
Conbench analyzed the 1 benchmark run on commit There were no benchmark performance regressions. 🎉 The full Conbench report has more details. |
|
|
Thank you @amitkdutta ! |
Summary:
Recently, we found the following benchmark sql became much slower after we upgraded our system to latest velox.
```
select
ps_partkey,
sum(ps_supplycost * ps_availqty) as value
from
partsupp,
supplier,
nation
where
ps_suppkey = s_suppkey
and s_nationkey = n_nationkey
and n_name = 'GERMANY'
group by
ps_partkey having
sum(ps_supplycost * ps_availqty) > (
select
sum(ps_supplycost * ps_availqty) * 0.000001
from
partsupp,
supplier,
nation
where
ps_suppkey = s_suppkey
and s_nationkey = n_nationkey
and n_name = 'GERMANY'
)
order by
value desc;
```
Through the flame-graph investigations, we noticed that ```NestedLoopJoin::generateOutput``` became the bottle-neck. So we went through the most recent changes to NestedLoopJoin relevant files and found this suspicious PR (#10651).
After reviewing the changes introduced by the PR, I find out that this PR prepares output for each probe row even if the build table has only one row (we have filter defined in our SQL). Even though each probe row only makes use of small part of the output space, a new output is still created for the next probe row.
So my PR here is trying to re-use the output acorss multiple probe rows and avoid un-necessary memory allocating for new output. And with my PR, the beanchmark SQL performance recovers.
Pull Request resolved: #12519
Reviewed By: Yuhta
Differential Revision: D71409408
Pulled By: xiaoxmeng
fbshipit-source-id: fdff86af2d4146e9e8a81ae33368db152bf9965d
…bator#12519) Summary: Recently, we found the following benchmark sql became much slower after we upgraded our system to latest velox. ``` select ps_partkey, sum(ps_supplycost * ps_availqty) as value from partsupp, supplier, nation where ps_suppkey = s_suppkey and s_nationkey = n_nationkey and n_name = 'GERMANY' group by ps_partkey having sum(ps_supplycost * ps_availqty) > ( select sum(ps_supplycost * ps_availqty) * 0.000001 from partsupp, supplier, nation where ps_suppkey = s_suppkey and s_nationkey = n_nationkey and n_name = 'GERMANY' ) order by value desc; ``` Through the flame-graph investigations, we noticed that ```NestedLoopJoin::generateOutput``` became the bottle-neck. So we went through the most recent changes to NestedLoopJoin relevant files and found this suspicious PR (facebookincubator#10651). After reviewing the changes introduced by the PR, I find out that this PR prepares output for each probe row even if the build table has only one row (we have filter defined in our SQL). Even though each probe row only makes use of small part of the output space, a new output is still created for the next probe row. So my PR here is trying to re-use the output acorss multiple probe rows and avoid un-necessary memory allocating for new output. And with my PR, the beanchmark SQL performance recovers. Pull Request resolved: facebookincubator#12519 Reviewed By: Yuhta Differential Revision: D71409408 Pulled By: xiaoxmeng fbshipit-source-id: fdff86af2d4146e9e8a81ae33368db152bf9965d
…bator#12519) Summary: Recently, we found the following benchmark sql became much slower after we upgraded our system to latest velox. ``` select ps_partkey, sum(ps_supplycost * ps_availqty) as value from partsupp, supplier, nation where ps_suppkey = s_suppkey and s_nationkey = n_nationkey and n_name = 'GERMANY' group by ps_partkey having sum(ps_supplycost * ps_availqty) > ( select sum(ps_supplycost * ps_availqty) * 0.000001 from partsupp, supplier, nation where ps_suppkey = s_suppkey and s_nationkey = n_nationkey and n_name = 'GERMANY' ) order by value desc; ``` Through the flame-graph investigations, we noticed that ```NestedLoopJoin::generateOutput``` became the bottle-neck. So we went through the most recent changes to NestedLoopJoin relevant files and found this suspicious PR (facebookincubator#10651). After reviewing the changes introduced by the PR, I find out that this PR prepares output for each probe row even if the build table has only one row (we have filter defined in our SQL). Even though each probe row only makes use of small part of the output space, a new output is still created for the next probe row. So my PR here is trying to re-use the output acorss multiple probe rows and avoid un-necessary memory allocating for new output. And with my PR, the beanchmark SQL performance recovers. Pull Request resolved: facebookincubator#12519 Reviewed By: Yuhta Differential Revision: D71409408 Pulled By: xiaoxmeng fbshipit-source-id: fdff86af2d4146e9e8a81ae33368db152bf9965d
Summary:
To follow expectations from engines such as Presto, changing NLJ
operator to emit output in the same order as probe inputs are read. This
required us to make changes to how the operator processes data internally
(check the documentation added to the header file).
.
The main difference (other than a large code refactor), is the now each probe
record needs to be processed entirely before we can move to the next; this
means we can emit probe mismatches right away in the right order.
.
As a downside, since we process probe records entirely, the output may contain
records from multiple build vectors, so we can't just wrap them into
dictionaries. We still produce dictionaries wrapped around probe columns
though.
Differential Revision: D60685613