fix: Re-use output across probe rows for NestedLoopJoin#12519
fix: Re-use output across probe rows for NestedLoopJoin#12519iamorchid wants to merge 2 commits intofacebookincubator:mainfrom
Conversation
|
Hi @iamorchid! Thank you for your pull request and welcome to our community. Action RequiredIn order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you. ProcessIn order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks! |
✅ Deploy Preview for meta-velox canceled.
|
|
@pedroerp please help take a review at this PR, thanks |
|
@mbasmanova can you also help take a review at this PR ? |
pedroerp
left a comment
There was a problem hiding this comment.
A few small comments, but overall looks good to me. Thanks for the optimization!
Have you tried to run join fuzzer for a while a check if this not trigger any issues?
velox/exec/NestedLoopJoinProbe.cpp
Outdated
There was a problem hiding this comment.
for readability, could you wrap this logic in a helper method?
if (!readyToProduceOutput()) {
return nullptr;
}
output_->resize(numOutputRows_);
return std::move(output_);
or something similar
There was a problem hiding this comment.
It's a good advice to define readyToProduceOutput.
velox/exec/NestedLoopJoinProbe.cpp
Outdated
There was a problem hiding this comment.
we should probably add something in the method's documentation that is the caller's responsibility to resize the output now.
yes, the join fuzzer test passed. |
|
@mbasmanova @pedroerp can you help merge the PR if it looks ok ? |
|
@pedroerp can you help merge this PR if it looks OK ? |
|
@mbasmanova can you help merge this PR if it looks good ? |
|
@pedroerp Pedro, would you take another look? |
pedroerp
left a comment
There was a problem hiding this comment.
Looks good to me. Thank you for the PR!
|
@kagamiori has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
|
Hi @iamorchid, could you please rebase the PR onto the latest main? That's needed for merging it. Thanks! |
sure, updated |
|
@kagamiori Hi, wei. Is it OK to merge it now ? |
|
Hi @iamorchid, some unit tests fail internally. Could you please take a look? Also, |
|
@kagamiori sorry Wei, the UT break is my fault (forget to re-run UTs with changes to comments). I've checked that all relavent UTs passed with my second commit. |
|
@pedroerp we noted the issue in our TPCDS benchmark as well. We may need to go through every operator and make sure the output rowvector big enough. |
|
@kagamiori Is it OK to merge this PR now ? |
|
@xiaoxmeng has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
|
@xiaoxmeng merged this pull request in 96c011d. |
…bator#12519) Summary: Recently, we found the following benchmark sql became much slower after we upgraded our system to latest velox. ``` select ps_partkey, sum(ps_supplycost * ps_availqty) as value from partsupp, supplier, nation where ps_suppkey = s_suppkey and s_nationkey = n_nationkey and n_name = 'GERMANY' group by ps_partkey having sum(ps_supplycost * ps_availqty) > ( select sum(ps_supplycost * ps_availqty) * 0.000001 from partsupp, supplier, nation where ps_suppkey = s_suppkey and s_nationkey = n_nationkey and n_name = 'GERMANY' ) order by value desc; ``` Through the flame-graph investigations, we noticed that ```NestedLoopJoin::generateOutput``` became the bottle-neck. So we went through the most recent changes to NestedLoopJoin relevant files and found this suspicious PR (facebookincubator#10651). After reviewing the changes introduced by the PR, I find out that this PR prepares output for each probe row even if the build table has only one row (we have filter defined in our SQL). Even though each probe row only makes use of small part of the output space, a new output is still created for the next probe row. So my PR here is trying to re-use the output acorss multiple probe rows and avoid un-necessary memory allocating for new output. And with my PR, the beanchmark SQL performance recovers. Pull Request resolved: facebookincubator#12519 Reviewed By: Yuhta Differential Revision: D71409408 Pulled By: xiaoxmeng fbshipit-source-id: fdff86af2d4146e9e8a81ae33368db152bf9965d
…bator#12519) Summary: Recently, we found the following benchmark sql became much slower after we upgraded our system to latest velox. ``` select ps_partkey, sum(ps_supplycost * ps_availqty) as value from partsupp, supplier, nation where ps_suppkey = s_suppkey and s_nationkey = n_nationkey and n_name = 'GERMANY' group by ps_partkey having sum(ps_supplycost * ps_availqty) > ( select sum(ps_supplycost * ps_availqty) * 0.000001 from partsupp, supplier, nation where ps_suppkey = s_suppkey and s_nationkey = n_nationkey and n_name = 'GERMANY' ) order by value desc; ``` Through the flame-graph investigations, we noticed that ```NestedLoopJoin::generateOutput``` became the bottle-neck. So we went through the most recent changes to NestedLoopJoin relevant files and found this suspicious PR (facebookincubator#10651). After reviewing the changes introduced by the PR, I find out that this PR prepares output for each probe row even if the build table has only one row (we have filter defined in our SQL). Even though each probe row only makes use of small part of the output space, a new output is still created for the next probe row. So my PR here is trying to re-use the output acorss multiple probe rows and avoid un-necessary memory allocating for new output. And with my PR, the beanchmark SQL performance recovers. Pull Request resolved: facebookincubator#12519 Reviewed By: Yuhta Differential Revision: D71409408 Pulled By: xiaoxmeng fbshipit-source-id: fdff86af2d4146e9e8a81ae33368db152bf9965d
Recently, we found the following benchmark sql became much slower after we upgraded our system to latest velox.
Through the flame-graph investigations, we noticed that
NestedLoopJoin::generateOutputbecame the bottle-neck. So we went through the most recent changes to NestedLoopJoin relevant files and found this suspicious PR (#10651).After reviewing the changes introduced by the PR, I find out that this PR prepares output for each probe row even if the build table has only one row (we have filter defined in our SQL). Even though each probe row only makes use of small part of the output space, a new output is still created for the next probe row.
So my PR here is trying to re-use the output acorss multiple probe rows and avoid un-necessary memory allocating for new output. And with my PR, the beanchmark SQL performance recovers.