Ensure Nested Loop Join output follows probe order by pedroerp · Pull Request #10651 · facebookincubator/velox

pedroerp · 2024-08-02T19:50:18Z

Summary:
To follow expectations from engines such as Presto, changing NLJ
operator to emit output in the same order as probe inputs are read. This
required us to make changes to how the operator processes data internally
(check the documentation added to the header file).
.
The main difference (other than a large code refactor), is the now each probe
record needs to be processed entirely before we can move to the next; this
means we can emit probe mismatches right away in the right order.
.
As a downside, since we process probe records entirely, the output may contain
records from multiple build vectors, so we can't just wrap them into
dictionaries. We still produce dictionaries wrapped around probe columns
though.

Differential Revision: D60685613

facebook-github-bot · 2024-08-02T19:50:37Z

This pull request was exported from Phabricator. Differential Revision: D60685613

netlify · 2024-08-02T19:50:37Z

✅ Deploy Preview for meta-velox canceled.

Name	Link
🔨 Latest commit	`5ca9f68`
🔍 Latest deploy log	https://app.netlify.com/sites/meta-velox/deploys/66b1738ee3832a0008ca35de

Yuhta · 2024-08-02T20:17:16Z

velox/exec/NestedLoopJoinProbe.cpp

This will kill the performance. Can we accumulate buildRows as CopyRanges or toSourceRow and only call copy once per output batch?

Yuhta · 2024-08-02T20:25:21Z

Also maybe we should consider merging build side vectors into one large vector, this way we would be able to wrap them in dictionary. Is there any concern about this?

pedroerp · 2024-08-02T23:48:46Z

Also maybe we should consider merging build side vectors into one large vector, this way we would be able to wrap them in dictionary. Is there any concern about this?

We discussed this offline. Depending on the size of the build it may be tricky to allocate a large enough contiguous memory location. We decided to move on with this approach for now. I'll take a look at batching the copies like you suggested.

facebook-github-bot · 2024-08-02T23:52:09Z

This pull request was exported from Phabricator. Differential Revision: D60685613

pedroerp · 2024-08-03T01:39:03Z

Just added the optimization suggested by @Yuhta . The code now keeps track of the ranges of rows from the build side to be copied to the output (merging them when possible), then copying them to the output using BaseVector::copyRanges() column-by-column.

Another potential optimization would be to figure out if only copies from the same build buffer are need for an output buffer, we could create a dictionary instead. But it's not clear how frequently this would happen. Could be done as a follow up.

We also validated that our internal Presto query failing because of the issue described in prestodb/presto#22585 now works. Cc: @amitkdutta @aditi-pandit

facebook-github-bot · 2024-08-03T01:40:54Z

This pull request was exported from Phabricator. Differential Revision: D60685613

pedroerp · 2024-08-03T01:42:09Z

velox/exec/NestedLoopJoinProbe.cpp

just moved this function up to make the code order more readable; no changes here.

pedroerp · 2024-08-03T01:42:28Z

velox/exec/NestedLoopJoinProbe.cpp

same for these two functions.

pedroerp · 2024-08-03T01:42:58Z

velox/exec/NestedLoopJoinProbe.cpp

from now on the changes start

facebook-github-bot · 2024-08-03T01:47:36Z

This pull request was exported from Phabricator. Differential Revision: D60685613

amitkdutta · 2024-08-04T21:58:34Z

@pedroerp Thanks a lot Pedro for priortizing and fixing this long lasting issue. Recently @kgpai made a PR to disbale NLJ in Prestissimo worker (prestodb/presto#23341), it will be great to run the e2e tests with Prestissimo to see if queries are working expected. This will test this PR well and advancing velox in PrestoDb won't create any unnecessary issues. CC: @aditi-pandit @majetideepak

aditi-pandit · 2024-08-05T05:27:09Z

@pedroerp Thanks a lot Pedro for priortizing and fixing this long lasting issue. Recently @kgpai made a PR to disbale NLJ in Prestissimo worker (prestodb/presto#23341), it will be great to run the e2e tests with Prestissimo to see if queries are working expected. This will test this PR well and advancing velox in PrestoDb won't create any unnecessary issues. CC: @aditi-pandit @majetideepak

@pedroerp : Agree. Would be great to re-enable NLJ and test all the queries in presto-native-execution/src/test/java/com/facebook/presto/nativeworker/AbstractTestNativeGeneralQueries.java in https://github.com/prestodb/presto/pull/23315/files#diff-a07cf9ace74d4a8c17fce624676de4770bbcf47547e69d35c70d12777d57d9a1

aditi-pandit

Thanks @pedroerp for the code and detailed comments.

Had some very minor comments.

aditi-pandit · 2024-08-05T05:20:24Z

velox/exec/tests/NestedLoopJoinTest.cpp

Generally, HashJoin is used for predicates with equality. It might be better to use another comparison for more realistic use of NLJ.

aditi-pandit · 2024-08-05T05:38:28Z

velox/exec/NestedLoopJoinProbe.h

Nit : This statement is confusing. The output follows the order of the "probe" side right ? Or do you mean something else ?

Good catch, I indeed meant "probe" here. :)

aditi-pandit · 2024-08-05T05:48:00Z

velox/exec/NestedLoopJoinProbe.h

Nit : grammar "will collect all build matches at the end"

aditi-pandit · 2024-08-05T07:07:04Z

velox/exec/NestedLoopJoinProbe.cpp

Nit : const

aditi-pandit · 2024-08-05T07:07:11Z

velox/exec/NestedLoopJoinProbe.cpp

Nit : const

mbasmanova

It would be nice to update documentation in https://facebookincubator.github.io/velox/develop/operators.html#nestedloopjoinnode and in PlanNode.h

mbasmanova · 2024-08-05T17:16:01Z

velox/exec/NestedLoopJoinProbe.cpp

typo: input -> output ?

mbasmanova · 2024-08-05T17:19:39Z

velox/exec/NestedLoopJoinProbe.cpp

typo: indice -> index ?

mbasmanova · 2024-08-05T17:19:49Z

velox/exec/NestedLoopJoinProbe.cpp

typo: indice -> index ?

mbasmanova · 2024-08-05T17:20:29Z

velox/exec/NestedLoopJoinProbe.h

nit: drop "This class" and start with a verb "Implements ..."

mbasmanova · 2024-08-05T17:20:51Z

velox/exec/NestedLoopJoinProbe.h

build -> probe

mbasmanova · 2024-08-05T17:27:39Z

velox/exec/tests/utils/QueryAssertions.cpp

Can this change be extracted into a separate PR?

Can we call loadedVectorShared once per column vs. once per column per row?

facebook-github-bot · 2024-08-05T19:09:25Z

This pull request was exported from Phabricator. Differential Revision: D60685613

amitkdutta · 2024-08-05T21:40:08Z

@pedroerp Thanks a lot Pedro for priortizing and fixing this long lasting issue. Recently @kgpai made a PR to disbale NLJ in Prestissimo worker (prestodb/presto#23341), it will be great to run the e2e tests with Prestissimo to see if queries are working expected. This will test this PR well and advancing velox in PrestoDb won't create any unnecessary issues. CC: @aditi-pandit @majetideepak

@pedroerp : Agree. Would be great to re-enable NLJ and test all the queries in presto-native-execution/src/test/java/com/facebook/presto/nativeworker/AbstractTestNativeGeneralQueries.java in https://github.com/prestodb/presto/pull/23315/files#diff-a07cf9ace74d4a8c17fce624676de4770bbcf47547e69d35c70d12777d57d9a1

Confirming that we run the e2e tests in meta internal tests and they are all good in this PR.

facebook-github-bot · 2024-08-05T22:05:20Z

This pull request was exported from Phabricator. Differential Revision: D60685613

mbasmanova · 2024-08-05T22:26:06Z

velox/core/PlanNode.h

It would be nice to also update documentation in https://facebookincubator.github.io/velox/develop/operators.html#nestedloopjoinnode

Should we clarify that the order is preserved only within a single thread of execution?

pedroerp · 2024-08-06T00:48:34Z

Thank you @mbasmanova @Yuhta and @aditi-pandit for the quick code review. All comments are addressed. Will merge it later today unless I hear any last objections :)

…#10651) Summary: Pull Request resolved: facebookincubator#10651 To follow expectations from engines such as Presto, changing NLJ operator to emit output in the same order as probe inputs are read. This required us to make changes to how the operator processes data internally (check the documentation added to the header file). . The main difference (other than a large code refactor), is the now each probe record needs to be processed entirely before we can move to the next; this means we can emit probe mismatches right away in the right order. . As a downside, since we process probe records entirely, the output may contain records from multiple build vectors, so we can't just wrap them into dictionaries. We still produce dictionaries wrapped around probe columns though. Reviewed By: mbasmanova Differential Revision: D60685613

facebook-github-bot · 2024-08-06T00:51:22Z

This pull request was exported from Phabricator. Differential Revision: D60685613

aditi-pandit

Thanks @pedroerp.

I don't think there is much use of NLJ in TPC-DS queries, but we will do a run and monitor any perf degradations.

#10343 describes the equivalent in Presto. We could consider implementing something like that in the future.

facebook-github-bot · 2024-08-06T06:30:27Z

This pull request has been merged in c0fa8f2.

conbench-facebook · 2024-08-06T07:07:10Z

Conbench analyzed the 1 benchmark run on commit c0fa8f23.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details.

amitkdutta · 2024-08-07T04:43:00Z

@aditi-pandit The pr for disabling NLJ has not been merged, and if this is ready its not required.

https://github.com/prestodb/presto/pull/23315/files#diff-a07cf9ace74d4a8c17fce624676de4770bbcf47547e69d35c70d12777d57d9a1

Would be good to add this in a subsequent Prestissimo PR upgrading Velox.

@kgpai @pedroerp

Updated velox and merged in [native] Advance velox. prestodb/presto#23395. No errors in existing tests
Copied tests from @aditi-pandit's [Do not review][Native] Do not carry over partitioning in StreamPropertyDerivations for nested loop joins prestodb/presto#23315 which extensively tests this path in [native] Add e2e tests for corelated subqueries. prestodb/presto#23396

pedroerp · 2024-08-07T23:14:08Z

Thank you @amitkdutta !

Summary: Recently, we found the following benchmark sql became much slower after we upgraded our system to latest velox. ``` select ps_partkey, sum(ps_supplycost * ps_availqty) as value from partsupp, supplier, nation where ps_suppkey = s_suppkey and s_nationkey = n_nationkey and n_name = 'GERMANY' group by ps_partkey having sum(ps_supplycost * ps_availqty) > ( select sum(ps_supplycost * ps_availqty) * 0.000001 from partsupp, supplier, nation where ps_suppkey = s_suppkey and s_nationkey = n_nationkey and n_name = 'GERMANY' ) order by value desc; ``` Through the flame-graph investigations, we noticed that ```NestedLoopJoin::generateOutput``` became the bottle-neck. So we went through the most recent changes to NestedLoopJoin relevant files and found this suspicious PR (#10651). After reviewing the changes introduced by the PR, I find out that this PR prepares output for each probe row even if the build table has only one row (we have filter defined in our SQL). Even though each probe row only makes use of small part of the output space, a new output is still created for the next probe row. So my PR here is trying to re-use the output acorss multiple probe rows and avoid un-necessary memory allocating for new output. And with my PR, the beanchmark SQL performance recovers. Pull Request resolved: #12519 Reviewed By: Yuhta Differential Revision: D71409408 Pulled By: xiaoxmeng fbshipit-source-id: fdff86af2d4146e9e8a81ae33368db152bf9965d

…bator#12519) Summary: Recently, we found the following benchmark sql became much slower after we upgraded our system to latest velox. ``` select ps_partkey, sum(ps_supplycost * ps_availqty) as value from partsupp, supplier, nation where ps_suppkey = s_suppkey and s_nationkey = n_nationkey and n_name = 'GERMANY' group by ps_partkey having sum(ps_supplycost * ps_availqty) > ( select sum(ps_supplycost * ps_availqty) * 0.000001 from partsupp, supplier, nation where ps_suppkey = s_suppkey and s_nationkey = n_nationkey and n_name = 'GERMANY' ) order by value desc; ``` Through the flame-graph investigations, we noticed that ```NestedLoopJoin::generateOutput``` became the bottle-neck. So we went through the most recent changes to NestedLoopJoin relevant files and found this suspicious PR (facebookincubator#10651). After reviewing the changes introduced by the PR, I find out that this PR prepares output for each probe row even if the build table has only one row (we have filter defined in our SQL). Even though each probe row only makes use of small part of the output space, a new output is still created for the next probe row. So my PR here is trying to re-use the output acorss multiple probe rows and avoid un-necessary memory allocating for new output. And with my PR, the beanchmark SQL performance recovers. Pull Request resolved: facebookincubator#12519 Reviewed By: Yuhta Differential Revision: D71409408 Pulled By: xiaoxmeng fbshipit-source-id: fdff86af2d4146e9e8a81ae33368db152bf9965d

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 2, 2024

facebook-github-bot added the fb-exported label Aug 2, 2024

pedroerp requested review from aditi-pandit, karteekmurthys and mbasmanova August 2, 2024 19:53

Yuhta reviewed Aug 2, 2024

View reviewed changes

pedroerp force-pushed the export-D60685613 branch from 4eea1c8 to 123b157 Compare August 2, 2024 23:52

pedroerp force-pushed the export-D60685613 branch from 123b157 to 33dcb48 Compare August 3, 2024 01:40

pedroerp commented Aug 3, 2024

View reviewed changes

velox/exec/NestedLoopJoinProbe.cpp Outdated

Copy link
Copy Markdown

Contributor Author

pedroerp Aug 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same for these two functions.

pedroerp commented Aug 3, 2024

View reviewed changes

velox/exec/NestedLoopJoinProbe.cpp Outdated

Copy link
Copy Markdown

Contributor Author

pedroerp Aug 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from now on the changes start

pedroerp force-pushed the export-D60685613 branch from 33dcb48 to 48c8e73 Compare August 3, 2024 01:47

aditi-pandit reviewed Aug 5, 2024

View reviewed changes

pedroerp force-pushed the export-D60685613 branch from 48c8e73 to 2a12284 Compare August 5, 2024 17:26

mbasmanova reviewed Aug 5, 2024

View reviewed changes

pedroerp force-pushed the export-D60685613 branch from 2a12284 to f9c9ee3 Compare August 5, 2024 19:09

pedroerp force-pushed the export-D60685613 branch from f9c9ee3 to 190a041 Compare August 5, 2024 22:05

mbasmanova approved these changes Aug 5, 2024

View reviewed changes

pedroerp force-pushed the export-D60685613 branch from 190a041 to 5ca9f68 Compare August 6, 2024 00:51

aditi-pandit approved these changes Aug 6, 2024

View reviewed changes

facebook-github-bot closed this in c0fa8f2 Aug 6, 2024

facebook-github-bot added the Merged label Aug 6, 2024

amitkdutta mentioned this pull request Aug 7, 2024

[native] Add e2e tests for corelated subqueries. prestodb/presto#23396

Merged

iamorchid mentioned this pull request Mar 4, 2025

fix: Re-use output across probe rows for NestedLoopJoin #12519

Closed

zml1206 mentioned this pull request Jun 5, 2025

[VL] Add outputOrdering for VeloxBroadcastNestedLoopJoinExecTransformer apache/gluten#9872

Merged

zhangxffff mentioned this pull request Mar 17, 2026

[Bug] NestedLoopJoinProbe OOM with wide build-side rows bytedance/bolt#401

Closed

5 tasks

Conversation

pedroerp commented Aug 2, 2024

Uh oh!

facebook-github-bot commented Aug 2, 2024

Uh oh!

netlify bot commented Aug 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for meta-velox canceled.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Yuhta commented Aug 2, 2024

Uh oh!

pedroerp commented Aug 2, 2024

Uh oh!

facebook-github-bot commented Aug 2, 2024

Uh oh!

pedroerp commented Aug 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot commented Aug 3, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Aug 3, 2024

Uh oh!

amitkdutta commented Aug 4, 2024

Uh oh!

aditi-pandit commented Aug 5, 2024

Uh oh!

aditi-pandit left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mbasmanova left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Aug 5, 2024

Uh oh!

amitkdutta commented Aug 5, 2024

Uh oh!

facebook-github-bot commented Aug 5, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pedroerp commented Aug 6, 2024

Uh oh!

facebook-github-bot commented Aug 6, 2024

Uh oh!

aditi-pandit left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Aug 6, 2024

netlify bot commented Aug 2, 2024 •

edited

Loading

pedroerp commented Aug 3, 2024 •

edited

Loading