Skip to content

Ensure Nested Loop Join output follows probe order#10651

Closed
pedroerp wants to merge 1 commit intofacebookincubator:mainfrom
pedroerp:export-D60685613
Closed

Ensure Nested Loop Join output follows probe order#10651
pedroerp wants to merge 1 commit intofacebookincubator:mainfrom
pedroerp:export-D60685613

Conversation

@pedroerp
Copy link
Copy Markdown
Contributor

@pedroerp pedroerp commented Aug 2, 2024

Summary:
To follow expectations from engines such as Presto, changing NLJ
operator to emit output in the same order as probe inputs are read. This
required us to make changes to how the operator processes data internally
(check the documentation added to the header file).
.
The main difference (other than a large code refactor), is the now each probe
record needs to be processed entirely before we can move to the next; this
means we can emit probe mismatches right away in the right order.
.
As a downside, since we process probe records entirely, the output may contain
records from multiple build vectors, so we can't just wrap them into
dictionaries. We still produce dictionaries wrapped around probe columns
though.

Differential Revision: D60685613

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 2, 2024
@facebook-github-bot
Copy link
Copy Markdown
Contributor

This pull request was exported from Phabricator. Differential Revision: D60685613

@netlify
Copy link
Copy Markdown

netlify bot commented Aug 2, 2024

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 5ca9f68
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/66b1738ee3832a0008ca35de

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will kill the performance. Can we accumulate buildRows as CopyRanges or toSourceRow and only call copy once per output batch?

@Yuhta
Copy link
Copy Markdown
Contributor

Yuhta commented Aug 2, 2024

Also maybe we should consider merging build side vectors into one large vector, this way we would be able to wrap them in dictionary. Is there any concern about this?

@pedroerp
Copy link
Copy Markdown
Contributor Author

pedroerp commented Aug 2, 2024

Also maybe we should consider merging build side vectors into one large vector, this way we would be able to wrap them in dictionary. Is there any concern about this?

We discussed this offline. Depending on the size of the build it may be tricky to allocate a large enough contiguous memory location. We decided to move on with this approach for now. I'll take a look at batching the copies like you suggested.

@facebook-github-bot
Copy link
Copy Markdown
Contributor

This pull request was exported from Phabricator. Differential Revision: D60685613

@pedroerp
Copy link
Copy Markdown
Contributor Author

pedroerp commented Aug 3, 2024

Just added the optimization suggested by @Yuhta . The code now keeps track of the ranges of rows from the build side to be copied to the output (merging them when possible), then copying them to the output using BaseVector::copyRanges() column-by-column.

Another potential optimization would be to figure out if only copies from the same build buffer are need for an output buffer, we could create a dictionary instead. But it's not clear how frequently this would happen. Could be done as a follow up.

We also validated that our internal Presto query failing because of the issue described in prestodb/presto#22585 now works. Cc: @amitkdutta @aditi-pandit

@facebook-github-bot
Copy link
Copy Markdown
Contributor

This pull request was exported from Phabricator. Differential Revision: D60685613

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just moved this function up to make the code order more readable; no changes here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same for these two functions.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from now on the changes start

@facebook-github-bot
Copy link
Copy Markdown
Contributor

This pull request was exported from Phabricator. Differential Revision: D60685613

@amitkdutta
Copy link
Copy Markdown
Contributor

@pedroerp Thanks a lot Pedro for priortizing and fixing this long lasting issue. Recently @kgpai made a PR to disbale NLJ in Prestissimo worker (prestodb/presto#23341), it will be great to run the e2e tests with Prestissimo to see if queries are working expected. This will test this PR well and advancing velox in PrestoDb won't create any unnecessary issues. CC: @aditi-pandit @majetideepak

@aditi-pandit
Copy link
Copy Markdown
Collaborator

@pedroerp Thanks a lot Pedro for priortizing and fixing this long lasting issue. Recently @kgpai made a PR to disbale NLJ in Prestissimo worker (prestodb/presto#23341), it will be great to run the e2e tests with Prestissimo to see if queries are working expected. This will test this PR well and advancing velox in PrestoDb won't create any unnecessary issues. CC: @aditi-pandit @majetideepak

@pedroerp : Agree. Would be great to re-enable NLJ and test all the queries in presto-native-execution/src/test/java/com/facebook/presto/nativeworker/AbstractTestNativeGeneralQueries.java in https://github.com/prestodb/presto/pull/23315/files#diff-a07cf9ace74d4a8c17fce624676de4770bbcf47547e69d35c70d12777d57d9a1

Copy link
Copy Markdown
Collaborator

@aditi-pandit aditi-pandit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @pedroerp for the code and detailed comments.

Had some very minor comments.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally, HashJoin is used for predicates with equality. It might be better to use another comparison for more realistic use of NLJ.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit : This statement is confusing. The output follows the order of the "probe" side right ? Or do you mean something else ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, I indeed meant "probe" here. :)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit : grammar "will collect all build matches at the end"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit : const

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit : const

Copy link
Copy Markdown
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to update documentation in https://facebookincubator.github.io/velox/develop/operators.html#nestedloopjoinnode and in PlanNode.h

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: input -> output ?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: indice -> index ?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: indice -> index ?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: drop "This class" and start with a verb "Implements ..."

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

build -> probe

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this change be extracted into a separate PR?

Can we call loadedVectorShared once per column vs. once per column per row?

@facebook-github-bot
Copy link
Copy Markdown
Contributor

This pull request was exported from Phabricator. Differential Revision: D60685613

@amitkdutta
Copy link
Copy Markdown
Contributor

@pedroerp Thanks a lot Pedro for priortizing and fixing this long lasting issue. Recently @kgpai made a PR to disbale NLJ in Prestissimo worker (prestodb/presto#23341), it will be great to run the e2e tests with Prestissimo to see if queries are working expected. This will test this PR well and advancing velox in PrestoDb won't create any unnecessary issues. CC: @aditi-pandit @majetideepak

@pedroerp : Agree. Would be great to re-enable NLJ and test all the queries in presto-native-execution/src/test/java/com/facebook/presto/nativeworker/AbstractTestNativeGeneralQueries.java in https://github.com/prestodb/presto/pull/23315/files#diff-a07cf9ace74d4a8c17fce624676de4770bbcf47547e69d35c70d12777d57d9a1

Confirming that we run the e2e tests in meta internal tests and they are all good in this PR.

@facebook-github-bot
Copy link
Copy Markdown
Contributor

This pull request was exported from Phabricator. Differential Revision: D60685613

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to also update documentation in https://facebookincubator.github.io/velox/develop/operators.html#nestedloopjoinnode

Should we clarify that the order is preserved only within a single thread of execution?

@pedroerp
Copy link
Copy Markdown
Contributor Author

pedroerp commented Aug 6, 2024

Thank you @mbasmanova @Yuhta and @aditi-pandit for the quick code review. All comments are addressed. Will merge it later today unless I hear any last objections :)

…#10651)

Summary:
Pull Request resolved: facebookincubator#10651

To follow expectations from engines such as Presto, changing NLJ
operator to emit output in the same order as probe inputs are read. This
required us to make changes to how the operator processes data internally
(check the documentation added to the header file).
.
The main difference (other than a large code refactor), is the now each probe
record needs to be processed entirely before we can move to the next; this
means we can emit probe mismatches right away in the right order.
.
As a downside, since we process probe records entirely, the output may contain
records from multiple build vectors, so we can't just wrap them into
dictionaries. We still produce dictionaries wrapped around probe columns
though.

Reviewed By: mbasmanova

Differential Revision: D60685613
@facebook-github-bot
Copy link
Copy Markdown
Contributor

This pull request was exported from Phabricator. Differential Revision: D60685613

Copy link
Copy Markdown
Collaborator

@aditi-pandit aditi-pandit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @pedroerp.

I don't think there is much use of NLJ in TPC-DS queries, but we will do a run and monitor any perf degradations.

#10343 describes the equivalent in Presto. We could consider implementing something like that in the future.

@facebook-github-bot
Copy link
Copy Markdown
Contributor

This pull request has been merged in c0fa8f2.

@conbench-facebook
Copy link
Copy Markdown

Conbench analyzed the 1 benchmark run on commit c0fa8f23.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details.

@amitkdutta
Copy link
Copy Markdown
Contributor

@aditi-pandit The pr for disabling NLJ has not been merged, and if this is ready its not required.

https://github.com/prestodb/presto/pull/23315/files#diff-a07cf9ace74d4a8c17fce624676de4770bbcf47547e69d35c70d12777d57d9a1

Would be good to add this in a subsequent Prestissimo PR upgrading Velox.

@kgpai @pedroerp

@pedroerp
Copy link
Copy Markdown
Contributor Author

pedroerp commented Aug 7, 2024

Thank you @amitkdutta !

facebook-github-bot pushed a commit that referenced this pull request Mar 31, 2025
Summary:
Recently, we found the following benchmark sql became much slower after we upgraded our system to latest velox.

```
select
  ps_partkey,
  sum(ps_supplycost * ps_availqty) as value
from
  partsupp,
  supplier,
  nation
where
  ps_suppkey = s_suppkey
  and s_nationkey = n_nationkey
  and n_name = 'GERMANY'
group by
  ps_partkey having
    sum(ps_supplycost * ps_availqty) > (
      select
        sum(ps_supplycost * ps_availqty) * 0.000001
      from
        partsupp,
        supplier,
        nation
      where
        ps_suppkey = s_suppkey
        and s_nationkey = n_nationkey
        and n_name = 'GERMANY'
    )
order by
  value desc;
```

Through the flame-graph investigations, we noticed that ```NestedLoopJoin::generateOutput``` became the bottle-neck. So we went through the most recent changes to NestedLoopJoin relevant files and found this suspicious PR (#10651).

After reviewing the changes introduced by the PR, I find out that this PR prepares output for each probe row even if the build table has only one row (we have filter defined in our SQL). Even though each probe row only makes use of small part of the output space, a new output is still created for the next probe row.

So my PR here is trying to re-use the output acorss multiple probe rows and avoid un-necessary memory allocating for new output. And with my PR, the beanchmark SQL performance recovers.

Pull Request resolved: #12519

Reviewed By: Yuhta

Differential Revision: D71409408

Pulled By: xiaoxmeng

fbshipit-source-id: fdff86af2d4146e9e8a81ae33368db152bf9965d
zhouyuan pushed a commit to zhouyuan/velox that referenced this pull request Apr 9, 2025
…bator#12519)

Summary:
Recently, we found the following benchmark sql became much slower after we upgraded our system to latest velox.

```
select
  ps_partkey,
  sum(ps_supplycost * ps_availqty) as value
from
  partsupp,
  supplier,
  nation
where
  ps_suppkey = s_suppkey
  and s_nationkey = n_nationkey
  and n_name = 'GERMANY'
group by
  ps_partkey having
    sum(ps_supplycost * ps_availqty) > (
      select
        sum(ps_supplycost * ps_availqty) * 0.000001
      from
        partsupp,
        supplier,
        nation
      where
        ps_suppkey = s_suppkey
        and s_nationkey = n_nationkey
        and n_name = 'GERMANY'
    )
order by
  value desc;
```

Through the flame-graph investigations, we noticed that ```NestedLoopJoin::generateOutput``` became the bottle-neck. So we went through the most recent changes to NestedLoopJoin relevant files and found this suspicious PR (facebookincubator#10651).

After reviewing the changes introduced by the PR, I find out that this PR prepares output for each probe row even if the build table has only one row (we have filter defined in our SQL). Even though each probe row only makes use of small part of the output space, a new output is still created for the next probe row.

So my PR here is trying to re-use the output acorss multiple probe rows and avoid un-necessary memory allocating for new output. And with my PR, the beanchmark SQL performance recovers.

Pull Request resolved: facebookincubator#12519

Reviewed By: Yuhta

Differential Revision: D71409408

Pulled By: xiaoxmeng

fbshipit-source-id: fdff86af2d4146e9e8a81ae33368db152bf9965d
zml1206 pushed a commit to zml1206/velox that referenced this pull request May 23, 2025
…bator#12519)

Summary:
Recently, we found the following benchmark sql became much slower after we upgraded our system to latest velox.

```
select
  ps_partkey,
  sum(ps_supplycost * ps_availqty) as value
from
  partsupp,
  supplier,
  nation
where
  ps_suppkey = s_suppkey
  and s_nationkey = n_nationkey
  and n_name = 'GERMANY'
group by
  ps_partkey having
    sum(ps_supplycost * ps_availqty) > (
      select
        sum(ps_supplycost * ps_availqty) * 0.000001
      from
        partsupp,
        supplier,
        nation
      where
        ps_suppkey = s_suppkey
        and s_nationkey = n_nationkey
        and n_name = 'GERMANY'
    )
order by
  value desc;
```

Through the flame-graph investigations, we noticed that ```NestedLoopJoin::generateOutput``` became the bottle-neck. So we went through the most recent changes to NestedLoopJoin relevant files and found this suspicious PR (facebookincubator#10651).

After reviewing the changes introduced by the PR, I find out that this PR prepares output for each probe row even if the build table has only one row (we have filter defined in our SQL). Even though each probe row only makes use of small part of the output space, a new output is still created for the next probe row.

So my PR here is trying to re-use the output acorss multiple probe rows and avoid un-necessary memory allocating for new output. And with my PR, the beanchmark SQL performance recovers.

Pull Request resolved: facebookincubator#12519

Reviewed By: Yuhta

Differential Revision: D71409408

Pulled By: xiaoxmeng

fbshipit-source-id: fdff86af2d4146e9e8a81ae33368db152bf9965d
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported Merged

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants