fix: Re-use output across probe rows for NestedLoopJoin by iamorchid · Pull Request #12519 · facebookincubator/velox

iamorchid · 2025-03-04T03:37:09Z

Recently, we found the following benchmark sql became much slower after we upgraded our system to latest velox.

select
  ps_partkey,
  sum(ps_supplycost * ps_availqty) as value
from
  partsupp,
  supplier,
  nation
where
  ps_suppkey = s_suppkey
  and s_nationkey = n_nationkey
  and n_name = 'GERMANY'
group by
  ps_partkey having
    sum(ps_supplycost * ps_availqty) > ( 
      select
        sum(ps_supplycost * ps_availqty) * 0.000001
      from
        partsupp,
        supplier,
        nation
      where
        ps_suppkey = s_suppkey
        and s_nationkey = n_nationkey
        and n_name = 'GERMANY'
    )   
order by
  value desc;

Through the flame-graph investigations, we noticed that NestedLoopJoin::generateOutput became the bottle-neck. So we went through the most recent changes to NestedLoopJoin relevant files and found this suspicious PR (#10651).

After reviewing the changes introduced by the PR, I find out that this PR prepares output for each probe row even if the build table has only one row (we have filter defined in our SQL). Even though each probe row only makes use of small part of the output space, a new output is still created for the next probe row.

So my PR here is trying to re-use the output acorss multiple probe rows and avoid un-necessary memory allocating for new output. And with my PR, the beanchmark SQL performance recovers.

facebook-github-bot · 2025-03-04T03:37:14Z

Hi @iamorchid!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks!

netlify · 2025-03-04T03:37:30Z

✅ Deploy Preview for meta-velox canceled.

Name	Link
🔨 Latest commit	`8baf921`
🔍 Latest deploy log	https://app.netlify.com/sites/meta-velox/deploys/67de7ab07623690008cf6c8e

iamorchid · 2025-03-04T03:53:29Z

@pedroerp please help take a review at this PR, thanks

iamorchid · 2025-03-06T10:54:57Z

@mbasmanova can you also help take a review at this PR ?

pedroerp

A few small comments, but overall looks good to me. Thanks for the optimization!

Have you tried to run join fuzzer for a while a check if this not trigger any issues?

pedroerp · 2025-03-06T15:43:59Z

velox/exec/NestedLoopJoinProbe.cpp

for readability, could you wrap this logic in a helper method?

if (!readyToProduceOutput()) { return nullptr; } output_->resize(numOutputRows_); return std::move(output_);

or something similar

It's a good advice to define readyToProduceOutput.

pedroerp · 2025-03-06T15:45:01Z

velox/exec/NestedLoopJoinProbe.cpp

we should probably add something in the method's documentation that is the caller's responsibility to resize the output now.

It's updated.

iamorchid · 2025-03-07T05:43:59Z

A few small comments, but overall looks good to me. Thanks for the optimization!

Have you tried to run join fuzzer for a while a check if this not trigger any issues?

yes, the join fuzzer test passed.

iamorchid · 2025-03-08T02:05:52Z

@mbasmanova @pedroerp can you help merge the PR if it looks ok ?

iamorchid · 2025-03-12T01:13:39Z

@pedroerp can you help merge this PR if it looks OK ?

iamorchid · 2025-03-18T12:12:08Z

@mbasmanova can you help merge this PR if it looks good ?

mbasmanova · 2025-03-18T12:28:05Z

@pedroerp Pedro, would you take another look?

pedroerp

Looks good to me. Thank you for the PR!

facebook-github-bot · 2025-03-18T18:17:50Z

@kagamiori has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

kagamiori · 2025-03-18T18:18:23Z

Hi @iamorchid, could you please rebase the PR onto the latest main? That's needed for merging it. Thanks!

iamorchid · 2025-03-19T05:39:09Z

Hi @iamorchid, could you please rebase the PR onto the latest main? That's needed for merging it. Thanks!

sure, updated

iamorchid · 2025-03-19T21:47:25Z

@kagamiori Hi, wei. Is it OK to merge it now ?

kagamiori · 2025-03-21T16:07:19Z

Hi @iamorchid, some unit tests fail internally. Could you please take a look?

Note: Google Test filter = NestedLoopJoinTest.basic
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from NestedLoopJoinTest
[ RUN      ] NestedLoopJoinTest.basic
fbcode/velox/exec/tests/utils/QueryAssertions.cpp:1182: Failure
Failed
Expected 15, got 15
15 extra rows, 15 missing rows
10 of extra rows:
	3049146574685680872 | 6364777458702462862
	3471844029210195232 | 6943465136533772555
	3471844029210195232 | 6943465136533772555
	5676477873606191036 | 6911158193961541679
	5676477873606191036 | 6911158193961541679
	7593276348121265214 | 259054925252220560
	7593276348121265214 | 6628868324100099797
	7593276348121265214 | 6628868324100099797
	7593276348121265214 | 6628868324100099797
	7593276348121265214 | 6628868324100099797

10 of missing rows:
	259054925252220560 | 259054925252220560
	2511017508255745915 | 2511017508255745915
	6364777458702462862 | 6364777458702462862
	6628868324100099797 | 6628868324100099797
	6628868324100099797 | 6628868324100099797
	6628868324100099797 | 6628868324100099797
	6628868324100099797 | 6628868324100099797
	6628868324100099797 | 6628868324100099797
	6628868324100099797 | 6628868324100099797
	6628868324100099797 | 6628868324100099797

Note: DuckDB only supports timestamps of millisecond precision. If this test involves timestamp inputs, please make sure you use the right precision.
DuckDB query: SELECT t0, u0 FROM t INNER JOIN u ON t.t0 = u.u0
Google Test trace:
fbcode/velox/exec/tests/NestedLoopJoinTest.cpp:104: maxDrivers:1 joinType:INNER comparison:=

fbcode/velox/exec/tests/utils/QueryAssertions.cpp:1182: Failure
Failed
Expected 90, got 90
15 extra rows, 15 missing rows
10 of extra rows:
	3049146574685680872 | 6364777458702462862
	3471844029210195232 | 6943465136533772555
	3471844029210195232 | 6943465136533772555
	5676477873606191036 | 6911158193961541679
	5676477873606191036 | 6911158193961541679
	7593276348121265214 | 259054925252220560
	7593276348121265214 | 6628868324100099797
	7593276348121265214 | 6628868324100099797
	7593276348121265214 | 6628868324100099797
	7593276348121265214 | 6628868324100099797

10 of missing rows:
	259054925252220560 | 259054925252220560
	2511017508255745915 | 2511017508255745915
	6364777458702462862 | 6364777458702462862
	6628868324100099797 | 6628868324100099797
	6628868324100099797 | 6628868324100099797
	6628868324100099797 | 6628868324100099797
	6628868324100099797 | 6628868324100099797
	6628868324100099797 | 6628868324100099797
	6628868324100099797 | 6628868324100099797
	6628868324100099797 | 6628868324100099797

Note: DuckDB only supports timestamps of millisecond precision. If this test involves timestamp inputs, please make sure you use the right precision.
DuckDB query: SELECT t0, u0 FROM t RIGHT JOIN u ON t.t0 = u.u0
Google Test trace:
fbcode/velox/exec/tests/NestedLoopJoinTest.cpp:104: maxDrivers:1 joinType:RIGHT comparison:=


I0318 20:54:13.038564 144428 SharedArbitrator.cpp:295] [MEM] Start memory reclaim executor with 26 threads
I0318 20:54:13.038663 144428 SharedArbitrator.cpp:358] [MEM] Global arbitration abort capacity limits: 0
I0318 20:54:13.038817 144428 SharedArbitrator.cpp:300] [MEM] Shared arbitrator created with 6.00GB capacity, 0B reserved capacity
I0318 20:54:13.038868 144428 SharedArbitrator.cpp:305] [MEM] Arbitration config: max arbitration time 5m 0s, global memory reclaim percentage 10, global arbitration abort time ratio 0.5, global arbitration skip spill 0
I0318 20:54:13.039014 144432 SharedArbitrator.cpp:889] [MEM] Global arbitration controller started
I0318 20:54:13.038902 144428 SharedArbitrator.cpp:314] [MEM] Memory pool participant config: initCapacity 512.00MB, minCapacity 0B, fastExponentialGrowthCapacityLimit 512.00MB, slowCapacityGrowRatio 0.25, minFreeCapacity 128.00MB, minFreeCapacityRatio 0.25, minReclaimBytes 0B, abortCapacityLimit 0B
I0318 20:54:13.610507 144428 PeriodicStatsReporter.cpp:69] Starting PeriodicStatsReporter with options allocatorStatsIntervalMs:2000, cacheStatsIntervalMs:2000, arbitratorStatsIntervalMs:2000, spillStatsIntervalMs:2000
I0318 20:54:13.611109 144428 HiveConnector.cpp:56] Hive connector test-hive created with maximum of 20000 cached file handles.
I0318 20:54:13.611140 144489 PeriodicStatsReporter.cpp:252] Spill memory usage: current[0B] peak[0B]
I0318 20:54:13.651444 144493 Task.cpp:2055] Terminating task test_cursor 1 with state Finished after running for 25ms
I0318 20:54:13.693948 144491 Task.cpp:2055] Terminating task test_cursor 2 with state Finished after running for 27ms
I0318 20:54:14.084604 144493 Task.cpp:2055] Terminating task test_cursor 3 with state Finished after running for 24ms
I0318 20:54:14.472589 144492 Task.cpp:2055] Terminating task test_cursor 4 with state Finished after running for 31ms
I0318 20:54:15.462435 144493 Task.cpp:2055] Terminating task test_cursor 5 with state Finished after running for 27ms
I0318 20:54:15.527756 144491 Task.cpp:2055] Terminating task test_cursor 6 with state Finished after running for 43ms
I0318 20:54:15.611277 144489 PeriodicStatsReporter.cpp:252] Spill memory usage: current[0B] peak[0B]
I0318 20:54:15.982252 144490 Task.cpp:2055] Terminating task test_cursor 7 with state Finished after running for 31ms
I0318 20:54:16.437009 144491 Task.cpp:2055] Terminating task test_cursor 8 with state Finished after running for 33ms
I0318 20:54:17.538368 144492 Task.cpp:2055] Terminating task test_cursor 9 with state Finished after running for 20ms
I0318 20:54:17.593410 144491 Task.cpp:2055] Terminating task test_cursor 10 with state Finished after running for 38ms
I0318 20:54:17.611534 144489 PeriodicStatsReporter.cpp:252] Spill memory usage: current[0B] peak[0B]
I0318 20:54:18.011440 144491 Task.cpp:2055] Terminating task test_cursor 11 with state Finished after running for 39ms
I0318 20:54:18.413925 144492 Task.cpp:2055] Terminating task test_cursor 12 with state Finished after running for 32ms
I0318 20:54:19.374485 144490 Task.cpp:2055] Terminating task test_cursor 13 with state Finished after running for 24ms
I0318 20:54:19.436452 144493 Task.cpp:2055] Terminating task test_cursor 14 with state Finished after running for 29ms
I0318 20:54:19.611694 144489 PeriodicStatsReporter.cpp:252] Spill memory usage: current[0B] peak[0B]
I0318 20:54:19.866226 144492 Task.cpp:2055] Terminating task test_cursor 15 with state Finished after running for 26ms
I0318 20:54:20.259239 144490 Task.cpp:2055] Terminating task test_cursor 16 with state Finished after running for 31ms
AddressSanitizer:DEADLYSIGNAL
=================================================================
==144428==ERROR: AddressSanitizer: SEGV on unknown address (pc 0x000000e929f3 bp 0x7fd5f2bfac50 sp 0x7fd5f2bfac00 T58)
==144428==The signal is caused by a READ memory access.
==144428==Hint: this fault was caused by a dereference of a high value address (see register values below).  Disassemble the provided pc to learn which register was used.
SCARINESS: 20 (wild-addr-read)
I0318 20:54:21.611858 144489 PeriodicStatsReporter.cpp:252] Spill memory usage: current[0B] peak[0B]
    #0 0xe929f3 in bool facebook::velox::bits::isBitSet<unsigned long>(unsigned long const*, unsigned long) fbcode/velox/common/base/BitUtil.h:53
    #1 0xe9292b in facebook::velox::bits::isBitNull(unsigned long const*, int) fbcode/velox/common/base/Nulls.h:38
    #2 0x1026c75 in facebook::velox::DecodedVector::isNullAt(int) const fbcode/velox/vector/DecodedVector.h:206
    #3 0x11526d0 in auto facebook::velox::FlatVector<long>::copyValuesAndNulls(facebook::velox::BaseVector const*, facebook::velox::SelectivityVector const&, int const*)::'lambda2'(auto)::operator()<int>(auto) const fbcode/velox/vector/FlatVector-inl.h:245
    #4 0x1149ce0 in void facebook::velox::SelectivityVector::applyToSelected<facebook::velox::FlatVector<long>::copyValuesAndNulls(facebook::velox::BaseVector const*, facebook::velox::SelectivityVector const&, int const*)::'lambda2'(auto)>(auto) const fbcode/velox/vector/SelectivityVector.h:446
    #5 0x1147d45 in facebook::velox::FlatVector<long>::copyValuesAndNulls(facebook::velox::BaseVector const*, facebook::velox::SelectivityVector const&, int const*) fbcode/velox/vector/FlatVector-inl.h:244
    #6 0x10eae6a in facebook::velox::FlatVector<long>::copy(facebook::velox::BaseVector const*, facebook::velox::SelectivityVector const&, int const*) fbcode/velox/vector/FlatVector.h:256
    #7 0x7fd683820ecc in facebook::velox::RowVector::copy(facebook::velox::BaseVector const*, facebook::velox::SelectivityVector const&, int const*) fbcode/velox/vector/ComplexVector.cpp:219
    #8 0x7fd6838222c9 in facebook::velox::RowVector::copy(facebook::velox::BaseVector const*, int, int, int) fbcode/velox/vector/ComplexVector.cpp:182
    #9 0x7fd6840e86a6 in facebook::velox::exec::MultiThreadedTaskCursor::MultiThreadedTaskCursor(facebook::velox::exec::CursorParameters const&)::'lambda'(std::shared_ptr<facebook::velox::RowVector> const&, folly::SemiFuture<folly::Unit>*)::operator()(std::shared_ptr<facebook::velox::RowVector> const&, folly::SemiFuture<folly::Unit>*) const fbcode/velox/exec/Cursor.cpp:245
    #10 0x7fd6840e8199 in facebook::velox::exec::BlockingReason std::__invoke_impl<facebook::velox::exec::BlockingReason, facebook::velox::exec::MultiThreadedTaskCursor::MultiThreadedTaskCursor(facebook::velox::exec::CursorParameters const&)::'lambda'(std::shared_ptr<facebook::velox::Row
...
...
...
readPoolExecutor*, std::shared_ptr<folly::ThreadPoolExecutor::Thread>))(std::shared_ptr<folly::ThreadPoolExecutor::Thread>)>::__call<void, 0ul, 1ul>(std::tuple<>&&, std::_Index_tuple<0ul, 1ul>) fbcode/third-party-buck/platform010/build/libgcc/include/c++/trunk/functional:420
    #27 0x7fd686a99e1d in void std::_Bind<void (folly::ThreadPoolExecutor::* (folly::ThreadPoolExecutor*, std::shared_ptr<folly::ThreadPoolExecutor::Thread>))(std::shared_ptr<folly::ThreadPoolExecutor::Thread>)>::operator()<void>() fbcode/third-party-buck/platform010/build/libgcc/include/c++/trunk/functional:503
    #28 0x7fd686a99b3c in void folly::detail::function::call_<std::_Bind<void (folly::ThreadPoolExecutor::* (folly::ThreadPoolExecutor*, std::shared_ptr<folly::ThreadPoolExecutor::Thread>))(std::shared_ptr<folly::ThreadPoolExecutor::Thread>)>, true, false, void>(folly::detail::function::Data&) fbcode/folly/Function.h:341
    #29 0x2331676 in folly::detail::function::FunctionTraits<void ()>::operator()() fbcode/folly/Function.h:370
    #30 0x25151c4 in folly::NamedThreadFactory::newThread(folly::Function<void ()>&&)::'lambda'()::operator()() fbcode/folly/executors/thread_factory/NamedThreadFactory.h:40
    #31 0x2515084 in void std::__invoke_impl<void, folly::NamedThreadFactory::newThread(folly::Function<void ()>&&)::'lambda'()>(std::__invoke_other, folly::NamedThreadFactory::newThread(folly::Function<void ()>&&)::'lambda'()&&) fbcode/third-party-buck/platform010/build/libgcc/include/c++/trunk/bits/invoke.h:61
    #32 0x2515044 in std::__invoke_result<folly::NamedThreadFactory::newThread(folly::Function<void ()>&&)::'lambda'()>::type std::__invoke<folly::NamedThreadFactory::newThread(folly::Function<void ()>&&)::'lambda'()>(folly::NamedThreadFactory::newThread(folly::Function<void ()>&&)::'lambda'()&&) fbcode/third-party-buck/platform010/build/libgcc/include/c++/trunk/bits/invoke.h:96
    #33 0x251501c in void std::thread::_Invoker<std::tuple<folly::NamedThreadFactory::newThread(folly::Function<void ()>&&)::'lambda'()>>::_M_invoke<0ul>(std::_Index_tuple<0ul>) fbcode/third-party-buck/platform010/build/libgcc/include/c++/trunk/bits/std_thread.h:253
    #34 0x2514ff4 in std::thread::_Invoker<std::tuple<folly::NamedThreadFactory::newThread(folly::Function<void ()>&&)::'lambda'()>>::operator()() fbcode/third-party-buck/platform010/build/libgcc/include/c++/trunk/bits/std_thread.h:260
    #35 0x2514eb8 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<folly::NamedThreadFactory::newThread(folly::Function<void ()>&&)::'lambda'()>>>::_M_run() fbcode/third-party-buck/platform010/build/libgcc/include/c++/trunk/bits/std_thread.h:211
    #36 0x7fd658cdf5b4 in execute_native_thread_routine /home/engshare/third-party2/libgcc/11.x/src/gcc-11.x/x86_64-facebook-linux/libstdc++-v3/src/c++11/../../../.././libstdc++-v3/src/c++11/thread.cc:82:18
    #37 0x384200a in asan_thread_start(void*) ubsan.c
    #38 0x7fd65909abc8 in start_thread /home/engshare/third-party2/glibc/2.34/src/glibc-2.34/nptl/pthread_create.c:434:8
    #39 0x7fd65912ce4b in __GI___clone3 /home/engshare/third-party2/glibc/2.34/src/glibc-2.34/misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV fbcode/velox/common/base/BitUtil.h:53 in bool facebook::velox::bits::isBitSet<unsigned long>(unsigned long const*, unsigned long)
Thread T58 created by T0 here:
    #0 0x3829bfd in pthread_create (/data/sandcastle/boxes/eden-trunk-hg-full-fbsource/buck-out/v2/gen/fbcode/07838712e61ba1ef/velox/exec/tests/__velox_exec_test__/velox_exec_test+0x3829bfd)
    #1 0x7fd658cdf8de in __gthread_create /home/engshare/third-party2/libgcc/11.x/src/gcc-11.x/x86_64-facebook-linux/libstdc++-v3/include/x86_64-facebook-linux/bits/gthr-default.h:663:35
    #2 0x7fd658cdf8de in std::thread::_M_start_thread(std::unique_ptr<std::thread::_State, std::default_delete<std::thread::_State>>, void (*)()) /home/engshare/third-party2/libgcc/11.x/src/gcc-11.x/x86_64-facebook-linux/libstdc++-v3/src/c++11/../../../.././libstdc++-v3/src/c++11/thread.cc:147:37
    #3 0x25142ba in std::thread::thread<folly::NamedThreadFactory::newThread(folly::Function<void ()>&&)::'lambda'(), void>(folly::NamedThreadFactory::newThread(folly::Function<void ()>&&)::'lambda'()&&) fbcode/third-party-buck/platform010/build/libgcc/include/c++/trunk/bits/std_thread.h:143
    #4 0x251399a in folly::NamedThreadFactory::newThread(folly::Function<void ()>&&) fbcode/folly/executors/thread_factory/NamedThreadFactory.h:37
    #5 0x7fd686a95869 in folly::ThreadPoolExecutor::addThreads(unsigned long) fbcode/folly/executors/ThreadPoolExecutor.cpp:251
    #6 0x7fd686ac8150 in folly::ThreadPoolExecutor::ensureActiveThreads() fbcode/folly/executors/ThreadPoolExecutor.cpp:547
    #7 0x7fd686c45c57 in void folly::CPUThreadPoolExecutor::addImpl<false>(folly::Function<void ()>, signed char, std::chrono::duration<long, std::ratio<1l, 1000l>>, folly::Function<void ()>) fbcode/folly/executors/CPUThreadPoolExecutor.cpp:280
    #8 0x7fd686b4c3ad in folly::CPUThreadPoolExecutor::add(folly::Function<void ()>, std::chrono::duration<long, std::ratio<1l, 1000l>>, folly::Function<void ()>) fbcode/folly/executors/CPUThreadPoolExecutor.cpp:224
    #9 0x7fd686b4bc96 in folly::CPUThreadPoolExecutor::add(folly::Function<void ()>) fbcode/folly/executors/CPUThreadPoolExecutor.cpp:219
    #10 0x7fd67e7997b8 in facebook::velox::exec::Driver::enqueue(std::shared_ptr<facebook::velox::exec::Driver>) fbcode/velox/exec/Driver.cpp:243
    #11 0x7fd67f174c08 in facebook::velox::exec::Task::createAndStartDrivers(unsigned int) fbcode/velox/exec/Task.cpp:918
    #12 0x7fd67f170d6a in facebook::velox::exec::Task::start(unsigned int, unsigned int) fbcode/velox/exec/Task.cpp:803
    #13 0x7fd6840e1ae2 in facebook::velox::exec::MultiThreadedTaskCursor::start() fbcode/velox/exec/Cursor.cpp:272
    #14 0x7fd6840e1d69 in facebook::velox::exec::MultiThreadedTaskCursor::moveNext() fbcode/velox/exec/Cursor.cpp:280
    #15 0x7fd67665ce37 in facebook::velox::exec::test::readCursor(facebook::velox::exec::CursorParameters const&, std::function<void (facebook::velox::exec::Task*)>, unsigned long) fbcode/velox/exec/tests/utils/QueryAssertions.cpp:1434
    #16 0x7fd676699c9f in facebook::velox::exec::test::assertQuery(facebook::velox::exec::CursorParameters const&, std::function<void (facebook::velox::exec::Task*)>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, facebook::velox::exec::test::DuckDbQueryRunner&, std::optional<std::vector<unsigned int, std::allocator<unsigned int>>>) fbcode/velox/exec/tests/utils/QueryAssertions.cpp:1545
    #17 0x13b2d88 in facebook::velox::exec::test::OperatorTestBase::assertQuery(facebook::velox::exec::CursorParameters const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&) fbcode/velox/exec/tests/utils/OperatorTestBase.h:105
    #18 0x2621f52 in facebook::velox::exec::test::(anonymous namespace)::NestedLoopJoinTest::runTest(std::vector<std::shared_ptr<facebook::velox::RowVector>, std::allocator<std::shared_ptr<facebook::velox::RowVector>>> const&, std::vector<std::shared_ptr<facebook::velox::RowVector>, std::allocator<std::shared_ptr<facebook::velox::RowVector>>> const&, int, unsigned long) fbcode/velox/exec/tests/NestedLoopJoinTest.cpp:120
    #19 0x261f3a3 in facebook::velox::exec::test::(anonymous namespace)::NestedLoopJoinTest::runSingleAndMultiDriverTest(std::vector<std::shared_ptr<facebook::velox::RowVector>, std::allocator<std::shared_ptr<facebook::velox::RowVector>>> const&, std::vector<std::shared_ptr<facebook::velox::RowVector>, std::allocator<std::shared_ptr<facebook::velox::RowVector>>> const&) fbcode/velox/exec/tests/NestedLoopJoinTest.cpp:72
    #20 0x261f1b8 in facebook::velox::exec::test::(anonymous namespace)::NestedLoopJoinTest_basic_Test::TestBody() fbcode/velox/exec/tests/NestedLoopJoinTest.cpp:217
    #21 0x7fd6873c01be in void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) fbsource/src/gtest.cc:2675
    #22 0x7fd6873bfa44 in testing::Test::Run() fbsource/src/gtest.cc:2692
    #23 0x7fd6873c568f in testing::TestInfo::Run() fbsource/src/gtest.cc:2841
    #24 0x7fd6873cd646 in testing::TestSuite::Run() fbsource/src/gtest.cc:3020
    #25 0x7fd687408fab in testing::internal::UnitTestImpl::RunAllTests() fbsource/src/gtest.cc:5925
    #26 0x7fd68740800b in bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) fbsource/src/gtest.cc:2675
    #27 0x7fd687407549 in testing::UnitTest::Run() fbsource/src/gtest.cc:5489
    #28 0x239f2b0 in RUN_ALL_TESTS() fbsource/gtest/gtest.h:2317
    #29 0x239f171 in main fbcode/velox/exec/tests/Main.cpp:28
    #30 0x7fd65902c656 in __libc_start_call_main /home/engshare/third-party2/glibc/2.34/src/glibc-2.34/csu/../sysdeps/nptl/libc_start_call_main.h:58:16
    #31 0x7fd65902c717 in __libc_start_main@GLIBC_2.2.5 /home/engshare/third-party2/glibc/2.34/src/glibc-2.34/csu/../csu/libc-start.c:409:3
    #32 0xe6cd50 in _start /home/engshare/third-party2/glibc/2.34/src/glibc-2.34/csu/../sysdeps/x86_64/start.S:116

==144428==ABORTING

Also,

Note: Google Test filter = NestedLoopJoinTest.bigintArray
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from NestedLoopJoinTest
[ RUN      ] NestedLoopJoinTest.bigintArray
fbcode/velox/exec/tests/utils/QueryAssertions.cpp:1182: Failure
Failed
Expected 874, got 874
7 extra rows, 7 missing rows
7 of extra rows:
	null | 5746192603694578572
	2453124679116815258 | 6465094110047898509
	2498668260353398648 | null
	3653562531762506384 | null
	8093754489623884800 | 259054925252220560
	8693514054697472469 | 1879318768042108095
	8693514054697472469 | 1879318768042108095

7 of missing rows:
	259054925252220560 | 259054925252220560
	1879318768042108095 | 1879318768042108095
	1879318768042108095 | 1879318768042108095
	5746192603694578572 | 5746192603694578572
	6465094110047898509 | 6465094110047898509
	8820484975464199494 | null
	8901369548833288010 | null

Note: DuckDB only supports timestamps of millisecond precision. If this test involves timestamp inputs, please make sure you use the right precision.
DuckDB query: SELECT t0, u0 FROM t FULL JOIN u ON t.t0 = u.u0
Google Test trace:
fbcode/velox/exec/tests/NestedLoopJoinTest.cpp:104: maxDrivers:4 joinType:FULL comparison:=

[  FAILED  ] NestedLoopJoinTest.bigintArray (351 ms)
[----------] 1 test from NestedLoopJoinTest (351 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (581 ms total)
[  PASSED  ] 0 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] NestedLoopJoinTest.bigintArray

 1 FAILED TEST

iamorchid · 2025-03-22T08:35:40Z

@kagamiori sorry Wei, the UT break is my fault (forget to re-run UTs with changes to comments). I've checked that all relavent UTs passed with my second commit.
@pedroerp can you help take another look at my changes ? Thanks.

FelixYBW · 2025-03-25T17:34:43Z

@pedroerp we noted the issue in our TPCDS benchmark as well.

We may need to go through every operator and make sure the output rowvector big enough.

iamorchid · 2025-03-26T12:00:48Z

@kagamiori Is it OK to merge this PR now ?

facebook-github-bot · 2025-03-30T06:42:33Z

@xiaoxmeng has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2025-03-31T07:06:25Z

@xiaoxmeng merged this pull request in 96c011d.

…bator#12519) Summary: Recently, we found the following benchmark sql became much slower after we upgraded our system to latest velox. ``` select ps_partkey, sum(ps_supplycost * ps_availqty) as value from partsupp, supplier, nation where ps_suppkey = s_suppkey and s_nationkey = n_nationkey and n_name = 'GERMANY' group by ps_partkey having sum(ps_supplycost * ps_availqty) > ( select sum(ps_supplycost * ps_availqty) * 0.000001 from partsupp, supplier, nation where ps_suppkey = s_suppkey and s_nationkey = n_nationkey and n_name = 'GERMANY' ) order by value desc; ``` Through the flame-graph investigations, we noticed that ```NestedLoopJoin::generateOutput``` became the bottle-neck. So we went through the most recent changes to NestedLoopJoin relevant files and found this suspicious PR (facebookincubator#10651). After reviewing the changes introduced by the PR, I find out that this PR prepares output for each probe row even if the build table has only one row (we have filter defined in our SQL). Even though each probe row only makes use of small part of the output space, a new output is still created for the next probe row. So my PR here is trying to re-use the output acorss multiple probe rows and avoid un-necessary memory allocating for new output. And with my PR, the beanchmark SQL performance recovers. Pull Request resolved: facebookincubator#12519 Reviewed By: Yuhta Differential Revision: D71409408 Pulled By: xiaoxmeng fbshipit-source-id: fdff86af2d4146e9e8a81ae33368db152bf9965d

Yuhta requested a review from pedroerp March 4, 2025 22:46

iamorchid force-pushed the refine_nlj branch from 6e0757c to 840ef6c Compare March 6, 2025 09:23

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 6, 2025

mbasmanova requested review from Yuhta and xiaoxmeng March 6, 2025 11:10

mbasmanova changed the title ~~re-use output across probe rows for NestedLoopJoin~~ fix: Re-use output across probe rows for NestedLoopJoin Mar 6, 2025

pedroerp reviewed Mar 6, 2025

View reviewed changes

iamorchid force-pushed the refine_nlj branch from 840ef6c to 299d692 Compare March 7, 2025 02:39

iamorchid requested a review from pedroerp March 7, 2025 03:03

pedroerp approved these changes Mar 18, 2025

View reviewed changes

pedroerp added the ready-to-merge PR that have been reviewed and are ready for merging. PRs with this tag notify the Velox Meta oncall label Mar 18, 2025

iamorchid force-pushed the refine_nlj branch from 299d692 to 4ef170c Compare March 19, 2025 05:38

fix: Re-use output across probe rows for NestedLoopJoin

ddd66e8

iamorchid force-pushed the refine_nlj branch from 4ef170c to 53f4869 Compare March 22, 2025 08:26

iamorchid closed this Mar 22, 2025

iamorchid reopened this Mar 22, 2025

fix NestedLoopJoin UT break

8baf921

iamorchid force-pushed the refine_nlj branch from 53f4869 to 8baf921 Compare March 22, 2025 08:54

facebook-github-bot closed this in 96c011d Mar 31, 2025

facebook-github-bot added the Merged label Mar 31, 2025

ayushi-agarwal mentioned this pull request Jan 8, 2026

NestedLoopJoin is significantly slower than vanilla spark #12294

Open

zhangxffff mentioned this pull request Mar 17, 2026

[Bug] NestedLoopJoinProbe OOM with wide build-side rows bytedance/bolt#401

Closed

5 tasks

Conversation

iamorchid commented Mar 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot commented Mar 4, 2025

Action Required

Process

Uh oh!

netlify bot commented Mar 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for meta-velox canceled.

Uh oh!

iamorchid commented Mar 4, 2025

Uh oh!

iamorchid commented Mar 6, 2025

Uh oh!

pedroerp left a comment

Choose a reason for hiding this comment

Uh oh!

pedroerp Mar 6, 2025

Choose a reason for hiding this comment

Uh oh!

iamorchid Mar 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pedroerp Mar 6, 2025

Choose a reason for hiding this comment

Uh oh!

iamorchid Mar 7, 2025

Choose a reason for hiding this comment

Uh oh!

iamorchid commented Mar 7, 2025

Uh oh!

iamorchid commented Mar 8, 2025

Uh oh!

iamorchid commented Mar 12, 2025

Uh oh!

iamorchid commented Mar 18, 2025

Uh oh!

mbasmanova commented Mar 18, 2025

Uh oh!

pedroerp left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Mar 18, 2025

Uh oh!

kagamiori commented Mar 18, 2025

Uh oh!

iamorchid commented Mar 19, 2025

Uh oh!

iamorchid commented Mar 19, 2025

Uh oh!

kagamiori commented Mar 21, 2025

Uh oh!

iamorchid commented Mar 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

FelixYBW commented Mar 25, 2025

Uh oh!

iamorchid commented Mar 26, 2025

Uh oh!

facebook-github-bot commented Mar 30, 2025

Uh oh!

facebook-github-bot commented Mar 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

iamorchid commented Mar 4, 2025 •

edited

Loading

netlify bot commented Mar 4, 2025 •

edited

Loading

iamorchid Mar 7, 2025 •

edited

Loading

iamorchid commented Mar 22, 2025 •

edited

Loading