Skip to content

fix: Re-use output across probe rows for NestedLoopJoin#12519

Closed
iamorchid wants to merge 2 commits intofacebookincubator:mainfrom
iamorchid:refine_nlj
Closed

fix: Re-use output across probe rows for NestedLoopJoin#12519
iamorchid wants to merge 2 commits intofacebookincubator:mainfrom
iamorchid:refine_nlj

Conversation

@iamorchid
Copy link
Copy Markdown
Contributor

@iamorchid iamorchid commented Mar 4, 2025

Recently, we found the following benchmark sql became much slower after we upgraded our system to latest velox.

select
  ps_partkey,
  sum(ps_supplycost * ps_availqty) as value
from
  partsupp,
  supplier,
  nation
where
  ps_suppkey = s_suppkey
  and s_nationkey = n_nationkey
  and n_name = 'GERMANY'
group by
  ps_partkey having
    sum(ps_supplycost * ps_availqty) > ( 
      select
        sum(ps_supplycost * ps_availqty) * 0.000001
      from
        partsupp,
        supplier,
        nation
      where
        ps_suppkey = s_suppkey
        and s_nationkey = n_nationkey
        and n_name = 'GERMANY'
    )   
order by
  value desc;

Through the flame-graph investigations, we noticed that NestedLoopJoin::generateOutput became the bottle-neck. So we went through the most recent changes to NestedLoopJoin relevant files and found this suspicious PR (#10651).

After reviewing the changes introduced by the PR, I find out that this PR prepares output for each probe row even if the build table has only one row (we have filter defined in our SQL). Even though each probe row only makes use of small part of the output space, a new output is still created for the next probe row.

So my PR here is trying to re-use the output acorss multiple probe rows and avoid un-necessary memory allocating for new output. And with my PR, the beanchmark SQL performance recovers.

@facebook-github-bot
Copy link
Copy Markdown
Contributor

Hi @iamorchid!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks!

@netlify
Copy link
Copy Markdown

netlify bot commented Mar 4, 2025

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 8baf921
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/67de7ab07623690008cf6c8e

@iamorchid
Copy link
Copy Markdown
Contributor Author

@pedroerp please help take a review at this PR, thanks

@Yuhta Yuhta requested a review from pedroerp March 4, 2025 22:46
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 6, 2025
@iamorchid
Copy link
Copy Markdown
Contributor Author

@mbasmanova can you also help take a review at this PR ?

@mbasmanova mbasmanova requested review from Yuhta and xiaoxmeng March 6, 2025 11:10
@mbasmanova mbasmanova changed the title re-use output across probe rows for NestedLoopJoin fix: Re-use output across probe rows for NestedLoopJoin Mar 6, 2025
Copy link
Copy Markdown
Contributor

@pedroerp pedroerp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few small comments, but overall looks good to me. Thanks for the optimization!

Have you tried to run join fuzzer for a while a check if this not trigger any issues?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for readability, could you wrap this logic in a helper method?

if (!readyToProduceOutput()) {
   return nullptr;
}
output_->resize(numOutputRows_);
return std::move(output_);

or something similar

Copy link
Copy Markdown
Contributor Author

@iamorchid iamorchid Mar 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good advice to define readyToProduceOutput.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should probably add something in the method's documentation that is the caller's responsibility to resize the output now.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's updated.

@iamorchid
Copy link
Copy Markdown
Contributor Author

A few small comments, but overall looks good to me. Thanks for the optimization!

Have you tried to run join fuzzer for a while a check if this not trigger any issues?

yes, the join fuzzer test passed.

@iamorchid
Copy link
Copy Markdown
Contributor Author

@mbasmanova @pedroerp can you help merge the PR if it looks ok ?

@iamorchid
Copy link
Copy Markdown
Contributor Author

@pedroerp can you help merge this PR if it looks OK ?

@iamorchid
Copy link
Copy Markdown
Contributor Author

@mbasmanova can you help merge this PR if it looks good ?

@mbasmanova
Copy link
Copy Markdown
Contributor

@pedroerp Pedro, would you take another look?

Copy link
Copy Markdown
Contributor

@pedroerp pedroerp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Thank you for the PR!

@pedroerp pedroerp added the ready-to-merge PR that have been reviewed and are ready for merging. PRs with this tag notify the Velox Meta oncall label Mar 18, 2025
@facebook-github-bot
Copy link
Copy Markdown
Contributor

@kagamiori has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@kagamiori
Copy link
Copy Markdown
Contributor

Hi @iamorchid, could you please rebase the PR onto the latest main? That's needed for merging it. Thanks!

@iamorchid
Copy link
Copy Markdown
Contributor Author

Hi @iamorchid, could you please rebase the PR onto the latest main? That's needed for merging it. Thanks!

sure, updated

@iamorchid
Copy link
Copy Markdown
Contributor Author

@kagamiori Hi, wei. Is it OK to merge it now ?

@kagamiori
Copy link
Copy Markdown
Contributor

Hi @iamorchid, some unit tests fail internally. Could you please take a look?

Note: Google Test filter = NestedLoopJoinTest.basic
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from NestedLoopJoinTest
[ RUN      ] NestedLoopJoinTest.basic
fbcode/velox/exec/tests/utils/QueryAssertions.cpp:1182: Failure
Failed
Expected 15, got 15
15 extra rows, 15 missing rows
10 of extra rows:
	3049146574685680872 | 6364777458702462862
	3471844029210195232 | 6943465136533772555
	3471844029210195232 | 6943465136533772555
	5676477873606191036 | 6911158193961541679
	5676477873606191036 | 6911158193961541679
	7593276348121265214 | 259054925252220560
	7593276348121265214 | 6628868324100099797
	7593276348121265214 | 6628868324100099797
	7593276348121265214 | 6628868324100099797
	7593276348121265214 | 6628868324100099797

10 of missing rows:
	259054925252220560 | 259054925252220560
	2511017508255745915 | 2511017508255745915
	6364777458702462862 | 6364777458702462862
	6628868324100099797 | 6628868324100099797
	6628868324100099797 | 6628868324100099797
	6628868324100099797 | 6628868324100099797
	6628868324100099797 | 6628868324100099797
	6628868324100099797 | 6628868324100099797
	6628868324100099797 | 6628868324100099797
	6628868324100099797 | 6628868324100099797

Note: DuckDB only supports timestamps of millisecond precision. If this test involves timestamp inputs, please make sure you use the right precision.
DuckDB query: SELECT t0, u0 FROM t INNER JOIN u ON t.t0 = u.u0
Google Test trace:
fbcode/velox/exec/tests/NestedLoopJoinTest.cpp:104: maxDrivers:1 joinType:INNER comparison:=

fbcode/velox/exec/tests/utils/QueryAssertions.cpp:1182: Failure
Failed
Expected 90, got 90
15 extra rows, 15 missing rows
10 of extra rows:
	3049146574685680872 | 6364777458702462862
	3471844029210195232 | 6943465136533772555
	3471844029210195232 | 6943465136533772555
	5676477873606191036 | 6911158193961541679
	5676477873606191036 | 6911158193961541679
	7593276348121265214 | 259054925252220560
	7593276348121265214 | 6628868324100099797
	7593276348121265214 | 6628868324100099797
	7593276348121265214 | 6628868324100099797
	7593276348121265214 | 6628868324100099797

10 of missing rows:
	259054925252220560 | 259054925252220560
	2511017508255745915 | 2511017508255745915
	6364777458702462862 | 6364777458702462862
	6628868324100099797 | 6628868324100099797
	6628868324100099797 | 6628868324100099797
	6628868324100099797 | 6628868324100099797
	6628868324100099797 | 6628868324100099797
	6628868324100099797 | 6628868324100099797
	6628868324100099797 | 6628868324100099797
	6628868324100099797 | 6628868324100099797

Note: DuckDB only supports timestamps of millisecond precision. If this test involves timestamp inputs, please make sure you use the right precision.
DuckDB query: SELECT t0, u0 FROM t RIGHT JOIN u ON t.t0 = u.u0
Google Test trace:
fbcode/velox/exec/tests/NestedLoopJoinTest.cpp:104: maxDrivers:1 joinType:RIGHT comparison:=


I0318 20:54:13.038564 144428 SharedArbitrator.cpp:295] [MEM] Start memory reclaim executor with 26 threads
I0318 20:54:13.038663 144428 SharedArbitrator.cpp:358] [MEM] Global arbitration abort capacity limits: 0
I0318 20:54:13.038817 144428 SharedArbitrator.cpp:300] [MEM] Shared arbitrator created with 6.00GB capacity, 0B reserved capacity
I0318 20:54:13.038868 144428 SharedArbitrator.cpp:305] [MEM] Arbitration config: max arbitration time 5m 0s, global memory reclaim percentage 10, global arbitration abort time ratio 0.5, global arbitration skip spill 0
I0318 20:54:13.039014 144432 SharedArbitrator.cpp:889] [MEM] Global arbitration controller started
I0318 20:54:13.038902 144428 SharedArbitrator.cpp:314] [MEM] Memory pool participant config: initCapacity 512.00MB, minCapacity 0B, fastExponentialGrowthCapacityLimit 512.00MB, slowCapacityGrowRatio 0.25, minFreeCapacity 128.00MB, minFreeCapacityRatio 0.25, minReclaimBytes 0B, abortCapacityLimit 0B
I0318 20:54:13.610507 144428 PeriodicStatsReporter.cpp:69] Starting PeriodicStatsReporter with options allocatorStatsIntervalMs:2000, cacheStatsIntervalMs:2000, arbitratorStatsIntervalMs:2000, spillStatsIntervalMs:2000
I0318 20:54:13.611109 144428 HiveConnector.cpp:56] Hive connector test-hive created with maximum of 20000 cached file handles.
I0318 20:54:13.611140 144489 PeriodicStatsReporter.cpp:252] Spill memory usage: current[0B] peak[0B]
I0318 20:54:13.651444 144493 Task.cpp:2055] Terminating task test_cursor 1 with state Finished after running for 25ms
I0318 20:54:13.693948 144491 Task.cpp:2055] Terminating task test_cursor 2 with state Finished after running for 27ms
I0318 20:54:14.084604 144493 Task.cpp:2055] Terminating task test_cursor 3 with state Finished after running for 24ms
I0318 20:54:14.472589 144492 Task.cpp:2055] Terminating task test_cursor 4 with state Finished after running for 31ms
I0318 20:54:15.462435 144493 Task.cpp:2055] Terminating task test_cursor 5 with state Finished after running for 27ms
I0318 20:54:15.527756 144491 Task.cpp:2055] Terminating task test_cursor 6 with state Finished after running for 43ms
I0318 20:54:15.611277 144489 PeriodicStatsReporter.cpp:252] Spill memory usage: current[0B] peak[0B]
I0318 20:54:15.982252 144490 Task.cpp:2055] Terminating task test_cursor 7 with state Finished after running for 31ms
I0318 20:54:16.437009 144491 Task.cpp:2055] Terminating task test_cursor 8 with state Finished after running for 33ms
I0318 20:54:17.538368 144492 Task.cpp:2055] Terminating task test_cursor 9 with state Finished after running for 20ms
I0318 20:54:17.593410 144491 Task.cpp:2055] Terminating task test_cursor 10 with state Finished after running for 38ms
I0318 20:54:17.611534 144489 PeriodicStatsReporter.cpp:252] Spill memory usage: current[0B] peak[0B]
I0318 20:54:18.011440 144491 Task.cpp:2055] Terminating task test_cursor 11 with state Finished after running for 39ms
I0318 20:54:18.413925 144492 Task.cpp:2055] Terminating task test_cursor 12 with state Finished after running for 32ms
I0318 20:54:19.374485 144490 Task.cpp:2055] Terminating task test_cursor 13 with state Finished after running for 24ms
I0318 20:54:19.436452 144493 Task.cpp:2055] Terminating task test_cursor 14 with state Finished after running for 29ms
I0318 20:54:19.611694 144489 PeriodicStatsReporter.cpp:252] Spill memory usage: current[0B] peak[0B]
I0318 20:54:19.866226 144492 Task.cpp:2055] Terminating task test_cursor 15 with state Finished after running for 26ms
I0318 20:54:20.259239 144490 Task.cpp:2055] Terminating task test_cursor 16 with state Finished after running for 31ms
AddressSanitizer:DEADLYSIGNAL
=================================================================
==144428==ERROR: AddressSanitizer: SEGV on unknown address (pc 0x000000e929f3 bp 0x7fd5f2bfac50 sp 0x7fd5f2bfac00 T58)
==144428==The signal is caused by a READ memory access.
==144428==Hint: this fault was caused by a dereference of a high value address (see register values below).  Disassemble the provided pc to learn which register was used.
SCARINESS: 20 (wild-addr-read)
I0318 20:54:21.611858 144489 PeriodicStatsReporter.cpp:252] Spill memory usage: current[0B] peak[0B]
    #0 0xe929f3 in bool facebook::velox::bits::isBitSet<unsigned long>(unsigned long const*, unsigned long) fbcode/velox/common/base/BitUtil.h:53
    #1 0xe9292b in facebook::velox::bits::isBitNull(unsigned long const*, int) fbcode/velox/common/base/Nulls.h:38
    #2 0x1026c75 in facebook::velox::DecodedVector::isNullAt(int) const fbcode/velox/vector/DecodedVector.h:206
    #3 0x11526d0 in auto facebook::velox::FlatVector<long>::copyValuesAndNulls(facebook::velox::BaseVector const*, facebook::velox::SelectivityVector const&, int const*)::'lambda2'(auto)::operator()<int>(auto) const fbcode/velox/vector/FlatVector-inl.h:245
    #4 0x1149ce0 in void facebook::velox::SelectivityVector::applyToSelected<facebook::velox::FlatVector<long>::copyValuesAndNulls(facebook::velox::BaseVector const*, facebook::velox::SelectivityVector const&, int const*)::'lambda2'(auto)>(auto) const fbcode/velox/vector/SelectivityVector.h:446
    #5 0x1147d45 in facebook::velox::FlatVector<long>::copyValuesAndNulls(facebook::velox::BaseVector const*, facebook::velox::SelectivityVector const&, int const*) fbcode/velox/vector/FlatVector-inl.h:244
    #6 0x10eae6a in facebook::velox::FlatVector<long>::copy(facebook::velox::BaseVector const*, facebook::velox::SelectivityVector const&, int const*) fbcode/velox/vector/FlatVector.h:256
    #7 0x7fd683820ecc in facebook::velox::RowVector::copy(facebook::velox::BaseVector const*, facebook::velox::SelectivityVector const&, int const*) fbcode/velox/vector/ComplexVector.cpp:219
    #8 0x7fd6838222c9 in facebook::velox::RowVector::copy(facebook::velox::BaseVector const*, int, int, int) fbcode/velox/vector/ComplexVector.cpp:182
    #9 0x7fd6840e86a6 in facebook::velox::exec::MultiThreadedTaskCursor::MultiThreadedTaskCursor(facebook::velox::exec::CursorParameters const&)::'lambda'(std::shared_ptr<facebook::velox::RowVector> const&, folly::SemiFuture<folly::Unit>*)::operator()(std::shared_ptr<facebook::velox::RowVector> const&, folly::SemiFuture<folly::Unit>*) const fbcode/velox/exec/Cursor.cpp:245
    #10 0x7fd6840e8199 in facebook::velox::exec::BlockingReason std::__invoke_impl<facebook::velox::exec::BlockingReason, facebook::velox::exec::MultiThreadedTaskCursor::MultiThreadedTaskCursor(facebook::velox::exec::CursorParameters const&)::'lambda'(std::shared_ptr<facebook::velox::Row
...
...
...
readPoolExecutor*, std::shared_ptr<folly::ThreadPoolExecutor::Thread>))(std::shared_ptr<folly::ThreadPoolExecutor::Thread>)>::__call<void, 0ul, 1ul>(std::tuple<>&&, std::_Index_tuple<0ul, 1ul>) fbcode/third-party-buck/platform010/build/libgcc/include/c++/trunk/functional:420
    #27 0x7fd686a99e1d in void std::_Bind<void (folly::ThreadPoolExecutor::* (folly::ThreadPoolExecutor*, std::shared_ptr<folly::ThreadPoolExecutor::Thread>))(std::shared_ptr<folly::ThreadPoolExecutor::Thread>)>::operator()<void>() fbcode/third-party-buck/platform010/build/libgcc/include/c++/trunk/functional:503
    #28 0x7fd686a99b3c in void folly::detail::function::call_<std::_Bind<void (folly::ThreadPoolExecutor::* (folly::ThreadPoolExecutor*, std::shared_ptr<folly::ThreadPoolExecutor::Thread>))(std::shared_ptr<folly::ThreadPoolExecutor::Thread>)>, true, false, void>(folly::detail::function::Data&) fbcode/folly/Function.h:341
    #29 0x2331676 in folly::detail::function::FunctionTraits<void ()>::operator()() fbcode/folly/Function.h:370
    #30 0x25151c4 in folly::NamedThreadFactory::newThread(folly::Function<void ()>&&)::'lambda'()::operator()() fbcode/folly/executors/thread_factory/NamedThreadFactory.h:40
    #31 0x2515084 in void std::__invoke_impl<void, folly::NamedThreadFactory::newThread(folly::Function<void ()>&&)::'lambda'()>(std::__invoke_other, folly::NamedThreadFactory::newThread(folly::Function<void ()>&&)::'lambda'()&&) fbcode/third-party-buck/platform010/build/libgcc/include/c++/trunk/bits/invoke.h:61
    #32 0x2515044 in std::__invoke_result<folly::NamedThreadFactory::newThread(folly::Function<void ()>&&)::'lambda'()>::type std::__invoke<folly::NamedThreadFactory::newThread(folly::Function<void ()>&&)::'lambda'()>(folly::NamedThreadFactory::newThread(folly::Function<void ()>&&)::'lambda'()&&) fbcode/third-party-buck/platform010/build/libgcc/include/c++/trunk/bits/invoke.h:96
    #33 0x251501c in void std::thread::_Invoker<std::tuple<folly::NamedThreadFactory::newThread(folly::Function<void ()>&&)::'lambda'()>>::_M_invoke<0ul>(std::_Index_tuple<0ul>) fbcode/third-party-buck/platform010/build/libgcc/include/c++/trunk/bits/std_thread.h:253
    #34 0x2514ff4 in std::thread::_Invoker<std::tuple<folly::NamedThreadFactory::newThread(folly::Function<void ()>&&)::'lambda'()>>::operator()() fbcode/third-party-buck/platform010/build/libgcc/include/c++/trunk/bits/std_thread.h:260
    #35 0x2514eb8 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<folly::NamedThreadFactory::newThread(folly::Function<void ()>&&)::'lambda'()>>>::_M_run() fbcode/third-party-buck/platform010/build/libgcc/include/c++/trunk/bits/std_thread.h:211
    #36 0x7fd658cdf5b4 in execute_native_thread_routine /home/engshare/third-party2/libgcc/11.x/src/gcc-11.x/x86_64-facebook-linux/libstdc++-v3/src/c++11/../../../.././libstdc++-v3/src/c++11/thread.cc:82:18
    #37 0x384200a in asan_thread_start(void*) ubsan.c
    #38 0x7fd65909abc8 in start_thread /home/engshare/third-party2/glibc/2.34/src/glibc-2.34/nptl/pthread_create.c:434:8
    #39 0x7fd65912ce4b in __GI___clone3 /home/engshare/third-party2/glibc/2.34/src/glibc-2.34/misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV fbcode/velox/common/base/BitUtil.h:53 in bool facebook::velox::bits::isBitSet<unsigned long>(unsigned long const*, unsigned long)
Thread T58 created by T0 here:
    #0 0x3829bfd in pthread_create (/data/sandcastle/boxes/eden-trunk-hg-full-fbsource/buck-out/v2/gen/fbcode/07838712e61ba1ef/velox/exec/tests/__velox_exec_test__/velox_exec_test+0x3829bfd)
    #1 0x7fd658cdf8de in __gthread_create /home/engshare/third-party2/libgcc/11.x/src/gcc-11.x/x86_64-facebook-linux/libstdc++-v3/include/x86_64-facebook-linux/bits/gthr-default.h:663:35
    #2 0x7fd658cdf8de in std::thread::_M_start_thread(std::unique_ptr<std::thread::_State, std::default_delete<std::thread::_State>>, void (*)()) /home/engshare/third-party2/libgcc/11.x/src/gcc-11.x/x86_64-facebook-linux/libstdc++-v3/src/c++11/../../../.././libstdc++-v3/src/c++11/thread.cc:147:37
    #3 0x25142ba in std::thread::thread<folly::NamedThreadFactory::newThread(folly::Function<void ()>&&)::'lambda'(), void>(folly::NamedThreadFactory::newThread(folly::Function<void ()>&&)::'lambda'()&&) fbcode/third-party-buck/platform010/build/libgcc/include/c++/trunk/bits/std_thread.h:143
    #4 0x251399a in folly::NamedThreadFactory::newThread(folly::Function<void ()>&&) fbcode/folly/executors/thread_factory/NamedThreadFactory.h:37
    #5 0x7fd686a95869 in folly::ThreadPoolExecutor::addThreads(unsigned long) fbcode/folly/executors/ThreadPoolExecutor.cpp:251
    #6 0x7fd686ac8150 in folly::ThreadPoolExecutor::ensureActiveThreads() fbcode/folly/executors/ThreadPoolExecutor.cpp:547
    #7 0x7fd686c45c57 in void folly::CPUThreadPoolExecutor::addImpl<false>(folly::Function<void ()>, signed char, std::chrono::duration<long, std::ratio<1l, 1000l>>, folly::Function<void ()>) fbcode/folly/executors/CPUThreadPoolExecutor.cpp:280
    #8 0x7fd686b4c3ad in folly::CPUThreadPoolExecutor::add(folly::Function<void ()>, std::chrono::duration<long, std::ratio<1l, 1000l>>, folly::Function<void ()>) fbcode/folly/executors/CPUThreadPoolExecutor.cpp:224
    #9 0x7fd686b4bc96 in folly::CPUThreadPoolExecutor::add(folly::Function<void ()>) fbcode/folly/executors/CPUThreadPoolExecutor.cpp:219
    #10 0x7fd67e7997b8 in facebook::velox::exec::Driver::enqueue(std::shared_ptr<facebook::velox::exec::Driver>) fbcode/velox/exec/Driver.cpp:243
    #11 0x7fd67f174c08 in facebook::velox::exec::Task::createAndStartDrivers(unsigned int) fbcode/velox/exec/Task.cpp:918
    #12 0x7fd67f170d6a in facebook::velox::exec::Task::start(unsigned int, unsigned int) fbcode/velox/exec/Task.cpp:803
    #13 0x7fd6840e1ae2 in facebook::velox::exec::MultiThreadedTaskCursor::start() fbcode/velox/exec/Cursor.cpp:272
    #14 0x7fd6840e1d69 in facebook::velox::exec::MultiThreadedTaskCursor::moveNext() fbcode/velox/exec/Cursor.cpp:280
    #15 0x7fd67665ce37 in facebook::velox::exec::test::readCursor(facebook::velox::exec::CursorParameters const&, std::function<void (facebook::velox::exec::Task*)>, unsigned long) fbcode/velox/exec/tests/utils/QueryAssertions.cpp:1434
    #16 0x7fd676699c9f in facebook::velox::exec::test::assertQuery(facebook::velox::exec::CursorParameters const&, std::function<void (facebook::velox::exec::Task*)>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, facebook::velox::exec::test::DuckDbQueryRunner&, std::optional<std::vector<unsigned int, std::allocator<unsigned int>>>) fbcode/velox/exec/tests/utils/QueryAssertions.cpp:1545
    #17 0x13b2d88 in facebook::velox::exec::test::OperatorTestBase::assertQuery(facebook::velox::exec::CursorParameters const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&) fbcode/velox/exec/tests/utils/OperatorTestBase.h:105
    #18 0x2621f52 in facebook::velox::exec::test::(anonymous namespace)::NestedLoopJoinTest::runTest(std::vector<std::shared_ptr<facebook::velox::RowVector>, std::allocator<std::shared_ptr<facebook::velox::RowVector>>> const&, std::vector<std::shared_ptr<facebook::velox::RowVector>, std::allocator<std::shared_ptr<facebook::velox::RowVector>>> const&, int, unsigned long) fbcode/velox/exec/tests/NestedLoopJoinTest.cpp:120
    #19 0x261f3a3 in facebook::velox::exec::test::(anonymous namespace)::NestedLoopJoinTest::runSingleAndMultiDriverTest(std::vector<std::shared_ptr<facebook::velox::RowVector>, std::allocator<std::shared_ptr<facebook::velox::RowVector>>> const&, std::vector<std::shared_ptr<facebook::velox::RowVector>, std::allocator<std::shared_ptr<facebook::velox::RowVector>>> const&) fbcode/velox/exec/tests/NestedLoopJoinTest.cpp:72
    #20 0x261f1b8 in facebook::velox::exec::test::(anonymous namespace)::NestedLoopJoinTest_basic_Test::TestBody() fbcode/velox/exec/tests/NestedLoopJoinTest.cpp:217
    #21 0x7fd6873c01be in void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) fbsource/src/gtest.cc:2675
    #22 0x7fd6873bfa44 in testing::Test::Run() fbsource/src/gtest.cc:2692
    #23 0x7fd6873c568f in testing::TestInfo::Run() fbsource/src/gtest.cc:2841
    #24 0x7fd6873cd646 in testing::TestSuite::Run() fbsource/src/gtest.cc:3020
    #25 0x7fd687408fab in testing::internal::UnitTestImpl::RunAllTests() fbsource/src/gtest.cc:5925
    #26 0x7fd68740800b in bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) fbsource/src/gtest.cc:2675
    #27 0x7fd687407549 in testing::UnitTest::Run() fbsource/src/gtest.cc:5489
    #28 0x239f2b0 in RUN_ALL_TESTS() fbsource/gtest/gtest.h:2317
    #29 0x239f171 in main fbcode/velox/exec/tests/Main.cpp:28
    #30 0x7fd65902c656 in __libc_start_call_main /home/engshare/third-party2/glibc/2.34/src/glibc-2.34/csu/../sysdeps/nptl/libc_start_call_main.h:58:16
    #31 0x7fd65902c717 in __libc_start_main@GLIBC_2.2.5 /home/engshare/third-party2/glibc/2.34/src/glibc-2.34/csu/../csu/libc-start.c:409:3
    #32 0xe6cd50 in _start /home/engshare/third-party2/glibc/2.34/src/glibc-2.34/csu/../sysdeps/x86_64/start.S:116

==144428==ABORTING

Also,

Note: Google Test filter = NestedLoopJoinTest.bigintArray
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from NestedLoopJoinTest
[ RUN      ] NestedLoopJoinTest.bigintArray
fbcode/velox/exec/tests/utils/QueryAssertions.cpp:1182: Failure
Failed
Expected 874, got 874
7 extra rows, 7 missing rows
7 of extra rows:
	null | 5746192603694578572
	2453124679116815258 | 6465094110047898509
	2498668260353398648 | null
	3653562531762506384 | null
	8093754489623884800 | 259054925252220560
	8693514054697472469 | 1879318768042108095
	8693514054697472469 | 1879318768042108095

7 of missing rows:
	259054925252220560 | 259054925252220560
	1879318768042108095 | 1879318768042108095
	1879318768042108095 | 1879318768042108095
	5746192603694578572 | 5746192603694578572
	6465094110047898509 | 6465094110047898509
	8820484975464199494 | null
	8901369548833288010 | null

Note: DuckDB only supports timestamps of millisecond precision. If this test involves timestamp inputs, please make sure you use the right precision.
DuckDB query: SELECT t0, u0 FROM t FULL JOIN u ON t.t0 = u.u0
Google Test trace:
fbcode/velox/exec/tests/NestedLoopJoinTest.cpp:104: maxDrivers:4 joinType:FULL comparison:=

[  FAILED  ] NestedLoopJoinTest.bigintArray (351 ms)
[----------] 1 test from NestedLoopJoinTest (351 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (581 ms total)
[  PASSED  ] 0 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] NestedLoopJoinTest.bigintArray

 1 FAILED TEST

@iamorchid
Copy link
Copy Markdown
Contributor Author

iamorchid commented Mar 22, 2025

@kagamiori sorry Wei, the UT break is my fault (forget to re-run UTs with changes to comments). I've checked that all relavent UTs passed with my second commit.
@pedroerp can you help take another look at my changes ? Thanks.

@FelixYBW
Copy link
Copy Markdown

@pedroerp we noted the issue in our TPCDS benchmark as well.

We may need to go through every operator and make sure the output rowvector big enough.

@iamorchid
Copy link
Copy Markdown
Contributor Author

@kagamiori Is it OK to merge this PR now ?

@facebook-github-bot
Copy link
Copy Markdown
Contributor

@xiaoxmeng has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Copy Markdown
Contributor

@xiaoxmeng merged this pull request in 96c011d.

zhouyuan pushed a commit to zhouyuan/velox that referenced this pull request Apr 9, 2025
…bator#12519)

Summary:
Recently, we found the following benchmark sql became much slower after we upgraded our system to latest velox.

```
select
  ps_partkey,
  sum(ps_supplycost * ps_availqty) as value
from
  partsupp,
  supplier,
  nation
where
  ps_suppkey = s_suppkey
  and s_nationkey = n_nationkey
  and n_name = 'GERMANY'
group by
  ps_partkey having
    sum(ps_supplycost * ps_availqty) > (
      select
        sum(ps_supplycost * ps_availqty) * 0.000001
      from
        partsupp,
        supplier,
        nation
      where
        ps_suppkey = s_suppkey
        and s_nationkey = n_nationkey
        and n_name = 'GERMANY'
    )
order by
  value desc;
```

Through the flame-graph investigations, we noticed that ```NestedLoopJoin::generateOutput``` became the bottle-neck. So we went through the most recent changes to NestedLoopJoin relevant files and found this suspicious PR (facebookincubator#10651).

After reviewing the changes introduced by the PR, I find out that this PR prepares output for each probe row even if the build table has only one row (we have filter defined in our SQL). Even though each probe row only makes use of small part of the output space, a new output is still created for the next probe row.

So my PR here is trying to re-use the output acorss multiple probe rows and avoid un-necessary memory allocating for new output. And with my PR, the beanchmark SQL performance recovers.

Pull Request resolved: facebookincubator#12519

Reviewed By: Yuhta

Differential Revision: D71409408

Pulled By: xiaoxmeng

fbshipit-source-id: fdff86af2d4146e9e8a81ae33368db152bf9965d
zml1206 pushed a commit to zml1206/velox that referenced this pull request May 23, 2025
…bator#12519)

Summary:
Recently, we found the following benchmark sql became much slower after we upgraded our system to latest velox.

```
select
  ps_partkey,
  sum(ps_supplycost * ps_availqty) as value
from
  partsupp,
  supplier,
  nation
where
  ps_suppkey = s_suppkey
  and s_nationkey = n_nationkey
  and n_name = 'GERMANY'
group by
  ps_partkey having
    sum(ps_supplycost * ps_availqty) > (
      select
        sum(ps_supplycost * ps_availqty) * 0.000001
      from
        partsupp,
        supplier,
        nation
      where
        ps_suppkey = s_suppkey
        and s_nationkey = n_nationkey
        and n_name = 'GERMANY'
    )
order by
  value desc;
```

Through the flame-graph investigations, we noticed that ```NestedLoopJoin::generateOutput``` became the bottle-neck. So we went through the most recent changes to NestedLoopJoin relevant files and found this suspicious PR (facebookincubator#10651).

After reviewing the changes introduced by the PR, I find out that this PR prepares output for each probe row even if the build table has only one row (we have filter defined in our SQL). Even though each probe row only makes use of small part of the output space, a new output is still created for the next probe row.

So my PR here is trying to re-use the output acorss multiple probe rows and avoid un-necessary memory allocating for new output. And with my PR, the beanchmark SQL performance recovers.

Pull Request resolved: facebookincubator#12519

Reviewed By: Yuhta

Differential Revision: D71409408

Pulled By: xiaoxmeng

fbshipit-source-id: fdff86af2d4146e9e8a81ae33368db152bf9965d
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Merged ready-to-merge PR that have been reviewed and are ready for merging. PRs with this tag notify the Velox Meta oncall

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants