Skip to content

feat(cudf): Add cuDF based OrderBy operator#12735

Closed
devavret wants to merge 40 commits intofacebookincubator:mainfrom
devavret:cmake-upstreaming
Closed

feat(cudf): Add cuDF based OrderBy operator#12735
devavret wants to merge 40 commits intofacebookincubator:mainfrom
devavret:cmake-upstreaming

Conversation

@devavret
Copy link
Copy Markdown
Collaborator

This PR adds a cuDF based OrderBy operator and tooling to replace existing
Velox based operators. This includes:

  • CudfVector class that holds a cudf::table and is a replacement for Velox's
    RowVector when dealing with cuDF.
  • Interop code to convert between Velox and cuDF RowVectors.
  • CudfToVelox and CudfFromVelox operators that sit between the cuDF and Velox
    operators and handle the conversion of RowVectors to cudf::table and back.
  • A cuDF driver adapter that converts Velox operators to cuDF operators.
  • Nvtx tooling to help with profiling

@netlify
Copy link
Copy Markdown

netlify bot commented Mar 20, 2025

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 13a4a87
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/67f8c9cd683dc40008e0c62c

Copy link
Copy Markdown
Collaborator

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving a few explanatory comments for reviewers. Thanks in advance for reviewing!

Lots of credit to @karthikeyann, @devavret, @mhaseeb123, @GregoryKimball on the cuDF side, and lots of appreciation to @oerling @pedroerp @Yuhta @kgpai @assignUser (and more!) on the Meta / Voltron side for all your assistance. We are looking forward to upstreaming more features after this initial PR lands.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are pinning this to specific commits of cuDF and its dependencies to avoid breakage from any final changes in the 25.04 release. Once the RAPIDS 25.04 release is out (currently targeting April 9-10), we can remove a lot of this logic for rapids-cmake, rmm, and kvikio -- and just pin cuDF to the stable release.

Comment thread CMakeLists.txt
endif()
find_package(CUDAToolkit REQUIRED)
if(VELOX_ENABLE_CUDF)
set(VELOX_ENABLE_ARROW ON)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cuDF itself does not need Arrow (cuDF uses nanoarrow), but the Velox-cuDF interop requires Arrow functionality in Velox.

Comment thread scripts/setup-centos9.sh
dnf_install autoconf automake python3-devel pip libtool

pip install cmake==3.28.3
pip install cmake==3.30.4
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cuDF and its dependencies require CMake 3.30.4. That CMake version shipped with a fix for finding some CUDA Toolkit components that cuDF and its dependencies use.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a cmake_minimum_required to the top of cudf.cmake with a comment?

@facebook-github-bot
Copy link
Copy Markdown
Contributor

Hi @devavret!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks!

Comment thread velox/experimental/cudf/exec/CudfConversion.cpp Outdated
Comment thread velox/experimental/cudf/exec/CudfConversion.cpp Outdated
@facebook-github-bot
Copy link
Copy Markdown
Contributor

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 20, 2025
Comment thread velox/experimental/cudf/exec/CudfConversion.cpp Outdated
Comment thread velox/experimental/cudf/exec/CudfConversion.cpp Outdated
Comment thread velox/experimental/cudf/exec/CudfOrderBy.cpp Outdated

DECLARE_bool(velox_cudf_enabled);
DECLARE_string(velox_cudf_memory_resource);

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also declare velox_cudf_debug` here? I also need to set the flag.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added in 5575205

return nullptr;
}
finished_ = noMoreInput_;
return outputTable_;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The output table might be a very big table, containing all the input data, which is not allowed.

@jinchengchenghh
Copy link
Copy Markdown
Collaborator

Failed by PlanNode destructor

(gdb) bt 
#0  __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44
#1  0x00007effa62bbe73 in __pthread_kill_internal (signo=6, threadid=<optimized out>) at pthread_kill.c:78
#2  0x00007effa626eb46 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#3  0x00007effa6258833 in __GI_abort () at abort.c:79
#4  0x00007eff20678b21 in __gnu_cxx::__verbose_terminate_handler () at ../../../../libstdc++-v3/libsupc++/vterminate.cc:95
#5  0x00007efedde3c24d in folly::exception_tracer::(anonymous namespace)::terminateHandler ()
    at /code/gluten/ep/build-velox/build/velox_ep/deps-download/folly/folly/debugging/exception_tracer/ExceptionTracer.cpp:226
#6  0x00007eff2068453c in __cxxabiv1::__terminate (handler=<optimized out>) at ../../../../libstdc++-v3/libsupc++/eh_terminate.cc:48
#7  0x00007eff20683509 in __cxa_call_terminate (ue_header=0x7effa02e3120) at ../../../../libstdc++-v3/libsupc++/eh_call.cc:54
#8  0x00007eff20683c8a in __cxxabiv1::__gxx_personality_v0 (version=<optimized out>, actions=6, exception_class=5138137972254386944, ue_header=0x7effa02e3120, context=<optimized out>)
    at ../../../../libstdc++-v3/libsupc++/eh_personality.cc:685
#9  0x00007eff2de142d4 in _Unwind_RaiseException_Phase2 (exc=exc@entry=0x7effa02e3120, context=context@entry=0x7fffaab195c0, frames_p=frames_p@entry=0x7fffaab196b0) at ../../../libgcc/unwind.inc:64
#10 0x00007eff2de14971 in _Unwind_RaiseException (exc=0x7effa02e3120) at ../../../libgcc/unwind.inc:136
#11 0x00007eff206847fc in __cxxabiv1::__cxa_throw (obj=<optimized out>, tinfo=0x7efedff712d8, dest=0x7efedf1f4b5e <gluten::GlutenException::~GlutenException()>)
    at ../../../../libstdc++-v3/libsupc++/eh_throw.cc:90
#12 0x00007efedf29c18f in attachCurrentThreadAsDaemonOrThrow (vm=0x7effa61cc1c0 <main_vm>, out=0x7fffaab19948) at /code/gluten/cpp/core/jni/JniCommon.h:115
#13 0x00007efedf29c969 in gluten::JniColumnarBatchIterator::~JniColumnarBatchIterator (this=0x5633d6fc5f60, __in_chrg=<optimized out>) at /code/gluten/cpp/core/jni/JniCommon.cc:98
#14 0x00007efedf29c9de in gluten::JniColumnarBatchIterator::~JniColumnarBatchIterator (this=0x5633d6fc5f60, __in_chrg=<optimized out>) at /code/gluten/cpp/core/jni/JniCommon.cc:102
#15 0x00007efedf241028 in std::default_delete<gluten::ColumnarBatchIterator>::operator() (this=0x5633d6e985f0, __ptr=0x5633d6fc5f60) at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/unique_ptr.h:95
#16 0x00007efedf23db70 in std::unique_ptr<gluten::ColumnarBatchIterator, std::default_delete<gluten::ColumnarBatchIterator> >::~unique_ptr (this=0x5633d6e985f0, __in_chrg=<optimized out>)
    at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/unique_ptr.h:396
#17 0x00007efedf24d998 in gluten::ResultIterator::~ResultIterator (this=0x5633d6e985f0, __in_chrg=<optimized out>) at /code/gluten/cpp/core/compute/ResultIterator.h:30
#18 0x00007efedf24d9b3 in std::_Destroy<gluten::ResultIterator> (__pointer=0x5633d6e985f0) at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/stl_construct.h:151
#19 0x00007efedf24d7f6 in std::allocator_traits<std::allocator<void> >::destroy<gluten::ResultIterator> (__p=0x5633d6e985f0) at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/alloc_traits.h:648
#20 0x00007efedf24d45b in std::_Sp_counted_ptr_inplace<gluten::ResultIterator, std::allocator<void>, (__gnu_cxx::_Lock_policy)2>::_M_dispose (this=0x5633d6e985e0)
    at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/shared_ptr_base.h:613
#21 0x00007efec7443381 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x5633d6e985e0) at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/shared_ptr_base.h:346
#22 0x00007efec7449765 in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count (this=0x5633d6cfe300, __in_chrg=<optimized out>)
    at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/shared_ptr_base.h:1071
#23 0x00007efec7470438 in std::__shared_ptr<gluten::ResultIterator, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr (this=0x5633d6cfe2f8, __in_chrg=<optimized out>)
    at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/shared_ptr_base.h:1524
#24 0x00007efec7470454 in std::shared_ptr<gluten::ResultIterator>::~shared_ptr (this=0x5633d6cfe2f8, __in_chrg=<optimized out>) at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/shared_ptr.h:175
#25 0x00007efec788fcc0 in gluten::ValueStreamNode::~ValueStreamNode (this=0x5633d6cfe2c0, __in_chrg=<optimized out>) at /code/gluten/cpp/velox/operators/plannodes/RowVectorStream.h:111
#26 0x00007efec7892beb in std::_Destroy<gluten::ValueStreamNode> (__pointer=0x5633d6cfe2c0) at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/stl_construct.h:151
#27 0x00007efec7892554 in std::allocator_traits<std::allocator<void> >::destroy<gluten::ValueStreamNode> (__p=0x5633d6cfe2c0) at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/alloc_traits.h:648
#28 0x00007efec7890229 in std::_Sp_counted_ptr_inplace<gluten::ValueStreamNode, std::allocator<void>, (__gnu_cxx::_Lock_policy)2>::_M_dispose (this=0x5633d6cfe2b0)
    at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/shared_ptr_base.h:613
#29 0x00007efec7443381 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x5633d6cfe2b0) at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/shared_ptr_base.h:346
#30 0x00007efec7449765 in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count (this=0x5633d6cfdcb8, __in_chrg=<optimized out>)
    at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/shared_ptr_base.h:1071
#31 0x00007efec746f33c in std::__shared_ptr<facebook::velox::core::PlanNode const, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr (this=0x5633d6cfdcb0, __in_chrg=<optimized out>)
    at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/shared_ptr_base.h:1524
#32 0x00007efec746f358 in std::shared_ptr<facebook::velox::core::PlanNode const>::~shared_ptr (this=0x5633d6cfdcb0, __in_chrg=<optimized out>)
    at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/shared_ptr.h:175
#33 0x00007efec74aff4f in std::_Destroy<std::shared_ptr<facebook::velox::core::PlanNode const> > (__pointer=0x5633d6cfdcb0) at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/stl_construct.h:151
#34 0x00007efec74ab512 in std::_Destroy_aux<false>::__destroy<std::shared_ptr<facebook::velox::core::PlanNode const>*> (__first=0x5633d6cfdcb0, __last=0x5633d6cfdcc0)
    at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/stl_construct.h:163
#35 0x00007efec74a6dbe in std::_Destroy<std::shared_ptr<facebook::velox::core::PlanNode const>*> (__first=0x5633d6cfdcb0, __last=0x5633d6cfdcc0)
    at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/stl_construct.h:196
#36 0x00007efec74a1bb9 in std::_Destroy<std::shared_ptr<facebook::velox::core::PlanNode const>*, std::shared_ptr<facebook::velox::core::PlanNode const> > (__first=0x5633d6cfdcb0, __last=0x5633d6cfdcc0)
    at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/alloc_traits.h:850
#37 0x00007efec749d8fb in std::vector<std::shared_ptr<facebook::velox::core::PlanNode const>, std::allocator<std::shared_ptr<facebook::velox::core::PlanNode const> > >::~vector (this=0x5633d6fc5ef0, 
    __in_chrg=<optimized out>) at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/stl_vector.h:730
#38 0x00007efec7a34f6a in facebook::velox::core::OrderByNode::~OrderByNode (this=0x5633d6fc5e90, __in_chrg=<optimized out>) at /code/gluten/ep/build-velox/build/velox_ep/./velox/core/PlanNode.h:2002
#39 0x00007efec7892c83 in std::_Destroy<facebook::velox::core::OrderByNode> (__pointer=0x5633d6fc5e90) at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/stl_construct.h:151
#40 0x00007efec789263c in std::allocator_traits<std::allocator<void> >::destroy<facebook::velox::core::OrderByNode> (__p=0x5633d6fc5e90) at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/alloc_traits.h:648
#41 0x00007efec78907c1 in std::_Sp_counted_ptr_inplace<facebook::velox::core::OrderByNode, std::allocator<void>, (__gnu_cxx::_Lock_policy)2>::_M_dispose (this=0x5633d6fc5e80)
    at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/shared_ptr_base.h:613
#42 0x00007efec7443381 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x5633d6fc5e80) at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/shared_ptr_base.h:346
#43 0x00007efec7449765 in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count (this=0x5633d6cf60a8, __in_chrg=<optimized out>)
    at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/shared_ptr_base.h:1071
#44 0x00007efec746f33c in std::__shared_ptr<facebook::velox::core::PlanNode const, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr (this=0x5633d6cf60a0, __in_chrg=<optimized out>)
    at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/shared_ptr_base.h:1524
#45 0x00007efec746f358 in std::shared_ptr<facebook::velox::core::PlanNode const>::~shared_ptr (this=0x5633d6cf60a0, __in_chrg=<optimized out>)
    at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/shared_ptr.h:175
#46 0x00007efec74aff4f in std::_Destroy<std::shared_ptr<facebook::velox::core::PlanNode const> > (__pointer=0x5633d6cf60a0) at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/stl_construct.h:151
--Type <RET> for more, q to quit, c to continue without paging--
#47 0x00007efec74ab512 in std::_Destroy_aux<false>::__destroy<std::shared_ptr<facebook::velox::core::PlanNode const>*> (__first=0x5633d6cf60a0, __last=0x5633d6cf60b0)
    at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/stl_construct.h:163
#48 0x00007efec74a6dbe in std::_Destroy<std::shared_ptr<facebook::velox::core::PlanNode const>*> (__first=0x5633d6cf6090, __last=0x5633d6cf60b0)
    at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/stl_construct.h:196
#49 0x00007efec74a1bb9 in std::_Destroy<std::shared_ptr<facebook::velox::core::PlanNode const>*, std::shared_ptr<facebook::velox::core::PlanNode const> > (__first=0x5633d6cf6090, __last=0x5633d6cf60b0)
    at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/alloc_traits.h:850
#50 0x00007efec749d8fb in std::vector<std::shared_ptr<facebook::velox::core::PlanNode const>, std::allocator<std::shared_ptr<facebook::velox::core::PlanNode const> > >::~vector (this=0x7effa390c250, 
    __in_chrg=<optimized out>) at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/stl_vector.h:730
#51 0x00007efecf2ae92d in std::_Destroy<std::vector<std::shared_ptr<facebook::velox::core::PlanNode const>, std::allocator<std::shared_ptr<facebook::velox::core::PlanNode const> > > > (__pointer=0x7effa390c250)
    at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/stl_construct.h:151
#52 0x00007efecf2ae912 in std::allocator_traits<std::allocator<void> >::destroy<std::vector<std::shared_ptr<facebook::velox::core::PlanNode const>, std::allocator<std::shared_ptr<facebook::velox::core::PlanNode const> > > > (__p=0x7effa390c250) at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/alloc_traits.h:648
#53 0x00007efecf2ae807 in std::_Sp_counted_ptr_inplace<std::vector<std::shared_ptr<facebook::velox::core::PlanNode const>, std::allocator<std::shared_ptr<facebook::velox::core::PlanNode const> > >, std::allocator<void>, (__gnu_cxx::_Lock_policy)2>::_M_dispose (this=0x7effa390c240) at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/shared_ptr_base.h:613
#54 0x00007efec7443381 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x7effa390c240) at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/shared_ptr_base.h:346
#55 0x00007efec7449765 in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count (this=0x7effa390d998, __in_chrg=<optimized out>)
    at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/shared_ptr_base.h:1071
#56 0x00007efecf2a750a in std::__shared_ptr<std::vector<std::shared_ptr<facebook::velox::core::PlanNode const>, std::allocator<std::shared_ptr<facebook::velox::core::PlanNode const> > >, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr (this=0x7effa390d990, __in_chrg=<optimized out>) at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/shared_ptr_base.h:1524
#57 0x00007efecf2a7526 in std::shared_ptr<std::vector<std::shared_ptr<facebook::velox::core::PlanNode const>, std::allocator<std::shared_ptr<facebook::velox::core::PlanNode const> > > >::~shared_ptr (
    this=0x7effa390d990, __in_chrg=<optimized out>) at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/shared_ptr.h:175
#58 0x00007efecf2a7682 in facebook::velox::cudf_velox::cudfDriverAdapter::~cudfDriverAdapter (this=0x7effa390d980, __in_chrg=<optimized out>)
    at /code/gluten/ep/build-velox/build/velox_ep/velox/experimental/cudf/exec/ToCudf.cpp:264
#59 0x00007efecf2acf3c in std::_Function_base::_Base_manager<facebook::velox::cudf_velox::cudfDriverAdapter>::_M_destroy (__victim=...) at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/std_function.h:175
#60 0x00007efecf2ac098 in std::_Function_base::_Base_manager<facebook::velox::cudf_velox::cudfDriverAdapter>::_M_manager (__dest=..., __source=..., __op=std::__destroy_functor)
    at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/std_function.h:203
#61 0x00007efecf2aa73a in std::_Function_handler<void (facebook::velox::core::PlanFragment const&), facebook::velox::cudf_velox::cudfDriverAdapter>::_M_manager(std::_Any_data&, std::_Any_data const&, std::_Manager_operation) (__dest=..., __source=..., __op=std::__destroy_functor) at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/std_function.h:282
#62 0x00007efec7443b7b in std::_Function_base::~_Function_base (this=0x7effa38ec590, __in_chrg=<optimized out>) at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/std_function.h:244
#63 0x00007efecea8039a in std::function<void (facebook::velox::core::PlanFragment const&)>::~function() (this=0x7effa38ec590, __in_chrg=<optimized out>)
    at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/std_function.h:334
#64 0x00007efecea82728 in facebook::velox::exec::DriverAdapter::~DriverAdapter (this=0x7effa38ec570, __in_chrg=<optimized out>) at /code/gluten/ep/build-velox/build/velox_ep/./velox/exec/Driver.h:611
#65 0x00007efecea852a7 in std::_Destroy<facebook::velox::exec::DriverAdapter> (__pointer=0x7effa38ec570) at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/stl_construct.h:151
#66 0x00007efecea8457b in std::_Destroy_aux<false>::__destroy<facebook::velox::exec::DriverAdapter*> (__first=0x7effa38ec570, __last=0x7effa38ec5d0)
    at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/stl_construct.h:163
#67 0x00007efecea8277a in std::_Destroy<facebook::velox::exec::DriverAdapter*> (__first=0x7effa38ec570, __last=0x7effa38ec5d0) at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/stl_construct.h:196
#68 0x00007efecea80635 in std::_Destroy<facebook::velox::exec::DriverAdapter*, facebook::velox::exec::DriverAdapter> (__first=0x7effa38ec570, __last=0x7effa38ec5d0)
    at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/alloc_traits.h:850
#69 0x00007efecea856a7 in std::vector<facebook::velox::exec::DriverAdapter, std::allocator<facebook::velox::exec::DriverAdapter> >::~vector (
    this=0x7efed7ffba10 <facebook::velox::exec::DriverFactory::adapters>, __in_chrg=<optimized out>) at /opt/rh/gcc-toolset-12/root/usr/include/c++/12/bits/stl_vector.h:730
#70 0x00007effa62712dd in __run_exit_handlers (status=0, listp=0x7effa6429838 <__exit_funcs>, run_list_atexit=run_list_atexit@entry=true, run_dtors=run_dtors@entry=true) at exit.c:126
#71 0x00007effa6271430 in __GI_exit (status=<optimized out>) at exit.c:156
#72 0x00007effa62595d7 in __libc_start_call_main (main=main@entry=0x5633b3200720, argc=argc@entry=68, argv=argv@entry=0x7fffaab1a408) at ../sysdeps/nptl/libc_start_call_main.h:74
#73 0x00007effa6259680 in __libc_start_main_impl (main=0x5633b3200720, argc=68, argv=0x7fffaab1a408, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffaab1a3f8)
    at ../csu/libc-start.c:389

bool operator()(const exec::DriverFactory& factory, exec::Driver& driver) {
auto state = CompileState(factory, driver, *planNodes_);
// Stored planNodes_ from inspect.
auto res = state.compile();
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

call planNodes_->clear(); here can solve this issue #12735 (comment)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe this is quite right. In plans involving multiple pipelines, the compile operator is called multiple times, once for each pipeline. So we need these stored plan nodes in all subsequent calls. Would adding a separate clear() method to the cudf driver adapter help? If you know when the task is finished, you can manually clear out the adapter.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean the Task parallel execution mode? I read the wave DriverAdaper https://github.com/facebookincubator/velox/blob/main/velox/experimental/wave/exec/ToWave.cpp#L251, it does not need to store the plan nodes, is there any difference?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gluten uses the single thread task mode, so It's ok to clear it here. And if we have multiple drivers, after createAndStartDrivers in Task::start, can we clear the PlanNodes?

Is there any difference between the planNodes stored in DriverFactory and CompileState?

struct DriverFactory {
  std::vector<std::shared_ptr<const core::PlanNode>> planNodes;

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The adapters is saved to static zone, if we call the clear after the task is finished, we don't know which driver adapter is attached to the task. Gluten has parallel tasks in a single machine.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean the Task parallel execution mode?

I meant plans that say have a join in them. Those would need to be run through the compile at least twice, once for each branch.

Copy link
Copy Markdown
Collaborator Author

@devavret devavret Apr 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The adapters is saved to static zone, if we call the clear after the task is finished, we don't know which driver adapter is attached to the task. Gluten has parallel tasks in a single machine.

You wouldn't need to wait until a task is finished. You only need to wait until operator replacements have been made in it.

To be clear, I am not standing my ground here. I just think we need a different solution to the one suggested.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any difference between the planNodes stored in DriverFactory and CompileState?

Yes, the CompileState stores planNodes from the whole task while planNodes stored in DriverFactory are only from the current pipeline. The latter sometimes excludes planNodes which contribute operators to multiple pipelines like partition and hash join.

But I observed now that if the required planNode cannot be found in DriverFactory::planNodes then it's usually found in DriverFactory::consumerNode. I've made this change in 5575205 and it seems to work fine both for this PR and for our internal fork with all tpch operators replaced.

@karthikeyann since you originally wrote this piece, can you verify this commit?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good. I will verify this with all tpch queries. (with partition and hashjoin).

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I verified with tpch benchmarks. It works.

tableViews, stream, cudf::get_current_device_resource_ref());
}

std::unique_ptr<cudf::table> getConcatenatedTable(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the number of ConcatenatedTable rows is beyond the range? We have a case that accumulated rows in OrderBy is beyond vector_size_t range, #10848

Copy link
Copy Markdown
Collaborator

@karthikeyann karthikeyann Apr 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jinchengchenghh how does velox handle larger than vector_size_t range (all inputs together) for Orderby?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

velox stores the input in RowContainer, and store the sortedRows pointer in std::vector, so I assume the input rows it can sort is size_t, and extract the outputRows with data type vector_size_t, it will output several batches whose size should satisfy the output rows config.

@devavret
Copy link
Copy Markdown
Collaborator Author

devavret commented Apr 8, 2025

@jinchengchenghh Unfortunately my attempts for simple fixes for the decimal issue have been unsuccessful. I'm going to have to postpone decimal support to a follow up PR.
Also, in list of work I'm going to do in the follow up rather than here is the output size limit problem.

The planNode lifetime issue should be fixed but I have no way of verifying that.

Can you give another look to see if there's any showstopping bugs in this PR?

devavret and others added 3 commits April 8, 2025 20:01
@GregoryKimball
Copy link
Copy Markdown
Collaborator

@Yuhta would you please share your review for this work? Are there CI or other blockers at this stage? (+ @pedroerp)

@Yuhta Yuhta added the ready-to-merge PR that have been reviewed and are ready for merging. PRs with this tag notify the Velox Meta oncall label Apr 14, 2025
@facebook-github-bot
Copy link
Copy Markdown
Contributor

@pedroerp has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Copy Markdown
Contributor

@pedroerp merged this pull request in 0cba113.

@jinchengchenghh
Copy link
Copy Markdown
Collaborator

Thanks, I will verify it. @devavret

zhanglistar pushed a commit to bigo-sg/velox that referenced this pull request Apr 22, 2025
Summary:
This PR adds a cuDF based OrderBy operator and tooling to replace existing
Velox based operators. This includes:
- CudfVector class that holds a cudf::table and is a replacement for Velox's
  RowVector when dealing with cuDF.
- Interop code to convert between Velox and cuDF RowVectors.
- CudfToVelox and CudfFromVelox operators that sit between the cuDF and Velox
  operators and handle the conversion of RowVectors to cudf::table and back.
- A cuDF driver adapter that converts Velox operators to cuDF operators.
- Nvtx tooling to help with profiling

Pull Request resolved: facebookincubator#12735

Reviewed By: Yuhta

Differential Revision: D73003714

Pulled By: pedroerp

fbshipit-source-id: 5ac1e3db2d3754528802f51ded42b43e7250f191
@jinchengchenghh
Copy link
Copy Markdown
Collaborator

Thanks for your actively update, I have verified in Gluten, all the issues are resolved. @devavret

Operators before adapting for cuDF: count [2]
  Operator: ID 0: ValueStream[0] 0
  Operator: ID 1: OrderBy[1] 1
Operators after adapting for cuDF: count [4]
  Operator: ID 0: ValueStream[0] 0
  Operator: ID 1: CudfFromVelox[1-from-velox] 1
  Operator: ID 2: CudfOrderBy[1] 2
  Operator: ID 3: CudfToVelox[1-to-velox] 3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Merged ready-to-merge PR that have been reviewed and are ready for merging. PRs with this tag notify the Velox Meta oncall

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants