[native] Enable the use of the cuDF parquet reader#25899
[native] Enable the use of the cuDF parquet reader#25899devavret wants to merge 1 commit intoprestodb:masterfrom
Conversation
|
|
|
Are there any published benchmarks around gpu usage in scans vs cpu based machines? |
|
Commenting a local patch until proper fix from Velox. It's useful for anyone trying to reproduce presto GPU runs. diff --git a/presto-native-execution/presto_cpp/main/connectors/CMakeLists.txt b/presto-native-execution/presto_cpp/main/connectors/CMakeLists.txt
index 45cba0ced7..6132746932 100644
--- a/presto-native-execution/presto_cpp/main/connectors/CMakeLists.txt
+++ b/presto-native-execution/presto_cpp/main/connectors/CMakeLists.txt
@@ -12,6 +12,8 @@
add_library(presto_connectors Registration.cpp PrestoToVeloxConnector.cpp
SystemConnector.cpp)
+add_compile_definitions(presto_connectors PUBLIC VELOX_ENABLE_BACKWARD_COMPATIBILITY)
+
if(PRESTO_ENABLE_ARROW_FLIGHT_CONNECTOR)
add_subdirectory(arrow_flight)
target_compile_definitions(presto_connectors |
|
My coordinator crashes on first request. @devavret Do you have any local fixes besides what is in this PR?
|
|
Discussion: let's please add the additional configs for cuDF TableScan to the Presto config registry |
|
|
||
| if(PRESTO_ENABLE_CUDF) | ||
| target_link_libraries(presto_connectors velox_cudf_hive_connector | ||
| cudf::cudf) |
There was a problem hiding this comment.
Is it required for cudf::cudf to be explicit here? Should we be adding it to velox_cudf_hive_connector instead?
| void registerConnectorFactories() { | ||
| // These checks for connector factories can be removed after we remove the | ||
| // registrations from the Velox library. | ||
| #ifdef PRESTO_ENABLE_CUDF |
There was a problem hiding this comment.
We can likely push this #ifdef inside the if check and reduce some code. Please add a brief comment stating that CudfHiveConnectorFactory has been extended from the HiveConnectorFactory, and any unsupported feature will fall back to it for execution on the CPU.
|
|
||
| add_library(presto_types PrestoToVeloxQueryPlan.cpp VeloxPlanValidator.cpp | ||
| PrestoToVeloxSplit.cpp) | ||
| target_compile_definitions(presto_types PUBLIC VELOX_ENABLE_BACKWARD_COMPATIBILITY) |
There was a problem hiding this comment.
Can this be removed?
Reviewer's GuideThis PR conditionally switches the Hive connector to a GPU-accelerated Cudf-based implementation when built with CUDF support, updating connector registration logic and CMake build files to include the necessary libraries and compile definitions. File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Hey there - I've reviewed your changes - here's some feedback:
- Consider unifying the duplicated connector registration logic in Registration.cpp between the cuDF and non-cuDF paths to reduce code duplication.
- Add a startup log statement indicating whether the CudfHiveConnector or the default HiveConnector was registered to simplify troubleshooting.
- Document the rationale behind adding VELOX_ENABLE_BACKWARD_COMPATIBILITY in the presto_types CMakeLists, as it appears unrelated to the cuDF integration.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- Consider unifying the duplicated connector registration logic in Registration.cpp between the cuDF and non-cuDF paths to reduce code duplication.
- Add a startup log statement indicating whether the CudfHiveConnector or the default HiveConnector was registered to simplify troubleshooting.
- Document the rationale behind adding VELOX_ENABLE_BACKWARD_COMPATIBILITY in the presto_types CMakeLists, as it appears unrelated to the cuDF integration.Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
|
@devavret can you also complete the CLA? Thanks. |
Co-authored-by: Devavret Makkar <dmakkar@nvidia.com>
|
@devavret I am not able to push to branch. permission denied. Please rebase and squash as single commit (presto requires this). Here are commands that I used, |
d0e98bf to
4479b24
Compare
|
The connector factory API changed and this code is not valid any more. I will add support for this here #26156 |
Description
Supersedes #25376
This PR lets the native worker use the cuDF parquet reader by using a
CudfHiveConnectorthat makes a datasource that producesCudfVectorbatches when cudf integration is enabled in velox.CudfHiveConnectorextends theHiveConnectorwhich allows fallback to the CPU for any unsupported features.Motivation and Context
By default the Velox readers produce a RowVector which needs to be both converted and copied into a CudfVector. This conversion can be very expensive and we'd like to avoid this. By leveraging the GPU side Parquet code in cuDF, plus leveraging GPUDirect RDMA, we can skip the CPU path entirely.
Impact
No impact when presto is compiled with cmake flag
PRESTO_ENABLE_CUDF=OFF.When this flag is turned on, we register a
CudfHiveConnectorFactoryinstead ofHiveConnectorFactory. ThisCudfHiveConnectorFactoryonly createsCudfVectorproducing datasource iffacebook::velox::cudf_velox::cudfIsRegistered()returns true.It defaults to the HiveConnector behavior otherwise.
Test Plan
I've run TPC-H queries both with, and without the cuDF tablescan enabled.
Contributor checklist
Release Notes
Please follow release notes guidelines and fill in the release notes below.