Skip to content

[native] Enable the use of the cuDF parquet reader#25899

Closed
devavret wants to merge 1 commit intoprestodb:masterfrom
devavret:cudf-scans
Closed

[native] Enable the use of the cuDF parquet reader#25899
devavret wants to merge 1 commit intoprestodb:masterfrom
devavret:cudf-scans

Conversation

@devavret
Copy link
Copy Markdown

@devavret devavret commented Aug 27, 2025

Description

Supersedes #25376

This PR lets the native worker use the cuDF parquet reader by using a CudfHiveConnector that makes a datasource that produces CudfVector batches when cudf integration is enabled in velox.
CudfHiveConnector extends the HiveConnector which allows fallback to the CPU for any unsupported features.

Motivation and Context

By default the Velox readers produce a RowVector which needs to be both converted and copied into a CudfVector. This conversion can be very expensive and we'd like to avoid this. By leveraging the GPU side Parquet code in cuDF, plus leveraging GPUDirect RDMA, we can skip the CPU path entirely.

Impact

No impact when presto is compiled with cmake flag PRESTO_ENABLE_CUDF=OFF.
When this flag is turned on, we register a CudfHiveConnectorFactory instead of HiveConnectorFactory. This CudfHiveConnectorFactory only creates CudfVector producing datasource if facebook::velox::cudf_velox::cudfIsRegistered() returns true.
It defaults to the HiveConnector behavior otherwise.

Test Plan

I've run TPC-H queries both with, and without the cuDF tablescan enabled.

Contributor checklist

  • Please make sure your submission complies with our contributing guide, in particular code style and commit standards.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.

Release Notes

Please follow release notes guidelines and fill in the release notes below.

== RELEASE NOTES ==

Hive Connector Changes
* Improve GPU TableScan performance: if Presto C++ is built with cuDF integration and cuDF operator replacement is enabled, Hive splits can be read by cuDF's parquet library instead of Velox's parquet library

@linux-foundation-easycla
Copy link
Copy Markdown

linux-foundation-easycla bot commented Aug 27, 2025

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: karthikeyann / name: Karthikeyan (4479b24)

@jaystarshot
Copy link
Copy Markdown
Member

Are there any published benchmarks around gpu usage in scans vs cpu based machines?

@karthikeyann
Copy link
Copy Markdown
Contributor

Commenting a local patch until proper fix from Velox. It's useful for anyone trying to reproduce presto GPU runs.

diff --git a/presto-native-execution/presto_cpp/main/connectors/CMakeLists.txt b/presto-native-execution/presto_cpp/main/connectors/CMakeLists.txt
index 45cba0ced7..6132746932 100644
--- a/presto-native-execution/presto_cpp/main/connectors/CMakeLists.txt
+++ b/presto-native-execution/presto_cpp/main/connectors/CMakeLists.txt
@@ -12,6 +12,8 @@
 add_library(presto_connectors Registration.cpp PrestoToVeloxConnector.cpp
                               SystemConnector.cpp)
 
+add_compile_definitions(presto_connectors PUBLIC VELOX_ENABLE_BACKWARD_COMPATIBILITY)
+
 if(PRESTO_ENABLE_ARROW_FLIGHT_CONNECTOR)
   add_subdirectory(arrow_flight)
   target_compile_definitions(presto_connectors

@karthikeyann
Copy link
Copy Markdown
Contributor

karthikeyann commented Sep 11, 2025

My coordinator crashes on first request. @devavret Do you have any local fixes besides what is in this PR?

docker logs presto-coordinator output

2025-09-11T06:48:32.898Z	ERROR	main	com.facebook.presto.server.PrestoServer	Function already registered: presto.default.array_split_into_chunks<T>(array(T),integer):array(array(T))
java.lang.IllegalArgumentException: Function already registered: presto.default.array_split_into_chunks<T>(array(T),integer):array(array(T))

@GregoryKimball
Copy link
Copy Markdown

Discussion: let's please add the additional configs for cuDF TableScan to the Presto config registry


if(PRESTO_ENABLE_CUDF)
target_link_libraries(presto_connectors velox_cudf_hive_connector
cudf::cudf)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it required for cudf::cudf to be explicit here? Should we be adding it to velox_cudf_hive_connector instead?

void registerConnectorFactories() {
// These checks for connector factories can be removed after we remove the
// registrations from the Velox library.
#ifdef PRESTO_ENABLE_CUDF
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can likely push this #ifdef inside the if check and reduce some code. Please add a brief comment stating that CudfHiveConnectorFactory has been extended from the HiveConnectorFactory, and any unsupported feature will fall back to it for execution on the CPU.


add_library(presto_types PrestoToVeloxQueryPlan.cpp VeloxPlanValidator.cpp
PrestoToVeloxSplit.cpp)
target_compile_definitions(presto_types PUBLIC VELOX_ENABLE_BACKWARD_COMPATIBILITY)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be removed?

@majetideepak majetideepak marked this pull request as ready for review September 22, 2025 09:44
@majetideepak majetideepak requested review from a team as code owners September 22, 2025 09:44
@sourcery-ai
Copy link
Copy Markdown
Contributor

sourcery-ai bot commented Sep 22, 2025

Reviewer's Guide

This PR conditionally switches the Hive connector to a GPU-accelerated Cudf-based implementation when built with CUDF support, updating connector registration logic and CMake build files to include the necessary libraries and compile definitions.

File-Level Changes

Change Details Files
Conditional registration of the Cudf-based Hive connector
  • Add PRESTO_ENABLE_CUDF guard around connector registration
  • Include CudfHiveConnector header under the CUDF flag
  • Register CudfHiveConnectorFactory instead of default HiveConnectorFactory when enabled
presto-native-execution/presto_cpp/main/connectors/Registration.cpp
Link CUDF libraries for connectors when CUDF is enabled
  • Add PRESTO_ENABLE_CUDF block in connectors/CMakeLists.txt
  • Link veloc_cudf_hive_connector and cudf::cudf under the CUDF flag
presto-native-execution/presto_cpp/main/connectors/CMakeLists.txt
Enable backward compatibility in the types module
  • Add VELOX_ENABLE_BACKWARD_COMPATIBILITY compile definition for presto_types
presto-native-execution/presto_cpp/main/types/CMakeLists.txt

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link
Copy Markdown
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there - I've reviewed your changes - here's some feedback:

  • Consider unifying the duplicated connector registration logic in Registration.cpp between the cuDF and non-cuDF paths to reduce code duplication.
  • Add a startup log statement indicating whether the CudfHiveConnector or the default HiveConnector was registered to simplify troubleshooting.
  • Document the rationale behind adding VELOX_ENABLE_BACKWARD_COMPATIBILITY in the presto_types CMakeLists, as it appears unrelated to the cuDF integration.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- Consider unifying the duplicated connector registration logic in Registration.cpp between the cuDF and non-cuDF paths to reduce code duplication.
- Add a startup log statement indicating whether the CudfHiveConnector or the default HiveConnector was registered to simplify troubleshooting.
- Document the rationale behind adding VELOX_ENABLE_BACKWARD_COMPATIBILITY in the presto_types CMakeLists, as it appears unrelated to the cuDF integration.

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@majetideepak
Copy link
Copy Markdown
Collaborator

@devavret can you also complete the CLA? Thanks.
#25899 (comment)

Co-authored-by: Devavret Makkar <dmakkar@nvidia.com>
@karthikeyann
Copy link
Copy Markdown
Contributor

karthikeyann commented Sep 23, 2025

@devavret I am not able to push to branch. permission denied. Please rebase and squash as single commit (presto requires this). Here are commands that I used,

git fetch --all
git pull origin master --ff
git reset --soft origin/master
git commit -S -m "cudf scan using CudfHiveConnector, Add linking to cudf"
git push -f

@majetideepak
Copy link
Copy Markdown
Collaborator

majetideepak commented Oct 7, 2025

The connector factory API changed and this code is not valid any more. I will add support for this here #26156

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants