Skip to content

Comments

Use KvikIO's unified interface to create remote I/O endpoints#19788

Merged
rapids-bot[bot] merged 29 commits intorapidsai:branch-25.10from
kingcrimsontianyu:use-remote-io-easy-interface
Sep 9, 2025
Merged

Use KvikIO's unified interface to create remote I/O endpoints#19788
rapids-bot[bot] merged 29 commits intorapidsai:branch-25.10from
kingcrimsontianyu:use-remote-io-easy-interface

Conversation

@kingcrimsontianyu
Copy link
Contributor

@kingcrimsontianyu kingcrimsontianyu commented Aug 25, 2025

Description

With rapidsai/kvikio#793, KvikIO can infer the endpoint type from the URL, supporting creation of a broader range of remote resources via a single interface. This PR updates the cuDF data source accordingly. Specifically, pylibcudf now can read from WebHDFS, S3, S3 presigned URL resources.

Partially addresses #19633

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@kingcrimsontianyu kingcrimsontianyu requested review from a team as code owners August 25, 2025 18:21
@kingcrimsontianyu kingcrimsontianyu added the libcudf Affects libcudf (C++/CUDA) code. label Aug 25, 2025
@kingcrimsontianyu kingcrimsontianyu added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Aug 25, 2025
@copy-pr-bot
Copy link

copy-pr-bot bot commented Aug 25, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@kingcrimsontianyu
Copy link
Contributor Author

kingcrimsontianyu commented Aug 25, 2025

static std::regex const pattern{R"(^s3://)", std::regex_constants::icase};
return std::regex_search(url, pattern);
// Regular expression to match the URL scheme conforming to RFC 3986
static std::regex const pattern{R"(^[a-zA-Z][a-zA-Z0-9+.-]*://)", std::regex_constants::icase};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the behavior today with file:/// URLs, which typically mean absolute path on a user's machine (not sure if that's a convention or a spec)? Would that hit this code path, and is it OK to pass that through as a "maybe remote url"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good question to consider. I checked the relevant information. The file URI is specified by RFC 8089, and is indeed supported by libcurl. KvikIO does not support this scheme. So when a file URI is passed to libcudf in this PR, it will be considered a remote source, and KvikIO will reject it with Unsupported endpoint URL.

@kingcrimsontianyu
Copy link
Contributor Author

/ok to test c338a74

try {
return std::make_unique<remote_file_source>(filepath.c_str());
} catch (std::exception const& ex) {
CUDF_FAIL("Error accessing the remote file \"" + filepath + "\". Reason: " + ex.what(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we avoid including the user-provided file paths in logs/errors for security reasons. Might be okay to just skip it here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. File path removed from the message.

Copy link
Contributor Author

@kingcrimsontianyu kingcrimsontianyu Sep 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would also require KvikIO to not attach URL to its exception message. I have to check if this is being satisfied. Perhaps CUDF_FAIL should not be used here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New change has been made that explicitly redacts the remote file path on the cuDF side from the exception message.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sample output for the URL "xxx://path.bin" (temporarily modified KvikIO so that exception message includes the remote file path):

CUDF failure at:/home/coder/cudf/cpp/src/io/utilities/datasource.cpp:435: Error accessing the remote file. 
Reason: KvikIO failure at: /home/coder/cudf/cpp/build/pip/cuda-12.9/release/_deps/kvikio-
src/cpp/src/remote_handle.cpp:606: Unsupported endpoint URL.<redacted-remote-file-path>

@kingcrimsontianyu
Copy link
Contributor Author

/ok to test 2e84038

@copy-pr-bot
Copy link

copy-pr-bot bot commented Sep 5, 2025

/ok to test 2e84038

@kingcrimsontianyu, there was an error processing your request: E2

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/2/

@kingcrimsontianyu
Copy link
Contributor Author

/ok to test 2f1b12e

Comment on lines +428 to +429
redacted_msg =
std::regex_replace(ex.what(), std::regex{filepath}, "<redacted-remote-file-path>");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wow :D

@kingcrimsontianyu
Copy link
Contributor Author

/ok to test c1c2d79

@Matt711 Matt711 removed their assignment Sep 8, 2025
@kingcrimsontianyu
Copy link
Contributor Author

/ok to test 4fe15e3

@kingcrimsontianyu
Copy link
Contributor Author

The branch is on top of another branch (now merged) with unverified commits. So CI doesn't start automatically :/

@Matt711 Matt711 mentioned this pull request Sep 8, 2025
3 tasks
@kingcrimsontianyu
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit b851bc3 into rapidsai:branch-25.10 Sep 9, 2025
254 of 256 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

7 participants