Use KvikIO's unified interface to create remote I/O endpoints#19788
Conversation
|
| static std::regex const pattern{R"(^s3://)", std::regex_constants::icase}; | ||
| return std::regex_search(url, pattern); | ||
| // Regular expression to match the URL scheme conforming to RFC 3986 | ||
| static std::regex const pattern{R"(^[a-zA-Z][a-zA-Z0-9+.-]*://)", std::regex_constants::icase}; |
There was a problem hiding this comment.
What's the behavior today with file:/// URLs, which typically mean absolute path on a user's machine (not sure if that's a convention or a spec)? Would that hit this code path, and is it OK to pass that through as a "maybe remote url"?
There was a problem hiding this comment.
That's a good question to consider. I checked the relevant information. The file URI is specified by RFC 8089, and is indeed supported by libcurl. KvikIO does not support this scheme. So when a file URI is passed to libcudf in this PR, it will be considered a remote source, and KvikIO will reject it with Unsupported endpoint URL.
|
/ok to test c338a74 |
cpp/src/io/utilities/datasource.cpp
Outdated
| try { | ||
| return std::make_unique<remote_file_source>(filepath.c_str()); | ||
| } catch (std::exception const& ex) { | ||
| CUDF_FAIL("Error accessing the remote file \"" + filepath + "\". Reason: " + ex.what(), |
There was a problem hiding this comment.
we avoid including the user-provided file paths in logs/errors for security reasons. Might be okay to just skip it here.
There was a problem hiding this comment.
Done. File path removed from the message.
There was a problem hiding this comment.
This would also require KvikIO to not attach URL to its exception message. I have to check if this is being satisfied. Perhaps .CUDF_FAIL should not be used here
There was a problem hiding this comment.
New change has been made that explicitly redacts the remote file path on the cuDF side from the exception message.
There was a problem hiding this comment.
Sample output for the URL "xxx://path.bin" (temporarily modified KvikIO so that exception message includes the remote file path):
CUDF failure at:/home/coder/cudf/cpp/src/io/utilities/datasource.cpp:435: Error accessing the remote file.
Reason: KvikIO failure at: /home/coder/cudf/cpp/build/pip/cuda-12.9/release/_deps/kvikio-
src/cpp/src/remote_handle.cpp:606: Unsupported endpoint URL.<redacted-remote-file-path>
|
/ok to test 2e84038 |
@kingcrimsontianyu, there was an error processing your request: See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/2/ |
|
/ok to test 2f1b12e |
| redacted_msg = | ||
| std::regex_replace(ex.what(), std::regex{filepath}, "<redacted-remote-file-path>"); |
|
/ok to test c1c2d79 |
|
/ok to test 4fe15e3 |
|
The branch is on top of another branch (now merged) with unverified commits. So CI doesn't start automatically :/ |
|
/merge |
b851bc3
into
rapidsai:branch-25.10
- Closes #19633 - Depends on #19788 Authors: - Matthew Murray (https://github.com/Matt711) - Tianyu Liu (https://github.com/kingcrimsontianyu) - Vukasin Milovanovic (https://github.com/vuule) Approvers: - Tom Augspurger (https://github.com/TomAugspurger) - James Lamb (https://github.com/jameslamb) URL: #19921
Description
With rapidsai/kvikio#793, KvikIO can infer the endpoint type from the URL, supporting creation of a broader range of remote resources via a single interface. This PR updates the cuDF data source accordingly. Specifically,
pylibcudfnow can read from WebHDFS, S3, S3 presigned URL resources.Partially addresses #19633
Checklist