Skip to content

Comments

Add a unified remote I/O interface that infers the endpoint type from URL (1/2): C++ implementation#793

Merged
rapids-bot[bot] merged 62 commits intorapidsai:branch-25.10from
kingcrimsontianyu:remote-io-easy-interface
Aug 25, 2025
Merged

Add a unified remote I/O interface that infers the endpoint type from URL (1/2): C++ implementation#793
rapids-bot[bot] merged 62 commits intorapidsai:branch-25.10from
kingcrimsontianyu:remote-io-easy-interface

Conversation

@kingcrimsontianyu
Copy link
Contributor

@kingcrimsontianyu kingcrimsontianyu commented Aug 7, 2025

This PR adds a new remote I/O utility function RemoteHandle::open(url) that infers the remote endpoint type from the URL to facilitate RemoteHandle creation.

  • Supported endpoint types include S3, S3 with presigned URL, WebHDFS, and generic HTTP/HTTPS.
  • Optionally, instead of letting open figure it out, users can explicitly specify the endpoint type by passing an enum argument RemoteEndpointType.
  • Optionally, users can provide an allowlist that restricts the endpoint candidates
  • Optionally, users can specify the expected file size. This design is to fully support the existing constructor overload RemoteHandle(endpoint, nbytes).

A byproduct of this PR is an internal utility class UrlParser that uses the idiomatic libcurl URL API to validate the URL against "RFC 3986 plus".

This PR depends on

@kingcrimsontianyu kingcrimsontianyu added improvement Improves an existing functionality non-breaking Introduces a non-breaking change c++ Affects the C++ API of KvikIO labels Aug 7, 2025
@copy-pr-bot
Copy link

copy-pr-bot bot commented Aug 7, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@kingcrimsontianyu kingcrimsontianyu changed the title Add a unified remote I/O interface that infers the endpoint type from URL Add a unified remote I/O interface that infers the endpoint type from URL (1/2): C++ implementation Aug 7, 2025
@kingcrimsontianyu
Copy link
Contributor Author

/ok to test 64e8713

@kingcrimsontianyu
Copy link
Contributor Author

/ok to test 68e3a16

@kingcrimsontianyu kingcrimsontianyu marked this pull request as ready for review August 23, 2025 04:26
@kingcrimsontianyu kingcrimsontianyu requested review from a team as code owners August 23, 2025 04:26
@kingcrimsontianyu
Copy link
Contributor Author

An issue not addressed in this PR and requires more thinking in the future:

Endpoints on the current 25.10 only perform light syntax check for URL using regular expressions. Should we add more extensive URL validation (RFC 3986 plus) inside the constructor, or should we keep validation and construction separated?

In this PR, the open function explicitly calls is_url_valid to check the URL validity against a certain endpoint, and if valid, proceeds to instantiate the endpoint. There would be duplicate validation if we add it to the constructor. Should we simply tolerate an ABI change to add an optional (and perhaps unsightly) bool validate_url=true/false parameter to the endpoint constructor?

@rapidsai rapidsai deleted a comment from copy-pr-bot bot Aug 23, 2025
@kingcrimsontianyu
Copy link
Contributor Author

/ok to test a498a83

@kingcrimsontianyu
Copy link
Contributor Author

/ok to test bba7253

Copy link
Member

@madsbk madsbk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, I only have minor suggestions

@kingcrimsontianyu
Copy link
Contributor Author

/ok to test 357a615

@kingcrimsontianyu
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit 46fa7dd into rapidsai:branch-25.10 Aug 25, 2025
77 checks passed
rapids-bot bot pushed a commit that referenced this pull request Aug 27, 2025
… URL (2/2): Python binding (#808)

This PR adds Python binding to #793
Closes #807

Authors:
  - Tianyu Liu (https://github.com/kingcrimsontianyu)

Approvers:
  - Mads R. B. Kristensen (https://github.com/madsbk)

URL: #808
rapids-bot bot pushed a commit to rapidsai/cudf that referenced this pull request Sep 9, 2025
With rapidsai/kvikio#793, KvikIO can infer the endpoint type from the URL, supporting creation of a broader range of remote resources via a single interface. This PR updates the cuDF data source accordingly. Specifically, `pylibcudf` now can read from WebHDFS, S3, S3 presigned URL resources.

Partially addresses #19633

Authors:
  - Tianyu Liu (https://github.com/kingcrimsontianyu)
  - Vukasin Milovanovic (https://github.com/vuule)
  - Matthew Murray (https://github.com/Matt711)

Approvers:
  - Mads R. B. Kristensen (https://github.com/madsbk)
  - Nghia Truong (https://github.com/ttnghia)
  - Muhammad Haseeb (https://github.com/mhaseeb123)
  - Vukasin Milovanovic (https://github.com/vuule)
  - Matthew Murray (https://github.com/Matt711)

URL: #19788
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

c++ Affects the C++ API of KvikIO improvement Improves an existing functionality non-breaking Introduces a non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants