Skip to content

Support WebHDFS (1/2): C++ implementation#788

Merged
rapids-bot[bot] merged 20 commits intorapidsai:branch-25.10from
kingcrimsontianyu:web-hdfs
Aug 8, 2025
Merged

Support WebHDFS (1/2): C++ implementation#788
rapids-bot[bot] merged 20 commits intorapidsai:branch-25.10from
kingcrimsontianyu:web-hdfs

Conversation

@kingcrimsontianyu
Copy link
Contributor

@kingcrimsontianyu kingcrimsontianyu commented Jul 30, 2025

Summary

This PR adds WebHDFS support to KvikIO. The background information is available at #787.

Limitations

This PR does not address:

  • Idiomatic and secure URL parsing and validation
  • Testing on URL encoding/decoding (which means percent-decoded URL may or may not work at the moment)
  • Advanced authentication such as Kerberos

These features will be added in the future.

Partially addresses #787

@kingcrimsontianyu kingcrimsontianyu added feature request New feature or request non-breaking Introduces a non-breaking change c++ Affects the C++ API of KvikIO labels Jul 30, 2025
@copy-pr-bot
Copy link

copy-pr-bot bot commented Jul 30, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@kingcrimsontianyu
Copy link
Contributor Author

/ok to test 46c1cbb

@kingcrimsontianyu kingcrimsontianyu marked this pull request as ready for review August 5, 2025 14:42
@kingcrimsontianyu kingcrimsontianyu requested review from a team as code owners August 5, 2025 14:42
@kingcrimsontianyu
Copy link
Contributor Author

/ok to test

@copy-pr-bot
Copy link

copy-pr-bot bot commented Aug 6, 2025

/ok to test

@kingcrimsontianyu, there was an error processing your request: E1

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/

@kingcrimsontianyu
Copy link
Contributor Author

/ok to test c3881e0

Copy link
Member

@madsbk madsbk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, I haven't tested the hdfs protocol but the libcurl usage looks sound.

…destroy the remote file only once per test suite instead of per test
kingcrimsontianyu and others added 2 commits August 7, 2025 10:09
Co-authored-by: Mads R. B. Kristensen <madsbk@gmail.com>
@kingcrimsontianyu
Copy link
Contributor Author

An update to the unit test in this PR:
Since gtest does not guarantee that different TEST_Fs from the same test suite are always run sequentially (e.g. by using --gtest_parallel), it is necessary to create/upload/destroy the remote test file "/tmp/kvikio-test-webhdfs.bin" only once per test suite, instead of per test, to avoid potential race condition. So static void SetUpTestSuite() and static void TearDownTestSuite() are used to replace SetUp() and TearDown().

@kingcrimsontianyu kingcrimsontianyu requested a review from a team as a code owner August 7, 2025 14:18
@kingcrimsontianyu
Copy link
Contributor Author

/ok to test 9c6477c

else:
- ${{ pin_compatible("cuda-version", upper_bound="12.2.0a0", lower_bound="12.0") }}
- cuda-cudart
- libcurl ${{ libcurl_version }}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixing the "overlinking" error from conda according to Lawrence's (@wence- ) suggestion. One remaining question raised is why libcurl is not present in libkvikio's run section. 🤔

Copy link
Contributor Author

@kingcrimsontianyu kingcrimsontianyu Aug 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PS: The generated build.ninja file indicates the use of static library libcurl-d.a (under debug build). This may be related to libcurl not being listed under run but ignore_run_exports section.

build gtests/HDFS_TEST: CXX_EXECUTABLE_LINKER__HDFS_TEST_Debug tests/CMakeFiles/HDFS_TEST.dir/test_hdfs.cpp.o tests/CMakeFiles/HDFS_TEST.dir/utils/hdfs_helper.cpp.o | libkvikio.so lib/libgmock.a lib/libgmock_main.a lib/libgtest.a lib/libgtest_main.a /usr/local/cuda-12.9/targets/x86_64-linux/lib/libcudart.so _deps/curl-build/lib/libcurl-d.a lib/libgmock.a lib/libgtest.a /usr/lib/x86_64-linux-gnu/librt.a /usr/lib/x86_64-linux-gnu/libssl.so /usr/lib/x86_64-linux-gnu/libcrypto.so /usr/lib/x86_64-linux-gnu/libz.so || _deps/curl-build/lib/libcurl-d.a lib/libgmock.a lib/libgmock_main.a lib/libgtest.a lib/libgtest_main.a libkvikio.so

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dependencies that are required at build/compile time should be listed under host. Dependencies required at runtime should be listed under run. Plenty of dependencies require both.
Do we need libcurl at runtime, or just for the build?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ignore_run_exports is a bit different. A package (like libcurl) can define its own runtime dependencies that you may or may not want to accept. In the case of libkvikio, we need libcurl at build time but we don't want or need the explicit runtime dependency on libcurl, so we add it to the ignore_run_exports.

You probably want to do the same thing for the libkvikio-tests output

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need libcurl at runtime, or just for the build?

We need it a build time, and naively at runtime because we depend on libcurl. But we end up not needing it at runtime because we statically link libcurl into libkvikio.so.

If we were to change to dynamically linking, I presume we would also need it at runtime.

The above applies mutatis mutandis to the libkvikio-tests package that in this PR now also directly depends on libcurl at build time.

Copy link
Contributor

@gforsyth gforsyth Aug 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is statically linked then we'd want to explicitly disable the run-export so we don't enforce an (admittedly light) unneeded dependency downstream.

Although for the tests package, I'm much less concerned about carrying around a runtime dep of libcurl

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I investigated the C++ artifacts from 0507d78 (https://github.com/rapidsai/kvikio/actions/runs/16811025193/artifacts/3712834228).

The HDFS_TEST binary does link to libcurl.so as shown in the CI output. However, I think we're passing the overlinking check because libcurl is in the host environment. This is probably not a good solution and leads to finding libcurl.so.4 from my system rather than conda.

In fact, the same thing is happening for libkvikio.so (it links to libcurl.so.4 but doesn't have a run dependency on libcurl).

On further inspection, I think we're only statically linking libcurl in wheels but not conda packages where we still expect the shared library to be present. We haven't run into issues because libcurl.so.4 is pretty widely available in conda environments and Linux OSes. We probably need to keep the ignore_run_exports here because the run-export pinnings are really tight (exact) but we should add an explicit runtime pinning to avoid this problem in both libkvikio and libkvikio-tests. I'll push a commit here to do that.

Additional references:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in cb24034.

Copy link
Contributor

@bdice bdice Aug 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked myself again on this -- I couldn't believe that libcurl's run-exports were exact. I was wrong and they are not. It says pin_subpackage('libcurl'), not pin_subpackage('libcurl', exact=True). I just misread it earlier. We often use exact=True so I overlooked it at first.

The run-exports of the package we use at build time (libcurl 8.5.0) are {"weak": ["libcurl >=8.5.0,<9.0a0"]}. This is totally fine to use, we don't need to ignore libcurl's run-exports and add our own pinnings. I pushed another commit fixing this (and simplifying what I did in the previous commit): 70be8f9

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @bdice!

@kingcrimsontianyu
Copy link
Contributor Author

/ok to test 0507d78

@bdice
Copy link
Contributor

bdice commented Aug 7, 2025

/ok to test 8a7ff55

@bdice
Copy link
Contributor

bdice commented Aug 7, 2025

/ok to test 4daf55e

@kingcrimsontianyu
Copy link
Contributor Author

Requested an approval of this PR if no further change is needed.

Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

libcurl packaging now works as intended for libkvikio and libkvikio-tests. Approving!

@kingcrimsontianyu
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit acada8d into rapidsai:branch-25.10 Aug 8, 2025
49 checks passed
rapids-bot bot pushed a commit that referenced this pull request Aug 15, 2025
## Summary
This PR adds Python binding for the WebHDFS support

Depends on PR #788

Closes #787

Python's built-in package `http.server` is well suited to server mocking. It enables high-level testing for the client. Closes #634 too.

Authors:
  - Tianyu Liu (https://github.com/kingcrimsontianyu)

Approvers:
  - Mads R. B. Kristensen (https://github.com/madsbk)

URL: #791
rapids-bot bot pushed a commit that referenced this pull request Aug 25, 2025
… URL (1/2): C++ implementation (#793)

This PR adds a new remote I/O utility function `RemoteHandle::open(url)` that infers the remote endpoint type from the URL to facilitate `RemoteHandle` creation.

- Supported endpoint types include S3, S3 with presigned URL, WebHDFS, and generic HTTP/HTTPS.
- Optionally, instead of letting `open` figure it out, users can explicitly specify the endpoint type by passing an enum argument `RemoteEndpointType`.
- Optionally, users can provide an allowlist that restricts the endpoint candidates
- Optionally, users can specify the expected file size. This design is to fully support the existing constructor overload `RemoteHandle(endpoint, nbytes)`.

A byproduct of this PR is an internal utility class `UrlParser` that uses the idiomatic libcurl URL API to validate the URL against "[RFC 3986 plus](https://curl.se/docs/url-syntax.html)".

## This PR depends on
- [x] #791
- [x] #788

Authors:
  - Tianyu Liu (https://github.com/kingcrimsontianyu)

Approvers:
  - Mads R. B. Kristensen (https://github.com/madsbk)

URL: #793
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

c++ Affects the C++ API of KvikIO feature request New feature or request non-breaking Introduces a non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants