Support WebHDFS (1/2): C++ implementation#788
Support WebHDFS (1/2): C++ implementation#788rapids-bot[bot] merged 20 commits intorapidsai:branch-25.10from
Conversation
d28f419 to
d406591
Compare
|
/ok to test 46c1cbb |
|
/ok to test |
@kingcrimsontianyu, there was an error processing your request: See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/ |
|
/ok to test c3881e0 |
madsbk
left a comment
There was a problem hiding this comment.
Looks great, I haven't tested the hdfs protocol but the libcurl usage looks sound.
…destroy the remote file only once per test suite instead of per test
Co-authored-by: Mads R. B. Kristensen <madsbk@gmail.com>
|
An update to the unit test in this PR: |
|
/ok to test 9c6477c |
conda/recipes/libkvikio/recipe.yaml
Outdated
| else: | ||
| - ${{ pin_compatible("cuda-version", upper_bound="12.2.0a0", lower_bound="12.0") }} | ||
| - cuda-cudart | ||
| - libcurl ${{ libcurl_version }} |
There was a problem hiding this comment.
Fixing the "overlinking" error from conda according to Lawrence's (@wence- ) suggestion. One remaining question raised is why libcurl is not present in libkvikio's run section. 🤔
There was a problem hiding this comment.
PS: The generated build.ninja file indicates the use of static library libcurl-d.a (under debug build). This may be related to libcurl not being listed under run but ignore_run_exports section.
build gtests/HDFS_TEST: CXX_EXECUTABLE_LINKER__HDFS_TEST_Debug tests/CMakeFiles/HDFS_TEST.dir/test_hdfs.cpp.o tests/CMakeFiles/HDFS_TEST.dir/utils/hdfs_helper.cpp.o | libkvikio.so lib/libgmock.a lib/libgmock_main.a lib/libgtest.a lib/libgtest_main.a /usr/local/cuda-12.9/targets/x86_64-linux/lib/libcudart.so _deps/curl-build/lib/libcurl-d.a lib/libgmock.a lib/libgtest.a /usr/lib/x86_64-linux-gnu/librt.a /usr/lib/x86_64-linux-gnu/libssl.so /usr/lib/x86_64-linux-gnu/libcrypto.so /usr/lib/x86_64-linux-gnu/libz.so || _deps/curl-build/lib/libcurl-d.a lib/libgmock.a lib/libgmock_main.a lib/libgtest.a lib/libgtest_main.a libkvikio.so
There was a problem hiding this comment.
Dependencies that are required at build/compile time should be listed under host. Dependencies required at runtime should be listed under run. Plenty of dependencies require both.
Do we need libcurl at runtime, or just for the build?
There was a problem hiding this comment.
The ignore_run_exports is a bit different. A package (like libcurl) can define its own runtime dependencies that you may or may not want to accept. In the case of libkvikio, we need libcurl at build time but we don't want or need the explicit runtime dependency on libcurl, so we add it to the ignore_run_exports.
You probably want to do the same thing for the libkvikio-tests output
There was a problem hiding this comment.
Do we need libcurl at runtime, or just for the build?
We need it a build time, and naively at runtime because we depend on libcurl. But we end up not needing it at runtime because we statically link libcurl into libkvikio.so.
If we were to change to dynamically linking, I presume we would also need it at runtime.
The above applies mutatis mutandis to the libkvikio-tests package that in this PR now also directly depends on libcurl at build time.
There was a problem hiding this comment.
If it is statically linked then we'd want to explicitly disable the run-export so we don't enforce an (admittedly light) unneeded dependency downstream.
Although for the tests package, I'm much less concerned about carrying around a runtime dep of libcurl
There was a problem hiding this comment.
I investigated the C++ artifacts from 0507d78 (https://github.com/rapidsai/kvikio/actions/runs/16811025193/artifacts/3712834228).
The HDFS_TEST binary does link to libcurl.so as shown in the CI output. However, I think we're passing the overlinking check because libcurl is in the host environment. This is probably not a good solution and leads to finding libcurl.so.4 from my system rather than conda.
In fact, the same thing is happening for libkvikio.so (it links to libcurl.so.4 but doesn't have a run dependency on libcurl).
On further inspection, I think we're only statically linking libcurl in wheels but not conda packages where we still expect the shared library to be present. We haven't run into issues because libcurl.so.4 is pretty widely available in conda environments and Linux OSes. We probably need to keep the ignore_run_exports here because the run-export pinnings are really tight (exact) but we should add an explicit runtime pinning to avoid this problem in both libkvikio and libkvikio-tests. I'll push a commit here to do that.
Additional references:
- https://curl.se/libcurl/abi.html libcurl ABI compatibility promises are pretty good
There was a problem hiding this comment.
I checked myself again on this -- I couldn't believe that libcurl's run-exports were exact. I was wrong and they are not. It says pin_subpackage('libcurl'), not pin_subpackage('libcurl', exact=True). I just misread it earlier. We often use exact=True so I overlooked it at first.
The run-exports of the package we use at build time (libcurl 8.5.0) are {"weak": ["libcurl >=8.5.0,<9.0a0"]}. This is totally fine to use, we don't need to ignore libcurl's run-exports and add our own pinnings. I pushed another commit fixing this (and simplifying what I did in the previous commit): 70be8f9
|
/ok to test 0507d78 |
|
/ok to test 8a7ff55 |
|
/ok to test 4daf55e |
|
Requested an approval of this PR if no further change is needed. |
bdice
left a comment
There was a problem hiding this comment.
libcurl packaging now works as intended for libkvikio and libkvikio-tests. Approving!
|
/merge |
## Summary This PR adds Python binding for the WebHDFS support Depends on PR #788 Closes #787 Python's built-in package `http.server` is well suited to server mocking. It enables high-level testing for the client. Closes #634 too. Authors: - Tianyu Liu (https://github.com/kingcrimsontianyu) Approvers: - Mads R. B. Kristensen (https://github.com/madsbk) URL: #791
… URL (1/2): C++ implementation (#793) This PR adds a new remote I/O utility function `RemoteHandle::open(url)` that infers the remote endpoint type from the URL to facilitate `RemoteHandle` creation. - Supported endpoint types include S3, S3 with presigned URL, WebHDFS, and generic HTTP/HTTPS. - Optionally, instead of letting `open` figure it out, users can explicitly specify the endpoint type by passing an enum argument `RemoteEndpointType`. - Optionally, users can provide an allowlist that restricts the endpoint candidates - Optionally, users can specify the expected file size. This design is to fully support the existing constructor overload `RemoteHandle(endpoint, nbytes)`. A byproduct of this PR is an internal utility class `UrlParser` that uses the idiomatic libcurl URL API to validate the URL against "[RFC 3986 plus](https://curl.se/docs/url-syntax.html)". ## This PR depends on - [x] #791 - [x] #788 Authors: - Tianyu Liu (https://github.com/kingcrimsontianyu) Approvers: - Mads R. B. Kristensen (https://github.com/madsbk) URL: #793
Summary
This PR adds WebHDFS support to KvikIO. The background information is available at #787.
Limitations
This PR does not address:
These features will be added in the future.
Partially addresses #787