Support WebHDFS (2/2): Python binding#791
Support WebHDFS (2/2): Python binding#791rapids-bot[bot] merged 43 commits intorapidsai:branch-25.10from
Conversation
…destroy the remote file only once per test suite instead of per test
Co-authored-by: Mads R. B. Kristensen <madsbk@gmail.com>
|
Thanks for making me aware of the It takes some debugging exercises to get it to work, documented as follows. Debugging the local hangInitially, testing KvikIO's WebHDFS Python binding using |
|
/ok to test f74a230 |
|
/ok to test 05e31f8 |
|
/ok to test d2ca076 |
|
Where is still a hang on CI, which I cannot reproduce locally: |
|
/ok to test 8f43cbc |
|
The hang continues to exist on CI. I think |
|
/ok to test 7c9a1ff |
|
/ok to test 82512bd |
|
/ok to test fe87dcb |
|
CI hang has been fixed. Previously, |
python/kvikio/tests/test_hdfs_io.py
Outdated
| def find_free_port() -> int: | ||
| with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s: | ||
| s.bind((LOCALHOST, 0)) | ||
| s.listen(1) | ||
| port = s.getsockname()[1] | ||
| return port |
There was a problem hiding this comment.
Let's unify with the fixtures in test_s3_io.py and move them to conftest.py?
kvikio/python/kvikio/tests/test_s3_io.py
Lines 28 to 40 in a35bf58
There was a problem hiding this comment.
Thanks. Good idea.
I made them free functions and put them in a new file utils.py. The reason I didn't put them in conftest.py is I'm not aware of a flexible way to change the fixture's scope on a per-file basis. In test_s3_io.py, they have a scope of session, while in test_hdfs_io.py, I prefer letting them have the function scope in case we may run tests in parallel in the future where several servers co-exist, each having to use a different port from the localhost.
|
/ok to test 326c482 |
|
/ok to test 51be21c |
|
Final improvement: used the neat fixture |
|
/merge |
… URL (1/2): C++ implementation (#793) This PR adds a new remote I/O utility function `RemoteHandle::open(url)` that infers the remote endpoint type from the URL to facilitate `RemoteHandle` creation. - Supported endpoint types include S3, S3 with presigned URL, WebHDFS, and generic HTTP/HTTPS. - Optionally, instead of letting `open` figure it out, users can explicitly specify the endpoint type by passing an enum argument `RemoteEndpointType`. - Optionally, users can provide an allowlist that restricts the endpoint candidates - Optionally, users can specify the expected file size. This design is to fully support the existing constructor overload `RemoteHandle(endpoint, nbytes)`. A byproduct of this PR is an internal utility class `UrlParser` that uses the idiomatic libcurl URL API to validate the URL against "[RFC 3986 plus](https://curl.se/docs/url-syntax.html)". ## This PR depends on - [x] #791 - [x] #788 Authors: - Tianyu Liu (https://github.com/kingcrimsontianyu) Approvers: - Mads R. B. Kristensen (https://github.com/madsbk) URL: #793

Summary
This PR adds Python binding for the WebHDFS support
Depends on PR #788
Closes #787
Python's built-in package
http.serveris well suited to server mocking. It enables high-level testing for the client. Closes #634 too.