Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
62 commits
Select commit Hold shift + click to select a range
d406591
Support WebHDFS: C++ implementation
kingcrimsontianyu Aug 5, 2025
2e8328f
Update
kingcrimsontianyu Aug 5, 2025
46c1cbb
Move advanced URL handling to a separate PR to reduce the scope of cu…
kingcrimsontianyu Aug 5, 2025
8fd6a70
Update
kingcrimsontianyu Aug 5, 2025
cecd49c
Add clarifying comments
kingcrimsontianyu Aug 5, 2025
e6e4d56
Add more comments
kingcrimsontianyu Aug 5, 2025
97da554
Update
kingcrimsontianyu Aug 5, 2025
6b48777
Update
kingcrimsontianyu Aug 5, 2025
e4fbd35
Add default arg
kingcrimsontianyu Aug 5, 2025
3792549
Add Python binding for WebHDFS
kingcrimsontianyu Aug 5, 2025
87603f0
Fix a bug where too large file causes string-to-size conversion to fail
kingcrimsontianyu Aug 5, 2025
95e1c91
Merge branch 'web-hdfs' into python-web-hdfs
kingcrimsontianyu Aug 5, 2025
c3881e0
Merge branch 'branch-25.10' into web-hdfs
kingcrimsontianyu Aug 5, 2025
c91fcfb
Merge branch 'web-hdfs' into python-web-hdfs
kingcrimsontianyu Aug 5, 2025
cef9556
Unified interface for various endpoint types
kingcrimsontianyu Aug 6, 2025
e659c2f
Update
kingcrimsontianyu Aug 7, 2025
bb256f2
For WebHDFS unit test, fix segfault when the test is skipped; create/…
kingcrimsontianyu Aug 7, 2025
969f6d0
Remove unnecessary ntvx annotation
kingcrimsontianyu Aug 7, 2025
377d88e
Update cpp/tests/CMakeLists.txt
kingcrimsontianyu Aug 7, 2025
9c6477c
Attempt to fix CI overlinking error
kingcrimsontianyu Aug 7, 2025
fee32db
Merge branch 'web-hdfs' into python-web-hdfs
kingcrimsontianyu Aug 7, 2025
0507d78
Remove libcurl from libkvikio-tests' run section
kingcrimsontianyu Aug 7, 2025
ec0871e
Merge branch 'web-hdfs' into python-web-hdfs
kingcrimsontianyu Aug 7, 2025
59c683d
Merge branch 'python-web-hdfs' into remote-io-easy-interface
kingcrimsontianyu Aug 7, 2025
7936f64
Merge branch 'branch-25.10' into python-web-hdfs
kingcrimsontianyu Aug 8, 2025
9d70828
Reformat
kingcrimsontianyu Aug 8, 2025
799db68
Add pytest-httpserver dependency for webhdfs testing
kingcrimsontianyu Aug 12, 2025
2a67c55
Investigate http server hang. DO NOT MERGE
kingcrimsontianyu Aug 12, 2025
d2e958f
Revert some changes
kingcrimsontianyu Aug 13, 2025
e53aee4
Merge branch 'branch-25.10' into python-web-hdfs
kingcrimsontianyu Aug 13, 2025
ddf21e6
Revert some auto-reformatting changes to pyproject.toml
kingcrimsontianyu Aug 13, 2025
d4bfaea
Add unit test for 'get file size'
kingcrimsontianyu Aug 13, 2025
258bf63
Update
kingcrimsontianyu Aug 14, 2025
8d683b2
Improve test organization
kingcrimsontianyu Aug 14, 2025
27a8467
Cleanup
kingcrimsontianyu Aug 14, 2025
cc4f75a
Add partial read test
kingcrimsontianyu Aug 14, 2025
f74a230
Remove the debug script
kingcrimsontianyu Aug 14, 2025
05e31f8
Add missing type hint wherever possible
kingcrimsontianyu Aug 14, 2025
d2ca076
Attempt to fix the MultiDict issue
kingcrimsontianyu Aug 14, 2025
3782a95
Merge branch 'branch-25.10' into remote-io-easy-interface
kingcrimsontianyu Aug 14, 2025
4bcbb87
Merge branch 'python-web-hdfs' into remote-io-easy-interface
kingcrimsontianyu Aug 14, 2025
f7bd741
Fixes
kingcrimsontianyu Aug 14, 2025
0c5aec8
Fix a critical bug in unit test
kingcrimsontianyu Aug 14, 2025
8f694a0
Merge branch 'branch-25.10' into remote-io-easy-interface
kingcrimsontianyu Aug 15, 2025
a93f178
Revert some changes to build
kingcrimsontianyu Aug 15, 2025
f17e228
Update
kingcrimsontianyu Aug 18, 2025
3130143
Merge branch 'branch-25.10' into remote-io-easy-interface
kingcrimsontianyu Aug 19, 2025
d8f78f9
Update
kingcrimsontianyu Aug 20, 2025
fd56b03
Update
kingcrimsontianyu Aug 20, 2025
d1c1f12
Merge branch 'branch-25.10' into remote-io-easy-interface
kingcrimsontianyu Aug 20, 2025
1fa36d9
Merge branch 'branch-25.10' into remote-io-easy-interface
kingcrimsontianyu Aug 21, 2025
618fe2b
Add unit tests
kingcrimsontianyu Aug 22, 2025
9d3bfc8
Improve implementation
kingcrimsontianyu Aug 22, 2025
64e8713
Improve test and impl
kingcrimsontianyu Aug 22, 2025
68e3a16
Add more comments. Improve test
kingcrimsontianyu Aug 22, 2025
b5dbcbb
Add comments to the unified open function
kingcrimsontianyu Aug 23, 2025
2b526e2
Improve doxygen comment
kingcrimsontianyu Aug 23, 2025
d44e165
Improve comment
kingcrimsontianyu Aug 23, 2025
0d3ab0b
Update doc
kingcrimsontianyu Aug 23, 2025
a498a83
Fix doc inaccuracy
kingcrimsontianyu Aug 23, 2025
bba7253
Address review comments
kingcrimsontianyu Aug 25, 2025
357a615
Address review comments
kingcrimsontianyu Aug 25, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion cpp/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -163,7 +163,7 @@ set(SOURCES

if(KvikIO_REMOTE_SUPPORT)
list(APPEND SOURCES "src/hdfs.cpp" "src/remote_handle.cpp" "src/detail/remote_handle.cpp"
"src/shim/libcurl.cpp"
"src/detail/url.cpp" "src/shim/libcurl.cpp"
)
endif()

Expand Down
199 changes: 199 additions & 0 deletions cpp/include/kvikio/detail/url.hpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,199 @@
/*
* Copyright (c) 2025, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
#pragma once

#include <optional>
#include <string>

#include <curl/curl.h>

namespace kvikio::detail {
/**
* @brief RAII wrapper for libcurl's URL handle (CURLU)
*
* This class provides automatic resource management for libcurl URL handles,
* ensuring proper cleanup when the handle goes out of scope. The class is
* move-only to prevent accidental sharing of the underlying resource.
*/
class CurlUrlHandle {
private:
CURLU* _handle{nullptr};

public:
/**
* @brief Create a new libcurl URL handle
*
* @exception std::runtime_error if libcurl cannot allocate the handle (usually due to out of
* memory)
*/
CurlUrlHandle();

/**
* @brief Clean up the underlying URL handle
*/
~CurlUrlHandle() noexcept;

CurlUrlHandle(CurlUrlHandle const&) = delete;
CurlUrlHandle& operator=(CurlUrlHandle const&) = delete;

CurlUrlHandle(CurlUrlHandle&& other) noexcept;
CurlUrlHandle& operator=(CurlUrlHandle&& other) noexcept;

/**
* @brief Get the underlying libcurl URL handle
*
* @return Pointer to the underlying libcurl URL handle
* @note The returned pointer should not be freed manually as it is managed by this class
*/
CURLU* get() const;
};

/**
* @brief URL parsing utility using libcurl's URL API
*
* This class provides static methods for parsing URLs into their constituent
* components (scheme, host, port, path, query, fragment).
*
* @note This class uses libcurl's URL parsing which follows RFC 3986 plus. See
* https://curl.se/docs/url-syntax.html
*
* Example:
* @code{.cpp}
* auto components = UrlParser::parse("https://example.com:8080/path?query=1#frag");
* if (components.scheme.has_value()) {
* std::cout << "Scheme: " << components.scheme.value() << std::endl;
* }
* if (components.host.has_value()) {
* std::cout << "Host: " << components.host.value() << std::endl;
* }
* @endcode
*/
class UrlParser {
public:
/**
* @brief Container for parsed URL components
*/
struct UrlComponents {
/**
* @brief The URL scheme (e.g., "http", "https", "ftp"). May be empty for scheme-relative URLs
* or paths.
*/
std::optional<std::string> scheme;

/**
* @brief The hostname or IP address. May be empty for URLs without an authority component
* (e.g., "file:///path").
*/
std::optional<std::string> host;

/**
* @brief The port number as a string. Will be empty if no explicit port is specified in the
* URL.
* @note Default ports (e.g., 80 for HTTP, 443 for HTTPS) are not automatically filled in.
*/
std::optional<std::string> port;

/**
* @brief The path component of the URL. Libcurl ensures that the path component is always
* present, even if empty (will be "/" for URLs like "http://example.com").
*/
std::optional<std::string> path;

/**
* @brief The query string (without the leading "?"). Empty if no query parameters are present.
*/
std::optional<std::string> query;

/**
* @brief The fragment identifier (without the leading "#"). Empty if no fragment is present.
*/
std::optional<std::string> fragment;
};

/**
* @brief Parses the given URL according to RFC 3986 plus and extracts its components.
*
* @param url The URL string to parse
* @param bitmask_url_flags Optional flags for URL parsing. Common flags include:
* - CURLU_DEFAULT_SCHEME: Allows URLs without schemes
* - CURLU_NON_SUPPORT_SCHEME: Accept non-supported schemes
* - CURLU_URLENCODE: URL encode the path
* @param bitmask_component_flags Optional flags for component extraction. Common flags include:
* - CURLU_URLDECODE: URL decode the component
* - CURLU_PUNYCODE: Return host as punycode
*
* @return UrlComponents structure containing the parsed URL components
*
* @throw std::runtime_error if the URL cannot be parsed or if component extraction fails
*
* Example:
* @code{.cpp}
* // Basic parsing
* auto components = UrlParser::parse("https://api.example.com/v1/users?page=1");
*
* // Parsing with URL decoding
* auto decoded = UrlParser::parse(
* "https://example.com/hello%20world",
* std::nullopt,
* CURLU_URLDECODE
* );
*
* // Allow non-standard schemes
* auto custom = UrlParser::parse(
* "myscheme://example.com",
* CURLU_NON_SUPPORT_SCHEME
* );
* @endcode
*/
static UrlComponents parse(std::string const& url,
std::optional<unsigned int> bitmask_url_flags = std::nullopt,
std::optional<unsigned int> bitmask_component_flags = std::nullopt);

/**
* @brief Extract a specific component from a CurlUrlHandle
*
* @param handle The CurlUrlHandle containing the parsed URL
* @param part The URL part to extract (e.g., CURLUPART_SCHEME)
* @param bitmask_component_flags Flags controlling extraction behavior
* @param allowed_err_code Optional error code to treat as valid (e.g., CURLUE_NO_SCHEME)
* @return The extracted component as a string, or std::nullopt if not present
* @throw std::runtime_error if extraction fails with an unexpected error
*/
static std::optional<std::string> extract_component(
CurlUrlHandle const& handle,
CURLUPart part,
std::optional<unsigned int> bitmask_component_flags = std::nullopt,
std::optional<CURLUcode> allowed_err_code = std::nullopt);

/**
* @brief Extract a specific component from a URL string
*
* @param url The URL string from which to extract a component
* @param part The URL part to extract
* @param bitmask_url_flags Optional flags for URL parsing.
* @param bitmask_component_flags Flags controlling extraction behavior
* @param allowed_err_code Optional error code to treat as valid
* @return The extracted component as a string, or std::nullopt if not present
* @throw std::runtime_error if extraction fails with an unexpected error
*/
static std::optional<std::string> extract_component(
std::string const& url,
CURLUPart part,
std::optional<unsigned int> bitmask_url_flags = std::nullopt,
std::optional<unsigned int> bitmask_component_flags = std::nullopt,
std::optional<CURLUcode> allowed_err_code = std::nullopt);
};
} // namespace kvikio::detail
8 changes: 8 additions & 0 deletions cpp/include/kvikio/hdfs.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -58,5 +58,13 @@ class WebHdfsEndpoint : public RemoteEndpoint {
std::string str() const override;
std::size_t get_file_size() override;
void setup_range_request(CurlHandle& curl, std::size_t file_offset, std::size_t size) override;

/**
* @brief Whether the given URL is valid for the WebHDFS endpoints.
*
* @param url A URL.
* @return Boolean answer.
*/
static bool is_url_valid(std::string const& url) noexcept;
};
} // namespace kvikio
Loading