Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
f16c37d
Use KvikIO's implementation of file-backed memory mapping
kingcrimsontianyu Jun 13, 2025
a09c563
Update
kingcrimsontianyu Jun 13, 2025
51ee3b2
Update
kingcrimsontianyu Jun 13, 2025
f75e3dc
Merge branch 'branch-25.08' into use-kvikio-mmap
kingcrimsontianyu Jun 21, 2025
63807a0
Merge branch 'branch-25.08' into use-kvikio-mmap
kingcrimsontianyu Jul 7, 2025
8f015f4
Merge branch 'branch-25.08' into use-kvikio-mmap
kingcrimsontianyu Jul 14, 2025
f315954
Merge branch 'branch-25.08' into use-kvikio-mmap
kingcrimsontianyu Jul 23, 2025
bae6b6d
Update
kingcrimsontianyu Jul 23, 2025
bccb387
Merge remote-tracking branch 'origin/use-kvikio-mmap' into use-kvikio…
kingcrimsontianyu Jul 23, 2025
8350456
Fix unit test error
kingcrimsontianyu Jul 24, 2025
5ea88c7
Merge branch 'branch-25.10' into use-kvikio-mmap
kingcrimsontianyu Jul 24, 2025
ab0d28c
Merge branch 'branch-25.10' into use-kvikio-mmap
kingcrimsontianyu Aug 5, 2025
513cfa0
Merge branch 'branch-25.10' into use-kvikio-mmap
vuule Aug 21, 2025
9a9a423
Merge branch 'branch-25.10' into use-kvikio-mmap
kingcrimsontianyu Aug 23, 2025
d8869aa
Use KvikIO's versatile remote file interface to infer the endpoint type
kingcrimsontianyu Aug 24, 2025
cd15bc6
Cherry-pick build fix
kingcrimsontianyu Aug 25, 2025
10bf781
Revert temp changes to jitify and kvikio cmake files
kingcrimsontianyu Aug 25, 2025
3d9dc0d
Merge branch 'branch-25.10' into use-remote-io-easy-interface
kingcrimsontianyu Aug 25, 2025
0a908f4
Merge branch 'branch-25.10' into use-remote-io-easy-interface
kingcrimsontianyu Aug 28, 2025
67ef92b
Update pylibcudf
kingcrimsontianyu Aug 28, 2025
0047a6d
Merge branch 'branch-25.10' into use-remote-io-easy-interface
kingcrimsontianyu Sep 2, 2025
0bcee77
Merge remote-tracking branch 'origin/use-remote-io-easy-interface' in…
kingcrimsontianyu Sep 2, 2025
a4c3321
Prepend additional message for remote file exception
kingcrimsontianyu Sep 4, 2025
c338a74
Merge branch 'branch-25.10' into use-remote-io-easy-interface
kingcrimsontianyu Sep 4, 2025
09378bf
Remove filepath from error message
kingcrimsontianyu Sep 4, 2025
2e84038
Redact remote file path
kingcrimsontianyu Sep 5, 2025
2f1b12e
Remove an unused header
kingcrimsontianyu Sep 5, 2025
c1c2d79
Remove another unused header that is automatically added by reformatter
kingcrimsontianyu Sep 5, 2025
4fe15e3
Merge branch 'branch-25.10' into use-remote-io-easy-interface
Matt711 Sep 8, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
45 changes: 30 additions & 15 deletions cpp/src/io/utilities/datasource.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -368,27 +368,29 @@ class user_datasource_wrapper : public datasource {
* @brief Remote file source backed by KvikIO, which handles S3 filepaths seamlessly.
*/
class remote_file_source : public kvikio_source<kvikio::RemoteHandle> {
static auto create_s3_handle(char const* filepath)
public:
explicit remote_file_source(char const* filepath)
: kvikio_source{kvikio::RemoteHandle::open(filepath)}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

{
return kvikio::RemoteHandle{
std::make_unique<kvikio::S3Endpoint>(kvikio::S3Endpoint::parse_s3_url(filepath))};
}

public:
explicit remote_file_source(char const* filepath) : kvikio_source{create_s3_handle(filepath)} {}

~remote_file_source() override = default;

/**
* @brief Is `url` referring to a remote file supported by KvikIO?
* @brief Checks if a path has a URL scheme format that could indicate a remote resource
*
* For now, only S3 urls (urls starting with "s3://") are supported.
* @note Strictly speaking, there is no definitive way to tell if a given file path refers to a
* remote or local file. For instance, it is legal to have a local directory named `s3:` and its
* file accessed by `s3://<sub-dir>/<file-name>` (the double slash is collapsed into a single
* slash), coincidentally taking on the remote S3 format. Here we ignore this special case and use
* a more practical approach: a file path is considered remote simply if it has a RFC
* 3986-conformant URL scheme.
*/
static bool is_supported_remote_url(std::string const& url)
static bool could_be_remote_url(std::string const& filepath)
{
// Regular expression to match "s3://"
static std::regex const pattern{R"(^s3://)", std::regex_constants::icase};
return std::regex_search(url, pattern);
// Regular expression to match the URL scheme conforming to RFC 3986
static std::regex const pattern{R"(^[a-zA-Z][a-zA-Z0-9+.-]*://)", std::regex_constants::icase};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the behavior today with file:/// URLs, which typically mean absolute path on a user's machine (not sure if that's a convention or a spec)? Would that hit this code path, and is it OK to pass that through as a "maybe remote url"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good question to consider. I checked the relevant information. The file URI is specified by RFC 8089, and is indeed supported by libcurl. KvikIO does not support this scheme. So when a file URI is passed to libcudf in this PR, it will be considered a remote source, and KvikIO will reject it with Unsupported endpoint URL.

return std::regex_search(filepath, pattern);
}
};
#else
Expand All @@ -398,7 +400,7 @@ class remote_file_source : public kvikio_source<kvikio::RemoteHandle> {
class remote_file_source : public file_source {
public:
explicit remote_file_source(char const* filepath) : file_source(filepath) {}
static constexpr bool is_supported_remote_url(std::string const&) { return false; }
static constexpr bool could_be_remote_url(std::string const&) { return false; }
};
#endif
} // namespace
Expand All @@ -415,8 +417,21 @@ std::unique_ptr<datasource> datasource::create(std::string const& filepath,

CUDF_FAIL("Invalid LIBCUDF_MMAP_ENABLED value: " + policy);
}();
if (remote_file_source::is_supported_remote_url(filepath)) {
return std::make_unique<remote_file_source>(filepath.c_str());

if (remote_file_source::could_be_remote_url(filepath)) {
try {
return std::make_unique<remote_file_source>(filepath.c_str());
} catch (std::exception const& ex) {
std::string redacted_msg;
try {
// For security reasons, redact the file path if any from KvikIO's exception message
redacted_msg =
std::regex_replace(ex.what(), std::regex{filepath}, "<redacted-remote-file-path>");
Comment on lines +428 to +429
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wow :D

} catch (std::exception const& ex) {
redacted_msg = " unknown due to additional process error";
}
CUDF_FAIL("Error accessing the remote file. Reason: " + redacted_msg, std::runtime_error);
}
} else if (use_memory_mapping) {
return std::make_unique<memory_mapped_source>(filepath.c_str(), offset, max_size_estimate);
} else {
Expand Down
2 changes: 1 addition & 1 deletion python/pylibcudf/pylibcudf/io/types.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -468,7 +468,7 @@ cdef class SourceInfo:
different types of sources will raise a `ValueError`.
"""
# Regular expression that match remote file paths supported by libcudf
_is_remote_file_pattern = re.compile(r"^s3://", re.IGNORECASE)
_is_remote_file_pattern = re.compile(r"^[a-zA-Z][a-zA-Z0-9+.-]*://", re.IGNORECASE)

def __init__(self, list sources):
if not sources:
Expand Down