Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
f16c37d
Use KvikIO's implementation of file-backed memory mapping
kingcrimsontianyu Jun 13, 2025
a09c563
Update
kingcrimsontianyu Jun 13, 2025
51ee3b2
Update
kingcrimsontianyu Jun 13, 2025
f75e3dc
Merge branch 'branch-25.08' into use-kvikio-mmap
kingcrimsontianyu Jun 21, 2025
63807a0
Merge branch 'branch-25.08' into use-kvikio-mmap
kingcrimsontianyu Jul 7, 2025
8f015f4
Merge branch 'branch-25.08' into use-kvikio-mmap
kingcrimsontianyu Jul 14, 2025
f315954
Merge branch 'branch-25.08' into use-kvikio-mmap
kingcrimsontianyu Jul 23, 2025
bae6b6d
Update
kingcrimsontianyu Jul 23, 2025
bccb387
Merge remote-tracking branch 'origin/use-kvikio-mmap' into use-kvikio…
kingcrimsontianyu Jul 23, 2025
8350456
Fix unit test error
kingcrimsontianyu Jul 24, 2025
5ea88c7
Merge branch 'branch-25.10' into use-kvikio-mmap
kingcrimsontianyu Jul 24, 2025
ab0d28c
Merge branch 'branch-25.10' into use-kvikio-mmap
kingcrimsontianyu Aug 5, 2025
513cfa0
Merge branch 'branch-25.10' into use-kvikio-mmap
vuule Aug 21, 2025
9a9a423
Merge branch 'branch-25.10' into use-kvikio-mmap
kingcrimsontianyu Aug 23, 2025
d8869aa
Use KvikIO's versatile remote file interface to infer the endpoint type
kingcrimsontianyu Aug 24, 2025
cd15bc6
Cherry-pick build fix
kingcrimsontianyu Aug 25, 2025
10bf781
Revert temp changes to jitify and kvikio cmake files
kingcrimsontianyu Aug 25, 2025
3d9dc0d
Merge branch 'branch-25.10' into use-remote-io-easy-interface
kingcrimsontianyu Aug 25, 2025
0a908f4
Merge branch 'branch-25.10' into use-remote-io-easy-interface
kingcrimsontianyu Aug 28, 2025
67ef92b
Update pylibcudf
kingcrimsontianyu Aug 28, 2025
0047a6d
Merge branch 'branch-25.10' into use-remote-io-easy-interface
kingcrimsontianyu Sep 2, 2025
0bcee77
Merge remote-tracking branch 'origin/use-remote-io-easy-interface' in…
kingcrimsontianyu Sep 2, 2025
a4c3321
Prepend additional message for remote file exception
kingcrimsontianyu Sep 4, 2025
c338a74
Merge branch 'branch-25.10' into use-remote-io-easy-interface
kingcrimsontianyu Sep 4, 2025
09378bf
Remove filepath from error message
kingcrimsontianyu Sep 4, 2025
2e84038
Redact remote file path
kingcrimsontianyu Sep 5, 2025
2f1b12e
Remove an unused header
kingcrimsontianyu Sep 5, 2025
c1c2d79
Remove another unused header that is automatically added by reformatter
kingcrimsontianyu Sep 5, 2025
4fe15e3
Merge branch 'branch-25.10' into use-remote-io-easy-interface
Matt711 Sep 8, 2025
6919b20
Remote IO support in cudf-polars
Matt711 Sep 8, 2025
ec2eb92
Merge branch 'branch-25.10' into fea/polars/remote-io
Matt711 Sep 9, 2025
e23bca8
Merge branch 'branch-25.10' into fea/polars/remote-io
Matt711 Sep 16, 2025
fd9f140
add tests
Matt711 Sep 16, 2025
1b94d1a
code coverage, and missing package dependency
Matt711 Sep 17, 2025
c5374f2
Merge branch 'branch-25.10' into fea/polars/remote-io
Matt711 Sep 17, 2025
01890d3
reuse remote uri logic
Matt711 Sep 17, 2025
675b33f
address review
Matt711 Sep 17, 2025
ef1e0a2
Merge branch 'branch-25.10' into fea/polars/remote-io
Matt711 Sep 17, 2025
2249e15
Merge branch 'branch-25.10' into fea/polars/remote-io
Matt711 Sep 18, 2025
4e4cb87
xfail tests
Matt711 Sep 18, 2025
276585c
address review
Matt711 Sep 18, 2025
4cb3909
Merge branch 'branch-25.10' into fea/polars/remote-io
Matt711 Sep 18, 2025
0d6922e
address review
Matt711 Sep 18, 2025
482e70d
Merge branch 'fea/polars/remote-io' of github.com:Matt711/cudf into f…
Matt711 Sep 18, 2025
30a0594
Merge branch 'branch-25.10' into fea/polars/remote-io
Matt711 Sep 18, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions conda/environments/all_cuda-129_arch-aarch64.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -74,6 +74,7 @@ dependencies:
- pytest-benchmark
- pytest-cases>=3.8.2
- pytest-cov
- pytest-httpserver
- pytest-rerunfailures!=16.0.0
- pytest-xdist
- python-confluent-kafka>=2.8.0,<2.9.0a0
Expand Down
1 change: 1 addition & 0 deletions conda/environments/all_cuda-129_arch-x86_64.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -75,6 +75,7 @@ dependencies:
- pytest-benchmark
- pytest-cases>=3.8.2
- pytest-cov
- pytest-httpserver
- pytest-rerunfailures!=16.0.0
- pytest-xdist
- python-confluent-kafka>=2.8.0,<2.9.0a0
Expand Down
1 change: 1 addition & 0 deletions conda/environments/all_cuda-130_arch-aarch64.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -74,6 +74,7 @@ dependencies:
- pytest-benchmark
- pytest-cases>=3.8.2
- pytest-cov
- pytest-httpserver
- pytest-rerunfailures!=16.0.0
- pytest-xdist
- python-confluent-kafka>=2.8.0,<2.9.0a0
Expand Down
1 change: 1 addition & 0 deletions conda/environments/all_cuda-130_arch-x86_64.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -75,6 +75,7 @@ dependencies:
- pytest-benchmark
- pytest-cases>=3.8.2
- pytest-cov
- pytest-httpserver
- pytest-rerunfailures!=16.0.0
- pytest-xdist
- python-confluent-kafka>=2.8.0,<2.9.0a0
Expand Down
2 changes: 2 additions & 0 deletions dependencies.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -97,6 +97,7 @@ files:
- test_python_common
- test_python_cudf_common
- test_python_pylibcudf
- test_python_cudf_polars
- depends_on_cudf
- depends_on_dask_cuda
- depends_on_pylibcudf
Expand Down Expand Up @@ -891,6 +892,7 @@ dependencies:
- output_types: [conda, requirements, pyproject]
packages:
- rich
- pytest-httpserver
test_python_narwhals:
common:
- output_types: [conda, requirements, pyproject]
Expand Down
22 changes: 17 additions & 5 deletions python/cudf_polars/cudf_polars/dsl/ir.py
Original file line number Diff line number Diff line change
Expand Up @@ -325,18 +325,30 @@ def __init__(
raise NotImplementedError(
"Read from cloud storage"
) # pragma: no cover; no test yet
if any(
str(p).startswith("https:/" if POLARS_VERSION_LT_131 else "https://")
for p in self.paths
):
if (
any(str(p).startswith("https:/") for p in self.paths)
and POLARS_VERSION_LT_131
): # pragma: no cover; polars passed us the wrong URI
# https://github.com/pola-rs/polars/issues/22766
raise NotImplementedError("Read from https")
if any(
str(p).startswith("file:/" if POLARS_VERSION_LT_131 else "file://")
for p in self.paths
):
# TODO: removing the file:// may work
raise NotImplementedError("Read from file URI")
if self.typ == "csv":
if any(plc.io.SourceInfo._is_remote_uri(p) for p in self.paths):
# This works fine when the file has no leading blank lines,
# but currently we do some file introspection
# to skip blanks before parsing the header.
# For remote files we cannot determine if leading blank lines
# exist, so we're punting on CSV support.
# TODO: Once the CSV reader supports skipping leading
# blank lines natively, we can remove this guard.
Comment on lines +341 to +347
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CC @vuule

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

raise NotImplementedError(
"Reading CSV from remote is not yet supported"
)

if self.reader_options["skip_rows_after_header"] != 0:
raise NotImplementedError("Skipping rows after header in CSV reader")
parse_options = self.reader_options["parse_options"]
Expand Down
1 change: 1 addition & 0 deletions python/cudf_polars/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,7 @@ test = [
"numpy>=1.23,<3.0a0",
"pytest",
"pytest-cov",
"pytest-httpserver",
"pytest-xdist",
"rich",
] # This list was generated by `rapids-dependency-file-generator`. To make changes, edit ../../dependencies.yaml and run `rapids-dependency-file-generator`.
Expand Down
116 changes: 116 additions & 0 deletions python/cudf_polars/tests/test_scan.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
from typing import TYPE_CHECKING

import pytest
from werkzeug import Response

import polars as pl

Expand All @@ -14,10 +15,14 @@
assert_ir_translation_raises,
)
from cudf_polars.testing.io import make_partitioned_source
from cudf_polars.utils.versions import POLARS_VERSION_LT_131

if TYPE_CHECKING:
from pathlib import Path

from pytest_httpserver import HTTPServer
from werkzeug import Request


NO_CHUNK_ENGINE = pl.GPUEngine(raise_on_fail=True, parquet_options={"chunked": False})

Expand Down Expand Up @@ -487,3 +492,114 @@ def test_scan_from_file_uri(tmp_path: Path) -> None:
df.write_parquet(path)
q = pl.scan_parquet(f"file://{path}")
assert_ir_translation_raises(q, NotImplementedError)


@pytest.mark.parametrize("chunked", [False, True])
def test_scan_parquet_remote(
request, tmp_path: Path, df: pl.DataFrame, httpserver: HTTPServer, *, chunked: bool
) -> None:
request.applymarker(
pytest.mark.xfail(
condition=POLARS_VERSION_LT_131,
reason="remote IO not supported",
)
)
path = tmp_path / "foo.parquet"
df.write_parquet(path)
bytes_ = path.read_bytes()
size = len(bytes_)

def head_handler(_: Request) -> Response:
return Response(
status=200,
headers={
"Content-Type": "parquet",
"Accept-Ranges": "bytes",
"Content-Length": size,
},
)

def get_handler(req: Request) -> Response:
# parse bytes=200-500 for example (the actual data)
rng = req.headers.get("Range")
if rng and rng.startswith("bytes="):
start, end = map(int, req.headers["Range"][6:].split("-"))
mv = memoryview(bytes_)[start : end + 1]
return Response(
mv.tobytes(),
status=206,
headers={
"Content-Type": "parquet",
"Accept-Ranges": "bytes",
"Content-Length": len(mv),
"Content-Range": f"bytes {start}-{end}/{size}",
},
)
return Response(
bytes_,
status=200,
headers={
"Content-Type": "parquet",
"Accept-Ranges": "bytes",
"Content-Length": size,
},
)

server_path = "/foo.parquet"
httpserver.expect_request(server_path, method="HEAD").respond_with_handler(
head_handler
)
httpserver.expect_request(server_path, method="GET").respond_with_handler(
get_handler
)

q = pl.scan_parquet(httpserver.url_for(server_path))

assert_gpu_result_equal(
q, engine=pl.GPUEngine(raise_on_fail=True, parquet_options={"chunked": chunked})
)


def test_scan_ndjson_remote(
request, tmp_path: Path, df: pl.LazyFrame, httpserver: HTTPServer
) -> None:
request.applymarker(
pytest.mark.xfail(
condition=POLARS_VERSION_LT_131,
reason="remote IO not supported",
)
)
path = tmp_path / "foo.jsonl"
df.write_ndjson(path)
bytes_ = path.read_bytes()
size = len(bytes_)

def head_handler(_: Request) -> Response:
return Response(
status=200,
headers={
"Content-Type": "ndjson",
"Content-Length": size,
},
)

def get_handler(_: Request) -> Response:
return Response(
bytes_,
status=200,
headers={
"Content-Type": "ndjson",
"Content-Length": size,
},
)

server_path = "/foo.jsonl"
httpserver.expect_request(server_path, method="HEAD").respond_with_handler(
head_handler
)
httpserver.expect_request(server_path, method="GET").respond_with_handler(
get_handler
)

q = pl.scan_ndjson(httpserver.url_for(server_path))
assert_gpu_result_equal(q)
12 changes: 9 additions & 3 deletions python/pylibcudf/pylibcudf/io/types.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -301,6 +301,7 @@ cdef class TableInputMetadata:
for i in range(self.c_obj.column_metadata.size())
]


cdef class TableWithMetadata:
"""A container holding a table and its associated metadata
(e.g. column names)
Expand Down Expand Up @@ -467,8 +468,6 @@ cdef class SourceInfo:
A homogeneous list of sources to read from. Mixing
different types of sources will raise a `ValueError`.
"""
# Regular expression that match remote file paths supported by libcudf
_is_remote_file_pattern = re.compile(r"^[a-zA-Z][a-zA-Z0-9+.-]*://", re.IGNORECASE)

def __init__(self, list sources):
if not sources:
Expand All @@ -483,7 +482,7 @@ cdef class SourceInfo:
for src in sources:
if not isinstance(src, (os.PathLike, str)):
raise ValueError("All sources must be of the same type!")
if not (os.path.isfile(src) or self._is_remote_file_pattern.match(src)):
if not (os.path.isfile(src) or SourceInfo._is_remote_uri(src)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if not (os.path.isfile(src) or SourceInfo._is_remote_uri(src)):
if not (os.path.isfile(src) or self._is_remote_uri(src)):

Nit

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since _is_remote_uri is a staticmethod, SourceInfo._is_remote_uri is preferred, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if there's a particular preference. Generally using self would still automatically respect a subclass' override of _is_remote_uri which is a nice guarantee.

Probably not worth another commit & CI run if this PR is close to merging, but good to consider in the future.

raise FileNotFoundError(
errno.ENOENT, os.strerror(errno.ENOENT), src
)
Expand Down Expand Up @@ -538,6 +537,13 @@ cdef class SourceInfo:

self.c_obj = source_info(host_span[host_span[const_byte]](self._hspans))

@staticmethod
def _is_remote_uri(path: str | os.PathLike) -> bool:
# Regular expression that match remote file paths supported by libcudf
return re.compile(
r"^[a-zA-Z][a-zA-Z0-9+.\-]*://", re.IGNORECASE
).match(str(path)) is not None

def _init_byte_like_sources(self, list sources, type expected_type):
cdef const unsigned char[::1] c_buffer
cdef bint empty_buffer = True
Expand Down