You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I noticed that I cannot read 16 or more http paths using read_json
This example works as expected:
import ray
data_urls = [f"https://data.commoncrawl.org/contrib/datacomp/DCLM-pool/crawl=CC-MAIN-2022-49/1669446711712.26/CC-MAIN-20221210042021-20221210072021-000{i:02}.jsonl.gz" for i in range(16)]
ds = ray.data.read_json(
data_urls[:15],
arrow_open_stream_args={"compression": "gzip"},
)
ds.show()
However this gives me an FileNotFoundError: https:/data.commoncrawl.org/contrib/datacomp/DCLM-pool/crawl=CC-MAIN-2022-49/1669446711712.26 error
import ray
data_urls = [f"https://data.commoncrawl.org/contrib/datacomp/DCLM-pool/crawl=CC-MAIN-2022-49/1669446711712.26/CC-MAIN-20221210042021-20221210072021-000{i:02}.jsonl.gz" for i in range(16)]
ds = ray.data.read_json(
data_urls[:16],
arrow_open_stream_args={"compression": "gzip"},
)
ds.show()
The bug seems to be in _expand_paths in file_meta_provider.py if the number of paths is >= FILE_SIZE_FETCH_PARALLELIZATION_THRESHOLD
Versions / Dependencies
ray: 2.41.0
python: 3.11
macOS Sequoia 15.2
Reproduction script
import ray
data_urls = [f"https://data.commoncrawl.org/contrib/datacomp/DCLM-pool/crawl=CC-MAIN-2022-49/1669446711712.26/CC-MAIN-20221210042021-20221210072021-000{i:02}.jsonl.gz" for i in range(16)]
ds = ray.data.read_json(
data_urls,
arrow_open_stream_args={"compression": "gzip"},
)
ds.show()
Issue Severity
Medium: It is a significant difficulty but I can work around it.
The text was updated successfully, but these errors were encountered:
GregorZiegltrumAA
added
bug
Something that is supposed to be working; but isn't
triage
Needs triage (eg: priority, bug/not-bug, and owning component)
labels
Jan 27, 2025
What happened + What you expected to happen
I noticed that I cannot read 16 or more http paths using
read_json
This example works as expected:
However this gives me an
FileNotFoundError: https:/data.commoncrawl.org/contrib/datacomp/DCLM-pool/crawl=CC-MAIN-2022-49/1669446711712.26
errorThe bug seems to be in
_expand_paths
infile_meta_provider.py
if the number of paths is >=FILE_SIZE_FETCH_PARALLELIZATION_THRESHOLD
Versions / Dependencies
ray: 2.41.0
python: 3.11
macOS Sequoia 15.2
Reproduction script
Issue Severity
Medium: It is a significant difficulty but I can work around it.
The text was updated successfully, but these errors were encountered: