[Data] read_json does not work for 16 or more URLs #50080

GregorZiegltrumAA · 2025-01-27T15:54:40Z

What happened + What you expected to happen

I noticed that I cannot read 16 or more http paths using read_json

This example works as expected:

import ray
data_urls = [f"https://data.commoncrawl.org/contrib/datacomp/DCLM-pool/crawl=CC-MAIN-2022-49/1669446711712.26/CC-MAIN-20221210042021-20221210072021-000{i:02}.jsonl.gz" for i in range(16)]
ds = ray.data.read_json(
    data_urls[:15],
    arrow_open_stream_args={"compression": "gzip"},
)
ds.show()

However this gives me an FileNotFoundError: https:/data.commoncrawl.org/contrib/datacomp/DCLM-pool/crawl=CC-MAIN-2022-49/1669446711712.26 error

import ray
data_urls = [f"https://data.commoncrawl.org/contrib/datacomp/DCLM-pool/crawl=CC-MAIN-2022-49/1669446711712.26/CC-MAIN-20221210042021-20221210072021-000{i:02}.jsonl.gz" for i in range(16)]
ds = ray.data.read_json(
    data_urls[:16],
    arrow_open_stream_args={"compression": "gzip"},
)
ds.show()

The bug seems to be in _expand_paths in file_meta_provider.py if the number of paths is >= FILE_SIZE_FETCH_PARALLELIZATION_THRESHOLD

Versions / Dependencies

ray: 2.41.0
python: 3.11
macOS Sequoia 15.2

Reproduction script

import ray
data_urls = [f"https://data.commoncrawl.org/contrib/datacomp/DCLM-pool/crawl=CC-MAIN-2022-49/1669446711712.26/CC-MAIN-20221210042021-20221210072021-000{i:02}.jsonl.gz" for i in range(16)]
ds = ray.data.read_json(
    data_urls,
    arrow_open_stream_args={"compression": "gzip"},
)
ds.show()

Issue Severity

Medium: It is a significant difficulty but I can work around it.

The text was updated successfully, but these errors were encountered:

GregorZiegltrumAA added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jan 27, 2025

jcotant1 added the data Ray Data-related issues label Jan 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data] read_json does not work for 16 or more URLs #50080

[Data] read_json does not work for 16 or more URLs #50080

GregorZiegltrumAA commented Jan 27, 2025 •

edited

Loading

[Data] read_json does not work for 16 or more URLs #50080

[Data] read_json does not work for 16 or more URLs #50080

Comments

GregorZiegltrumAA commented Jan 27, 2025 • edited Loading

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

GregorZiegltrumAA commented Jan 27, 2025 •

edited

Loading