Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Data] read_json does not work for 16 or more URLs #50080

Open
GregorZiegltrumAA opened this issue Jan 27, 2025 · 0 comments
Open

[Data] read_json does not work for 16 or more URLs #50080

GregorZiegltrumAA opened this issue Jan 27, 2025 · 0 comments
Labels
bug Something that is supposed to be working; but isn't data Ray Data-related issues triage Needs triage (eg: priority, bug/not-bug, and owning component)

Comments

@GregorZiegltrumAA
Copy link

GregorZiegltrumAA commented Jan 27, 2025

What happened + What you expected to happen

I noticed that I cannot read 16 or more http paths using read_json

This example works as expected:

import ray
data_urls = [f"https://data.commoncrawl.org/contrib/datacomp/DCLM-pool/crawl=CC-MAIN-2022-49/1669446711712.26/CC-MAIN-20221210042021-20221210072021-000{i:02}.jsonl.gz" for i in range(16)]
ds = ray.data.read_json(
    data_urls[:15],
    arrow_open_stream_args={"compression": "gzip"},
)
ds.show()

However this gives me an FileNotFoundError: https:/data.commoncrawl.org/contrib/datacomp/DCLM-pool/crawl=CC-MAIN-2022-49/1669446711712.26 error

import ray
data_urls = [f"https://data.commoncrawl.org/contrib/datacomp/DCLM-pool/crawl=CC-MAIN-2022-49/1669446711712.26/CC-MAIN-20221210042021-20221210072021-000{i:02}.jsonl.gz" for i in range(16)]
ds = ray.data.read_json(
    data_urls[:16],
    arrow_open_stream_args={"compression": "gzip"},
)
ds.show()

The bug seems to be in _expand_paths in file_meta_provider.py if the number of paths is >= FILE_SIZE_FETCH_PARALLELIZATION_THRESHOLD

Versions / Dependencies

ray: 2.41.0
python: 3.11
macOS Sequoia 15.2

Reproduction script

import ray
data_urls = [f"https://data.commoncrawl.org/contrib/datacomp/DCLM-pool/crawl=CC-MAIN-2022-49/1669446711712.26/CC-MAIN-20221210042021-20221210072021-000{i:02}.jsonl.gz" for i in range(16)]
ds = ray.data.read_json(
    data_urls,
    arrow_open_stream_args={"compression": "gzip"},
)
ds.show()

Issue Severity

Medium: It is a significant difficulty but I can work around it.

@GregorZiegltrumAA GregorZiegltrumAA added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jan 27, 2025
@jcotant1 jcotant1 added the data Ray Data-related issues label Jan 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't data Ray Data-related issues triage Needs triage (eg: priority, bug/not-bug, and owning component)
Projects
None yet
Development

No branches or pull requests

2 participants