Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HfHubHTTPError: 500 Server Error: Internal Server Error #2559

Closed
zqhuang211 opened this issue Sep 21, 2024 · 17 comments
Closed

HfHubHTTPError: 500 Server Error: Internal Server Error #2559

zqhuang211 opened this issue Sep 21, 2024 · 17 comments
Labels
bug Something isn't working

Comments

@zqhuang211
Copy link

Describe the bug

I have been encountering this error frequently while running model training. Despite its high frequency (multiple times a day), I cannot consistently replicate the error, as it often resolves after additional attempts and then reappears randomly with a different Parquet file in the same dataset or a different dataset.

I am unsure whether this issue is related to how we created and/or uploaded the dataset to HF or if it stems from HF’s internal servers.

Any insights and assistance with this issue would be greatly appreciated.

[rank0]: Traceback (most recent call last):
[rank0]:   File "/root/.cache/pypoetry/virtualenvs/ultravox-XehuBpN1-py3.11/lib/python3.11/site-packages/torch/utils/data/_utils/worker.py", line 309, in _worker_loop
[rank0]:     data = fetcher.fetch(index)  # type: ignore[possibly-undefined]
[rank0]:            ^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/.cache/pypoetry/virtualenvs/ultravox-XehuBpN1-py3.11/lib/python3.11/site-packages/torch/utils/data/_utils/fetch.py", line 33, in fetch
[rank0]:     data.append(next(self.dataset_iter))
[rank0]:                 ^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/ultravox-expts/ultravox/data/datasets.py", line 1080, in <genexpr>
[rank0]:     return (self._process(sample) for sample in self._dataset)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/ultravox-expts/ultravox/data/datasets.py", line 1049, in __iter__
[rank0]:     yield next(iters[iter_index])
[rank0]:           ^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/ultravox-expts/ultravox/data/datasets.py", line 440, in __iter__
[rank0]:     for _, row in enumerate(self._dataset):
[rank0]:   File "/root/.cache/pypoetry/virtualenvs/ultravox-XehuBpN1-py3.11/lib/python3.11/site-packages/datasets/iterable_dataset.py", line 1368, in __iter__
[rank0]:     yield from self._iter_pytorch()
[rank0]:   File "/root/.cache/pypoetry/virtualenvs/ultravox-XehuBpN1-py3.11/lib/python3.11/site-packages/datasets/iterable_dataset.py", line 1303, in _iter_pytorch
[rank0]:     for key, example in ex_iterable:
[rank0]:   File "/root/.cache/pypoetry/virtualenvs/ultravox-XehuBpN1-py3.11/lib/python3.11/site-packages/datasets/iterable_dataset.py", line 987, in __iter__
[rank0]:     for x in self.ex_iterable:
[rank0]:   File "/root/.cache/pypoetry/virtualenvs/ultravox-XehuBpN1-py3.11/lib/python3.11/site-packages/datasets/iterable_dataset.py", line 486, in __iter__
[rank0]:     yield from ex_iterable
[rank0]:   File "/root/.cache/pypoetry/virtualenvs/ultravox-XehuBpN1-py3.11/lib/python3.11/site-packages/datasets/iterable_dataset.py", line 282, in __iter__
[rank0]:     for key, pa_table in self.generate_tables_fn(**self.kwargs):
[rank0]:   File "/root/.cache/pypoetry/virtualenvs/ultravox-XehuBpN1-py3.11/lib/python3.11/site-packages/datasets/packaged_modules/parquet/parquet.py", line 90, in _generate_tables
[rank0]:     for batch_idx, record_batch in enumerate(
[rank0]:   File "pyarrow/_parquet.pyx", line 1621, in iter_batches
[rank0]:   File "pyarrow/error.pxi", line 89, in pyarrow.lib.check_status
[rank0]:   File "/root/.cache/pypoetry/virtualenvs/ultravox-XehuBpN1-py3.11/lib/python3.11/site-packages/datasets/utils/file_utils.py", line 1101, in read_with_retries
[rank0]:     out = read(*args, **kwargs)
[rank0]:           ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/.cache/pypoetry/virtualenvs/ultravox-XehuBpN1-py3.11/lib/python3.11/site-packages/huggingface_hub/hf_file_system.py", line 757, in read
[rank0]:     return super().read(length)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/.cache/pypoetry/virtualenvs/ultravox-XehuBpN1-py3.11/lib/python3.11/site-packages/fsspec/spec.py", line 1846, in read
[rank0]:     out = self.cache._fetch(self.loc, self.loc + length)
[rank0]:           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/.cache/pypoetry/virtualenvs/ultravox-XehuBpN1-py3.11/lib/python3.11/site-packages/fsspec/caching.py", line 189, in _fetch
[rank0]:     self.cache = self.fetcher(start, end)  # new block replaces old
[rank0]:                  ^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/.cache/pypoetry/virtualenvs/ultravox-XehuBpN1-py3.11/lib/python3.11/site-packages/huggingface_hub/hf_file_system.py", line 720, in _fetch_range
[rank0]:     hf_raise_for_status(r)
[rank0]:   File "/root/.cache/pypoetry/virtualenvs/ultravox-XehuBpN1-py3.11/lib/python3.11/site-packages/huggingface_hub/utils/_http.py", line 477, in hf_raise_for_status
[rank0]:     raise _format(HfHubHTTPError, str(e), response) from e
[rank0]: huggingface_hub.errors.HfHubHTTPError: 500 Server Error: Internal Server Error for url: https://huggingface.co/datasets/fixie-ai/librispeech_asr/resolve/ea1da511055b346c29e03ed435f5b276d1de5ac2/clean/train.100-00005-of-00024.parquet (Request ID: Root=1-66ef1f5e-7a30da9e07abcf7b0ab91451;2eae3886-588c-402a-8784-b771a3a04e76)

[rank0]: Internal Error - We're working hard to fix this as soon as possible!

Reproduction

No response

Logs

No response

System info

- huggingface_hub version: 0.25.0
- Platform: Linux-5.4.0-162-generic-x86_64-with-glibc2.31
- Python version: 3.11.10
- Running in iPython ?: No
- Running in notebook ?: No
- Running in Google Colab ?: No
- Running in Google Colab Enterprise ?: No
- Token path ?: /root/.cache/huggingface/token
- Has saved token ?: True
- Who am I ?: zqhuang
- Configured git credential helpers: 
- FastAI: N/A
- Tensorflow: N/A
- Torch: 2.4.1
- Jinja2: 3.1.4
- Graphviz: N/A
- keras: N/A
- Pydot: N/A
- Pillow: 10.4.0
- hf_transfer: N/A
- gradio: N/A
- tensorboard: 2.6.2.2
- numpy: 2.0.2
- pydantic: 2.9.2
- aiohttp: 3.10.5
- ENDPOINT: https://huggingface.co
- HF_HUB_CACHE: /root/.cache/huggingface/hub
- HF_ASSETS_CACHE: /root/.cache/huggingface/assets
- HF_TOKEN_PATH: /root/.cache/huggingface/token
- HF_HUB_OFFLINE: False
- HF_HUB_DISABLE_TELEMETRY: False
- HF_HUB_DISABLE_PROGRESS_BARS: None
- HF_HUB_DISABLE_SYMLINKS_WARNING: False
- HF_HUB_DISABLE_EXPERIMENTAL_WARNING: False
- HF_HUB_DISABLE_IMPLICIT_TOKEN: False
- HF_HUB_ENABLE_HF_TRANSFER: False
- HF_HUB_ETAG_TIMEOUT: 10
- HF_HUB_DOWNLOAD_TIMEOUT: 300
@zqhuang211 zqhuang211 added the bug Something isn't working label Sep 21, 2024
@hanouticelina
Copy link
Contributor

Hello @zqhuang211, Thank you for reporting this issue! We're looking into it internally to find the root cause. Are you using any proxy or specific network setup? this info will help us investigate better

@zqhuang211
Copy link
Author

@hanouticelina Thanks! I should have added that this Is on Mosaic AI model training platform from Databricks.

@Wauplin
Copy link
Contributor

Wauplin commented Sep 23, 2024

We can't find any error in our logs with request ID Root=1-66ef1f5e-7a30da9e07abcf7b0ab91451;2eae3886-588c-402a-8784-b771a3a04e76 or even any HTTP 500 on url https://huggingface.co/datasets/fixie-ai/librispeech_asr/resolve/ea1da511055b346c29e03ed435f5b276d1de5ac2/clean/train.100-00005-of-00024.parquet. This suggests that the request never reached our servers. Can it be that Mosaic ML has a mirror / proxy to handle calls to the Hugging Face Hub?

@zqhuang211
Copy link
Author

Hi @Wauplin, Thanks for looking into this. Could you also take a look at the following requests? If none had reached HF servers, then it is more likely a Mosaic ML issue.

huggingface_hub.errors.HfHubHTTPError: 500 Server Error: Internal Server Error for url: https://huggingface.co/datasets/fixie-ai/librispeech_asr/resolve/ea1da511055b346c29e03ed435f5b276d1de5ac2/clean/train.100-00022-of-00024.parquet (Request ID: Root=1-66f0b061-0927d6a9661325ee2f4c8d9c;618c5d6b-0585-4041-ad3f-77ad6f646798)

huggingface_hub.errors.HfHubHTTPError: 500 Server Error: Internal Server Error for url: https://huggingface.co/api/datasets/fixie-ai/librispeech_asr/revision/ea1da511055b346c29e03ed435f5b276d1de5ac2 (Request ID: Root=1-66f07748-78e6c8544ee3ff2b361c7e15;4b331c17-deae-485b-9031-a99a9dd79745)

huggingface_hub.errors.HfHubHTTPError: 500 Server Error: Internal Server Error for url: https://huggingface.co/datasets/fixie-ai/librispeech_asr/resolve/ea1da511055b346c29e03ed435f5b276d1de5ac2/clean/train.100-00006-of-00024.parquet (Request ID: Root=1-66f075a1-32c5a47f4ab69fbd14de03f6;79d75d37-6ca1-41ab-990c-625343a8e24f)

huggingface_hub.errors.HfHubHTTPError: 500 Server Error: Internal Server Error for url: https://huggingface.co/datasets/fixie-ai/librispeech_asr/resolve/ea1da511055b346c29e03ed435f5b276d1de5ac2/clean/train.360-00073-of-00084.parquet (Request ID: Root=1-66f0008e-419451b101b5c6c415e6dfa7;0d8f2f7d-e01a-48d2-ad7f-3e2727b48299)

huggingface_hub.errors.HfHubHTTPError: 500 Server Error: Internal Server Error for url: https://huggingface.co/datasets/fixie-ai/librispeech_asr/resolve/ea1da511055b346c29e03ed435f5b276d1de5ac2/clean/train.360-00028-of-00084.parquet (Request ID: Root=1-66ef9cc7-46a0f9771edda74a071949e3;77465ded-1272-4344-aa95-5e4e72686f2a)

@Wauplin
Copy link
Contributor

Wauplin commented Sep 23, 2024

Same, we have no logs with these request ids internally.
For the record, the request ids are created by the library when sending the requests. This ID is logged as soon as the request reaches our Hub (i.e. before making anything else) so the fact that we don't see them at all makes me think that something is special in the environment.

@zqhuang211
Copy link
Author

Thank you. This is helpful. I will work with the Mosaic folks to figure out what is going on and will report back.

@juberti
Copy link

juberti commented Sep 24, 2024

@Wauplin, does the presence of a 500 mean that the request did make it to a HF server but was rejected immediately? Or could the request have failed to connect and the library synthesized a 500 internally?

If the answer is that the request was rejected at the HF server, what could explain that?

@juberti
Copy link

juberti commented Sep 25, 2024

I looked closer at the datasets HTTP code and I didn't see a case where these 500 errors could be generated within the library.

Also, I talked to MosaicML support and they indicated that there shouldn't be any proxy in the path between our training node and HF Hub - which suggests that the 500 most likely is coming from HF Hub.

Thoughts?

@hlky
Copy link
Contributor

hlky commented Sep 25, 2024

Looks like there's no retry on 500 in _fetch_range. For upload there is a retry on 500, could be the same transient S3 error.
Could be useful to log length in HfFileSystemFile's read or start and end in _fetch_range in case there's some issue with the file obj.
Also from the line numbers (datasets/utils/file_utils.py", line 1101, in read_with_retries) it looks like you're using Datasets 2.19.2, there are some changes to these functions in newer versions which might help.

@Wauplin
Copy link
Contributor

Wauplin commented Sep 25, 2024

@juberti thanks for asking around. In the end we've finally found back the logs. We have add a few HTTP 500 over the last 10 days on the /resolve endpoints due to an gitaly issue. It should have been fixed by now.

Or could the request have failed to connect and the library synthesized a 500 internally?

For sure an HTTP 500 can only be returned by the server (the library itself only forwards it to the user).
So the real explanation was that the server indeed failed. Nothing on MosaicML side, sorry for that assumption 🙏

@Wauplin
Copy link
Contributor

Wauplin commented Sep 25, 2024

So @juberti @zqhuang211 could you let us know if the issue arise again in the future or if it got stable again?

@zqhuang211
Copy link
Author

@Wauplin @hlky Thank you both. We are adding 500 to the retry list and will report back after running more experiments. We will let the MosaicML folks know too.

@juberti
Copy link

juberti commented Sep 25, 2024

Thanks for digging in here, glad we were able to identify the root cause. We'll send a PR to retry on 500 which should be an effective workaround if this issue recurs.

@farzadab
Copy link
Contributor

I created a one-line PR here @Wauplin that you can take a look at: #2567

@Wauplin
Copy link
Contributor

Wauplin commented Sep 27, 2024

If that's fine with you, I'll close this issue :)
#2567 has been merged to retry on HTTP 500 when streaming a dataset. The root cause should also be mitigated server-side so you shouldn't be bothered anymore. Let me know if it's not the case and I'll reopen :)

@Wauplin Wauplin closed this as completed Sep 27, 2024
@zqhuang211
Copy link
Author

@Wauplin yes, the issue can be closed. I have run many training runs with the retry fix and haven’t had any problems. Thanks!

@Wauplin
Copy link
Contributor

Wauplin commented Sep 27, 2024

Good news!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants