Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add 500 HTTP Error to retry list #2567

Merged
merged 3 commits into from
Sep 27, 2024
Merged

Add 500 HTTP Error to retry list #2567

merged 3 commits into from
Sep 27, 2024

Conversation

farzadab
Copy link
Contributor

As reported in #2559, reading from HF Hub datasets can sometimes lead to 500 Internal Server Errors, which even if they're rare, can cause issues on large training runs.

The source of this error is not yet identified, but a simple fix seems to be to add the missing 500 error code to the retry list.

@Wauplin
Copy link
Contributor

Wauplin commented Sep 26, 2024

hI @farzadab, thanks for the PR. http_backoff is called 3 times in hf_file_system.py to download files in different ways. Could you update all 3 of them? Thanks!

@farzadab
Copy link
Contributor Author

My bad. Done.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Contributor

@Wauplin Wauplin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Looking good now :)

For the record, we've discussed this change internally (private slack). We had some concerns that retrying on HTTP 500 without knowing the root cause is clunky (can be transient error or not) but since:

  • we retry only up to 5 times (so ~1min)
  • and only when using hf_file_system so basically when streaming a dataset from the Hub

Then it's fine do to it. The cost of breaking a training process because of transient HTTP 500 is higher than when breaking a download process when instantiating a model once.

All of this to say, it's good to merge! 🤗

@Wauplin Wauplin merged commit 476fa0b into huggingface:main Sep 27, 2024
16 checks passed
@farzadab farzadab deleted the patch-1 branch September 30, 2024 16:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants