-
Notifications
You must be signed in to change notification settings - Fork 580
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HfHubHTTPError: 500 Server Error: Internal Server Error #2559
Comments
Hello @zqhuang211, Thank you for reporting this issue! We're looking into it internally to find the root cause. Are you using any proxy or specific network setup? this info will help us investigate better |
@hanouticelina Thanks! I should have added that this Is on Mosaic AI model training platform from Databricks. |
We can't find any error in our logs with request ID |
Hi @Wauplin, Thanks for looking into this. Could you also take a look at the following requests? If none had reached HF servers, then it is more likely a Mosaic ML issue.
|
Same, we have no logs with these request ids internally. |
Thank you. This is helpful. I will work with the Mosaic folks to figure out what is going on and will report back. |
@Wauplin, does the presence of a 500 mean that the request did make it to a HF server but was rejected immediately? Or could the request have failed to connect and the library synthesized a 500 internally? If the answer is that the request was rejected at the HF server, what could explain that? |
I looked closer at the datasets HTTP code and I didn't see a case where these 500 errors could be generated within the library. Also, I talked to MosaicML support and they indicated that there shouldn't be any proxy in the path between our training node and HF Hub - which suggests that the 500 most likely is coming from HF Hub. Thoughts? |
Looks like there's no retry on |
@juberti thanks for asking around. In the end we've finally found back the logs. We have add a few HTTP 500 over the last 10 days on the /resolve endpoints due to an gitaly issue. It should have been fixed by now.
For sure an HTTP 500 can only be returned by the server (the library itself only forwards it to the user). |
So @juberti @zqhuang211 could you let us know if the issue arise again in the future or if it got stable again? |
Thanks for digging in here, glad we were able to identify the root cause. We'll send a PR to retry on 500 which should be an effective workaround if this issue recurs. |
If that's fine with you, I'll close this issue :) |
@Wauplin yes, the issue can be closed. I have run many training runs with the retry fix and haven’t had any problems. Thanks! |
Good news! |
Describe the bug
I have been encountering this error frequently while running model training. Despite its high frequency (multiple times a day), I cannot consistently replicate the error, as it often resolves after additional attempts and then reappears randomly with a different Parquet file in the same dataset or a different dataset.
I am unsure whether this issue is related to how we created and/or uploaded the dataset to HF or if it stems from HF’s internal servers.
Any insights and assistance with this issue would be greatly appreciated.
Reproduction
No response
Logs
No response
System info
The text was updated successfully, but these errors were encountered: