-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] Restore worker silently fails and the program is stuck #24248
Comments
@rkooo567 do you think you could take this one since you worked on the spill error? |
Yeah I'd love to. It seems to be a part of better error message as well (which I have worked a bit and stopped for a while). Is P2 the correct priority? |
I'm upgrading this to P1 since it is actually pretty bad for debugging if we're silently dropping errors. @rkooo567 do you think you have bandwidth to take this in the next week or two? |
Hmm I will be a bit busy until June 21st I think. I can share some thoughts if there's anybody who has bandwidth on Shuffle project's side instead? |
I think we can find someone to work on it. But please share what you know about the issue so far! |
Could you point us to the related PRs / any examples of bubbling up errors from workers? @rkooo567 |
@kennethlien Spilling: needed when ray.put is called Right now, if spilling is failed, it is retried later when another spilling happens. (Ideally, we'd like to fail ray.put if this happens a lot, but I think this is P2). For restoration, we have the hanging issue described in this issue. Basically, this is how the problem happens;
Ideally, we'd like to catch failures and raise an exception for users when restoration keeps failing. what you need to do is;
|
Working on this at #25000 |
Is there a way to do this without changing the return value of Line 916 in c74886a
|
@kennethlien I think here you can just throw the exception like we do in the |
@rkooo567 We're not sure how to get |
How about we do both? Log the error on the restore handler, and then on the raylet side we can mark the object as failed with a generic "OBJECT_RESTORATION_FAILED" error. That way we shouldn't have endless restore errors. |
I think if we can raise an exception with the exception in the error message, it is not necessary to stream logs? (I am okay with this approach if adding an exception message to the error is really difficult). @kennethlien how big is the change? I am also down to have 1:1 to discuss more options to proceed this faster. |
What happened + What you expected to happen
When the restore worker encounters an error (e.g. in #24196, or in general for example the spilled file is removed), it silently fails (only prints an error
/tmp/ray/session_latest/logs/io-*.err
). It looks like the system will schedule a few retries to restore the object, but eventually give up.The behavior on the application side is that the program is stuck, and no progress is made, and no error is thrown, nor is there any error message. It would be nice if the error can be popped back to the application. For example, when the spill worker errors due to disk space or other reason, the application receives a Python error. Can we do the same for restore workers?
Versions / Dependencies
master
Reproduction script
You can manually make the restore worker throw in
external_storage.py
, then test its behavior.Issue Severity
Low: It annoys or frustrates me.
The text was updated successfully, but these errors were encountered: