-
Notifications
You must be signed in to change notification settings - Fork 7.1k
[core] Fix "Check failed: it->second.num_retries_left == -1" #54116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR addresses a crash caused by resubmitting a task after it’s been canceled by adding a guard in ResubmitTask and a corresponding test.
- Return
falseearly inTaskManager::ResubmitTaskwhen a task is canceled. - Add
TestResubmitCanceledTaskto verify that resubmitting a canceled task fails gracefully.
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| task_manager.cc | Added a check to bail out when num_retries_left is zero |
| task_manager_test.cc | Added TestResubmitCanceledTask to cover the new cancellation path |
|
cc @dayshah |
src/ray/core_worker/task_manager.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should do more than just return false out of this because we set the object error, which is visible to the user, based on this. I think if we just directly return false the error will become OBJECT_UNRECONSTRUCTABLE_MAX_ATTEMPTS_EXCEEDED but the status should be whatever we set it to when cancel happens, not this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make sense, i'll add another error type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need to do this in the test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this test suite requires every test case to clean themselves up (line 153)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😨
|
@dayshah's comments |
I'm missing something really obvious here, but what's the context for this? How does the the object_ref become available if the task was canceled? The possibilities I can think of are:
In all of those cases, should objects be reconstructed if the user intended for the task to be canceled? |
89dcc22 to
26d6a1d
Compare
src/ray/core_worker/task_manager.cc
Outdated
| return ResubmitTaskResult::FAILED_MAX_ATTEMPT_EXCEEDED; | ||
| } | ||
|
|
||
| if (it->second.num_retries_left == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main change is this condition; the rest is boilerplate due to the addition of a new field in common.proto
src/ray/core_worker/task_manager.cc
Outdated
|
|
||
| if (it->second.num_retries_left == 0) { | ||
| // This can happen when the task has been marked for cancellation. | ||
| return ResubmitTaskResult::FAILED_TASK_CANCELED; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i was thinking we could just use the existing cancel error type don't need a new custom one just for this situation.
ray/src/ray/protobuf/common.proto
Lines 203 to 204 in 95d41f6
| // Indicates that an object has been cancelled. | |
| TASK_CANCELLED = 5; |
The user is trying to cancel, we just need to honor that cancel and not resubmit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lol that's much better
I think the description needs to be rewritten a bit. The reason for this is that one thread, the thread the user actually calls ray.cancel on, can start the cancel, while another thread (the io_service of the reconstruction periodical runner) decides to resubmit the task. They both use the task manager. Cancelling doesn't atomically use the task manager to cancel, so you could do the first part of the cancel, set ray/src/ray/core_worker/transport/normal_task_submitter.cc Lines 710 to 726 in 95d41f6
@can-anyscale should probably leave a comment in the test too on how these things are called and how this interleaving can happen. Also might be worth thinking about any higher-level fixes, that could fix the submissible_tasks_ check failure issue too. I haven't put too much thought into what the implications of an atomic cancel might be. |
|
return TASK_CANCELED as the error type |
|
@israbbani - answer yours first, since I think it's simpler for me to explain. The crash was caused by the third case you mentioned—when "the task finished before it was canceled." My fix still prevents the object from being reconstructed (as defined by the API contract here). Previously, without my fix, Ray would crash. With the fix, object reconstruction still fails, but the failure is now properly propagated as a TaskCanceled exception instead of causing a crash. |
|
@dayshah - after closer investigation, I don't think this is caused by the race in the cancelling logic (basically not the race between https://github.com/ray-project/ray/blob/master/src/ray/core_worker/transport/normal_task_submitter.cc#L711 and https://github.com/ray-project/ray/blob/master/src/ray/core_worker/transport/normal_task_submitter.cc#L740). This crash only happens when the task is NOT pending, so the call to Another fix I considered was having But open for other suggestions. |
|
@can-anyscale thanks for the explanation! I think the PR description can be updated to state these points more clearly:
What happens in the other cases? |
@dayshah what's the submissible_tasks_ check failure issue? |
|
@dayshah , @israbbani : i added a python e2e test to reproduce the issue without the fix, and works with the fix; hopefully that makes the situation clearer without the fix, the
|
Signed-off-by: can <[email protected]>
Signed-off-by: can <[email protected]>
329797c to
7474054
Compare
src/ray/core_worker/task_manager.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix this while i'm here
|
Discussed offline with @dayshah and @israbbani — we’ll add a boolean is_cancelled field to the TaskEntry object instead of relying on num_retries_left to represent cancellation state, as the latter is error-prone. The rest of the logic is similar to before. |
dayshah
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you also write up some context on why we're not doing this check earlier in the call stack.
And also what happens if cancel after releasing the task manager mutex and before resubmitting - the retry will run to completion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
needed?
python/ray/tests/test_cancel.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit but would prefer calling it producer/consumer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. It's easier to understand and it matches what Dhyey did with the other tests.
|
@dayshah's comments |
src/ray/core_worker/task_manager.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dayshah: answer your questions via a comment here
edoakes
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Please update the description to reflect the commentary you provided offline.
src/ray/core_worker/task_manager.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // Whether the task has been marked for cancellation. | |
| bool is_cancelled; | |
| // Whether the task has been marked for cancellation. | |
| // Cancelled tasks will never be retried. | |
| bool is_cancelled; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please also specify this in the MarkTaskCanceled comment
|
@edoakes's comments |
israbbani
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. 🚢 . A few easy nits.
src/ray/core_worker/task_manager.h
Outdated
| enum class ResubmitTaskResult { | ||
| SUCCESS, | ||
| FAILED_MAX_ATTEMPT_EXCEEDED, | ||
| FAILED_TASK_CANCELED | ||
| }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea of the enum and it's in the right direction. Should this be a Status instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i end up not using this, Dhyey's pr reuses the rpc status code instead which is good
python/ray/tests/test_cancel.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| 2. Submit a task to the worker node to generate a big object. | |
| 2. Submit a task to the worker node to generate an object big enough to store in plasma. |
python/ray/tests/test_cancel.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| 6. Force a retry task to be scheduled on the new worker node to regenerate the big object. | |
| 6. Force a retry task to be scheduled on the new worker node to reconstruct the object. |
python/ray/tests/test_cancel.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. It's easier to understand and it matches what Dhyey did with the other tests.
python/ray/tests/test_cancel.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why num_cpus=2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just so that it doesn't run on the head node
src/ray/core_worker/task_manager.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODOs should have linked issues that provide additional context. We can make this a little more concise e.g.
| // TODO(can-anyscale): There is an existing race condition (even before this comment | |
| // was added) where a task can still be retried after its retry count has reached | |
| // zero—for example, in the case of task cancellation—because this code is still | |
| // reachable. The root cause is that we cannot guard the callback invocation with | |
| // the TaskManager mutex, since the callback itself acquires that mutex. As a result, | |
| // we cannot guarantee the task's state before the callback runs. | |
| // TODO(can-anyscale): There is a race condition where a task can still be retried | |
| // after its retry count has reached zero. Additional information in <ISSUE> |
src/ray/core_worker/task_manager.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: cancelled is the British spelling. We should stick with Canceled (e.g. TaskManager::MarkTaskCanceled)
|
@israbbani's comments |
Signed-off-by: can <[email protected]>

A crash occurs inside TaskManager when the ObjectRecoveryManager attempts to reconstruct an object for a task that has already been canceled. At a high level, we considered two options:
The original solution followed the second approach, and we ultimately adopted it — with one key refinement: we introduced an explicit cancellation state within TaskManager, instead of relying on the number of retries left as an indirect indicator of cancellation.
We didn’t choose the first approach because the interaction between ObjectRecoveryManager and task cancellation (eventually leads to the call inside TaskManager ) methods can be interleaved in arbitrary ways. This makes it impractical to place a reliable synchronization check early; the only effective place to validate the state is right at the crash site.
Concretely the sequencing is:
This happens to
This fix still prevents the object from being reconstructed (as defined by the API contract here). Previously, without my fix, Ray would crash. With the fix, object reconstruction still fails, but the failure is now properly propagated as a TaskCanceled exception instead of causing a crash.
I added a
that failed before the fix with the check failed, and pass afterwards.
Stack trace:
Test: