-
Notifications
You must be signed in to change notification settings - Fork 7k
[core] Make CancelTask RPC Fault Tolerant #58018
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
edoakes
merged 34 commits into
ray-project:master
from
Sparks0219:joshlee/make-cancel-task-fault-tolerant
Dec 8, 2025
Merged
Changes from 30 commits
Commits
Show all changes
34 commits
Select commit
Hold shift + click to select a range
f8150c0
Make CancelTask RPC Fault Tolerant
Sparks0219 0a630a7
Addressing comments
Sparks0219 8ae4e3a
clean up and cpp test failures
Sparks0219 a733422
Addressing comments
Sparks0219 8a2e428
Fix broken cpp tests
Sparks0219 901099d
Fix merge conflicts
Sparks0219 7d4ab2e
Clean up
Sparks0219 9070db5
lint
Sparks0219 dcec398
Addressing comments
Sparks0219 9df37aa
Fix cpp test failures
Sparks0219 d846b90
Addressing comments
Sparks0219 873a17c
Addressing comments
Sparks0219 430a4a6
Merge remote-tracking branch 'upstream/master' into joshlee/make-canc…
Sparks0219 a253c81
fix build error
Sparks0219 9d5cf6f
Addressing comments
Sparks0219 d0fddda
Merge conflicts
Sparks0219 c8e0ed6
Addressing comments
Sparks0219 49250fb
Bad merge conflict fix
Sparks0219 3dbcc22
Addressing comments
Sparks0219 73445ab
Fix cpp test
Sparks0219 f737eef
Addressing comments
Sparks0219 0429f79
Addressing comments
Sparks0219 2f9c24e
Addressing comments
Sparks0219 bed7884
Fix cpp test error
Sparks0219 2a66834
Addressing comments
Sparks0219 0fb240e
Merge remote-tracking branch 'upstream/master' into joshlee/make-canc…
Sparks0219 6bab852
Removing io context posts now that accessor node cache is thread safe
Sparks0219 c1f1e0f
Merge branch 'master' into joshlee/make-cancel-task-fault-tolerant
edoakes 758ecd6
Merge branch 'master' into joshlee/make-cancel-task-fault-tolerant
jjyao 22a53f5
Deflake serve test
Sparks0219 e053c2f
AI comment
Sparks0219 6b2674b
Addressing AI comments
Sparks0219 07da167
More AI comments
Sparks0219 41fa586
Addressing comments
Sparks0219 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@edoakes It looks like this test got flakier from my changes.
What I observed before my changes was:
1.) Proxy Actor sends a CancelTask RPC to ServeReplica
2.) ServeReplica processes the CancelTask RPC
3.) SignalActor.send.remote() gets sent
4.) CancelChildren doesn't find any pending children tasks to cancel
With my changes 3/4 are flipped, and CancelChildren is cancelling the queued send.remote() task before it fires, so it's timing out. It looks like you ran into the same issue here: https://github.com/ray-project/ray/pull/43320/files#diff-463bbcf17174b07dd1780cae9d6b719b248a0245fa029f8d8f280bf092d4db45R336 and fixed it for the other serve cancellation tests, so I moved this one to also use send_signal_on_cancellation.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still trying to figure out why it got more flaky, reverted back to the last time this PR passed CI but it still is flaky locally then for me. I'd expect the timing to change a bit due to my cancellation path changes, but I would've thought it would've slowed the cancellation path due to the node status cache access in actor/normal task submitter so 3/4 should've been less flaky 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it deflaked after using the context manager?