-
Notifications
You must be signed in to change notification settings - Fork 12
Fix for dereg failure on manual stream close #144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
43b0ee0
to
7b0e30c
Compare
There was code from the last de-registration fix PR that I had commented (to do with shielding arbiter dereg steps in `Actor._async_main()`) because the block didn't seem to make a difference under infinite streaming tests. Turns out it **for sure** is needed under certain conditions (likely if the actor's root nursery is cancelled prior to actor nursery exit). This was an attempt to simulate the failure mode if you manually close the stream **before** cancelling the containing **actor**. More tests to come I guess.
7b0e30c
to
d2d8860
Compare
The real issue is if the root nursery gets cancelled prior to de-registration with the arbiter. This doesn't seem easy to reproduce by side effect of a KBI however that is how it was discovered in practise.
Always shield waiting for he process and always run ``trio.Process.__aexit__()`` on teardown. This enforces that shutdown happens to due cancellation triggered inside the sub-actor instead of the process being killed externally by the parent.
In an effort acquire more deterministic actor cancellation, this adds a clearer and more resilient (whilst possibly a bit slower) internal nursery structure with explicit semantics for clarifying the task-scope shutdown sequence. Namely, on cancellation, the explicit steps are now: - cancel all currently running rpc tasks and wait for them to complete - cancel the channel server and wait for it to complete - cancel the msg loop for the channel with the immediate parent - de-register with arbiter if possible - wait on remaining connections to release - exit process To accomplish this add a new nursery called the "service nursery" which spawns all rpc tasks **instead of using** the "root nursery". The root is now used solely for async launching the msg loop for the primary channel with the parent such that it is (nearly) the last thing torn down on cancellation. In the future it should also be possible to have `self.cancel()` return a result to the parent once the runtime is sure that the rest of the shutdown is atomic; this would allow for a true unbounded shield in `Portal.cancel_actor()`. This will likely require that the error handling blocks in `Actor._async_main()` are moved "inside" the root nursery block such that the msg loop with the parent truly is the last thing to terminate.
My apologies, this ended up turning into a much larger change set which hardens the teardown machinery quite a bit. Basically I reorged the internal nursery scoping to keep the channel with the parent actor up as long as possible such that cancellation is more deterministic and we get way fewer race conditions. The CI should (hopefully) demonstrate this moving forward. |
Oof definitely some grammar issues in some commit messages 😆 |
tractor/_actor.py
Outdated
f"Failed to connect to parent @ {parent_addr}," | ||
" closing server") | ||
await self.cancel() | ||
# self._parent_chan = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missed this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'm not a useful reviewer for this. 😆 And obviously the one comment I made isn't even about the changes under review here.
There was code from the last de-registration fix PR (#141, #143) that I had commented
(to do with shielding arbiter dereg steps in
Actor._async_main()
) becausethe block didn't seem to make a difference under infinite streaming
tests.
Turns out it for sure is needed if you manually close the stream before cancelling the containing nursery.
Pushing up a test to demonstrate the failure then will push up a fix to correct.
UPDATE: this ended up being a much bigger change set see comment below.
Portal.cancel_actor()
without a timeout in the future. See the commit messages for some commentary on this.