Skip to content

Fix for dereg failure on manual stream close #144

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 19 commits into from
Aug 13, 2020
Merged

Conversation

goodboy
Copy link
Owner

@goodboy goodboy commented Aug 7, 2020

There was code from the last de-registration fix PR (#141, #143) that I had commented
(to do with shielding arbiter dereg steps in Actor._async_main()) because
the block didn't seem to make a difference under infinite streaming
tests.

Turns out it for sure is needed if you manually close the stream before cancelling the containing nursery.

Pushing up a test to demonstrate the failure then will push up a fix to correct.

UPDATE: this ended up being a much bigger change set see comment below.

  • Long story short, getting things to work reliably led to some reworking of cancellation machinery in terms of internal nursery composition.
  • I tried bringing code from newer feature branches to solve this de-registration issue but realized there was too much raciness at teardown to even reproduce the issue easily.
  • This rework should ready the core for even more deterministic actor cancellation soon. Ideally we can fully shield a call to Portal.cancel_actor() without a timeout in the future. See the commit messages for some commentary on this.
  • overall the runtime startup and teardown logic should be much easier to read and understand now.

@goodboy goodboy changed the title Add test for dereg failure on manual stream close Fix for dereg failure on manual stream close Aug 7, 2020
@goodboy goodboy force-pushed the dereg_on_channel_aclose branch 2 times, most recently from 43b0ee0 to 7b0e30c Compare August 7, 2020 05:22
There was code from the last de-registration fix PR that I had commented
(to do with shielding arbiter dereg steps in `Actor._async_main()`) because
the block didn't seem to make a difference under infinite streaming
tests. Turns out it **for sure** is needed under certain conditions (likely
if the actor's root nursery is cancelled prior to actor nursery exit).
This was an attempt to simulate the failure mode if you manually close the
stream **before** cancelling the containing **actor**.

More tests to come I guess.
@goodboy goodboy force-pushed the dereg_on_channel_aclose branch from 7b0e30c to d2d8860 Compare August 7, 2020 13:19
goodboy added 12 commits August 7, 2020 11:34
The real issue is if the root nursery gets cancelled prior to
de-registration with the arbiter. This doesn't seem easy to
reproduce by side effect of a KBI however that is how it was
discovered in practise.
Always shield waiting for he process and always run
``trio.Process.__aexit__()`` on teardown. This enforces
that shutdown happens to due cancellation triggered inside
the sub-actor instead of the process being killed externally
by the parent.
In an effort acquire more deterministic actor cancellation,
this adds a clearer and more resilient (whilst possibly a bit
slower) internal nursery structure with explicit semantics for
clarifying the task-scope shutdown sequence.

Namely, on cancellation, the explicit steps are now:
- cancel all currently running rpc tasks and wait
  for them to complete
- cancel the channel server and wait for it to complete
- cancel the msg loop for the channel with the immediate parent
- de-register with arbiter if possible
- wait on remaining connections to release
- exit process

To accomplish this add a new nursery called the "service nursery" which
spawns all rpc tasks **instead of using** the "root nursery". The root
is now used solely for async launching the msg loop for the primary
channel with the parent such that it is (nearly) the last thing torn
down on cancellation.

In the future it should also be possible to have `self.cancel()` return
a result to the parent once the runtime is sure that the rest of the
shutdown is atomic; this would allow for a true unbounded shield in
`Portal.cancel_actor()`. This will likely require that the error
handling blocks in `Actor._async_main()` are moved "inside" the root
nursery block such that the msg loop with the parent truly is the last
thing to terminate.
@goodboy
Copy link
Owner Author

goodboy commented Aug 9, 2020

My apologies, this ended up turning into a much larger change set which hardens the teardown machinery quite a bit.

Basically I reorged the internal nursery scoping to keep the channel with the parent actor up as long as possible such that cancellation is more deterministic and we get way fewer race conditions. The CI should (hopefully) demonstrate this moving forward.

@goodboy
Copy link
Owner Author

goodboy commented Aug 9, 2020

Oof definitely some grammar issues in some commit messages 😆

@goodboy goodboy mentioned this pull request Aug 11, 2020
f"Failed to connect to parent @ {parent_addr},"
" closing server")
await self.cancel()
# self._parent_chan = None
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missed this.

Copy link
Collaborator

@ryanhiebert ryanhiebert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm not a useful reviewer for this. 😆 And obviously the one comment I made isn't even about the changes under review here.

@goodboy goodboy merged commit 4da1632 into master Aug 13, 2020
@goodboy goodboy deleted the dereg_on_channel_aclose branch August 13, 2020 17:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants