Fix for dereg failure on manual stream close #144

goodboy · 2020-08-07T03:58:39Z

There was code from the last de-registration fix PR (#141, #143) that I had commented
(to do with shielding arbiter dereg steps in Actor._async_main()) because
the block didn't seem to make a difference under infinite streaming
tests.

Turns out it for sure is needed if you manually close the stream before cancelling the containing nursery.

Pushing up a test to demonstrate the failure then will push up a fix to correct.

UPDATE: this ended up being a much bigger change set see comment below.

Long story short, getting things to work reliably led to some reworking of cancellation machinery in terms of internal nursery composition.
I tried bringing code from newer feature branches to solve this de-registration issue but realized there was too much raciness at teardown to even reproduce the issue easily.
This rework should ready the core for even more deterministic actor cancellation soon. Ideally we can fully shield a call to Portal.cancel_actor() without a timeout in the future. See the commit messages for some commentary on this.
overall the runtime startup and teardown logic should be much easier to read and understand now.

There was code from the last de-registration fix PR that I had commented (to do with shielding arbiter dereg steps in `Actor._async_main()`) because the block didn't seem to make a difference under infinite streaming tests. Turns out it **for sure** is needed under certain conditions (likely if the actor's root nursery is cancelled prior to actor nursery exit). This was an attempt to simulate the failure mode if you manually close the stream **before** cancelling the containing **actor**. More tests to come I guess.

The real issue is if the root nursery gets cancelled prior to de-registration with the arbiter. This doesn't seem easy to reproduce by side effect of a KBI however that is how it was discovered in practise.

Always shield waiting for he process and always run ``trio.Process.__aexit__()`` on teardown. This enforces that shutdown happens to due cancellation triggered inside the sub-actor instead of the process being killed externally by the parent.

In an effort acquire more deterministic actor cancellation, this adds a clearer and more resilient (whilst possibly a bit slower) internal nursery structure with explicit semantics for clarifying the task-scope shutdown sequence. Namely, on cancellation, the explicit steps are now: - cancel all currently running rpc tasks and wait for them to complete - cancel the channel server and wait for it to complete - cancel the msg loop for the channel with the immediate parent - de-register with arbiter if possible - wait on remaining connections to release - exit process To accomplish this add a new nursery called the "service nursery" which spawns all rpc tasks **instead of using** the "root nursery". The root is now used solely for async launching the msg loop for the primary channel with the parent such that it is (nearly) the last thing torn down on cancellation. In the future it should also be possible to have `self.cancel()` return a result to the parent once the runtime is sure that the rest of the shutdown is atomic; this would allow for a true unbounded shield in `Portal.cancel_actor()`. This will likely require that the error handling blocks in `Actor._async_main()` are moved "inside" the root nursery block such that the msg loop with the parent truly is the last thing to terminate.

goodboy · 2020-08-09T01:20:06Z

My apologies, this ended up turning into a much larger change set which hardens the teardown machinery quite a bit.

Basically I reorged the internal nursery scoping to keep the channel with the parent actor up as long as possible such that cancellation is more deterministic and we get way fewer race conditions. The CI should (hopefully) demonstrate this moving forward.

tests/test_discovery.py

goodboy · 2020-08-09T03:18:30Z

Oof definitely some grammar issues in some commit messages 😆

goodboy · 2020-08-12T21:39:01Z

tractor/_actor.py

+                f"Failed to connect to parent @ {parent_addr},"
+                " closing server")
+            await self.cancel()
+            # self._parent_chan = None


Missed this.

ryanhiebert

Yeah, I'm not a useful reviewer for this. 😆 And obviously the one comment I made isn't even about the changes under review here.

tests/test_discovery.py

goodboy requested review from ryanhiebert, guilledk and salotz August 7, 2020 03:58

goodboy changed the title ~~Add test for dereg failure on manual stream close~~ Fix for dereg failure on manual stream close Aug 7, 2020

goodboy force-pushed the dereg_on_channel_aclose branch 2 times, most recently from 43b0ee0 to 7b0e30c Compare August 7, 2020 05:22

goodboy force-pushed the dereg_on_channel_aclose branch from 7b0e30c to d2d8860 Compare August 7, 2020 13:19

goodboy added 12 commits August 7, 2020 11:34

Cancel root nursery to trigger failure

3a868fe

The real issue is if the root nursery gets cancelled prior to de-registration with the arbiter. This doesn't seem easy to reproduce by side effect of a KBI however that is how it was discovered in practise.

Always shield de-register step with arbiter

ae8488a

Allow opening a portal through an existing channel

fe45d99

Harden trio spawner process waiting

532429a

Always shield waiting for he process and always run ``trio.Process.__aexit__()`` on teardown. This enforces that shutdown happens to due cancellation triggered inside the sub-actor instead of the process being killed externally by the parent.

Allow shielding in open_portal()

90c7fa6

Never allow more then info logging in daemon; causes blocking

7f74182

Actor cancellation is now more latent; loosen timeing

c821690

Add close channel test with remote arbiter

acd5b80

Handle mp accept_addr

42be410

Appease the great mypy

b3eba00

Module define default accept addr

292513b

goodboy commented Aug 9, 2020

View reviewed changes

tests/test_discovery.py Show resolved Hide resolved

Docs fixes

8a995be

goodboy mentioned this pull request Aug 11, 2020

Hard kill tests #145

Open

goodboy commented Aug 12, 2020

View reviewed changes

ryanhiebert reviewed Aug 13, 2020

View reviewed changes

tests/test_discovery.py Outdated Show resolved Hide resolved

goodboy added 3 commits August 13, 2020 11:53

Make rpc_module_paths a list

1ae0efb

Use allocated arbiter port in local reg test

0c8dcd0

Update copyright date

863a4b7

goodboy added 2 commits August 13, 2020 11:55

Always log actor errors

ec5d443

Pass explicit kwargs to new discovery test funcs

451170b

goodboy merged commit 4da1632 into master Aug 13, 2020

goodboy deleted the dereg_on_channel_aclose branch August 13, 2020 17:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix for dereg failure on manual stream close #144

Fix for dereg failure on manual stream close #144

goodboy commented Aug 7, 2020 •

edited

Loading

goodboy commented Aug 9, 2020

goodboy commented Aug 9, 2020

goodboy Aug 12, 2020

ryanhiebert left a comment

Fix for dereg failure on manual stream close #144

Fix for dereg failure on manual stream close #144

Conversation

goodboy commented Aug 7, 2020 • edited Loading

goodboy commented Aug 9, 2020

goodboy commented Aug 9, 2020

goodboy Aug 12, 2020

Choose a reason for hiding this comment

ryanhiebert left a comment

Choose a reason for hiding this comment

goodboy commented Aug 7, 2020 •

edited

Loading