Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cancelling a sub-sub-actor (2 levels down) that's in sync code when cancel msg arrives #156

Closed
goodboy opened this issue Oct 7, 2020 · 0 comments
Labels
bug Something isn't working

Comments

@goodboy
Copy link
Owner

goodboy commented Oct 7, 2020

I found this while working through a test set in #129.
There's a branch bug_in_debug with a commit for a test that demonstrates the issue.

I'm not entirely sure what's going on here yet, but the jist is that if a child's child enters the debugger before a cancel message is sent from the root level actor, somehow that subactor two levels down isn't being cancelled when it returns control to trio.

Right now the result is that the target child is not moving on after disengaging pdb.

The state leading up to it is:

python examples/debugging/root_cancelled_but_child_is_in_tty_lock.py
├─ python -m tractor._child --uid ('spawner1', 'ae3561a5 ...)
│  └─ python -m tractor._child --uid ('spawn_until_0', '103b6c23 ...)
└─ python -m tractor._child --uid ('spawner0', 'f1ed4832-2320 ...)

The root at this point has received an error via Portal.result() from a now cancelled sibling to spawner1 actor, spawner0 and has cancelled the nursery containing both:

Oct 07 06:09:04 (anonymous, 139223, tractor._main)) [WARNING] tractor.tractor._trionics _trionics.py:142 Cancelling nursery in ('anonymous', '2f2d27d5-f13e-4f5c-b110-be9059d0369b')
Oct 07 06:09:04 (anonymous, 139223, tractor._portal.Portal.cancel_actor)) [WARNING] tractor _portal.py:275 Sending actor cancel request to ('spawner0', 'de22443b-7a7f-44b8-8315-bb6ad6d55730') on <Channel fd=11, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('127.0.0.1', 43629), raddr=('127.0.0.1', 50468)>
Oct 07 06:09:04 (anonymous, 139223, tractor._portal.Portal.cancel_actor)) [WARNING] tractor _portal.py:275 Sending actor cancel request to ('spawner1', 'b12b4b20-f582-449f-934c-a4b8216beb43') on <Channel fd=13, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('127.0.0.1', 43629), raddr=('127.0.0.1', 50472)>
Oct 07 06:09:04 (anonymous, 139223, tractor._spawn.cancel_on_completion)) [WARNING] tractor _spawn.py:139 Cancelling ('spawner0', 'de22443b-7a7f-44b8-8315-bb6ad6d55730') after error ('spawner0', 'de22443b-7a7f-44b8-8315-bb6ad6d55730')

But, spawn_until_0 gets the tty lock just before each actor receives the cancel msg:

ct 07 06:09:04 (anonymous, 139223, _hijack_stdin_relay_to_child)) [WARNING] tractor.tractor._debug _debug.py:166 Actor ('spawn_until_0', '5e16346a-accf-4e32-90ad-b7faec54e671') ACQUIRED stdin hijack lock
Oct 07 06:09:04 (spawner0, 139226, tractor._actor.Actor._process_messages)) [WARNING] tractor _actor.py:534 ('spawner0', 'de22443b-7a7f-44b8-8315-bb6ad6d55730') was remotely cancelled
Oct 07 06:09:04 (spawner1, 139230, tractor._actor.Actor._process_messages)) [WARNING] tractor _actor.py:534 ('spawner1', 'b12b4b20-f582-449f-934c-a4b8216beb43') was remotely cancelled

Passing c to pdb again results in the next prompt with the following output:

[7] > /home/goodboy/repos/tractor/tractor/_portal.py(213)_return_from_resptype()
-> raise unpack_error(msg, self.channel)
Oct 07 06:09:04 (spawner0, 139226, tractor._debug._breakpoint.<locals>.wait_for_parent_stdin_hijack)) [WARNING] tractor _portal.py:254 Cancelling all streams with ('anonymous', '2f2d27d5-f13e-4f5c-b110-be9059d0369b')
Oct 07 06:09:04 (spawner0, 139226, tractor._debug._breakpoint.<locals>.wait_for_parent_stdin_hijack)) [WARNING] tractor _portal.py:91 <tractor._portal.StreamReceiveChannel object at 0x7f47fc8a0070> is already closed
Oct 07 06:09:04 (spawner0, 139226, cancel)) [WARNING] tractor _actor.py:153 Task <bound method Actor.cancel of <tractor._actor.Actor object at 0x7f47fc6b6a90>> likely errored or cancelled before it started
Oct 07 06:09:04 (spawner0, 139226, tractor._actor.Actor._async_main)) [WARNING] tractor _actor.py:754 Closing all actor lifetime contexts
(Pdb++) Oct 07 06:09:04 (anonymous, 139223, trio._highlevel_serve_listeners._run_handler)) [ERROR] tractor.ipc _ipc.py:54 Stream connection ('127.0.0.1', 50468) broke
Oct 07 06:09:04 (anonymous, 139223, tractor._spawn.new_proc)) [WARNING] tractor _spawn.py:374 Cancelling existing result waiter task for ('spawner0', 'de22443b-7a7f-44b8-8315-bb6ad6d55730')
Oct 07 06:09:05 (anonymous, 139223, tractor._portal.Portal.cancel_actor)) [WARNING] tractor _portal.py:288 May have failed to cancel ('spawner1', 'b12b4b20-f582-449f-934c-a4b8216beb43')
Oct 07 06:09:05 (anonymous, 139223, tractor._portal.Portal.cancel_actor)) [WARNING] tractor _portal.py:288 May have failed to cancel ('spawner0', 'de22443b-7a7f-44b8-8315-bb6ad6d55730')
Oct 07 06:09:05 (anonymous, 139223, tractor._spawn.cancel_on_completion)) [WARNING] tractor _portal.py:288 May have failed to cancel ('spawner0', 'de22443b-7a7f-44b8-8315-bb6ad6d55730')

Continuing again results in the hang:

c
Oct 07 06:12:37 (spawn_until_0, 139235, tractor._debug._breakpoint.<locals>.wait_for_parent_stdin_hijack)) [WARNING] tractor _portal.py:96 Cancelling stream c3fb672d-4e0a-4d7b-99d4-9412c66c93de to ('anonymous', '2f2d27d5-f13e-4f5c-b110-be9059d0369b')
Oct 07 06:12:37 (anonymous, 139223, _hijack_stdin_relay_to_child)) [ERROR] tractor.tractor._debug _debug.py:132 TTY lock released by <Task '_hijack_stdin_relay_to_child' at 0x7faa22f29f40>:('spawn_until_0', '5e16346a-accf-4e32-90ad-b7faec54e671')
Oct 07 06:12:37 (spawn_until_0, 139235, tractor._debug._breakpoint.<locals>.wait_for_parent_stdin_hijack)) [WARNING] tractor _portal.py:254 Cancelling all streams with ('anonymous', '2f2d27d5-f13e-4f5c-b110-be9059d0369b')
Oct 07 06:12:37 (spawn_until_0, 139235, tractor._debug._breakpoint.<locals>.wait_for_parent_stdin_hijack)) [WARNING] tractor _portal.py:91 <tractor._portal.StreamReceiveChannel object at 0x7f0370c7be50> is already closed

Adding trace level logging doesn't tell much other then that spawn_until_0 has left the debugger...

@goodboy goodboy added bug Something isn't working debugging labels Oct 7, 2020
goodboy added a commit that referenced this issue Oct 12, 2020
This appears to demonstrate the same bug found in #156. It looks like
cancelling a subactor with a child, while that child is running sync code,
can result in the child never getting cancelled due to some strange
condition where the internal nurseries aren't being torn down as
expected?
goodboy added a commit that referenced this issue Oct 13, 2020
This appears to demonstrate the same bug found in #156. It looks like
cancelling a subactor with a child, while that child is running sync code,
can result in the child never getting cancelled due to some strange
condition where the internal nurseries aren't being torn down as
expected when a `trio.Cancelled` is raised.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant