Asynchronous client calls in a timer cause the timer to never execute a second time. #1018

jwhtkr · 2022-10-03T21:38:12Z

Bug report

Required Info:

Operating System:
- Ubuntu 20.04
Installation type:
- binaries
Version or commit hash:
- foxy (the 2022-09-28 release that was just pushed through the apt repos)
DDS implementation:
- Fast-RTPS for sure
- Any (I think)
Client library (if applicable):
- rclpy

Steps to reproduce issue

Use the minimal service server from the examples

from example_interfaces.srv import AddTwoInts

import rclpy
from rclpy.node import Node


class MinimalService(Node):

    def __init__(self):
        super().__init__('minimal_service')
        self.srv = self.create_service(AddTwoInts, 'add_two_ints', self.add_two_ints_callback)

    def add_two_ints_callback(self, request, response):
        response.sum = request.a + request.b
        self.get_logger().info('Incoming request\na: %d b: %d' % (request.a, request.b))

        return response


def main(args=None):
    rclpy.init(args=args)

    minimal_service = MinimalService()

    rclpy.spin(minimal_service)

    rclpy.shutdown()


if __name__ == '__main__':
    main()

and a slightly modified service client:

from example_interfaces.srv import AddTwoInts
import rclpy
from rclpy.node import Node

from rclpy.executors import MultiThreadedExecutor
from rclpy.callback_groups import MutuallyExclusiveCallbackGroup


class MinimalClient(Node):
    def __init__(self):
        super().__init__("minimal_client_async")
        self.cb = None
        self.timer_cb = MutuallyExclusiveCallbackGroup()
        self.cli = self.create_client(
            AddTwoInts, "add_two_ints", callback_group=self.cb
        )
        while not self.cli.wait_for_service(timeout_sec=1.0):
            self.get_logger().info("service not available, waiting again...")
        self.req = AddTwoInts.Request()

        # Create a timer for the main loop execution
        self.timer = self.create_timer(
            1.0, self.main_loop, callback_group=self.timer_cb
        )

    def send_request(self, a, b):
        self.get_logger().info("send_request: enter")
        self.req.a = a
        self.req.b = b

        # Only works once, then never executes the timer again
        self.future = self.cli.call_async(self.req)
        rclpy.spin_until_future_complete(self, self.future)
        self.get_logger().info("send_request: exit")
        return self.future.result()

        # Used to work, but blocks now
        # return self.cli.call(self.req)

    def main_loop(self) -> None:
        response = self.send_request(4, 2)
        self.get_logger().info(
            "Result of add_two_ints: for %d + %d = %d" % (4, 2, response.sum)
        )


def main(args=None):
    rclpy.init(args=args)

    minimal_client = MinimalClient()

    executor = MultiThreadedExecutor()
    executor.add_node(minimal_client)
    executor.spin()

    rclpy.shutdown()


if __name__ == "__main__":
    main()

Run the service server in one terminal and the client in another.

Expected behavior

You repeatedly see the client call the server every second and get a response.

Actual behavior

The client runs correctly once, but then hangs (never executes the timer loop again)

Additional information

Previous to the 2022-09-28 patch release of foxy, a workaround was to use a synchronous call. However, that is non-functional now (see #1016 for more info/workarounds on that issue). As this behavior is functional in rclcpp in foxy, and seems to be what the documentation of foxy recommends it would be good to have this be functional.

llapx · 2022-10-13T00:58:54Z

The client runs correctly once, but then hangs (never executes the timer loop again)

The hangs happend when you use rclpy.spin_until_future_complete(self, self.future):

def spin_until_future_complete(
    node: 'Node',
    future: Future,
    executor: 'Executor' = None,
    timeout_sec: float = None
) -> None:
    """
    Execute work until the future is complete.

    Callbacks and other work will be executed by the provided executor until ``future.done()``
    returns ``True`` or the context associated with the executor is shutdown.

    :param node: A node to add to the executor to check for work.
    :param future: The future object to wait on.
    :param executor: The executor to use, or the global executor if ``None``.
    :param timeout_sec: Seconds to wait. Block until the future is complete
        if ``None`` or negative. Don't wait if 0.
    """
    executor = get_global_executor() if executor is None else executor
    try:
        executor.add_node(node)
        executor.spin_until_future_complete(future, timeout_sec)
    finally:
        executor.remove_node(node)

when you pass params without executor, then rclpy.spin_until_future_complete will use global executor which not the same executor as you defined.
when you pass params with executor, then rclpy.spin_until_future_complete will remove the node finally (the node and its timer never scheduled again).

So IMO, rclpy.spin_xxx may conflict with executor.spin_xxx (which you defined), it's better to use executor.spin_until_future_complete directly.

apockill · 2022-10-13T16:23:10Z

This may be a symptom of the cause found in #1016

llapx · 2022-10-14T02:08:04Z

@apockill

I have tested with latest of rclpy, which already applied the commit in #1016, but not work.

llapx · 2022-10-14T02:10:50Z

Here's my solution to the client:

...
    def send_request(self, a, b):
        self.get_logger().info("send_request: enter")
        self.req.a = a
        self.req.b = b

        self.future = self.cli.call_async(self.req)
        while rclpy.ok():
            if self.future.done() and self.future.result():
                self.get_logger().info("send_request: exit")
                return self.future.result()

        return None
...

fujitatomoya · 2022-10-14T23:07:10Z

I think this is adding the same node base to the different and multiple executor, which does not work. (this is because GuardCondition does not work, rclcpp does have the same behavior.)

@llapx 's suggestion works, but i would recommend that you could use async/await instead, cz you do not want to consume the extra CPU to wait for the future completed.

...<snip>
    async def send_request(self, a, b):
        self.get_logger().info("send_request: enter")
        self.req.a = a
        self.req.b = b
        return await self.cli.call_async(self.req)

    async def main_loop(self) -> None:
        response = await self.send_request(4, 2)
        self.get_logger().info(
            "Result of add_two_ints: for %d + %d = %d" % (4, 2, response.sum)
        )
...<snip>

… a second time ros2/rclpy#1018 Signed-off-by: Tomoya Fujita <[email protected]>

iuhilnehc-ynos · 2022-10-17T09:02:36Z

#1018 (comment) and #1018 (comment) seem not good to me.
What if users want to call a request by timeout?

I suggest using

+ self.executor.spin_until_future_complete(self.future)
- rclpy.spin_until_future_complete(self, self.future)

in the example.

Let me share my understanding of this issue.
In the main function,

    executor = MultiThreadedExecutor()
    executor.add_node(minimal_client)

Please notice that there is a property named executor in the Node.
executor.add_node(minimal_client) makes the
minimal_client.executor = executor(which is the MultiThreadedExecutor() instance) in the

rclpy/rclpy/rclpy/executors.py

Line 241 in e18084f

node.executor = self

.

after calling rclpy.spin_until_future_complete without executor parameter in the send_request,

rclpy/rclpy/rclpy/__init__.py

Line 247 in bce7e89

executor.add_node(node)

the executor.add_node(node) will call the node.executor=self again,
and then the self.executor(which is the MultiThreadedExecutor() instance) adding the node(minimal_client) before will remove the node in

rclpy/rclpy/rclpy/node.py

Lines 279 to 283 in e18084f

    
           current_executor = self.executor 
        
           if current_executor == new_executor: 
        
               return 
        
           if current_executor is not None: 
        
               current_executor.remove_node(self)

(This is the reason why the executor = MultiThreadedExecutor() not spin the events of minimal_client any more.)

rclcpp explicitly uses an exception to control One node should not be added to two Executors.
I think rclpy should also follow this strategy.

jwhtkr · 2022-10-17T18:54:20Z

All, thanks for all the discussion and potential solutions. I would just like to note that the two solutions that @fujitatomoya and @iuhilnehc-ynos brought up (using async def/await and directly using the node's executor), while functional, are pretty un-intuitive to arrive at on one's own. I'm not sure if that means that the functionality needs to be adjusted to be more user-friendly, the documentation needs adjustments to explain these concepts better, or both of those courses. But as someone who is switching from ROS1 (i.e., I'm quite familiar with general ROS concepts like sub/pub srvs topics, etc), I have spent an inordinate amount of time trying to just figure out how to do what should be relatively straightforward things like this. For example, given the documentation (and some older questions/comments on sites like answers.ros.org that seemed to suggest it wouldn't work) I had no idea that rclpy worked with python's standard async def and await syntax.
Again, I want to express how appreciative I am of all the suggestions, discussion, and work that people have put in to help with these issues I've run into, as well as to develop ROS2, rclpy, etc. I just worry that the trees of features and improvements have distracted from the forest of overall usability that (at least for me) has been a solid point of ROS1 for many years now.

fujitatomoya · 2022-10-19T02:18:40Z

@molasses11 thank you so much for being constructive ! those are just suggestions, i am not saying that this is not the problem, that will be some areas that we can refine.

more concrete and clear example and documentation to use node executor.
migration from ROS 1 to ROS 2, documentation to avoid this problem.

let's keep this open, i am not sure i can have bandwidth for this soon though.

808brick · 2023-05-05T23:58:16Z

I can confirm I am experiencing the same issue.

I tried experimenting with Executors and MutuallyExclusiveCallbackGroup like molasses11 with code which has timers/loops in it, and was able to get proper (non-blocking) functionality using Services by creating another node (within the same node using rclpy.create_node()) and having this new node send the request, and return the response it gets back to my main node (via class self).

This work-around however does not work with Actions. With the action all ROS functionality of the node gets frozen until the timer/loop finishes (ex: unable to get parameters or make service requests from the same node). And depending on how you're doing the looping (while loop with rate object, rclpy.spin_once, executor.spin_once) I sometimes have it where the Node refuses and new requests once the loop is finished (even if it accepted requests fine while looping).

I've found using MutuallyExclusiveCallbackGroup for the service callback option, and creating a custom executor for the node and using the executor.spin_once() call within a while loop (you need to make sure executor.spin_once() goes at the end of your loop, if not it causes issues) relieves the issue for services.

Currently reverting my Action to a Service since I can get around not needing feedback and request cancellation features for right now in my application, but hope this issue gets resolved.

Wanted to mention all this in case it helps show a correlation of the issue with client calls blocking future requests of the Node.

meyerj · 2024-01-12T13:59:06Z

We experienced similar issuses in different contexts. I think the problem is deeper, not only related to client calls and timers.

Essentially all implementations of rclpy.spin_*() methods in rclpy/__init__.py are somehow broken for the following reason, similar to what @iuhilnehc-ynos already outlined above:

If they are called recursively from within a callback, which is not explicitly forbidden, then the node instance was already added before to that executor. This results in False being returned from Executor.add_node(node), but the return value is not used in the spin functions.
Even if the node has not be added in this spin call, it will be removed from the executor unconditionally in the finally block.
So if any of those methods is used recursively, the node will not be spinned anymore afterwards.

I do think that recursive spinning is a desirable feature and apparently many users rely on that for implementing timeouts without blocking all other callbacks. Before recent #1188 calling a service with a timeout that works in any context was surprisingly hard, but the synchronous call() with a timeout does currently not allow other callbacks to run in between, like with recursive spinning. Even the most basic Python service client example in the official documentation uses rclpy.spin_until_future_complete(self, self.future), but in a method called directly from main() and not by an already running executor, like if a service would be called in another callback.

Suggestion:

So probably the "best" solution would be to try adding the node to the executor for spinning, but keep the return value and only remove it in finally if it was successfully added before? Something like

    try:
        was_added = executor.add_node(node)
        executor.spin_until_future_complete(future, timeout_sec)
    finally:
        if was_added:
            executor.remove_node(node)

of with a context manager that temporarily adds the node to the executor if it was not added before, instead of the repeated try ... finally construct.

Alternatively there must be a way to call executor.spin_until_future_complete(future, timeout_sec) directly without a specific node instance if it was already added to the executor before, like inside another callback of that node.

Something like

node.executor.spin_until_future_complete(future, timeout_sec)

instead of

rclpy.spin_until_future_complete(future, timeout_sec)

should also solve the problem described in this issue, and similar ones.

When checking on how to find the current executor of a node, I also found that Executor.add_node(node) is always called twice recursively, because it invokes the setter for node.executor which in turn calls back into executor.add_node(node). This only works because executor._nodes_lock is a recursive lock and add_node(node) checks whether the node was already added to self._nodes before.

Afaik there is currently no way to get the executor of the current thread, if any, without a reference to the node? I think asyncio.get_event_loop() uses thread local storage for that. So calls like rclpy.spin_until_future_complete(node, self.future) from within a callback would also fail, or spin the wrong executor, if used inside a callback that is executed by another than the global executor. If there was such a way, then using the executor of the current thread as the default executor in global rclpy.spin_*() functions would be a reasonable default, and only fall back to the global executor in case the call is from a thread context without any executor, like the main thread before any spin function was called. Of course that would be a quite severe change to the basics of rclpy...

Interestingly, in rclcpp recursively spinning a node results in an exception being thrown in Executor::add_node(), so it behaves fundamentally different (in version 6f6b5f2). See also @wjwwood's related TODO comment in rclcpp::spin_node_until_future_complete(). On the long term it would be nice if the two most widely used client libraries behave the same with this respect.

fujitatomoya added a commit to fujitatomoya/ros2_test_prover that referenced this issue Oct 14, 2022

Asynchronous client calls in a timer cause the timer to never execute…

321e2aa

… a second time ros2/rclpy#1018 Signed-off-by: Tomoya Fujita <[email protected]>

clalancette added the backlog label Nov 3, 2022

meyerj linked a pull request Jan 20, 2024 that will close this issue

Fix recursive spinning from within a callback? #1211

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Asynchronous client calls in a timer cause the timer to never execute a second time. #1018

Asynchronous client calls in a timer cause the timer to never execute a second time. #1018

jwhtkr commented Oct 3, 2022

llapx commented Oct 13, 2022

apockill commented Oct 13, 2022

llapx commented Oct 14, 2022

llapx commented Oct 14, 2022 •

edited

Loading

fujitatomoya commented Oct 14, 2022

iuhilnehc-ynos commented Oct 17, 2022

jwhtkr commented Oct 17, 2022

fujitatomoya commented Oct 19, 2022

808brick commented May 5, 2023 •

edited

Loading

meyerj commented Jan 12, 2024

Asynchronous client calls in a timer cause the timer to never execute a second time. #1018

Asynchronous client calls in a timer cause the timer to never execute a second time. #1018

Comments

jwhtkr commented Oct 3, 2022

Bug report

Steps to reproduce issue

Expected behavior

Actual behavior

Additional information

llapx commented Oct 13, 2022

apockill commented Oct 13, 2022

llapx commented Oct 14, 2022

llapx commented Oct 14, 2022 • edited Loading

fujitatomoya commented Oct 14, 2022

iuhilnehc-ynos commented Oct 17, 2022

jwhtkr commented Oct 17, 2022

fujitatomoya commented Oct 19, 2022

808brick commented May 5, 2023 • edited Loading

meyerj commented Jan 12, 2024

llapx commented Oct 14, 2022 •

edited

Loading

808brick commented May 5, 2023 •

edited

Loading