Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Synchronous service client call broken in timer callback #1016

Closed
jwhtkr opened this issue Sep 30, 2022 · 18 comments
Closed

Synchronous service client call broken in timer callback #1016

jwhtkr opened this issue Sep 30, 2022 · 18 comments

Comments

@jwhtkr
Copy link

jwhtkr commented Sep 30, 2022

Bug report

Required Info:

  • Operating System:
    • Ubuntu 20.04
  • Installation type:
    • binaries
  • Version or commit hash:
    • foxy (the 2022-09-28 release that was just pushed through the apt repos)
  • DDS implementation:
    • Fast-RTPS
  • Client library (if applicable):
    • rclpy (of course)

Steps to reproduce issue

Use the minimal service server from the examples

from example_interfaces.srv import AddTwoInts

import rclpy
from rclpy.node import Node


class MinimalService(Node):

    def __init__(self):
        super().__init__('minimal_service')
        self.srv = self.create_service(AddTwoInts, 'add_two_ints', self.add_two_ints_callback)

    def add_two_ints_callback(self, request, response):
        response.sum = request.a + request.b
        self.get_logger().info('Incoming request\na: %d b: %d' % (request.a, request.b))

        return response


def main(args=None):
    rclpy.init(args=args)

    minimal_service = MinimalService()

    rclpy.spin(minimal_service)

    rclpy.shutdown()


if __name__ == '__main__':
    main()

and a slightly modified service client:

from example_interfaces.srv import AddTwoInts
import rclpy
from rclpy.node import Node

from rclpy.executors import MultiThreadedExecutor
from rclpy.callback_groups import MutuallyExclusiveCallbackGroup


class MinimalClient(Node):
    def __init__(self):
        super().__init__("minimal_client_async")
        self.cb = None
        self.timer_cb = MutuallyExclusiveCallbackGroup()
        self.cli = self.create_client(
            AddTwoInts, "add_two_ints", callback_group=self.cb
        )
        while not self.cli.wait_for_service(timeout_sec=1.0):
            self.get_logger().info("service not available, waiting again...")
        self.req = AddTwoInts.Request()

        # Create a timer for the main loop execution
        self.timer = self.create_timer(
            1.0, self.main_loop, callback_group=self.timer_cb
        )

    def send_request(self, a, b):
        self.get_logger().info("send_request: enter")
        self.req.a = a
        self.req.b = b

        # Only works once, then never executes the time again
        # self.future = self.cli.call_async(self.req)
        # rclpy.spin_until_future_complete(self, self.future)
        # self.get_logger().info("send_request: exit")
        # return self.future.result()

        # Used to work, but blocks now
        return self.cli.call(self.req)

    def main_loop(self) -> None:
        response = self.send_request(4, 2)
        self.get_logger().info(
            "Result of add_two_ints: for %d + %d = %d" % (4, 2, response.sum)
        )


def main(args=None):
    rclpy.init(args=args)

    minimal_client = MinimalClient()

    executor = MultiThreadedExecutor()
    executor.add_node(minimal_client)
    executor.spin()

    rclpy.shutdown()


if __name__ == "__main__":
    main()

Run the service server in one terminal and the client in another.

Expected behavior

You repeatedly see the client call the server every second and get a response.

Actual behavior

The client hangs at the first call

Additional information

Previous to the 2022-09-28 patch release of foxy, the above minimal client was functional. However, the call_async(...) version that is currently commented out didn't work before in that it would run once and then never fire the timer again. This is despite the fact that in rclcpp this is exactly one of the recommended methods to implement this kind of behavior, and, from the documentation, would seem like the recommended way to do it inrclpy as well. It didn't work, so we found an alternative method: calling the synchronous call(...) with a multi-threaded executor. As stated, this then broke with the push of the 2022-09-28 release. This seriously hampers our ability to use ROS2 as there is essentially no way to perform the above logic, and is a breaking change.

@apockill
Copy link
Contributor

apockill commented Oct 2, 2022

Have you found any workarounds? I think this is affecting my codebase as well. It's pretty heavy in its use of the MultiThreadedExecutor and synchronous callbacks within services/timers.

@usako-lab
Copy link

usako-lab commented Oct 2, 2022

This commit affects.
It will work if you revert the following.

@@ -757,7 +759,11 @@ class MultiThreadedExecutor(Executor):
         except ConditionReachedException:
             pass
         else:
-            self._executor.submit(handler)
+            def handler_wrapper(handler):
+                handler()
+                if handler.exception() is not None:
+                    raise handler.exception()
+            self._executor.submit(handler_wrapper(handler))

about "handler_wrapper()"

  • if there is "handler()", it will work single thread. (may be block occurs)
  • missing return value("return handler")

@apockill
Copy link
Contributor

apockill commented Oct 2, 2022

I just found this as well. This breaks all usages of the MultiThreadedExecutor at the moment that rely on actual multithreading. I presume handle_wrapper(handler) could be replaced with partial(handle_wrapper, handler) to restore functionality.

@fujitatomoya
Copy link
Collaborator

@apockill @usako-lab really appreciate for the issue and contribution, that really helps. I will check the code and address the issue tomorrow.

@ramlab-jose
Copy link

We also ran into this issue and can confirm that the temporary solution proposed by @usako-lab fixes the problem. Unfortunately there seems to be convenient way to revert back to an older version via apt or rosdep, meaning the only option is to build from source.

It is rather inconvenient that this is the case, as it means any breaking change that is pushed officially will break our system when installing from the official sources (which always do the latest available version). If anyone has experienced the same, how did you deal with this problem?

@apockill
Copy link
Contributor

apockill commented Oct 3, 2022

@ramlab-jose it's a bit hacky, but I subclassed MultithreadedExecutor and monkeypatched the function with a corrected function.

This fit my workflow well because in our codebase we were already subclassing and adding some extra functionality.

@fujitatomoya
Copy link
Collaborator

fujitatomoya commented Oct 3, 2022

  • Confirmed this issue can be observed in rolling mainline.
  • The following patch will do fix the problem for sync call, but async.
diff --git a/rclpy/rclpy/executors.py b/rclpy/rclpy/executors.py
index 5c131e2..665338c 100644
--- a/rclpy/rclpy/executors.py
+++ b/rclpy/rclpy/executors.py
@@ -14,6 +14,7 @@

 from concurrent.futures import ThreadPoolExecutor
 from contextlib import ExitStack
+from functools import partial
 import inspect
 import multiprocessing
 from threading import Condition
@@ -769,7 +770,7 @@ class MultiThreadedExecutor(Executor):
                 handler()
                 if handler.exception() is not None:
                     raise handler.exception()
-            self._executor.submit(handler_wrapper(handler))
+            self._executor.submit(partial(handler_wrapper, handler))

     def spin_once(self, timeout_sec: float = None) -> None:
         self._spin_once_impl(timeout_sec)

@jwhtkr
Copy link
Author

jwhtkr commented Oct 3, 2022

@ramlab-jose Just to note, you don't actually have to build from source to fix this issue with the above patch. Using the binary install I edited the file /opt/ros/foxy/lib/python3.8/site-packages/rclpy/executors.py with the above patch and it worked (I re-built my workspace, but I'm not sure that's even necessary). At least I tested it with the MWE I posted in the original issue. However, that's still pretty hack-y, so it's obviously not a long-term solution. It's just a little easier short-term solution than building from source.

fujitatomoya added a commit to fujitatomoya/ros2_test_prover that referenced this issue Oct 3, 2022
@fujitatomoya
Copy link
Collaborator

the call_async(...) version that is currently commented out didn't work before in that it would run once and then never fire the timer again. This is despite the fact that in rclcpp this is exactly one of the recommended methods to implement this kind of behavior, and, from the documentation, would seem like the recommended way to do it inrclpy as well.

this sounds reasonable to me, user expects that could work with rclpy as well. probably we can create another issue to address and discuss on this one.

@wjwwood
Copy link
Member

wjwwood commented Oct 4, 2022

So, while this should be fixed, I want to point out that relying on the MultiThreadedExecutor in this way is fragile. I think a better way which also works with single threaded executors would look like this (tested it on rolling):

diff --git a/test_client_orig.py b/test_client.py
index 323d585..c25bd96 100644
--- a/test_client_orig.py
+++ b/test_client.py
@@ -2,7 +2,6 @@ from example_interfaces.srv import AddTwoInts
 import rclpy
 from rclpy.node import Node
 
-from rclpy.executors import MultiThreadedExecutor
 from rclpy.callback_groups import MutuallyExclusiveCallbackGroup
 
 
@@ -23,7 +22,7 @@ class MinimalClient(Node):
             1.0, self.main_loop, callback_group=self.timer_cb
         )
 
-    def send_request(self, a, b):
+    async def send_request(self, a, b):
         self.get_logger().info("send_request: enter")
         self.req.a = a
         self.req.b = b
@@ -34,11 +33,11 @@ class MinimalClient(Node):
         # self.get_logger().info("send_request: exit")
         # return self.future.result()
 
-        # Used to work, but blocks now
-        return self.cli.call(self.req)
+        # Better way? using asyncio
+        return await self.cli.call_async(self.req)
 
-    def main_loop(self) -> None:
-        response = self.send_request(4, 2)
+    async def main_loop(self) -> None:
+        response = await self.send_request(4, 2)
         self.get_logger().info(
             "Result of add_two_ints: for %d + %d = %d" % (4, 2, response.sum)
         )
@@ -49,9 +48,7 @@ def main(args=None):
 
     minimal_client = MinimalClient()
 
-    executor = MultiThreadedExecutor()
-    executor.add_node(minimal_client)
-    executor.spin()
+    rclpy.spin(minimal_client)
 
     rclpy.shutdown()
 

That works fine, but it is still important in this situation that the timer and service client are in separate callback groups. Because while control is returned to the executor on the await lines, the callback for the timer is still not done yet and therefore it prevents other callbacks from the same callback group from being run, even if they are ready and the executor is idle.

@usako-lab
Copy link

it works without "partial", belated.
https://docs.python.org/ja/3/library/concurrent.futures.html#concurrent.futures.Executor.submit

@fujitatomoya
Copy link
Collaborator

it works without "partial", belated.

thanks, i have tried it already, it fixes this problem but regenerates #983

@JakubNejedly
Copy link

I can confirm replicating error on rolling mainline.

  • Operating System:
    Ubuntu 22.04
  • Installation type:
    binaries
  • Version or commit hash:
    rolling (the 2022-09-28 release that was just pushed through the apt repos)
    DDS implementation:
    Cyclone DDS
    Client library (if applicable):
    rclpy

@fujitatomoya
Copy link
Collaborator

folks, really appreciate the all of the information and iterating with us.

#1017 has been reverted including all distro foxy/galactic/humble, i will close this issue.

@matias-pavez-dx
Copy link

Other than just reverting in #1017, I would really appreciate adding a regression test to make sure this issue never happens again. For example, making sure nested callbacks with different callback groups work.

We lost several days because of this issue. Production code depends on this feature.

@clalancette
Copy link
Contributor

Other than just reverting in #1017, I would really appreciate adding a regression test to make sure this issue never happens again. For example, making sure nested callbacks with different callback groups work.

Yes, I agree. We wanted to get this fixed and out quickly, so we didn't stop to do that, but a test would be ideal.

Would you like to submit a pull request that adds one?

@JakubNejedly
Copy link

Is there any apt release with the fix ? If not, when it will be created, please ? The newest release, which i can found, is from 19. September.

@matias-pavez-dx
Copy link

@JakubNejedly
You can get the patched rclpy version using this:

apt install ros-foxy-rclpy>=1.0.11-*

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants