Conversation
mikaelarguedas
left a comment
There was a problem hiding this comment.
this looks pretty good. I had a few comments/questions inline, mostly nitpicks.
This could benefit from another set of eyes as well
rclpy/rclpy/client.py
Outdated
| """ | ||
| Block until the service is available. | ||
|
|
||
| :param timeout_sec: Seconds to wait. Block forever if None. Don't wait if <= 0 |
There was a problem hiding this comment.
docstring should be updated to reflect that negative values wait forever
rclpy/rclpy/graph_listener.py
Outdated
| # Assumes lock is already held | ||
| if self._thread is None: | ||
| self._thread = threading.Thread(target=self._runner) | ||
| self._thread.daemon = True |
There was a problem hiding this comment.
Nit: you can "daemonize" a thread by passing the keyword argument daemon=True to the Thread constructor (same below)
rclpy/rclpy/client.py
Outdated
|
|
||
| :param timeout_sec: Seconds to wait. Block forever if None. Don't wait if <= 0 | ||
| :type timeout_sec: float or None | ||
| :rtype: bool true if the service is available |
There was a problem hiding this comment.
Nit:
Use returns to describe what the function returns and rtype to say what type is returned
:returns: True if the service is available, otherwise False
:rtype: bool
(Same for all functions docblocks)
rclpy/rclpy/wait_set.py
Outdated
| del self._ready_pointers[subscription_pointer] | ||
| self._needs_building = True | ||
|
|
||
| def add_subscriptions(self, subscriptions): |
There was a problem hiding this comment.
Nit: move this function next to add_subscription as it's done below for timers
rclpy/rclpy/wait_set.py
Outdated
|
|
||
| def remove_subscription(self, subscription_pointer): | ||
| del self._subscriptions[subscription_pointer] | ||
| del self._ready_pointers[subscription_pointer] |
There was a problem hiding this comment.
Should this check that subscription_pointer is a key of _ready_pointer before trying to delete it ? (same multiple times below)
There was a problem hiding this comment.
Nvm, the key is always created when an entity is created.
Though now I'm not sure who is making sure than the ready_pointers not updated in https://github.com/ros2/rclpy/pull/127/files#diff-c7dc6d55f68505f4de26646b4541e2cfR151 (the ones that are not ready) are set to False in the existing ready_pointers dict
There was a problem hiding this comment.
Oops, fixed + test updated to catch that in a8e54fc
| @@ -0,0 +1,61 @@ | |||
| # Copyright 2017 Open Source Robotics Foundation, Inc. | |||
There was a problem hiding this comment.
We will need to extend the timeout of the nose test as several of these could take some time and the total test time will likely take more than 90 seconds in non ideal cases
There was a problem hiding this comment.
What are the non-ideal cases? How long do they take? I upped it by 30 seconds (+4ish seconds per new test) 72e8ea9.
There was a problem hiding this comment.
The worst case scenario really depends on the nature of the test, e.g. if a test has a wait_for_service(5s) worst case it will time out and take ~5seconds. For reference our service communication tests have a timeout of 30s. On my machine, this takes between 90 and 100s so I guess 120 is fine. We often revisit test timeout later on given that not all nodes have the same behavior. A good way to estimate the duration would be to run jobs on all platforms with this single test repeated multiple times
rclpy/test/test_client.py
Outdated
| @classmethod | ||
| def setUpClass(cls): | ||
| rclpy.init() | ||
| cls.node = rclpy.create_node('TestClient', namespace='/') |
There was a problem hiding this comment.
Question: Is the namespace necessary here ? or should an empty namespace expand to / automatically ?
There was a problem hiding this comment.
Not necessary. This was a copy/paste error b9d015f
There was a problem hiding this comment.
Looks like the referenced commit didnt modify this line
rclpy/test/test_utilities.py
Outdated
| self.assertGreater(0, rclpy.utilities.timeout_sec_to_nsec(None)) | ||
| self.assertGreater(0, rclpy.utilities.timeout_sec_to_nsec(-1)) | ||
| self.assertEqual(0, rclpy.utilities.timeout_sec_to_nsec(0)) | ||
| self.assertEqual(1000000000, rclpy.utilities.timeout_sec_to_nsec(1)) |
There was a problem hiding this comment.
Nit: Using the number in nanoseconds makes it harder to read. Could we use S_TO_NS in these tests?
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
|
|
||
| import time |
There was a problem hiding this comment.
Nit: rcl timers are using steady time so I think the tests comparing times should use the same time source
rclpy/test/test_wait_set.py
Outdated
| self.assertFalse(ws.is_ready(gc_pointer)) | ||
|
|
||
| _rclpy.rclpy_trigger_guard_condition(gc_handle) | ||
| # TODO(sloretz) why does the next assertion fail with wait(0)? |
There was a problem hiding this comment.
hmmm good question, would be good to find out why
There was a problem hiding this comment.
I've seen that there are some additional things that happen in rmw_fastrtps_cpp's wait when the requested timeout is 0, in particular: https://github.com/ros2/rmw_fastrtps/blob/a6ac02905c55118fb1da636f85c8197dde8ef58e/rmw_fastrtps_cpp/src/rmw_wait.cpp#L236
When I was pulling up that link I realised that it looks to be checking hasTriggered on the guard condition, while getHasTriggered that was called at L229 will have cleared the flag already. So I suspect that rmw_wait is incorrectly returning that it timed out even though the guard condition had triggered.
Then as a flow-on effect, it looks like rcl_wait will clear the guard conditions if rmw_wait timed out: https://github.com/ros2/rcl/blob/1fa4acac8ca42f61bd0f80c7d3ee5fba3d196e33/rcl/src/rcl/wait.c#L621, so this combination of things might explain the difference in behaviour with wait(0).
Have you looked into it already @sloretz? Otherwise I'll continue the investigations
There was a problem hiding this comment.
@dhood nice find! I hadn't looked into it
There was a problem hiding this comment.
ros2/rmw_fastrtps#158 should let you change back to wait(0)
rclpy/rclpy/executors.py
Outdated
| if timeout_sec is not None and timeout_sec < 0: | ||
| timeout_sec = None | ||
| # Wait for all work to complete | ||
| if timeout_sec is None or timeout_sec >= 0: |
There was a problem hiding this comment.
isn't this always evaluating to True ? (given the previous 2 lines)
rclpy/src/rclpy/_rclpy.c
Outdated
| return NULL; | ||
| } | ||
| if (!PyCapsule_IsValid(pynode, NULL) || !PyCapsule_IsValid(pyclient, NULL)) { | ||
| PyErr_Format(PyExc_ValueError, "Expected two PyCapsule as arguments"); |
There was a problem hiding this comment.
This raises ValueError on invalid capsule while rclpy_get_graph_guard_condition raises TypeError
rclpy/rclpy/utilities.py
Outdated
| if timeout_sec is None: | ||
| # Block forever | ||
| return -1 | ||
| elif timeout_sec > -1.0 / S_TO_NS and timeout_sec < 1.0 / S_TO_NS: |
There was a problem hiding this comment.
Could you clarify the reasoning behind the rounding here?
I think it's going to trip users up because of how the behaviour changes if you just slightly change timeout values around the edges of +/-1ns. E.g. if I'm calculating timeout_sec values through some algorithm of my own, I'd expect both a -1.1ns timeout and a -0.9ns timeout to be translated to indefinite wait. I would suggest just converting to nanoseconds and then checking if the result is negative.
For the +0.9ns case, it could be argued that this should cause a blocking wait of 1ns, but I don't think we need to be too picky about this since in theory the difference between waiting 0ns and 1ns shouldn't be too drastic (in practice, it seems from https://github.com/ros2/rclpy/pull/127/files#diff-5464c403a7a6a4bd5dc7f394ce03bfdbR48 that that's not the case right now). We should at least clarify in the docblock that it doesn't wait if the timeout is <1ns.
There was a problem hiding this comment.
The goal was only to avoid == on a floating point number. The bound can be smaller (gtest EXPECT_DOUBLE_EQ uses 4 ULP). I chose 1ns because it's the smallest amount of time rcl_wait supports. How about changing it to 1 ULP on the negative side (block forever if timeout_sec < 0), and 0.5ns on the positive side (don't wait if 0 == round(timeout_sec / S_TO_NS)) ?
There was a problem hiding this comment.
Not sure what I was thinking before, but in 8b2008c any negative number blocks, and a timeout less than 1ns doesn't block due to the integer cast.
rclpy/test/test_wait_set.py
Outdated
| self.assertFalse(ws.is_ready(gc_pointer)) | ||
|
|
||
| _rclpy.rclpy_trigger_guard_condition(gc_handle) | ||
| # TODO(sloretz) why does the next assertion fail with wait(0)? |
There was a problem hiding this comment.
I've seen that there are some additional things that happen in rmw_fastrtps_cpp's wait when the requested timeout is 0, in particular: https://github.com/ros2/rmw_fastrtps/blob/a6ac02905c55118fb1da636f85c8197dde8ef58e/rmw_fastrtps_cpp/src/rmw_wait.cpp#L236
When I was pulling up that link I realised that it looks to be checking hasTriggered on the guard condition, while getHasTriggered that was called at L229 will have cleared the flag already. So I suspect that rmw_wait is incorrectly returning that it timed out even though the guard condition had triggered.
Then as a flow-on effect, it looks like rcl_wait will clear the guard conditions if rmw_wait timed out: https://github.com/ros2/rcl/blob/1fa4acac8ca42f61bd0f80c7d3ee5fba3d196e33/rcl/src/rcl/wait.c#L621, so this combination of things might explain the difference in behaviour with wait(0).
Have you looked into it already @sloretz? Otherwise I'll continue the investigations
| cli.wait_for_service() | ||
| cli.call(GetParameters.Request()) | ||
|
|
||
| ws.wait(5 * S_TO_NS) |
There was a problem hiding this comment.
I don't think this will be long enough for connext, we tend to use 20s in other service tests
There was a problem hiding this comment.
Given that in this case the client and the service server are part of the same participant I would expect the discovery to be very quick and the 5sec timeout to be enough.
Talking briefly with @dhood offline it looks like the current value should be good enough. We would still benefit from a repeated job on this package before merging to convince ourselves and catch potential flakyness before it hits the nighlies
There was a problem hiding this comment.
@mikaelarguedas @dhood Repeated job in progress http://ci.ros2.org/job/ci_linux/3361/
There was a problem hiding this comment.
It doesn't appear to be flaky on linux http://ci.ros2.org/job/ci_linux/3361/console#console-section-164. Do the tests take longer on OSX or Windows? I can start a repeated job on them too.
| import time | ||
| import unittest | ||
|
|
||
| from rcl_interfaces.srv import GetParameters |
There was a problem hiding this comment.
rclcpp defines mocking interfaces to test this kind of things rather than relying on existing rcl_interfaces. It may be worth doing the same things here to avoid confusion as an outsider would expect Parameters to be a thing that exists in rclpy when seeing this service being used
There was a problem hiding this comment.
Mind if this goes in a follow up PR? test_callback_group.py, test_destruction.py, and test_node.py would also benefit from this.
rclpy/rclpy/executors.py
Outdated
| timers_ready = _rclpy.rclpy_get_ready_entities('timer', wait_set) | ||
| clients_ready = _rclpy.rclpy_get_ready_entities('client', wait_set) | ||
| services_ready = _rclpy.rclpy_get_ready_entities('service', wait_set) | ||
| wait_set = _WaitSet() |
There was a problem hiding this comment.
This can't rely on __del__ being called to destroy the allocated C wait set. That might just never happen. Instead the code should explicitly call a cleanup function when it doesn't need wait_set anymore.
There was a problem hiding this comment.
AFAIK an explicit method cannot be relied on unless it is implemented using the with or try/finally statements. I can add __enter__() and __exit__() methods.
I think __del__ is OK here unless the C structure itself is very big. The cases where __del__ won't be called look acceptable to me.
__del__never called- the interpreter is shutting down
- OS will free memory anyways when the interpreter terminates.
- there is a chain of circular references and cyclic garbage collection is disabled and one the objects in the chain has a reference to an instance of WaitSet
- It's not just the wait set, all the other objects in the chain are leaked too. The user must fix their circular reference bug if they don't want memory leaks.
- the interpreter is shutting down
__del__not called for a long time- User messed with garbage collector settings
- They've got bigger problems. If this isn't getting collected then a lot of other stuff isn't either
- Something keeping the reference count from hitting zero (ex: An exception was raised and the WaitSet is referenced by a local variable in one of the stack frames in
sys.last_tracebackand another exception hasn't been raised for a while)- Explicit method or not the class instance and member variables will use memory until they are garbage collected (at which point
__del__will be called unless the interpreter is shutting down). An explicit method can only free the C structure. I'm not sure how big the C structure is; if it's really big then it is worth freeing it sooner.
- Explicit method or not the class instance and member variables will use memory until they are garbage collected (at which point
- User messed with garbage collector settings
There was a problem hiding this comment.
I am pretty sure that without fiddling with any setting the GC won't collect an instance "immediately" but might only do that after minutes. And if the user want's to ensure that the C structure is deleted there needs to be a way to do so (without triggering the GC). Therefore I think it is better to expect that the calling code invokes a cleanup function explicitly. If the caller fails to do so the __del__ method can this do it for them to be on the safe side.
There was a problem hiding this comment.
Resolved by adding WaitSet.destroy() and having the executor use it via with d8521cd
| from rclpy.wait_set import WaitSet as _WaitSet | ||
|
|
||
|
|
||
| class GraphListenerSingleton: |
There was a problem hiding this comment.
How is this singleton being destroyed in case of e.g. shutting down rclpy? When is the thread being stopped?
There was a problem hiding this comment.
In d10ee26 it checks if rclpy was shut down once per second.
8b2008c to
be25673
Compare
~5% less overhead in wait_for_ready_callbacks()
rclpy/rclpy/executors.py
Outdated
|
|
||
| # Process ready entities one node at a time | ||
| for node in nodes: | ||
| for tmr in [t for t in timers if wait_set.is_ready(t.timer_pointer)]: |
There was a problem hiding this comment.
Shouldn't this list comprehension only consider the timers of the current node?
Same for subscriptions, clients and services below.
Also refactored some code to a sub-generator
|
@dirk-thomas replaced comprehensions with |
|
Improvements result in ~9% less overhead in Edit: Still not fast enough :( Build with 7ee5ead Another try http://ci.ros2.org/job/ci_linux-aarch64/668/ Edit, more stuff I don't expect this to pass, but it will give me more info in the morning |
Changed since review. Executor code moved to #140
|
Lots has been split off, changed, and merged since this PR. I'll close this expecting to open smaller PRs with these features in the future. |
Better errors from codegen
This adds a blocking
wait_for_service()to the client. It is implemented using a GraphListener which runs in its own thread - similar to rclcpp. This is needed to prevent wait_for_service() from deadlocking a single threaded executor if someone calls it from a subscriber callback.There is a small amount of refactoring in the executor for the purpose of reusing code in the graph listener. Also after talking with @mikaelarguedas negative timeouts now block forever instead of returning immediately.
connects to #58