Layered handling of node and (sub-)system errors #48

chcorbato · 2020-09-30T06:39:59Z

from (#47 )

This is in the context of our exemplary case of the laser_driver error. We want to elaborate on the layered approach we discussed in the last MROS meeting. This is how I interpret our desired design (please comment if something is not correct or clear):

First the laser_driver code for handling errors tries to recover from the error in the ErrorProcessing transition state.

(from here it is a related but different issue)

If it does not succeed (I guess that means node does not transition to Active), the ModeManager tries to recover from the error using the feature/rules. For this, @jginesclavero is adding a rule in the SystemModes file of our system.

If there is no rule, or there is but after applying it the alternative MODE(s) of the laser_driver are not reached either, the ModeManager reports to the MROS Metacontroller that the corresponding (sub)system(s) MODE(s) are not reachable.
(see issue for the continuation of the handling of errors at the higher layers)

continuation

Currently this will be implemented in a passive way, by offering that information (see #43)
But, since the current target MODE cannot be reached... we were thinking (in a discussion with TUD and URJC) if the ModeManager should report this actively system wide, for the operator or any supervisory system (e.g. MROS Metacontroller) to handle it.

Proposal: Since not being able to reach the target MODE is a deviation of expected and desired behaviour, we propose that the ModeManager uses diagnostics to report this. The MROS Metacontroller will subscribe such diagnostic messages.
(@fmrico @jginesclavero @marioney please comment if I missed something or did not convey it correctly)

What do you think @norro ?

The text was updated successfully, but these errors were encountered:

norro · 2020-09-30T07:30:38Z

What the mode manager will actually already sense is the deviation between the requested state/mode and the actual state/mode. This is not yet merged to master, but available in the feature/rules branch, because it is necessary in order to decide when to apply rules. See feature/rules:mode_inference.cpp.
Reporting these deviations to diagnostics is an interesting idea.

This is again a question of timing, though. When a state/mode transition is requested, there is always and immediately a deviation, since systems/nodes will take some time to perform the transition. So the mode manager will have to decide, when to report the deviation, i.e. when to assume that the transition takes to long and the deviation therefore can be considered an erroneous deviation. Do you have an idea how/when to do this? After half a second? A second? ... @chcorbato

norro · 2020-09-30T07:34:05Z

Suggestion:

When a deviation is detected, wait a certain time t_0 before considering it an erroneous deviation
After t_0, try to apply a rule, if an appropriate rules exists. If no rule exists, try to recover the node/system
Wait a certain time t_1 and if nothing happened, report the erroneous deviation, e,g., through diagnostics

(t_0 and t_1 have to be configurable obviously)
/cc @chcorbato @ralph-lange

chcorbato · 2020-09-30T08:11:08Z

I like very much your suggestion of a configurable time limit for each management layer!

Do you have suggestions for these times in the case of navigation2 @fmrico @marioney @jginesclavero @lbajo ?

norro · 2020-09-30T08:45:11Z

@chcorbato The feature/rules is merely a micro-ROS experiment by now btw. For "2. If it does not succeed [...] tries to recover from the error using rules" I consider metacontrol (reconfiguration actions?) in charge.
We are even happy to drop the system modes rules feature completely once the metacontrol part for this task is integrated with system modes.

chcorbato · 2020-09-30T09:02:05Z

I see. Currently @jginesclavero is trying to get results with that feature this week, by adding such a rule in Pilot-URJC system model.

I propose we keep this test for this week and analyse the result afterwards (usefulness, problems...) to then make an informed decision to move the feature to the metacontrol part.

What do you think @norro @jginesclavero ?
@norro are you available to keep supporting @jginesclavero on this today and tomorrow?

norro · 2020-09-30T09:08:50Z

Yes, I am available today and tomorrow to help with upcoming issues.

jginesclavero · 2020-09-30T09:13:07Z

Hi @norro @chcorbato !
I was testing the feature/rules branch and it works as we expected. In short, I have defined a rule that changes to DEGRADED mode (navigation with pointcloud_to_laser) if the laser_driver is not in active state. The mode is changed immediately, works really nice.
I have done some navigation tests where I force a laser failure and the mode change correctly, the laser is replaced by the pointcloud and the navigation continues.

marioney · 2020-09-30T09:21:34Z

Do you have suggestions for these times in the case of navigation2 @fmrico @marioney @jginesclavero @lbajo ?

From the metacontroller point of view, the reasoning cycle is very slow (about 2 sec) so we're safe with half of that I guess. I'm not sure how that time affects the navigation 2, but I'm guessing it does not.

norro · 2021-07-23T10:51:17Z

Closing this issue soon as it has successfully been shown in the MROS pilots.

chcorbato mentioned this issue Sep 30, 2020

Unconfigured lifecycle state management #47

Closed

norro mentioned this issue Oct 16, 2020

Simple prototype of error handling rules #29

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Layered handling of node and (sub-)system errors #48

Layered handling of node and (sub-)system errors #48

chcorbato commented Sep 30, 2020 •

edited

Loading

norro commented Sep 30, 2020

norro commented Sep 30, 2020 •

edited

Loading

chcorbato commented Sep 30, 2020 •

edited

Loading

norro commented Sep 30, 2020

chcorbato commented Sep 30, 2020

norro commented Sep 30, 2020

jginesclavero commented Sep 30, 2020

marioney commented Sep 30, 2020

norro commented Jul 23, 2021

Layered handling of node and (sub-)system errors #48

Layered handling of node and (sub-)system errors #48

Comments

chcorbato commented Sep 30, 2020 • edited Loading

norro commented Sep 30, 2020

norro commented Sep 30, 2020 • edited Loading

chcorbato commented Sep 30, 2020 • edited Loading

norro commented Sep 30, 2020

chcorbato commented Sep 30, 2020

norro commented Sep 30, 2020

jginesclavero commented Sep 30, 2020

marioney commented Sep 30, 2020

norro commented Jul 23, 2021

chcorbato commented Sep 30, 2020 •

edited

Loading

norro commented Sep 30, 2020 •

edited

Loading

chcorbato commented Sep 30, 2020 •

edited

Loading