Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Layered handling of node and (sub-)system errors #48

Open
chcorbato opened this issue Sep 30, 2020 · 9 comments
Open

Layered handling of node and (sub-)system errors #48

chcorbato opened this issue Sep 30, 2020 · 9 comments

Comments

@chcorbato
Copy link

chcorbato commented Sep 30, 2020

from (#47 )

This is in the context of our exemplary case of the laser_driver error. We want to elaborate on the layered approach we discussed in the last MROS meeting. This is how I interpret our desired design (please comment if something is not correct or clear):

  1. First the laser_driver code for handling errors tries to recover from the error in the ErrorProcessing transition state.

(from here it is a related but different issue)

  1. If it does not succeed (I guess that means node does not transition to Active), the ModeManager tries to recover from the error using the feature/rules. For this, @jginesclavero is adding a rule in the SystemModes file of our system.
  2. If there is no rule, or there is but after applying it the alternative MODE(s) of the laser_driver are not reached either, the ModeManager reports to the MROS Metacontroller that the corresponding (sub)system(s) MODE(s) are not reachable.
    (see issue for the continuation of the handling of errors at the higher layers)

continuation

Currently this will be implemented in a passive way, by offering that information (see #43)
But, since the current target MODE cannot be reached... we were thinking (in a discussion with TUD and URJC) if the ModeManager should report this actively system wide, for the operator or any supervisory system (e.g. MROS Metacontroller) to handle it.

Proposal: Since not being able to reach the target MODE is a deviation of expected and desired behaviour, we propose that the ModeManager uses diagnostics to report this. The MROS Metacontroller will subscribe such diagnostic messages.
(@fmrico @jginesclavero @marioney please comment if I missed something or did not convey it correctly)

What do you think @norro ?

@norro
Copy link
Collaborator

norro commented Sep 30, 2020

What the mode manager will actually already sense is the deviation between the requested state/mode and the actual state/mode. This is not yet merged to master, but available in the feature/rules branch, because it is necessary in order to decide when to apply rules. See feature/rules:mode_inference.cpp.
Reporting these deviations to diagnostics is an interesting idea.

This is again a question of timing, though. When a state/mode transition is requested, there is always and immediately a deviation, since systems/nodes will take some time to perform the transition. So the mode manager will have to decide, when to report the deviation, i.e. when to assume that the transition takes to long and the deviation therefore can be considered an erroneous deviation. Do you have an idea how/when to do this? After half a second? A second? ... @chcorbato

@norro
Copy link
Collaborator

norro commented Sep 30, 2020

Suggestion:

  1. When a deviation is detected, wait a certain time t_0 before considering it an erroneous deviation
  2. After t_0, try to apply a rule, if an appropriate rules exists. If no rule exists, try to recover the node/system
  3. Wait a certain time t_1 and if nothing happened, report the erroneous deviation, e,g., through diagnostics

(t_0 and t_1 have to be configurable obviously)
/cc @chcorbato @ralph-lange

@chcorbato
Copy link
Author

chcorbato commented Sep 30, 2020

I like very much your suggestion of a configurable time limit for each management layer!

Do you have suggestions for these times in the case of navigation2 @fmrico @marioney @jginesclavero @lbajo ?

@norro
Copy link
Collaborator

norro commented Sep 30, 2020

@chcorbato The feature/rules is merely a micro-ROS experiment by now btw. For "2. If it does not succeed [...] tries to recover from the error using rules" I consider metacontrol (reconfiguration actions?) in charge.
We are even happy to drop the system modes rules feature completely once the metacontrol part for this task is integrated with system modes.

@chcorbato
Copy link
Author

I see. Currently @jginesclavero is trying to get results with that feature this week, by adding such a rule in Pilot-URJC system model.

I propose we keep this test for this week and analyse the result afterwards (usefulness, problems...) to then make an informed decision to move the feature to the metacontrol part.

What do you think @norro @jginesclavero ?
@norro are you available to keep supporting @jginesclavero on this today and tomorrow?

@norro
Copy link
Collaborator

norro commented Sep 30, 2020

Yes, I am available today and tomorrow to help with upcoming issues.

@jginesclavero
Copy link

Hi @norro @chcorbato !
I was testing the feature/rules branch and it works as we expected. In short, I have defined a rule that changes to DEGRADED mode (navigation with pointcloud_to_laser) if the laser_driver is not in active state. The mode is changed immediately, works really nice.
I have done some navigation tests where I force a laser failure and the mode change correctly, the laser is replaced by the pointcloud and the navigation continues.

@marioney
Copy link

Do you have suggestions for these times in the case of navigation2 @fmrico @marioney @jginesclavero @lbajo ?

From the metacontroller point of view, the reasoning cycle is very slow (about 2 sec) so we're safe with half of that I guess. I'm not sure how that time affects the navigation 2, but I'm guessing it does not.

@norro
Copy link
Collaborator

norro commented Jul 23, 2021

Closing this issue soon as it has successfully been shown in the MROS pilots.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants