-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Analysis of FW of mc4plus to solve board's disappearance from the ETH network #174
Comments
An idea for the simplified setup is:
|
Here are some news: I can now systematically reproduce the bug on a dedicated setup. I have prepared the simplified setup which is able to run 4 possible tests: either the For the The Linux PC sends a series of commands via UDP which every 5 seconds trigger the cycle of { This is the standard sequence of commands which Here are the results of the first run of tests.
Question: could the lack of sensors and actuator cause the problem? Now I can start a deeper analysis of what happens. For this I can use the debugger which I have fitted to each board. |
Yesterday, I did some further tests w/ different optimization levels. With a reduced optimization level I spotted a diagnostic message over the trace port which tells about a failure of memory allocation. I will now look at possible causes. |
I think that I found the problem. There is surely memory erosion due to reallocation of RAM at each start of the iteration (and hence at every relaunch of I am now extensively testing a solution which has already proven to successfully run 100+ cycles of My goal is to issue a |
Is it something that may happen also without restarting the That's great news anyways 💪 |
it may. In #174 (comment) I wrote: |
It would be interesting to see what happens if you manually restart |
I can give it a try |
I just tried running the
I don't know if it is related. I tried to reboot the motors, run it for 4 times, and the 5th again the same error. Here the log: Here the application file I used: https://github.com/dic-iit/robots-configuration/blob/1849cb4b3fe4d433344242e48c3675729f92dc99/iCubGenova04/icub_test.xml |
Very good. It is as I expected. From |
Hi @S-Dafarra, I have produced two binaries for the ems-test-v103.36.zip They are compatible w/ the latest devel / master release. I have tested the solutions added to these binaries w/ 100 cycles of mc service and the boards can still be pinged. |
Hi @marcoaccame! That's great! Unfortunately I am not in the lab at the moment, but the robot should be free if you want to test it. Otherwise, I will fash it in the following days! |
@S-Dafarra, I will be in the lab next week. In the meantime I will document the changes. |
Sounds good. Indeed, since the problem on the wrists might be difficult to check, I think it would be definitely worthwhile to proceed anyway. Then, with time we can check if the wrist problem appears again. |
I am sure that the changes I have done solve some problems (the ones verified on my setup) and we can surely add into robotology. However, before approving the PR it is our habit to test it on the robot. I will have a go at it next week. Maybe we can do it together so that we test in the same way the robot is effectively used. |
Hi, here is the code used to compile the two test binaries of #174 (comment) I will produce a PR into robotology that will be merged after the tests on the robot. |
What happened?
I have also spotted a memory leak in the |
Results of the tests of iCubGenova04: OK Did extensive tests on iCubGenova04 during the the whole day of 19 mar 2021. Used binaries for Launched The boards did not disappear. The current release of master instead caused the board to disappear after 2 or 3 srat / stop cycles. The used binaries work fine. |
What now: It may be worth a new quick test on the robot. |
Nice @marcoaccame 👍🏻 cc @Nicogene |
Tested once more on 24 mar 2021 after the rebase of icub-firmware, rebuilt the binaries to be sure the binaries in here were correct The tests were OK, hence I merged the two PRs: robotology/icub-firmware-build#25 and #176 |
Thanks @marcoaccame ! Closing, feel free to open it up again if needed. |
reopened following robotology/icub-tech-support#673 (comment) |
Tested a mechanism which does the following:
The log area cannnot hold more than 15 byte, so I will log:
I did a basic test for this mechanism, now I need to fill all the fatal error handlers and test some cases. Then, I will release a new version for the |
tested w/ code such as this: typedef struct
{
uint32_t millisecondsfromstart;
uint8_t handlertype;
uint8_t handlererrorcode;
uint8_t calledbyanirqhandler;
uint8_t idofthelastscheduledthread;
uint8_t forfutureuse0;
uint8_t forfutureuse1;
uint16_t signature;
} fatal_error_message_t; EO_VERIFYsizeof(fatal_error_message_t, 12)
typedef struct
{
uint64_t par64;
uint16_t par16;
} fatal_error_params_t; EO_VERIFYsizeof(fatal_error_params_t, 16)
typedef union
{
fatal_error_message_t message;
fatal_error_params_t params;
} fatal_error_t; EO_VERIFYsizeof(fatal_error_t, 16)
Code Listing: Data structures exchanged across a restart of the #include "eEsharedServices.h"
static volatile fatal_error_t detectedfatalerror = {0};
static uint8_t detectedsize = 0;
// read ipc memory
if(ee_res_OK == ee_sharserv_ipc_userdefdata_get((uint8_t*)&detectedfatalerror, &detectedsize, sizeof(fatal_error_t)))
{
// there is something.
// i clear ipc ram
ee_sharserv_ipc_userdefdata_clr();
if((detectedsize <= sharserv_base_ipc_userdefdata_maxsize) && (0x1234 == detectedfatalerror.message.signature))
{
// it is of the correct size. i use it to send a diagnostic message
uint16_t par16 = detectedfatalerror.params.par16;
uint64_t par64 = detectedfatalerror.params.par64;
eOerrmanDescriptor_t errdes = {0};
errdes.code = eoerror_code_get(eoerror_category_Debug, eoerror_value_DEB_tag06);
errdes.sourcedevice = eo_errman_sourcedevice_localboard;
errdes.sourceaddress = 0;
errdes.par16 = par16;
errdes.par64 = par64;
eo_errman_Error(eo_errman_GetHandle(), eo_errortype_error, "RESTARTED after FATAL error", NULL, &errdes);
}
}
Code Listing: Code executed by the // in here there is just a test of emission of ipc data w/ a restart
static fatal_error_t tobewritten = {0};
tobewritten.message.millisecondsfromstart = osal_system_ticks_abstime_get() / 1000; // better using another way so that we avoid calling an svc
tobewritten.message.handlertype = 1;
tobewritten.message.handlererrorcode = 3;
tobewritten.message.calledbyanirqhandler = 0;
tobewritten.message.idofthelastscheduledthread = 9;
tobewritten.message.forfutureuse0 = 1;
tobewritten.message.forfutureuse1 = 2;
tobewritten.message.signature = 0x1234;
// write in ipc memory and ... restart
ee_sharserv_ipc_userdefdata_set((uint8_t*)&tobewritten, sizeof(tobewritten));
ee_sharserv_sys_restart();
Code Listing: Code to be included in the fatal error handlers. It write some info in the IPC memory and then forces a restart. This code was tested on an [ERROR] (EO? tsk2 @S5:m632:u723)-> {0x4000006 p16 0x0201, p64 0x0900030100001600, dev 0, adr 0}:
DEBUG: tag06. INFO = RESTARTED after FATAL error Sadly, The IPC messaging was designed years ago to carry up to 15 bytes. So far we use only 8 bytes, so we can maybe add some more without changing the IPC implementation. To extend to more bytes is surely feasible but it needs to reflash eLoader, eUpdater, eApplication. |
We have just successfully tested a new FW for the mc4plus board which is able to send diganostics infor to yarprobotinetrfce in case of fatal errors. In such a case, it forces a restart and prints info such as in the following (where we artificially forced a stack overflow from thread runDO after 20 sec of start of teh board) [**INFO**] from BOARD 10.0.1.1 (l-hv3-hand), src LOCAL, adr 0, time 1s 953m 401u:
(code 0x0000003b, par16 0x0000 par64 0x0000000000000000) -> SYS: the board is bootstrapping + .
[**ERROR**] from BOARD 10.0.1.1 (l-hv3-hand), src LOCAL, adr 0, time 1s 955m 7u:
(code 0x04000000, par16 0x0000 par64 0x0b0be50300004e20) -> DEBUG: tag00 + RESTARTED after FATAL error
[**ERROR**] from BOARD 10.0.1.1 (l-hv3-hand), src LOCAL, adr 0, time 1s 955m 117u:
(code 0x04000000, par16 0x0000 par64 0x0b0be50300004e20) -> DEBUG: tag00 + @ 20000 ms
[**ERROR**] from BOARD 10.0.1.1 (l-hv3-hand), src LOCAL, adr 0, time 1s 955m 235u:
(code 0x04000000, par16 0x0000 par64 0x0b0be50300004e20) -> DEBUG: tag00 + handler OSAL, code 0xe5
[**ERROR**] from BOARD 10.0.1.1 (l-hv3-hand), src LOCAL, adr 0, time 1s 955m 348u:
(code 0x04000000, par16 0x0000 par64 0x0b0be50300004e20) -> DEBUG: tag00 + type osal_stackovf
[**ERROR**] from BOARD 10.0.1.1 (l-hv3-hand), src LOCAL, adr 0, time 1s 955m 467u:
(code 0x04000000, par16 0x0000 par64 0x0b0be50300004e20) -> DEBUG: tag00 + IRQHan SVCall Thread runDO
[**ERROR**] from BOARD 10.0.1.1 (l-hv3-hand), src LOCAL, adr 0, time 1s 955m 581u:
(code 0x04000000, par16 0x0000 par64 0x0b0be50300004e20) -> DEBUG: tag00 + ipsr 11, tid 11 List. Board 10.0.1.1 has detected a fatal error (first message of type In case of no error, teh FW works just fine as the latest devel version. Test were don today on iCubGenova09. |
I shall soon produce a PR. |
Hi @marcoaccame these are two logs containing the failures you're investigating in this issue:
Related issue: robotology/icub-tech-support#673 |
Cool, thanks @GiulioRomualdi for reporting on this. |
See robotology/icub-tech-support#673 (comment) for the looming outdoor demos that would strongly benefit from a definite solution to this issue |
Hi @GiulioRomualdi, the logs are obtained w/ a FW version of the mc4plus which is older than the one described in here, so the logs don't contain any useful (new) information. so, now we need to:
|
Just for general alignment, that version of the firmware was release as part of v2021.08.0 disto, so just using icub-firmware-build v1.21.0 is probably enough. |
Following issue in here, we have identified a critical behaviour that should be investigated and solved: some
mc4plus
boards disappears from the ETH network when runningyarprobotinterface
.What happens is that an
mc4plus
board which offers the motion control service may stop transmitting and receiving UDP frames. The board maintains PWM actuation and seems to be stuck: it does not respond toping
and is not seen byFirmwareUpdater
.We have observed the board
10.0.1.24
to have such a misbehavior but also the board10.0.1.20
.A careful analysis is required using multiple investigation paths. Amongst them:
mc4plus
andems
in an attempt to spot bugs in the differences;mc4plus
board so that we can trace exact behaviour;ems
board;cc: @S-Dafarra @DanielePucci @maggia80
The text was updated successfully, but these errors were encountered: