-
Notifications
You must be signed in to change notification settings - Fork 3k
Design document for Crash Reporting feature in MbedOS #8561
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Nice! Glad to see this taking shape. Few small questions from my end: The config option to prevent an endless reboot cycle is a solid idea, but would be nice if there was a time based mechanism to it as well. The current design is great for a target that is rebooting every 30 seconds to not clog whatever delivery mechanism they have for retrieving the error state, but is weaker if someone wants a deployed device to always reboot on a failure in the case of a potential crash every few weeks but not spam their network with crash logs if an update causes the device to get stuck in a reboot loop. I don't have a particular implementation in mind, but something like a watchdog that would clear the reset counter at some configurable time period could be helpful. I'm not a huge fan of having the reboot error detection logic be something a user places in their main program, it feels a little intrusive. Was there a reason the logic in the main program wasn't handled in the reboot callback? |
|
@kegilbert - Thanks for your review and please see my comments below.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why specific enum in return ? We can have int32_t as return like other API's with 0 - Success (Error found) -1 (Error not found). Will there be any other special case in mbed_error_status_t
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's correct, there could be Invalid argument(fault == NULL), Item not found because there is no context or MBED_SUCCESS. In addition we can capture the module reporting the error using mbed_error_status_t(in this case its MBED_MODULE_PLATFORM).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even when we return error codes, it is good to have return type as int to avoid type casting have the flexible API. Since error codes are used extensively in storage picked up one example here
https://github.com/ARMmbed/mbed-os/blob/master/features/storage/filesystem/littlefs/LittleFileSystem.cpp#L287
0xc0170
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just styling issues
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excetion Handler -> Exception typo (second blue rectangle from the top here)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exmaple -> example - yellow rectangle about WEAK attribute
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would specify that reboot is specifically a warm-reboot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
coding style update (run it through astyle) - same for other code examples here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Slightly lost on the twin callback + read APIs here. Not seeing why this callback is necessary. Can't the application just call mbed_get_reboot_fault_context if they want to find out if it was a reboot-due-to-error? One example suggests that, but another example shows it taking a copy in the callback, as if it's going to be lost.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The callback is necessary for few reasons. It provides opportunity for the application to clear any error situation based on error codes before the main starts. Its also required in the case where a reboot-max is configured where in the system will halt at the maximum reboot count, but even in this case callback will be invoked, so the application side can know the cause of error. Also making calls to retrieve the error context and fault context could be expensive as well as they take time during main() initialization and also the caller needs to allocate memory. So having this callback to set a flag(or something like that) can bring some optimization as well. And, if the application design doesn't require the callback, they don't have to override that, so its flexible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the callback is a good idea as it's disconnect error reporting from main application flow.
4baeb93 to
2682904
Compare
|
@0xc0170 - I have fixed the review comments, please review. |
bulislaw
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Some queries below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How are we going to achieve that? I would say that modifying all the linkerscripts is not feasible and asking all the HW vendors to the the change even less.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the callback is a good idea as it's disconnect error reporting from main application flow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How will we handle mix of errors and successfull reboots over a time? Eg. The device is crashing once a day due to memory leak, but between the runs it works fine and the crashes are not fatal from the functionality point of view. As a developer I'd like to avoid tight loop of consecutive reboots, but once in a while crash is not something that should, eventually, trigger device halt.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bulislaw If I'm understanding your point correctly, it sounds like my first concern (#8561 (comment)). A method to reset the reboot count that could be periodically called would prevent an eventual device halt unless the device is rebooting faster than the reset rate which could clog whatever delivery mechanism is being used to report the errors logs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Who will be "periodically calling" this reset?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking we can either leave the implementation to the user and have them call the reset function at whatever frequency they'd like, or have the reboot handler have its own timer/watchdog that is configured through the config settings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bulislaw and @kegilbert - These are great questions and I was thinking about adding the functionality for periodic reset initially but I also thought each application might have different policies and rules around when to do this periodic reset and we also have to lock out resources like memory, CPU if we are use evt_q, timer etc(we should also think about sleep behavior). So, I think its better for application design to implement it if they need as they know what resources they have at their disposal.
476bd50 to
60677d2
Compare
|
@0xc0170 @kegilbert @bulislaw @deepikabhavnani @kjbracey-arm - I think I have addressed/answered all of the queries. Please review/approve if you are ok with this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Allocated where? When? How?
A single line for requirements and assumptions feels really light to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cmonr - This is captured in detail in Detailed Design section.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Presumedly by stopping the invocation of the main application firmware?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, thats correct, it. Once the limit is reached we would be stopping the system from entering main(). Let me clarify that in the doc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Open-ended question. How do we expect this to be tested?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question, we are going to test this using Greentea similar to how system_test feature is tested. But note that, its still not possible to test the reboot limit mechanism using Greentea, for that I have a test application which I'm using and is bench tested - The app is at - https://github.com/ARMmbed/mbed-os-example-crash-reporting
This app will also serve as the example application for this feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would specify that reboot is specifically a warm-reboot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At what point is the section of RAM zero'd?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can be explicitly zero-ed by calling the reboot-error reset APIs described further down in the document or if the system goes through a cold-reset it will be left in un-initialized state. That's why we have the crc as part of stored data to find the integrity of the data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another silly question. Is this able to live side by side with mbed-trace?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, of course, this doesn't have any impact or conflict with mbed-trace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stored CRC? Of what? Generated by what?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, there is a CRC stored as part of error context. This is calculated in mbed_error() function as explained above, but I'll update it to capture more details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General question. Is there a specific reason that this feature is being referred to as "crash reporting"?
Imo, "crash" is a harsh word. Mainly wondering why something liike "exception reporting" wouldn't also work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's good question, the reason is "exception" is an ARM terminology specifically used for processor exceptions and we also have fatal errors(which are different from fatal exceptions) and the current implementation works for both, thats why we have the word "crash" which comprises both. Also the original requirement is also written with crash terminology.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it's just me, but across the document, exception, crash, and error are used interchangably. Would prefer if only one were used, otherwise when we look back at the config options outside of the context of this PR, the config options will appear disparate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see the point, the crash word comes from requirement which comprises both fatal exceptions and fatal errors. Exceptions are specifically referring to processor exceptions and errors refer to fatal errors in the system which ends up calling mbed_error() interface. Let me clarify these terminologies in assumption section. Hope that helps.
60677d2 to
12f1894
Compare
|
@cmonr - I have updated the doc with your review comments fixes, please review. |
…onflicts with current implementation of mbed_error_printf
12f1894 to
a0e42fa
Compare
|
@cmonr Please review. Once approved, this shall go to rollup (I'll label it now) |
|
Entering CI (rollup inclusion) Info: This PR has been re-bundled into a new rollup PR (#8800). No further work is needed here, as once that PR is merged, this PR will also be closed and marked as merged. |
|
CI started. |
Description
Design document for Crash Reporting feature in MbedOS. The idea here is to auto-reboot the system after a fatal error has occurred to bring the system back in stable state, without losing the RAM contents where we have the error information collected, and we can then save this information reliably for example to file system or to be send to ARM Pelion cloud.
Pull request type