Skip to content

Conversation

@SenRamakri
Copy link
Contributor

Description

Design document for Crash Reporting feature in MbedOS. The idea here is to auto-reboot the system after a fatal error has occurred to bring the system back in stable state, without losing the RAM contents where we have the error information collected, and we can then save this information reliably for example to file system or to be send to ARM Pelion cloud.

Pull request type

[ ] Fix
[ ] Refactor
[ ] Target update
[x] Functionality change
[ ] Docs update
[ ] Test update
[ ] Breaking change

@SenRamakri SenRamakri requested review from a team and kjbracey October 26, 2018 18:14
@kegilbert
Copy link
Contributor

kegilbert commented Oct 26, 2018

Nice! Glad to see this taking shape. Few small questions from my end:

The config option to prevent an endless reboot cycle is a solid idea, but would be nice if there was a time based mechanism to it as well.

The current design is great for a target that is rebooting every 30 seconds to not clog whatever delivery mechanism they have for retrieving the error state, but is weaker if someone wants a deployed device to always reboot on a failure in the case of a potential crash every few weeks but not spam their network with crash logs if an update causes the device to get stuck in a reboot loop.

I don't have a particular implementation in mind, but something like a watchdog that would clear the reset counter at some configurable time period could be helpful.

I'm not a huge fan of having the reboot error detection logic be something a user places in their main program, it feels a little intrusive. Was there a reason the logic in the main program wasn't handled in the reboot callback?

@SenRamakri
Copy link
Contributor Author

@kegilbert - Thanks for your review and please see my comments below.

  1. Yes your opinion is valid, there may be a requirement to reset the reboot count. So the current implementation provides this API to reset the count, old crash info using mbed_reset_reboot_error_info().
    So, the idea here is to provide a mechanism to reset that, but the actual policy on when to reset that is application dependent. So we provide that API and application implementation can clear it as required, this may be a day, week or month(or based on something else) and that would depend on application and is outside scope of MbedOS error handling. One thing to note here is there is no specific API to reset the reboot count alone. Do you think that's valuable?

  2. Actually the reboot error logic is completely done by Mbed-OS error handling implementation. The only thing the user get is a callback with the error-info during reboot(please see the flowchart). This is to provide the application an opportunity to record the fact that a reboot happened due to fatal error and the app may record the context as well. No reboot handling mechanism is part of application. Once the callback is completed the MbedOS error handling implementation will continue to process the captured error according to the flowchart.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why specific enum in return ? We can have int32_t as return like other API's with 0 - Success (Error found) -1 (Error not found). Will there be any other special case in mbed_error_status_t

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct, there could be Invalid argument(fault == NULL), Item not found because there is no context or MBED_SUCCESS. In addition we can capture the module reporting the error using mbed_error_status_t(in this case its MBED_MODULE_PLATFORM).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even when we return error codes, it is good to have return type as int to avoid type casting have the flexible API. Since error codes are used extensively in storage picked up one example here
https://github.com/ARMmbed/mbed-os/blob/master/features/storage/filesystem/littlefs/LittleFileSystem.cpp#L287

Copy link
Contributor

@0xc0170 0xc0170 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just styling issues

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excetion Handler -> Exception typo (second blue rectangle from the top here)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exmaple -> example - yellow rectangle about WEAK attribute

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would specify that reboot is specifically a warm-reboot.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

coding style update (run it through astyle) - same for other code examples here

@0xc0170 0xc0170 requested a review from bulislaw October 30, 2018 13:54
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slightly lost on the twin callback + read APIs here. Not seeing why this callback is necessary. Can't the application just call mbed_get_reboot_fault_context if they want to find out if it was a reboot-due-to-error? One example suggests that, but another example shows it taking a copy in the callback, as if it's going to be lost.

Copy link
Contributor Author

@SenRamakri SenRamakri Oct 30, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The callback is necessary for few reasons. It provides opportunity for the application to clear any error situation based on error codes before the main starts. Its also required in the case where a reboot-max is configured where in the system will halt at the maximum reboot count, but even in this case callback will be invoked, so the application side can know the cause of error. Also making calls to retrieve the error context and fault context could be expensive as well as they take time during main() initialization and also the caller needs to allocate memory. So having this callback to set a flag(or something like that) can bring some optimization as well. And, if the application design doesn't require the callback, they don't have to override that, so its flexible.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the callback is a good idea as it's disconnect error reporting from main application flow.

@SenRamakri SenRamakri force-pushed the sen_CrashReportingDesign branch from 4baeb93 to 2682904 Compare October 30, 2018 22:31
@SenRamakri
Copy link
Contributor Author

@0xc0170 - I have fixed the review comments, please review.

@0xc0170 0xc0170 dismissed their stale review October 31, 2018 12:48

Will review

Copy link
Member

@bulislaw bulislaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Some queries below.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are we going to achieve that? I would say that modifying all the linkerscripts is not feasible and asking all the HW vendors to the the change even less.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the callback is a good idea as it's disconnect error reporting from main application flow.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How will we handle mix of errors and successfull reboots over a time? Eg. The device is crashing once a day due to memory leak, but between the runs it works fine and the crashes are not fatal from the functionality point of view. As a developer I'd like to avoid tight loop of consecutive reboots, but once in a while crash is not something that should, eventually, trigger device halt.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bulislaw If I'm understanding your point correctly, it sounds like my first concern (#8561 (comment)). A method to reset the reboot count that could be periodically called would prevent an eventual device halt unless the device is rebooting faster than the reset rate which could clog whatever delivery mechanism is being used to report the errors logs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Who will be "periodically calling" this reset?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SenRamakri ?

I was thinking we can either leave the implementation to the user and have them call the reset function at whatever frequency they'd like, or have the reboot handler have its own timer/watchdog that is configured through the config settings.

Copy link
Contributor Author

@SenRamakri SenRamakri Nov 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bulislaw and @kegilbert - These are great questions and I was thinking about adding the functionality for periodic reset initially but I also thought each application might have different policies and rules around when to do this periodic reset and we also have to lock out resources like memory, CPU if we are use evt_q, timer etc(we should also think about sleep behavior). So, I think its better for application design to implement it if they need as they know what resources they have at their disposal.

@SenRamakri SenRamakri force-pushed the sen_CrashReportingDesign branch from 476bd50 to 60677d2 Compare November 7, 2018 16:30
@SenRamakri
Copy link
Contributor Author

@0xc0170 @kegilbert @bulislaw @deepikabhavnani @kjbracey-arm - I think I have addressed/answered all of the queries. Please review/approve if you are ok with this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Allocated where? When? How?
A single line for requirements and assumptions feels really light to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cmonr - This is captured in detail in Detailed Design section.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Presumedly by stopping the invocation of the main application firmware?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thats correct, it. Once the limit is reached we would be stopping the system from entering main(). Let me clarify that in the doc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Open-ended question. How do we expect this to be tested?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, we are going to test this using Greentea similar to how system_test feature is tested. But note that, its still not possible to test the reboot limit mechanism using Greentea, for that I have a test application which I'm using and is bench tested - The app is at - https://github.com/ARMmbed/mbed-os-example-crash-reporting
This app will also serve as the example application for this feature.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would specify that reboot is specifically a warm-reboot.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At what point is the section of RAM zero'd?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can be explicitly zero-ed by calling the reboot-error reset APIs described further down in the document or if the system goes through a cold-reset it will be left in un-initialized state. That's why we have the crc as part of stored data to find the integrity of the data.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another silly question. Is this able to live side by side with mbed-trace?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, of course, this doesn't have any impact or conflict with mbed-trace.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stored CRC? Of what? Generated by what?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there is a CRC stored as part of error context. This is calculated in mbed_error() function as explained above, but I'll update it to capture more details.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General question. Is there a specific reason that this feature is being referred to as "crash reporting"?
Imo, "crash" is a harsh word. Mainly wondering why something liike "exception reporting" wouldn't also work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's good question, the reason is "exception" is an ARM terminology specifically used for processor exceptions and we also have fatal errors(which are different from fatal exceptions) and the current implementation works for both, thats why we have the word "crash" which comprises both. Also the original requirement is also written with crash terminology.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's just me, but across the document, exception, crash, and error are used interchangably. Would prefer if only one were used, otherwise when we look back at the config options outside of the context of this PR, the config options will appear disparate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the point, the crash word comes from requirement which comprises both fatal exceptions and fatal errors. Exceptions are specifically referring to processor exceptions and errors refer to fatal errors in the system which ends up calling mbed_error() interface. Let me clarify these terminologies in assumption section. Hope that helps.

@SenRamakri
Copy link
Contributor Author

@cmonr - I have updated the doc with your review comments fixes, please review.

@SenRamakri SenRamakri force-pushed the sen_CrashReportingDesign branch from 12f1894 to a0e42fa Compare November 19, 2018 02:46
@0xc0170
Copy link
Contributor

0xc0170 commented Nov 19, 2018

@cmonr Please review. Once approved, this shall go to rollup (I'll label it now)

@0xc0170
Copy link
Contributor

0xc0170 commented Nov 19, 2018

Entering CI (rollup inclusion)

Info: This PR has been re-bundled into a new rollup PR (#8800).

No further work is needed here, as once that PR is merged, this PR will also be closed and marked as merged.
If any more commits are made in this PR, this PR will remain open and have to go through CI on its own.

@cmonr
Copy link
Contributor

cmonr commented Nov 22, 2018

CI started.

@0xc0170 0xc0170 merged commit fadaa65 into ARMmbed:master Nov 22, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants