Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Core Dump support via CrashCatcher and heap storage #210

Merged
merged 9 commits into from
May 14, 2019

Conversation

salkinium
Copy link
Member

@salkinium salkinium commented May 10, 2019

This uses the CrashCatcher to generate the core dump, store it in volatile memory and reboots so that the report is available for the application to use under normal operating conditions (ie. not in the HardFault handler).

This works as follows:

  1. A hardfault is generated, CrashCatcher performs its job.
  2. CrashCatcher calls into the :platform:fault module which stores the report in the memory sections designated for the heap as defined by the linkerscript's .table.heap section, effectively overwriting the heap.
  3. CrashCatcher reboots the device.
  4. modm's heap initialization does not overwrite the report stored in the heap, and initializes the remaining heap memory.
  5. The application's boot process continues normally and may use the heap, even if it is smaller now.
  6. The application can access the report and decide what to do with it (for example send it out via UART).
  7. To clear the report and regain the full heap memory, the application must reboot the device again.

This gives significantly more freedom to the application for responding to a non-recoverable fault event, even allowing the use of dynamic memory for more complex communication protocols like XPCC over CAN.

The :platform:fault module may be configured to output just the core, or the core and stack or the core, stack and static data memories (.data and .bss sections), depending on how much heap memory is available.
The resulting report is to be used with the CrashDebug program that support post-mortem debugging.

TODO:

  • Implement Core functionality
  • Module documentation
  • Implement heap truncation for all allocators
  • Upgrade examples
  • Better (stateless) reporting API: C++ Iterator design
  • SCons tool for best-effort post-mortem debugging
  • Extend SCons postmortem tool with artifact caching based on firmware hash
  • Pre-Report hook to put hardware in safe mode
  • Use Assembly instructions for accessing memory
  • Move CrashCatcher stack to top of heap instead of static memory. 500B is a lot.

The current user facing reporting API is only FaultReporter::report(lambda) which is a synchronous API forcing sequential reading and a reboot afterwards. This is a convenience API for synchronous UART reporting, but doesn't support sending the report in smaller chunks over an asynchronous channel like XPCC over CAN.
Added an InputIterator for reading the data however you want it.

cc @rleh @dergraaf @chris-durand

@salkinium
Copy link
Member Author

The module does not implement any GPIO blink behavior anymore, since it's out-of-scope.
The application can overwrite the extern "C" void HardFault_Handler itself to implement just LED blinking, or it can blink after the reboot when the report has been detected.

@salkinium salkinium force-pushed the feature/fault_reporter branch 2 times, most recently from 1a539ee to 47ae3ef Compare May 11, 2019 01:21
@salkinium
Copy link
Member Author

I added a HeapTable input iterator, which completely hides the fault storage from the heap initializations as well as deduplicated some code and makes it readable.
I will add a FaultReporter input iterator as well.

@salkinium
Copy link
Member Author

salkinium commented May 11, 2019

The F469 example correctly allows heap usage after hardfault reboot and before reporting:

Can I allocate 20kB? answer: 0x2000109C
Hold Button to cause a Hardfault!

Can I allocate 20kB? answer: 0x2000272C

=== CrashCatcher === HardFault === CoreDump ===

6343030001000000170000002B00000010000... lots of data

Can I allocate 20kB? answer: 0x2000109C
Hold Button to cause a Hardfault!

You can see the the first allocation, then a hardfault is triggered, the device reboots, allocates in a different spot, dumps the core, then reboots again, and can use the full heap again.

Works for all allocators transparently: newlib

Can I allocate 20kB? answer: 0xC0000008
Hold Button to cause a Hardfault!

Can I allocate 20kB? answer: 0xC0001658

=== CrashCatcher === HardFault === CoreDump ===

6343030001000000170000002B0000000000...

Can I allocate 20kB? answer: 0xC0000008
Hold Button to cause a Hardfault!

and block allocator:

Can I allocate 20kB? answer: 0xC0000004
Hold Button to cause a Hardfault!

Can I allocate 20kB? answer: 0xC0001654

=== CrashCatcher === HardFault === CoreDump ===

6343030001000000170000002B000000000000000...

Can I allocate 20kB? answer: 0xC0000004
Hold Button to cause a Hardfault!

@salkinium salkinium force-pushed the feature/fault_reporter branch 3 times, most recently from 80a34fc to 64d885b Compare May 11, 2019 21:44
@salkinium salkinium force-pushed the feature/fault_reporter branch 4 times, most recently from 1641b04 to 29d541e Compare May 13, 2019 13:40
pass
source = str(source[0])
binary = os.path.splitext(source)[0]+".bin"
subprocess.call(env.subst("$OBJCOPY -O binary {} {}".format(source, binary)), shell=True)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dergraaf I need the binary to compute the firmware hash, but I couldn’t figure out how to depend on env.Bin(source) properly. Any ideas?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general that's what env.Depends() is for. I can look into it in more detail tomorrow, at the moment I don't have my laptop.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I got to env.Depends too, but I just didn't understand SCons enough to know when to call this? I tried something like env.Depends(action, "path/to/artifact.bin") in the store_artifact function but I didn't get that to work. SCons is super undocumented.

@salkinium salkinium force-pushed the feature/fault_reporter branch from 29d541e to 8e778fc Compare May 13, 2019 20:32
@salkinium salkinium marked this pull request as ready for review May 13, 2019 20:32
@salkinium
Copy link
Member Author

I've added SCons support for caching the uploaded firmware ELF and binary files in the build directory and allow you to retrieve it using the firmware CRC32 sum: scons postmortem firmware={hash}.
This way you won't have to remember what commit the firmware is running on (if you even have a commit for that!!) and the whole thing is automated so you can't forget.

Of course you still have to manually copy the coredump data into the coredump.txt file, but then again this is post-mortem debugging, so you're not likely having your computer connected to the device at the time of the fault anyways.

I'm quite happy with this solution, it's also nicely documented I think.

@salkinium salkinium requested a review from dergraaf May 13, 2019 20:38
@salkinium salkinium force-pushed the feature/fault_reporter branch 2 times, most recently from c44241e to 54e879b Compare May 14, 2019 16:53
@salkinium salkinium force-pushed the feature/fault_reporter branch from 54e879b to dfb7e34 Compare May 14, 2019 21:29
@salkinium salkinium merged commit dfb7e34 into modm-io:develop May 14, 2019
@salkinium salkinium deleted the feature/fault_reporter branch May 14, 2019 22:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

2 participants