Core Dump support via CrashCatcher and heap storage #210

salkinium · 2019-05-10T18:28:53Z

This uses the CrashCatcher to generate the core dump, store it in volatile memory and reboots so that the report is available for the application to use under normal operating conditions (ie. not in the HardFault handler).

This works as follows:

A hardfault is generated, CrashCatcher performs its job.
CrashCatcher calls into the :platform:fault module which stores the report in the memory sections designated for the heap as defined by the linkerscript's .table.heap section, effectively overwriting the heap.
CrashCatcher reboots the device.
modm's heap initialization does not overwrite the report stored in the heap, and initializes the remaining heap memory.
The application's boot process continues normally and may use the heap, even if it is smaller now.
The application can access the report and decide what to do with it (for example send it out via UART).
To clear the report and regain the full heap memory, the application must reboot the device again.

This gives significantly more freedom to the application for responding to a non-recoverable fault event, even allowing the use of dynamic memory for more complex communication protocols like XPCC over CAN.

The :platform:fault module may be configured to output just the core, or the core and stack or the core, stack and static data memories (.data and .bss sections), depending on how much heap memory is available.
The resulting report is to be used with the CrashDebug program that support post-mortem debugging.

TODO:

The current user facing reporting API is only FaultReporter::report(lambda) which is a synchronous API forcing sequential reading and a reboot afterwards. This is a convenience API for synchronous UART reporting, but doesn't support sending the report in smaller chunks over an asynchronous channel like XPCC over CAN.
Added an InputIterator for reading the data however you want it.

cc @rleh @dergraaf @chris-durand

salkinium · 2019-05-10T18:54:01Z

The module does not implement any GPIO blink behavior anymore, since it's out-of-scope.
The application can overwrite the extern "C" void HardFault_Handler itself to implement just LED blinking, or it can blink after the reboot when the report has been detected.

salkinium · 2019-05-11T01:23:34Z

I added a HeapTable input iterator, which completely hides the fault storage from the heap initializations as well as deduplicated some code and makes it readable.
I will add a FaultReporter input iterator as well.

salkinium · 2019-05-11T01:26:25Z

The F469 example correctly allows heap usage after hardfault reboot and before reporting:

Can I allocate 20kB? answer: 0x2000109C
Hold Button to cause a Hardfault!

Can I allocate 20kB? answer: 0x2000272C

=== CrashCatcher === HardFault === CoreDump ===

6343030001000000170000002B00000010000... lots of data

Can I allocate 20kB? answer: 0x2000109C
Hold Button to cause a Hardfault!

You can see the the first allocation, then a hardfault is triggered, the device reboots, allocates in a different spot, dumps the core, then reboots again, and can use the full heap again.

Works for all allocators transparently: newlib

Can I allocate 20kB? answer: 0xC0000008
Hold Button to cause a Hardfault!

Can I allocate 20kB? answer: 0xC0001658

=== CrashCatcher === HardFault === CoreDump ===

6343030001000000170000002B0000000000...

Can I allocate 20kB? answer: 0xC0000008
Hold Button to cause a Hardfault!

and block allocator:

Can I allocate 20kB? answer: 0xC0000004
Hold Button to cause a Hardfault!

Can I allocate 20kB? answer: 0xC0001654

=== CrashCatcher === HardFault === CoreDump ===

6343030001000000170000002B000000000000000...

Can I allocate 20kB? answer: 0xC0000004
Hold Button to cause a Hardfault!

salkinium · 2019-05-13T17:42:42Z

tools/build_script_generator/scons/site_tools/artifact.py

+		pass
+	source = str(source[0])
+	binary = os.path.splitext(source)[0]+".bin"
+	subprocess.call(env.subst("$OBJCOPY -O binary {} {}".format(source, binary)), shell=True)


@dergraaf I need the binary to compute the firmware hash, but I couldn’t figure out how to depend on env.Bin(source) properly. Any ideas?

In general that's what env.Depends() is for. I can look into it in more detail tomorrow, at the moment I don't have my laptop.

Yes, I got to env.Depends too, but I just didn't understand SCons enough to know when to call this? I tried something like env.Depends(action, "path/to/artifact.bin") in the store_artifact function but I didn't get that to work. SCons is super undocumented.

salkinium · 2019-05-13T20:38:27Z

I've added SCons support for caching the uploaded firmware ELF and binary files in the build directory and allow you to retrieve it using the firmware CRC32 sum: scons postmortem firmware={hash}.
This way you won't have to remember what commit the firmware is running on (if you even have a commit for that!!) and the whole thing is automated so you can't forget.

Of course you still have to manually copy the coredump data into the coredump.txt file, but then again this is post-mortem debugging, so you're not likely having your computer connected to the device at the time of the fault anyways.

I'm quite happy with this solution, it's also nicely documented I think.

salkinium added advanced 🤯 feature 🚧 labels May 10, 2019

salkinium mentioned this pull request May 10, 2019

Adds core dump support via CrashCatcher #52

Closed

2 tasks

salkinium force-pushed the feature/fault_reporter branch 2 times, most recently from 1a539ee to 47ae3ef Compare May 11, 2019 01:21

salkinium force-pushed the feature/fault_reporter branch 3 times, most recently from 80a34fc to 64d885b Compare May 11, 2019 21:44

salkinium mentioned this pull request May 11, 2019

Fix compilation for ARM Cortex-M3 adamgreen/CrashCatcher#7

Merged

salkinium force-pushed the feature/fault_reporter branch 4 times, most recently from 1641b04 to 29d541e Compare May 13, 2019 13:40

salkinium commented May 13, 2019

View reviewed changes

salkinium force-pushed the feature/fault_reporter branch from 29d541e to 8e778fc Compare May 13, 2019 20:32

salkinium marked this pull request as ready for review May 13, 2019 20:32

salkinium requested a review from dergraaf May 13, 2019 20:38

salkinium force-pushed the feature/fault_reporter branch 2 times, most recently from c44241e to 54e879b Compare May 14, 2019 16:53

salkinium added 7 commits May 14, 2019 23:29

[scons] Add missing dependency on linkerscript

840ef29

[ext] Add adamgreen/CrashCatcher submodule

cf5dfde

[stm32] Move CrashCatcher stack into top of heap

a6feb37

[fault] Implement FaultStorage for CrashCatcher report

8f64983

[core] Initialize heap above fault storage

5bf76a1

[core] Add firmware image hashing

4ba7d72

[scons] Add artifact caching tool

fd97df8

salkinium added 2 commits May 14, 2019 23:29

[scons] Add post-mortem debugging with GDB

22c1119

[examples] Adapt for new fault reporter module

dfb7e34

salkinium force-pushed the feature/fault_reporter branch from 54e879b to dfb7e34 Compare May 14, 2019 21:29

salkinium merged commit dfb7e34 into modm-io:develop May 14, 2019

salkinium deleted the feature/fault_reporter branch May 14, 2019 22:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Core Dump support via CrashCatcher and heap storage #210

Core Dump support via CrashCatcher and heap storage #210

salkinium commented May 10, 2019 •

edited

Loading

salkinium commented May 10, 2019

salkinium commented May 11, 2019

salkinium commented May 11, 2019 •

edited

Loading

salkinium May 13, 2019

dergraaf May 13, 2019

salkinium May 13, 2019

salkinium commented May 13, 2019

Core Dump support via CrashCatcher and heap storage #210

Core Dump support via CrashCatcher and heap storage #210

Conversation

salkinium commented May 10, 2019 • edited Loading

salkinium commented May 10, 2019

salkinium commented May 11, 2019

salkinium commented May 11, 2019 • edited Loading

salkinium May 13, 2019

Choose a reason for hiding this comment

dergraaf May 13, 2019

Choose a reason for hiding this comment

salkinium May 13, 2019

Choose a reason for hiding this comment

salkinium commented May 13, 2019

salkinium commented May 10, 2019 •

edited

Loading

salkinium commented May 11, 2019 •

edited

Loading