Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IDF monitor fails to parse coredumps from uart (IDFGH-6439) #8099

Closed
bugadani opened this issue Dec 16, 2021 · 8 comments
Closed

IDF monitor fails to parse coredumps from uart (IDFGH-6439) #8099

bugadani opened this issue Dec 16, 2021 · 8 comments
Assignees
Labels
Resolution: Done Issue is done internally Status: Done Issue is done internally

Comments

@bugadani
Copy link
Contributor

bugadani commented Dec 16, 2021

In the current implementation, it is possible for some bytes to be missing from coredump data received over UART. The branch that processed incomplete lines (referenced below) does not update the coredump processor at all:

if self._last_line_part != b'' and force_print_or_matched:
self._force_line_print = True
self.logger.print(self._last_line_part)
self.logger.handle_possible_pc_address_in_line(self._last_line_part)
check_gdb_stub_and_run(self._last_line_part)
# It is possible that the incomplete line cuts in half the PC
# address. A small buffer is kept and will be used the next time
# handle_possible_pc_address_in_line is invoked to avoid this problem.
# MATCH_PCADDR matches 10 character long addresses. Therefore, we
# keep the last 9 characters.
self.logger.pc_address_buffer = self._last_line_part[-9:]
# GDB sequence can be cut in half also. GDB sequence is 7
# characters long, therefore, we save the last 6 characters.
if gdb_helper:
gdb_helper.gdb_buffer = self._last_line_part[-6:]
self._last_line_part = b''

The conditions where this may happen:

  • Data is received that contains an incomplete line
  • AND timer flushes the serial input buffer by calling handle_serial_input with finalize_line=True

Effect: monitor dumps a bunch of base64 encoded coredump data to the console, failing to process it due to missing bytes - either the base64 becomes invalid, or the data becomes shorter than expected.

I'm experiencing the issue with esp-idf 4.3.1, but looking at the implementation, master should still be affected.

@espressif-bot espressif-bot added the Status: Opened Issue is new label Dec 16, 2021
@github-actions github-actions bot changed the title IDF monitor fails to parse coredumps from uart IDF monitor fails to parse coredumps from uart (IDFGH-6439) Dec 16, 2021
@sio13
Copy link
Contributor

sio13 commented Dec 29, 2021

Hi, @bugadani. Thank you for reporting the issue. We'll take a look soon.

bugadani added a commit to bugadani/esp-idf that referenced this issue Feb 4, 2022
This PR fixed processing coredumps that contain partially received lines.
@espressif-bot espressif-bot added Status: In Progress Work is in progress and removed Status: Opened Issue is new labels Feb 8, 2022
@sio13
Copy link
Contributor

sio13 commented Feb 8, 2022

@bugadani Could you please provide more detailed steps to reproduce the issue? E.g. example of C code to be monitored in the device.

@bugadani
Copy link
Contributor Author

bugadani commented Feb 8, 2022

I can't, unfortunately, I have a pretty complicated rust project only. But let me try to provide more details, let's hope I don't get the facts terribly wrong.

Before that, one thing to note: I opened this issue for esp-idf v4.3, where I made a similar patch to the PR I closed, that solved this issue for me. Right now, I'm experiencing this issue with esp-idf v4.4. For the previous version, the patch looked like this:

diff --git a/tools/idf_monitor.py b/tools/idf_monitor.py
index 490178e53a..7f8dc1ddd1 100755
--- a/tools/idf_monitor.py
+++ b/tools/idf_monitor.py
@@ -640,8 +640,14 @@ class Monitor(object):
         if self._last_line_part != b'':
             if self._force_line_print or (finalize_line and self._line_matcher.match(self._last_line_part.decode(errors='ignore'))):
                 self._force_line_print = True
+                
+                if self._decode_coredumps != COREDUMP_DECODE_DISABLE:
+                    if self._reading_coredump == COREDUMP_READING:
+                        self._coredump_buffer += self._last_line_part.replace(b'\r', b'')
+                
                 self._print(self._last_line_part)
                 self.handle_possible_pc_address_in_line(self._last_line_part)
+                self.check_coredump_trigger_after_print(self._last_line_part)
                 self.check_gdbstub_trigger(self._last_line_part)
                 # It is possible that the incomplete line cuts in half the PC
                 # address. A small buffer is kept and will be used the next time

Symptoms

Most of the time when my program panics and outputs a coredump, esp-idf fails to parse this coredump and prints it's base64 raw representation instead. This is usually a 20-30kB dump.

There are two distinct error messages I get:

  • Invalid base64-encoded string: number of data characters (53) cannot be 1 more than a multiple of 4, the exact numbers are random. A variation of this is Error: Incorrect padding
  • and the other is a data length mismatch (I don't have the exact error message at hand).

Trying to decode this raw, printed buffer manually, in some of the cases (corresponding to error message 1), some of the lines are invalid base64, because data is missing from them. I suspect, that, in the other cases, the buffer was rejected with a data length mismatch, because in those cases, bytes were missing in such a way, that the buffer still represented a valid base64 string.

I know that data is missing, because when I debugged this issue, I suspected the C code to generate invalid buffers and I printed the raw coredump buffer in a different way, and observed, that what I printed and what esp-idf read from the serial input, were slightly different.

Possible causes

The error is probablilistic, but because of the size of my coredump, I hit it more often than not.

Examining the code paths that process coredump info, we can see that in the code branch in the issue description, data can be lost without adding it to the coredump buffer. I suspect this is the root cause of this issue.

This code is part of the handle_serial_input function, which has two call sites: it is used to handle the following events in idf_monitor.py _main_loop:

  • TAG_SERIAL
  • TAG_SERIAL_FLUSH

I believe the problematic code is hit, when a TAG_SERIAL_FLUSH event is generated, but it does not correspond to the actual last line of the serial data. I know this can happen, because, if I place a print like below, I get the following:

        elif event_tag == TAG_SERIAL_FLUSH:
            print("Flush")
            self.serial_handler.handle_serial_input(data, self.console_parser, self.coredump,
                                                    self.gdb_helper, self._line_matcher,
                                                    self.check_gdb_stub_and_run, finalize_line=True)

image

@sio13
Copy link
Contributor

sio13 commented Feb 22, 2022

Hi @bugadani, thank you for a detailed and clear explanation of the problem. You are right. The issue is caused by invoking the wrong branch of the code (TAG_SERIAL_FLUSH ). The problem is the longer time required to receive serial data. We reproduced the issue, and our temporary solution is to increase the last line time interval - see the line:

LAST_LINE_THREAD_INTERVAL = 0.1

Please, consider increasing the time interval LAST_LINE_THREAD_INTERVAL (maybe even to 1 second), and please, let me know if the temporary solution works for you as well.

@bugadani
Copy link
Contributor Author

Hi @sio13 indeed, increasing the timeout works well enough, thanks!.

@sio13
Copy link
Contributor

sio13 commented Feb 22, 2022

@bugadani, thank you for verifying the solution. We also found out that increasing the UART console baud rate using idf.py menuconfig resolved the issue. We needed to increase it to 2M. That might be more convenient than modifying the code.

@bugadani
Copy link
Contributor Author

I hope both of these solutions are treated as a workaround only :)

Well this isn't important, but modifying the code isn't an issue on my end, as I have other patches I need to apply for small fixes that aren't released, or haven't been backported yet. The baud rate increase is interesting, I need to verify if 2M works with my physical setup. It has the added benefit of making my log calls faster, as I've been able to verify that in my system, logging is actually IO bound, even using 460800.

Regardless, thank you for providing workarounds, being able to see coredumps is certainly more helpful than not seeing them. :)

@espressif-bot espressif-bot added Resolution: NA Issue resolution is unavailable Status: Done Issue is done internally Resolution: Done Issue is done internally and removed Status: In Progress Work is in progress Resolution: NA Issue resolution is unavailable labels Feb 24, 2022
@sio13
Copy link
Contributor

sio13 commented Feb 24, 2022

@bugadani sure, the root cause was adding line breaks into the core dump data that led to issues in decoding. The fix should be merged and backported to the older versions soon. Thank you for reporting and helping with this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Resolution: Done Issue is done internally Status: Done Issue is done internally
Projects
None yet
Development

No branches or pull requests

3 participants