Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Debugging instability of Lilygo_rtl_433_ESP #2043

Closed
puterboy opened this issue Sep 2, 2024 · 34 comments
Closed

Debugging instability of Lilygo_rtl_433_ESP #2043

puterboy opened this issue Sep 2, 2024 · 34 comments

Comments

@puterboy
Copy link
Contributor

puterboy commented Sep 2, 2024

I was doing some analysis of 'uptimes' by looking through old backups of home_assitant_v2.db and extracting times between reboots of the Lilygo_rtl_433_ESP device (running on a Lilygo LORA esp32).

I noticed that the device typically reboots every 10-60 hours looking back over the past 3 months.

  • There doesn't seem to be any clear pattern in times between reboot and there are no changes that I know of on my network or in the rtl_433 devices being monitored
  • Also, the Lilygo device has a Lipo battery backup attached so it shouldn't be a power failure issue.
  • It doesn't seem to be an out-of-memory issue as for example the min/max/average memory usage reported (every 2min) via MQTT over the past 2 months is 80KB/109KB/116KB respectively so it seems unlikely that it suddenly runs out of memory
  • MQTT doesn't show any reboot reason, presumably because it crashes before issuing a reboot reason

Note:

  • During most of the past 3 months (other than the last couple of weeks), discovery was turned off
  • Up until about 1 week ago, I was running the dev version from March so it doesn't seem to be anything "new" (just that I never tracked reboots until now)

Any suggestions on possible causes and how to investigate?

@puterboy
Copy link
Contributor Author

puterboy commented Sep 2, 2024

Attached is a graph of free mem over the past 24 hours. There were 2 reboots at about 5:36 AM and 5:16 PM as you can see by the temporary positive memory spikes with the reboot:
image

And here is a graph showing memory usage and uptime before I upgraded and started playing with the code:
image
image

Really hard to see an obvious memory problem unless suddenly out-of-the-blue it got swamped with simultaneous messages -- note that based on the "quantum"-like pattern in memory usage, it seems like a single message uses up about 656 bytes of memory.

@1technophile
Copy link
Owner

Idealy a serial monitor with exception decoder activated would help along with the logs associated with the reboot.
I know that it requires to have a computer connected but for now this is the best solution I have

@puterboy
Copy link
Contributor Author

puterboy commented Sep 3, 2024

I ordered a spare Lilygo LORA, so I can hook that up to a serial monitor when it arrives.
Also, how do activate "exception decoder"?

@1technophile
Copy link
Owner

By adding this line into your environment:

monitor_filters = esp32_exception_decoder

@piechade
Copy link

piechade commented Sep 17, 2024

I ordered a spare Lilygo LORA, so I can hook that up to a serial monitor when it arrives. Also, how do activate "exception decoder"?

Any news about the issue, my device arrived today I have maybe the same issue. My 5 in 1 bresser is detected as 6 in 1 what's also wrong. Not sure if this is the problem or there is general something wrong.

@puterboy
Copy link
Contributor Author

By adding this line into your environment:

monitor_filters = esp32_exception_decoder

OK - I got my new board and it too reboots -- even more often, every few hours actually!!

I will recompile with monitor_filters = esp32_exception_decoder
@1technophile Am I correct in assuming that the output goes to the serial monitor (via the USB connector on the ESP32 board)?

@1technophile
Copy link
Owner

Am I correct in assuming that the output goes to the serial monitor (via the USB connector on the ESP32 board)

This is correct

@Forge36
Copy link

Forge36 commented Sep 23, 2024

Is yours fully rebooting? I'm finding the restart connects to WiFi, no 433 radio or mqtt (some sort of limited web panel) a second restart works.

@puterboy
Copy link
Contributor Author

Reboot automatically and fully every number of hours - highly variable - seems to reboot even more on my new one...

@puterboy
Copy link
Contributor Author

puterboy commented Sep 25, 2024

OK - I have been monitoring serial messages with the 'monitor_filters' command for the past day and have seen the following:

  1. The error message around a crash is:
Guru Meditation Error: Core  1 panic'ed (Unhandled debug exception).
Debug exception reason: Stack canary watchpoint triggered (rtl_433_Decoder)
Core  1 register dump:
PC      : 0x4008df3b  PS      : 0x00060a36  A0      : 0x4008dd4a  A1      : 0x3ffe12f0
A2      : 0x00000000  A3      : 0x4008abb1  A4      : 0x00000000  A5      : 0x3ffe13c0
A6      : 0x0000014c  A7      : 0x0000000c  A8      : 0x00000009  A9      : 0x00000024
A10     : 0x0000005a  A11     : 0x0000002a  A12     : 0x40084f96  A13     : 0x3f442100
A14     : 0x00ff0000  A15     : 0xff000000  SAR     : 0x00000018  EXCCAUSE: 0x00000001
EXCVADDR: 0x00000000  LBEG    : 0x4008abb1  LEND    : 0x4008abc1  LCOUNT  : 0xfffffffc


Backtrace: 0x4008df38:0x3ffe12f0 0x4008dd47:0x3ffe16c0 0x400d6360:0x3ffe1780 0x400d6635:0x3ffe1870 0x400e7ec5:0x3ffe1890 0x401dab3d:0x3ffe18b0 0x401dc849:0x3ffe18e0 0x400f11ab:0x3ffe1900 0x400d67fe:0x3ffe1930 0x400d6a9e:0x3ffe1980 0x400e230e:0x3ffe19a0 0x4012bfc5:0x3ffe1de0 0x401dc58a:0x3ffe1e20 0x401200af:0x3ffe1e40 0x40120142:0x3ffe3890 0x4012d597:0x3ffe38c0 0x4012dc15:0x3ffe38e0 0x4012c300:0x3ffe3970 0x4010121f:0x3ffe39a0
  1. A couple of the crashes were missing the "Debug exception reason" line so not sure what caused them -- could it be that the error was the same but it wasn't caught by the debugger?

  2. I also see occasional error messages of form:

bitbuffer_add_bit: Warning: row count limit (50 rows) reached

But those messages don't seem to be temporally related to the actual crashes.

Is this helpful????

@puterboy
Copy link
Contributor Author

Some additional observations:
1 From googling it seems like this is due to stack overflow (which was also mentioned in #1655) but not clear what the root cause of this stack overflow is
2. There doesn't seem to be any memory hole per say, as the free memory as reported every 2 minutes never seems to dip below about 100K -- and one crash even occurred just after such a memory report and there was 110K free
2. This bug may be similar to #1836 where some posters mention a similar Stack canary watchpoint triggered (rtl_433_Decoder) stack trace (though that bug seems to have been fixed for at least some users by removing the json mqtt stack which seemingly is unrelated to rtl_433_Decoder plus I think my earlier PR fixed the root cause of the MQTT stack issue which was due to running out of memory and in this case free memory seems to be fine)

@NorthernMan54
Copy link
Collaborator

The rtl_433_Decoder stack size is not externally visible, try increasing this by about 1000

rtl_433_Decoder

@puterboy
Copy link
Contributor Author

OK increased rtl_433_Decoder_Stack (under OOK_MODULATION) from 10000 to 11000.
Will watch and see...

Any idea what the root cause of this is?
Or is the stack truly (marginally) not big enough so that every once in a while it overflows and crashes?
Also, is there any simple way to add a log 'warning' when stack is full or near full to help future debugging?

@NorthernMan54
Copy link
Collaborator

@puterboy I had tuned the stack size based on my real world results in a attempt to balance free memory versus not using too many resources. Within the OMG environment we have multiple modules sharing resources so if one gets greedy, the others suffer.

@NorthernMan54
Copy link
Collaborator

@puterboy PS If you have a value that works for you, pls submit a PR or issue against rtl_433_ESP, and I will try to include in the next release

@1technophile
Copy link
Owner

1technophile commented Sep 25, 2024

Also, is there any simple way to add a log 'warning' when stack is full or near full to help future debugging?

Yes, you can have the stack available at the task level by using this uxTaskGetStackHighWaterMark(NULL) and if you want to use this function outside of the task, you replace the NULL with the task Handle rtl_433_DecoderHandle.
My advice, add this to the stateRFMeasures and follow the value with MQTT explorer.

A PR will be welcome to add this indicator.

@puterboy
Copy link
Contributor Author

puterboy commented Sep 26, 2024

Sure. I added the code to SYStoMQTT based on your original comment. That seemed to make sense since stack memory is similar to freemem. I also added a sensor for it in ZmqttDiscovery.ino but that is probably overkill...
Inclusion is conditional based on #ifdef ZgatewayRTL_433

I see you edited your comment and are now suggesting stateRFMeasurees -- would you prefer it there?
(It's obviously easy to change :) )

@1technophile
Copy link
Owner

1technophile commented Sep 26, 2024

I see you edited your comment and are now suggesting stateRFMeasurees -- would you prefer it there?

Yes please, this way it would be with the other RF infos

@puterboy
Copy link
Contributor Author

Stack overflow is definitely at least the primary issue here.
I increased the stack size from 10K to 11K and after a few hours the high water mark dipped below 1000 which would have caused a crash if size were still 10K.

I will continue to monitor to see if increasing by 1K is sufficient.

Is there any way to dynamically change the stack size if high water mark dips too low?
Say redeclare function???
I say that because every situation may be different. Other people may have more or different devices resulting in different stack consumption.

@puterboy
Copy link
Contributor Author

I see you edited your comment and are now suggesting stateRFMeasurees -- would you prefer it there?

Yes please, this way it would be with the other RF infos

I assume that if in this section, for consistency, no need for sensor discovery logic to be added.

@1technophile
Copy link
Owner

I assume that if in this section, for consistency, no need for sensor discovery logic to be added.

Indeed, as we do with BT we don't autodiscover the modules task free stack

@1technophile
Copy link
Owner

Is there any way to dynamically change the stack size if high water mark dips too low?
Say redeclare function???

We could delete it and recreate it with a different size but I'm not sure if this approach is relevant.

@puterboy
Copy link
Contributor Author

Comparing the two Lilygo rtl433's I have, I noticed that the one upstairs resets more often than the one downstairs -- and correspondingly, the one upstairs is receiving signals from more 433MHz devices.

Presumably, the more devices transmitting, the more the queue grows, leading to more stack consumption if multiple signals are received in close approximation...
Assuming that is correct, it would seem that there is no "right" value for the stack size as some people may live in an isolated area with only a few of their own 433MHz devices and no neighbors while others may live in a city with many of their own devices plus neighbor ones.

If true, then it would seem that one should implement a method to dynamically resize the queue if it dips too low. One would presumably need to copy over the stack so that nothing is lost before setting up the task again...

Otherwise, the stack size seems quite arbitrary and one either needs to set it so large that nobody fails in which case memory is wasted or set it an "average" user's size in which case some people will get crashes.

Perhaps it would be helpful to get @NorthernMan54's input...

@puterboy
Copy link
Contributor Author

Also, since arrival times of different devices are generally stochastic and statistically independent, after enough time you will have an instance where all your devices send a signal at approximately the same time... so the worst case eventually happens...

@NorthernMan54
Copy link
Collaborator

@puterboy Pls keep in mind that the majority of the rtl_433_ESP code base is a direct clone from rtl_433, which runs on machines with a large memory footprint. So the stack usage is high. If you look, a couple of features from rtl_433 require a huge stack ( almost all the memory on a ESP32 ). So current stack size is a bit of a comprise.

The code base takes each received signal, allocates a spot on the heap ( not stack ), and queues it for processing thru the rtl_433 code base ( the 100+ device decoders ). Am thinking that one of the decoders, when it sees a particular signal, is consuming the stack. The processing queue has a max size of 5, so the memory hit from a large number of signals is managed, and hits the heap ( so this queing is not the source of the stack issue ).

After working recently with esp32-s3 with 8Mb ram, am wonder if running with a larger memory footprint may be beneficial. Hopefully a board with a display and a SX127x chip will be built by a manufacturer, keeping it simple like the lilgygo.

Free Heap: 165,756
Free PSRAM: 8,078,267
Free Stack: 14,612

@puterboy
Copy link
Contributor Author

puterboy commented Sep 29, 2024

What you say about one of the decoders consuming the stack is perhaps consistent with the following:
After running for about 36 hours, one of my LilyGo devices suddenly dropped from 3176 bytes free on the stack to 856 bytes on the stack (this would have caused a crash had I not increased the size by 1000KB).
This sudden drop suggests perhaps that there is some device that issue a signal that results in a disproportionate amount of stack usage.

@puterboy
Copy link
Contributor Author

@NorthernMan54 is there any debugging logging that could report how much stack space is being consumed by any given decoder along with the actual device that ended up triggering it.
It would even be interesting to know whether the max stack consumption is being triggered by a device that gets decoded or by an unknown one that goes through all the candidate decoders and fails to get decoded.

@NorthernMan54
Copy link
Collaborator

@puterboy That degree of logging does not exist within the code base. The trick would be to log the stack size between each decoder run, and then review the logs and determine which decoder was invoked that increased stack space.

@puterboy
Copy link
Contributor Author

That's exactly what I was thinking. Seems like it could be worthwhile to add such logging given the persistent and difficult to debug challenges with stack consumption.

puterboy added a commit to puterboy/rtl_433_ESP that referenced this issue Oct 6, 2024
…ck to small on Lilygo Lora device

This fixes the crash caused by OOM when low water mark on rtl_433_Decder_Stack drops below 0.
(See: 1technophile/OpenMQTTGateway#2043)
I increased the memory size by 1500 which after running for a week on 2 different Lilygo Lora ESP32 devices leaves the water mark at just over 1KB -- I want to leave a little spare in case there are other sensor configurations and edge cases that would dip further into the stack.

I also wrapped the definitions of `rtl_433_Decoder_stack` with `ifndef rtl_433_Decoder_Stack` so that users can easily manually tweak the allocated stack size for their own particular situations.
@puterboy
Copy link
Contributor Author

puterboy commented Oct 6, 2024

I have been running with rtl_433_Decoder_Stack size 10000 for 10 days on 2 different Lilygo Lora ESP32 devices and the low water mark after 10 days is just over 600 bytes -- to give a little more margin of error for alternative sensor environments and/or worst cases, I created the following PR with stack size 11,500.

NorthernMan54/rtl_433_ESP#156

@puterboy
Copy link
Contributor Author

puterboy commented Oct 6, 2024

The above referenced two PRs (#2081 and NorthernMan54/rtl_433_ESP#156) should close out this bug...

NorthernMan54 pushed a commit to NorthernMan54/rtl_433_ESP that referenced this issue Oct 6, 2024
…ck to small on Lilygo Lora device (#156)

This fixes the crash caused by OOM when low water mark on rtl_433_Decder_Stack drops below 0.
(See: 1technophile/OpenMQTTGateway#2043)
I increased the memory size by 1500 which after running for a week on 2 different Lilygo Lora ESP32 devices leaves the water mark at just over 1KB -- I want to leave a little spare in case there are other sensor configurations and edge cases that would dip further into the stack.

I also wrapped the definitions of `rtl_433_Decoder_stack` with `ifndef rtl_433_Decoder_Stack` so that users can easily manually tweak the allocated stack size for their own particular situations.
@puterboy
Copy link
Contributor Author

puterboy commented Oct 6, 2024

@puterboy That degree of logging does not exist within the code base. The trick would be to log the stack size between each decoder run, and then review the logs and determine which decoder was invoked that increased stack space.

@NorthernMan54 are you planning on implementing this?
If you do, I would be happy to enable the debugging on my end and run my spare Lilygo Lora ESP32 to gather data (assuming it would be via serial report logging).

@puterboy
Copy link
Contributor Author

puterboy commented Oct 6, 2024

If you accept PR #2082 that will bump rtl_433_ESP to v0.3.3 which includes my patch to signalDecoder.cpp to increase size of rtl_433_Decoder_Stack as well as to expose this as a configuration variable.

@puterboy
Copy link
Contributor Author

puterboy commented Oct 7, 2024

OK - closing this now that both PR's accepted and bug solved!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants