Skip to content

Conversation

@TD-er
Copy link
Member

@TD-er TD-er commented May 29, 2021

Make sure all flash strings are served by not copying it.

Mainly to deal with the memory issues reported by @ghtester and @clumsy-stefan since #3655
The main change of that PR is to serve flash strings where possible, so low free memory may be related to this change.

@TD-er
Copy link
Member Author

TD-er commented May 29, 2021

Fixing one funky bug too where pages would not be served full.
This could happen occasionally when flush() was called exactly after 400 bytes were just sent, thus sending 0 bytes, effectively terminating the chunked transfer.

Not sure if we can make the page serving much faster, as the sysinfo page now can be served over WiFi in < 140 msec.
It just feels quite a bit faster compared to how it was before.

The needed amount of free memory is now even less, so maybe we can lower the "workable" threshold below which the webserver refuses to serve pages.

No need to wrap a single char into a String object for serving it as we can also append it directly to the buffer.
@TD-er
Copy link
Member Author

TD-er commented May 30, 2021

OK, now I will stop and get some sleep.
You know you're done if you cannot keep up clicking links in the browser tapping along with Thunderstruck on the head phones and the ESP keeps serving the pages over WiFi at a rate of 3 - 4 pages per second without breaking a sweat or getting any lower on minimal free memory while also serving an OLED display and some sensors.

@clumsy-stefan
Copy link
Contributor

I updated a few test units this morning with this PR. If you look at the graphs below, the free mem got even less than before this PR. Interestingly it seems directly related to the RSSI, but I can't see any reason why this could be...
Units 2&4 were updated this morning around 9 o clock.
image
image

Other Units didn't seem to be affected and stayed with similar free memory than before...

@TD-er
Copy link
Member Author

TD-er commented May 30, 2021

Hmm that's really strange.. By using less memory allocations for serving pages, the amount of free memory gets less....

@ghtester
Copy link

I have also updated one ESP node with the latest custom firmware with PR 3660.
There was only 1 plugin configured with 2x DS18b20 sensors, MQTT controller and Email notification configured (notification not used).
Before (a week old custom build) the Free RAM was about 16000. With the current build / PR 3660 the Free RAM was reduced to about 13000. When I configured another plugin (Communication - TSOP4838), the Free RAM fell into about 7000 and several exceptions happened when browsing the GUI.

Boot: Exception (4)
Reset Reason: Exception
Last Action before Reboot: Background Task

@TD-er
Copy link
Member Author

TD-er commented May 30, 2021

Yep, testing here too with the "Custom" build using the default pre_custom_esp82xx.py and it also is using quite a lot more memory than before.
So I'm testing to see what happens with the serving of flash strings. It seems like data does not get de-allocated somewhere, but I've no clue yet where.

@ghtester
Copy link

Well, the serving the web GUI is a quite fast, unfortunately it's still easy to get an Exception with pushing F5 quickly several times to refresh the web page. I would prefer a slower web page serving but a more rock stability.

@TD-er
Copy link
Member Author

TD-er commented May 30, 2021

Well, the serving the web GUI is a quite fast, unfortunately it's still easy to get an Exception with pushing F5 quickly several times to refresh the web page. I would prefer a slower web page serving but a more rock stability.

That;s what I'm now doing.
I now try to optimize between speed when enough free memory or trying to copy safely when low on memory.

@clumsy-stefan
Copy link
Contributor

But what in the previous change (#3655) before optimizing for speed took so much memory? One of the major memory reductions was already in that PR...

@TD-er
Copy link
Member Author

TD-er commented May 30, 2021

But what in the previous change (#3655) before optimizing for speed took so much memory? One of the major memory reductions was already in that PR...

That PR was mainly about keeping flash strings as long as possible in that format without converting them to String objects.
A flash string only needs a single pointer in memory, where a String object allocates some memory on the stack and if it needs to hold > 11 bytes it does allocate it on the heap.
So String(F("This is a longer flash string")) does need roughly 30 bytes of binary code + to allocate the full string in RAM, where F("This is a longer flash string") just is a pointer. (4 bytes)

So it does not make sense that this optimization does require extra RAM.
The only reason I can imagine is that the IP-stack or the webserver code does perform some caching, allocating new blocks based on the largest stream once sent.

I'm now experimenting with calling several flush statements, to make sure data is not pending.
This has a huge negative impact on the performance, and not really a bit improvement on memory consumption.

Still searching....

TD-er added 3 commits May 30, 2021 16:50
Large chunks of the template were copied to stream.
Now they are directly streamed to the client instead of copied first.
Page serving will be slower when low on memory, but this will improve stability when low on ram.
@TD-er
Copy link
Member Author

TD-er commented May 30, 2021

Added a few more optimizations. The most important ones are to serve chunks of the web template instead of copying them first.

The other one is about the speed vs stability aspect of the web server.
It will now perform more checks on allocation of the needed buffer of the web server and forcing a flush when done.
This may take longer to process, but will make it more stable when running low on memory.

#ifdef USES_P092
#undef USES_P092 // DL-Bus
#endif
#ifdef USES_P093
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the same issue will apply to P076, P093 and P100, missing from their TESTING sets 🤔

I've been testing this with a wrapper #ifndef PLUGIN_SET_TESTING around those #ifdef/#undef's, though I've not yet been able to build all variations.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P076 & P093 are part of the energy build.

Copy link
Contributor

@tonhuisman tonhuisman May 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, but P100 is in TESTING set B (and that one was already quite challenging to fit, but maybe the optimizations make it fit now)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P100 should get the same treatment, seems to fit now in even the (previously) most cramped builds.

@clumsy-stefan
Copy link
Contributor

Did a quick test with a build up to commit 2bb7ebd which seems to have even less free mem (units 2 & 4)

image

really strange... also could quite easily force an exception by reloading the web interface a number of times..

@TD-er
Copy link
Member Author

TD-er commented May 30, 2021

Which plugins do those builds with low memory have, that others don't have?
Or if the build is the same over the units, what do those units have enabled that others haven't?

@TD-er
Copy link
Member Author

TD-er commented May 30, 2021

Also... how do you build it?
Using Custom.h, PlatformIO Python script or Arduino IDE?

@clumsy-stefan
Copy link
Contributor

I buld them with custom.h and ArduinoIDE on Mac OS X.

All units have different plugins enabled. the two test units 2&4 are quite different, 2 has a lot of varioous plugins/tasks (incl. display, etc.) nr 2 only has a GPIO and a BME.

But I always compare before and after updating on the same node with the same plugiins, as it seems the available memory is somehow related to RSSI of that node (bad RSSI => lower memory), it doesn't really make sense to compare 2 differeent units... I have two "empty" units sitting next to each other, and also they show diffrences in memory allocation..

@TD-er
Copy link
Member Author

TD-er commented May 31, 2021

Ah, would be nice if it can be compared with the same node moved around to test this hypothesis.

@clumsy-stefan
Copy link
Contributor

Ah, would be nice if it can be compared with the same node moved around to test this hypothesis.

Sure, I can try this, as I have some battery powered units, they hold around a day if not using deep sleep.

@TD-er
Copy link
Member Author

TD-er commented May 31, 2021

I'm now diving into optimizing memory usage.
So hopefully I can also give some relief there.

Another thing to watch for... I have seen some reports of I2C somewhat crashing, not responding etc.
On my breadboard I've seen it a few times now (without having the serial console attached :( )
But it does seem to be an issue somehow. Can you also keep an eye out for it to see if I2C stuff may become unresponsive. Power cycle seems to be the only way to fix it right now.

@clumsy-stefan
Copy link
Contributor

I had the I2C issue some time ago, but I think since a few months this didn't happen anymore. However I'm always using latest esp8266 core commits from GIT, so probably something has been fixed there...

Still I'll watch out for any issues around the connected peripherials..

@TD-er
Copy link
Member Author

TD-er commented May 31, 2021

Just pushed some debug code and also made some small improvements which should give you roughly 500 - 1000 bytes more free memory, depending on the build.

I will later have a closer look at the used buffers for logs, as that's something with room for improvement.
With serial log disabled, you should have the most free memory.

Some of the (now turned in to debug log level) logs I added are like this:

259 : Info   : After PLUGIN_INIT  task:  1   Free mem after: 21344  plugin: 3528  [ena] Display - OLED SSD1306/SH1106 Framed
261 : Info   : After PLUGIN_INIT  task:  2   Free mem after: 21280  plugin: 0     [dis] Position - GPS [TESTING]
267 : Info   : After PLUGIN_INIT  task:  3   Free mem after: 21144  plugin: 104   [ena] Generic - System Info
269 : Info   : After PLUGIN_INIT  task:  4   Free mem after: 20608  plugin: 0     [dis]
270 : Info   : After PLUGIN_INIT  task:  5   Free mem after: 20608  plugin: 0     [dis]
275 : Info   : After PLUGIN_INIT  task:  6   Free mem after: 20488  plugin: 104   [ena] Switch input - Switch
276 : Info   : After PLUGIN_INIT  task:  7   Free mem after: 20944  plugin: 0     [dis]
354 : Info   : After PLUGIN_INIT  task:  8   Free mem after: 20592  plugin: 264   [ena] Environment - DS18b20
361 : Info   : After PLUGIN_INIT  task:  9   Free mem after: 19936  plugin: 104   [ena] Generic - System Info
440 : Info   : After PLUGIN_INIT  task: 10   Free mem after: 19672  plugin: 264   [ena] Environment - DS18b20
441 : Info   : After PLUGIN_INIT  task: 11   Free mem after: 19656  plugin: 0     [dis]
442 : Info   : After PLUGIN_INIT  task: 12   Free mem after: 19656  plugin: 0     [dis] Environment - DS18b20

This allows to get some more insight in what amount of RAM some plugins/tasks may require.

Will merge this soon, as it is working fine, so I can also focus on the pulse plugin code (will also merge that one)

@TD-er TD-er merged commit 761e58b into letscontrolit:mega May 31, 2021
@TD-er TD-er deleted the bugfix/WebServer_flashStrings branch May 31, 2021 18:00
@clumsy-stefan
Copy link
Contributor

I'm sorry, I'm off for to days and can only look into it again after tomorrow. I'll test the new commits then...

@clumsy-stefan
Copy link
Contributor

I tested the latest mega commit and installed it on my units yesterday evening. Interestingly my test-units have are a little lower on free mem, all other units are higher, see charts.

The postitive thing is, that I didn't have a single reboot on any unit since installing the new version! So there's definitevely an enhancment / bug fix in it!

image

@TD-er
Copy link
Member Author

TD-er commented Jun 2, 2021

Memory usage seems more consistent too.

@clumsy-stefan
Copy link
Contributor

Still not a single reboot, that's a long time ago since I had a version that stable! Well done!!

@ghtester
Copy link

ghtester commented Jun 2, 2021

On my test node the RAM is still quite low (5000-7000 with IRRX plugin enabled) and I am able to crash the node with quickly refreshing the web pages but overall it looks more stable than previous build which sometimes generated Exception without obvious reason.

@TD-er
Copy link
Member Author

TD-er commented Jun 2, 2021

Quickly refreshing the web pages is indeed still a good receipe for a crash, especially since I had to "revert" some of the speed improvements when running low on memory.
N.B. with enough free memory, most of the speed-up improvements are still present, but I had to limit the flash transfers as the IP-stack and/or the webserver part tends to claim lots of memory if presented quickly enough.

So there is a trade-off to me made between speed and stability. Max. responsiveness is also quite stable when enough memory is available. However as soon as you get below some threshold, it immediately turns into highly unstable. So I have changed it to some compromise level.
It is however possible to serve almost any page in under 140 msec on an ESP8266 via WiFi, given you have enough free memory.
As long as you keep the number of requests per second such that on average the number of pending requests does not get > 1, the ESP will not crash.
But lower free memory does increase page load times. Sometimes even to multiple seconds if running really low on memory.

@ghtester
Copy link

ghtester commented Jun 2, 2021

Yeah, it's a pity the free RAM fell down with latest updates despite all that improvements you made in this matter but perhaps it will be enough for node to work stable. I'll keep testing with most of plugins enabled and we'll see... Thanks again for your great work!

@clumsy-stefan
Copy link
Contributor

With the latest commits from yesterday (f93e035) heavier loaded units run into a hardware watchdog every hour or so... others run stable.

Acutually the one node in question uses the Counter plugin heavily, so the latest changes around that plugin could be an issue.

@clumsy-stefan
Copy link
Contributor

Update: I did a new version where I included your "feature/build_core_3_0_0" branch. This seems to make it stable again. So probably an issue between me using the 3.0.0 git core and some changes not yet included in the mega branch for this...

@TD-er
Copy link
Member Author

TD-er commented Jun 8, 2021

The rename of the IRAM_ATTR I guess.

@clumsy-stefan
Copy link
Contributor

I already had to change that some weeks ago, as I used the latest build. but it could be that I missed something somewhere..

in any case, the latest version runs rock-stable again... no reboots until now..

@ghtester
Copy link

ghtester commented Jun 10, 2021

I would like to know how to build the custom rock-stable ESP Easy mega firmware from current sources. My ESP8266 nodes are able to survive only several hours between Exceptions... :-( And one of them sometimes can't reconnect to WiFi.

Unit Number: 11
Local Time: 2021-06-10 21:40:01
Uptime: 0 days 10 hours 52 minutes
Load: 21.07% (LC=928)
CPU Eco Mode: false
Boot: Exception (4)
Reset Reason: Exception
Last Action before Reboot: 3 timer, id: 7
SW WD count: 0

Firmware

Build:⋄ | 20114 - Mega
System Libraries:⋄ | ESP82xx Core 2843a5ac, NONOS SDK 2.2.2-dev(38a443e), LWIP: 2.1.2 PUYA support
Git Build:⋄ | My Build: Jun 9 2021 12:05:13
Plugin Count:⋄ | 35
Build Origin: | Vagrant
Build Time:⋄ | Jun 9 2021 12:04:14
Binary Filename:⋄ | ESP_Easy_mega_20210609_custom_IR_ESP8266_4M1M
Build Platform:⋄ | Linux-4.15.0-51-generic-x86_64-with-glibc2.27
Git HEAD:⋄ | mega_f93e035

The most stable recent firmware in my case is this one, which is running for 18 days from Exception :

Firmware

Build:⋄ | 20113 - Mega
System Libraries:⋄ | ESP82xx Core 2843a5ac, NONOS SDK 2.2.2-dev(38a443e), LWIP: 2.1.2 PUYA support
Git Build:⋄ | My Build: May 23 2021 08:56:56
Plugin Count:⋄ | 35
Build Origin: | Self built
Build Time:⋄ | May 23 2021 08:55:56
Binary Filename:⋄ | ESP_Easy_mega_20210523_custom_IR_ESP8266_4M1M
Build Platform:⋄ | Linux-4.15.0-51-generic-x86_64-with-glibc2.27
Git HEAD:⋄ | HEAD_71c6ffa

Unit Number: 12
Local Time: 2021-06-10 22:08:26
Uptime: 18 days 3 hours 45 minutes
Load: 12.45% (LC=915)
CPU Eco Mode: false
Boot: Exception (1)
Reset Reason: Exception
Last Action before Reboot: 1 timer, id: 0
SW WD count: 0

@clumsy-stefan
Copy link
Contributor

clumsy-stefan commented Jun 11, 2021

I built it with ArduinoIDE on a Mac using the latest esp8266 core from GIT and the 3.0.0 branch from ESPeasy (see above). Important is to define -DCORE_POST_3_0_0 at compile time so the relevant definitions are included....

Good luck!

EDIT:
my current uptimes:

fhem> list .*sysinfo uptime
sonoff_4ch_01_sysinfo 2021-06-11 22:44:46    4975
sonoff_basic_08_sysinfo 2021-06-11 22:40:31    4970
sonoff_dual_01_sysinfo 2021-06-11 22:41:53    4980
sonoff_dual_02_sysinfo 2021-06-11 22:45:09    4975
sonoff_pow_01_sysinfo 2021-06-11 22:45:17    4980
sonoff_pow_02_sysinfo 2021-06-11 22:42:42    4975
sonoff_pow_03_sysinfo 2021-06-11 22:40:59    4970
sonoff_pow_04_sysinfo 2021-06-11 22:43:08    4980
wemos_mini_01_sysinfo 2021-06-11 22:45:07    4980
wemos_mini_02_sysinfo 2021-06-11 22:45:42    4985
wemos_mini_03_sysinfo 2021-06-11 22:41:17    4975
wemos_mini_04_sysinfo 2021-06-11 22:45:13    4985
wemos_mini_05_sysinfo 2021-06-11 22:45:21    5
wemos_mini_06_sysinfo 2021-06-11 22:41:49    4975
wemos_mini_07_sysinfo 2021-06-11 22:41:31    4975
wemos_mini_08_sysinfo 2021-06-11 22:37:06    0
wemos_mini_09_sysinfo 2021-06-11 22:42:03    385
wemos_mini_10_sysinfo 2021-06-11 22:44:11    4370
wemos_mini_11_sysinfo 2021-06-11 22:44:24    3585
wemos_mini_12_sysinfo 2021-06-11 22:44:11    4980
wemos_mini_13_sysinfo 2021-06-11 22:44:42    4980
wemos_mini_14_sysinfo 2021-06-11 22:41:05    4975
wemos_mini_15_sysinfo 2021-06-11 22:40:52    4975
wemos_mini_16_sysinfo 2021-06-11 22:42:01    0
wemos_mini_17_sysinfo 2021-06-11 22:41:14    4980
wemos_mini_18_sysinfo 2021-06-11 22:40:59    4980
wemos_mini_19_sysinfo 2021-06-11 22:44:31    4980
wemos_mini_tcst_01_sysinfo 2021-06-11 22:44:09    14405
wemos_pro_04_sysinfo 2021-06-11 22:44:42    4985
wemos_pro_05_sysinfo 2021-06-11 21:43:36    0
wemos_pro_06_sysinfo 2021-06-11 22:43:16    4975
wemos_pro_07_sysinfo 2021-06-11 22:13:32    4945

Note: the 0's are deep-sleep units...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants