-
-
Notifications
You must be signed in to change notification settings - Fork 315
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New beta (published April 2, 2024), please add problems here #1020
Comments
Thanks for creating this issue! The current firmware (from the alpha directory and compiled from the current beta branch) runs very unstable on my sensors (nodemcu v2, one with SDS011 and BME280, the other just for testing firmware without any sensor at the moment). They frequently crash with a 'Software Watchdog' or 'Exception' error. Largest uptime is less than 3 hours. The nodes do recover but will show lots of gaps in the measurements. This most likely is caused by the combination of parts of #1017 and #1019. I suspect that with adding the SEN5X code memory is sometimes exhausted. Reverting #1019 results in stable firmware capable of OTA-upgrades again but no SEN5X. Reverting relevant parts of #1017 also gives a relatively stable firmware with SEN5X but that re-introduces the OTA-problem (#915) again, the node completely freezes when trying to download an update and can only be restored by re-flashing. I've created a few branches to test the above mentioned reverts: https://github.com/Phaze-III/sensors-software/tree/feature/revert-beta-sen5x-part-of-pr1019 I also looked into isolating the SEN5X code to make it possible to (de)activate the code with a compile time define. That code (created with some diff -D magic) is available at commit d268378 We might do something like that for other sensors and perhaps provide pre-compiled binaries for different combinations of sensors. |
I've a node that is running since yesterday evening: https://api-rrd.madavi.de:3000/grafana/d/GUaL5aZMz/pm-sensors?var-chipID=esp8266-13597771&orgId=1 |
I do see some gaps which could also be upload errors. The last one around 18:06. What's the uptime on that node? |
Okay uptime is 4 hours, 37 minutes (at 22:47 CEST). Reset reason was an exception. |
That looks like a crash around the time of the gap in the graph. My guess would be that if you regularly check the uptime that it will be e few hours at most. |
My SCD30 stopped because a small insect laid eggs... After disassembly of the SCD30, cleaning and reassembly, everything is now OK. |
Agreed. I created #1022 to do that. If needed I can also create a PR to sync the current beta-sen5x with beta so people can still use that branch to compile SEN5X firmware. |
I've also been looking at other potential stability issues. There are two situations where one might lose control of the sensor. Both happen when selecting a sensor in the configuration without actually having that type of sensor connected and most likely if there are wiring problems with the sensor.
Both situations can only be fixed by reflashing the FS with the config. All other sensors appear to have some check that the sensor can be read or otherwise don't result in a lost sensor. |
I will upload the actual version with PR #1022 to the alpha folder. |
Regarding the newer sensor types Next PM and IPS-710 I think the code needs some clean-up. I haven't had the time to check this thoroughly before merging ... |
The firmware from the alpha folder is running fine on my nodes, no crashes. I also tested some other languages than the usual ones and they are OK too. OTA upgrades from the alpha-firmware also work, they will fetch the currently published beta (or release update) and loader and restart with the downloaded firmware. So no objections from me to copy them to the beta tree :-) |
Okay. My test sensor is running for 23 and a half hour now. |
Good news, thanks! I just did a successful OTA upgrade from the published stable firmware (NRZ-2020-133/NL (Nov 29 2020)) to beta. So hopefully more people can now help in testing the beta. |
My test device is running for more than 43 hours now. Time to move the beta to stable? |
I've checked the size of the binaries. We may get a problem with updates in some languages where the total of the loader and the firmware is larger than the available flash memory (1GB) ... I will try language with the largest binary now. |
Okay, Bulgarian firmware as the largest is working. So we should be ready to go live with the actual beta and move it to the stable branch. (The stable version will be named NRZ-2024-135) |
Actually I tested an OTA update of all languages yesterday :) They all went fine. Note that the spiffs fs size where the binaries and config are stored is actually 3MB which leaves room for both the old and new firmware and the loader. I'm still a bit worried about the RAM usage of the extra sensors (especially when new stuff is added in the future) but merging beta to stable should be OK (the few conflicts are easy to resolve). I would suggest however to first copy the NRZ-2024-135 builds to the 'alpha' directory for a first preview and a few test runs of the builds. Every build still can show worse performance and compilation might need some tweaking. Don't expect that will be the case but better safe than sorry before updating 10K+ sensors :-) |
For the update process both the loader and the new firmware need to be copied to the 1GB system flash. |
When I was testing the OTA-problem last year I used some extra code to list the size and contents of the spiffs partition. That shows something like this after an OTA update:
A next update will first store the new firmware binary there before flashing it to the 1MB program partition. So even with the larger binaries there still will be enough room. |
Ah, okay. So we're close to exhausting the 1M space. Is the current code actually using the described OTA Update process? |
This is the reason we use the two step update. Otherwise we would have only 0.5 MB for the firmware. And we haven't found a way to 'resize' the file system without 'killing' most of the devices. |
The master build with the actual version is uploaded to the alpha folder. |
Thanks. The firmware from the alpha folder has been running fine overnight on my test node with the same performance characteristics (sample rate, memory usage and heap fragmentation, web ui response time, upload time and working auto update) as the beta. So the 'alpha' builds are OK with one minor intl detail for the CZ version. I used a local compile of the master branch on my 'production' node with the CZ version and noticed (that is, my script to collect the statistics did) that the definitions for
Google translate suggests All in all I think we're good to go :-) |
ChaCha20 should be less CPU-intensive than AES-GCM based ciphers. Not sure about how that works out on an ESP-board. Furthermore it is a very common cipher used on strict HTTPS servers. For example on the forum there is someone using his own API-server using the following setup server-side:
Uploading over HTTPS to the custom API on that server only worked with the beta. I would keep the ChaCha20. Will it save much by disabling it? |
I will go through the ciphers list. A quick comparison: actual cipher list in beta BearSSL Basic (actual cipher list in latest published stable) |
I've seen some parts where we may be able to save memory. |
The new stable (master) version NRZ-2024-135 is now live. |
Thanks Rajko! I've seen already a bunch of sensors auto-updating and still running fine. Do you monitor the distribution of installed versions (i.e. based on uploads or firmware requests) ? |
One of the nice things about this release is that now more people can try the Power Save feature. I've been running my sensor in power save mode most of the time for almost a year now. The web-interface feels a little bit more sluggish (response time ~200ms instead of ~30ms but still very acceptable) but otherwise no real issues. In my case the power consumption is almost half of the normal power consumption (~450 mW instead of ~900 mW): Now if all 10K+ sensors would run in Power Save mode that would save the equivalent of the power consumption of 15 average Dutch households :-) |
Two stations 24 hours with new 135 version are working fine. The problem occurred on one of them with previous beta version- didn’t want to update to the new one. After re-flash with non-beta, OTA was done. |
That is expected behaviour. The Auto-Update of the previous beta (NRZ-2021-134-B4) is broken. People running NRZ-2021-134-B4 can only update with a re-flash. |
Hi, long time no see, My sensor did an update as expected during the night. I have DHT22 and BME230 both. |
Congratulations! Mine has automatically updated about 4h past, |
@dokape I don't have a DHT22 sensor so can't check but did you check what happens if you only activate BME280 and let it run for a while after a save/reboot? |
Yes, was not recognized. I installed the airrohr after moving to another house 4 years ago and had no problem but 2 times it stopped. A Power-Reset did restart the sensor and it worked fine again. The last stable war really a very stable version! Edit: |
Is it possible for you to capture the debug information just after the restart? It should show something like:
|
Ah, i think I got something. it looks belonging to translations. |
I can't really explain that from the code. Somehow it looks like in your case the German version detects a BMP280 instead of the actual BME280. On my sensor the German (and all other versions) correctly detect and show a BME280 . This is what I get when selecting both sensors (but as said I don't have a DHT22 connected): Do you get both sensors again when also selecting DHT22 using the English version? |
Did you or could you try switching back to the German version again? I wouldn't be surprised that it will detect the BME280 again properly. |
Most likely some issue with registers not cleared correctly during the update. Might also have been solved with removing power for a few minutes and powering back up. Thanks for reporting! |
The power reset did not work. |
@Phaze-III You've asked for some stats: |
@ricki-z Nice statistics. Good to see that the update went/is going smoothly! |
Just to let you know, the update killed my connection to the sensor for about an hour. I was no longer able to get on the web interface and didn't receive any data via API (madavi and my own in the local network). Is there any need for debugging the issue? Or is this behaviour acceptable as it solves itself within an hour? |
It happens sometimes. When a sensor detects new firmware on the server it starts downloading the firmware (the actual .bin file, the .md5 for checking for corruption, the loader and md5). Each of these downloads can fail and normally the sensor stops the update process, continues to operate normally and tries again after 24 hours. It can however also crash during the downloads after which it will start the update process again after reboot. It can happen that such a crash/restart/try-again cycle happens a few times in a row. Most of the time one of those cycles will succeed. I haven't been able to pinpoint a cause but it tends to happen under less than optimal WiFi conditions (say below -80 dB with a lot of noise) or with older and perhaps more worn out hardware. Another scenario is that during the first start after the update the sensor crashes with a Software Watchdog or Exception and might do that a few times in a row. It most of the times can also recover after a while but sometimes only a power reset (keep power of a few minutes) helps. I would say it is acceptable. |
Sorry for the noise, overlooked #1020 (comment) |
@ricki-z Thank you for the new FW, existing sensor running fine after a manual update, now also shown in your stats above. (NRZ-2024-136-B1 (Beta)) To have a backup I wanted to setup a new device. The flashing works fine with the airrohr-flasher. Although I cannot initially set a Wifi PW, as the new screenshots indicate. Secondly the sensors comes up in AP mode, with Found it in some other issue: #916 (comment) Thank you. |
Yes, we have to update our instructions. For all looking for the AP mode password here, it's 'airrohrcfg'. |
Mine has developed an OTA loop. It boots, connects to WiFi, downloads the three files, goes through stage 2, reboots, displays the 'new' version number, then simply downloads it again and reinstalls it again repeatedly. The two displayed MD5s always differ. I can break the loop by disconnecting its wifi association on the AP, which fails the download, so it just runs normally. I have tried reverting back to the non-beta version EDIT: |
At the moment the SCD30 seems to stop after some time. Looks like an I2C issue, bus speed might needed to decreased. But this may cause problems with other devices.
The text was updated successfully, but these errors were encountered: