Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New beta (published April 2, 2024), please add problems here #1020

Open
ricki-z opened this issue Apr 1, 2024 · 52 comments
Open

New beta (published April 2, 2024), please add problems here #1020

ricki-z opened this issue Apr 1, 2024 · 52 comments

Comments

@ricki-z
Copy link
Member

ricki-z commented Apr 1, 2024

At the moment the SCD30 seems to stop after some time. Looks like an I2C issue, bus speed might needed to decreased. But this may cause problems with other devices.

@Phaze-III
Copy link
Contributor

Thanks for creating this issue! The current firmware (from the alpha directory and compiled from the current beta branch) runs very unstable on my sensors (nodemcu v2, one with SDS011 and BME280, the other just for testing firmware without any sensor at the moment).

They frequently crash with a 'Software Watchdog' or 'Exception' error. Largest uptime is less than 3 hours. The nodes do recover but will show lots of gaps in the measurements. This most likely is caused by the combination of parts of #1017 and #1019. I suspect that with adding the SEN5X code memory is sometimes exhausted.

Reverting #1019 results in stable firmware capable of OTA-upgrades again but no SEN5X.

Reverting relevant parts of #1017 also gives a relatively stable firmware with SEN5X but that re-introduces the OTA-problem (#915) again, the node completely freezes when trying to download an update and can only be restored by re-flashing.

I've created a few branches to test the above mentioned reverts:

https://github.com/Phaze-III/sensors-software/tree/feature/revert-beta-sen5x-part-of-pr1019
https://github.com/Phaze-III/sensors-software/tree/feature/revert-beta-ota-workaround-part-of-pr1017

I also looked into isolating the SEN5X code to make it possible to (de)activate the code with a compile time define. That code (created with some diff -D magic) is available at commit d268378
(branch https://github.com/Phaze-III/sensors-software/tree/feature/beta-ifdef-sen5x-enclosing )

We might do something like that for other sensors and perhaps provide pre-compiled binaries for different combinations of sensors.

@ricki-z
Copy link
Member Author

ricki-z commented Apr 2, 2024

I've a node that is running since yesterday evening: https://api-rrd.madavi.de:3000/grafana/d/GUaL5aZMz/pm-sensors?var-chipID=esp8266-13597771&orgId=1
This node seems to run without the described problems...

@Phaze-III
Copy link
Contributor

I do see some gaps which could also be upload errors. The last one around 18:06. What's the uptime on that node?

@ricki-z
Copy link
Member Author

ricki-z commented Apr 2, 2024

Okay uptime is 4 hours, 37 minutes (at 22:47 CEST). Reset reason was an exception.
I think we should remove the SEN5x code for the moment. Priority should be a stable version for the new certs.

@Phaze-III
Copy link
Contributor

That looks like a crash around the time of the gap in the graph. My guess would be that if you regularly check the uptime that it will be e few hours at most.

@jmparatte
Copy link

At the moment the SCD30 seems to stop after some time. Looks like an I2C issue, bus speed might needed to decreased. But this may cause problems with other devices.

My SCD30 stopped because a small insect laid eggs... After disassembly of the SCD30, cleaning and reassembly, everything is now OK.

@Phaze-III
Copy link
Contributor

I think we should remove the SEN5x code for the moment. Priority should be a stable version for the new certs.

Agreed. I created #1022 to do that. If needed I can also create a PR to sync the current beta-sen5x with beta so people can still use that branch to compile SEN5X firmware.

@Phaze-III
Copy link
Contributor

I've also been looking at other potential stability issues. There are two situations where one might lose control of the sensor. Both happen when selecting a sensor in the configuration without actually having that type of sensor connected and most likely if there are wiring problems with the sensor.

  1. Select 'Tera Sensor Next PM' and save/restart. After restarting the sensor will go into a really fast loop to "Wait for Serial" and getting into the web-interface to fix it is not possible anymore

  2. Select 'Piera Systems IPS-7100' and save/restart. After restarting the firmware immediately crashes and goes into a reboot/crash loop.

Both situations can only be fixed by reflashing the FS with the config.

All other sensors appear to have some check that the sensor can be read or otherwise don't result in a lost sensor.

@ricki-z
Copy link
Member Author

ricki-z commented Apr 3, 2024

I will upload the actual version with PR #1022 to the alpha folder.
If this is working I will copy it to the beta tomorrow.

@ricki-z
Copy link
Member Author

ricki-z commented Apr 3, 2024

Regarding the newer sensor types Next PM and IPS-710 I think the code needs some clean-up. I haven't had the time to check this thoroughly before merging ...

@Phaze-III
Copy link
Contributor

If this is working I will copy it to the beta tomorrow.

The firmware from the alpha folder is running fine on my nodes, no crashes. I also tested some other languages than the usual ones and they are OK too.

OTA upgrades from the alpha-firmware also work, they will fetch the currently published beta (or release update) and loader and restart with the downloaded firmware.

So no objections from me to copy them to the beta tree :-)

@ricki-z
Copy link
Member Author

ricki-z commented Apr 4, 2024

Okay. My test sensor is running for 23 and a half hour now.
The alpha will become the new beta in some minutes.

@Phaze-III
Copy link
Contributor

Good news, thanks! I just did a successful OTA upgrade from the published stable firmware (NRZ-2020-133/NL (Nov 29 2020)) to beta. So hopefully more people can now help in testing the beta.

@ricki-z
Copy link
Member Author

ricki-z commented Apr 5, 2024

My test device is running for more than 43 hours now. Time to move the beta to stable?
I will then make some cosmetic changes and increase the version numbers before publishing the stable version.

@ricki-z
Copy link
Member Author

ricki-z commented Apr 6, 2024

I've checked the size of the binaries. We may get a problem with updates in some languages where the total of the loader and the firmware is larger than the available flash memory (1GB) ... I will try language with the largest binary now.
For future firmware releases we may need to 'optimize' the firmware again (i.e. by removing some of the ciphers again, do we really need the chacha20).

@ricki-z
Copy link
Member Author

ricki-z commented Apr 6, 2024

Okay, Bulgarian firmware as the largest is working. So we should be ready to go live with the actual beta and move it to the stable branch. (The stable version will be named NRZ-2024-135)

@Phaze-III
Copy link
Contributor

Phaze-III commented Apr 6, 2024

Actually I tested an OTA update of all languages yesterday :) They all went fine. Note that the spiffs fs size where the binaries and config are stored is actually 3MB which leaves room for both the old and new firmware and the loader.

I'm still a bit worried about the RAM usage of the extra sensors (especially when new stuff is added in the future) but merging beta to stable should be OK (the few conflicts are easy to resolve).

I would suggest however to first copy the NRZ-2024-135 builds to the 'alpha' directory for a first preview and a few test runs of the builds. Every build still can show worse performance and compilation might need some tweaking. Don't expect that will be the case but better safe than sorry before updating 10K+ sensors :-)

@ricki-z
Copy link
Member Author

ricki-z commented Apr 6, 2024

For the update process both the loader and the new firmware need to be copied to the 1GB system flash.
The memory usage during the update process is described here https://arduino-esp8266.readthedocs.io/en/latest/ota_updates/readme.html#update-process-memory-view

@Phaze-III
Copy link
Contributor

When I was testing the OTA-problem last year I used some extra code to list the size and contents of the spiffs partition. That shows something like this after an OTA update:

airRohr: NRZ-2023-134-B5/EN
mounting FS...
opened config file...
parsed json...
File system info:
Total space:      2949250 bytes
Total space used: 950537 bytes
Block size:       8192 bytes
Page size:        2949250 bytes
Max open files:   5
Max path lenght:  32

Files found:
/loader.bin - 311632
/firmware.old - 626736
/config.json.old - 1510
/config.json - 1510

A next update will first store the new firmware binary there before flashing it to the 1MB program partition. So even with the larger binaries there still will be enough room.

@Phaze-III
Copy link
Contributor

For the update process both the loader and the new firmware need to be copied to the 1GB system flash.
The memory usage during the update process is described here https://arduino-esp8266.readthedocs.io/en/latest/ota_updates/readme.html#update-process-memory-view

Ah, okay. So we're close to exhausting the 1M space. Is the current code actually using the described OTA Update process?

@ricki-z
Copy link
Member Author

ricki-z commented Apr 6, 2024

This is the reason we use the two step update. Otherwise we would have only 0.5 MB for the firmware. And we haven't found a way to 'resize' the file system without 'killing' most of the devices.

@ricki-z
Copy link
Member Author

ricki-z commented Apr 6, 2024

The master build with the actual version is uploaded to the alpha folder.

@Phaze-III
Copy link
Contributor

Thanks. The firmware from the alpha folder has been running fine overnight on my test node with the same performance characteristics (sample rate, memory usage and heap fragmentation, web ui response time, upload time and working auto update) as the beta. So the 'alpha' builds are OK with one minor intl detail for the CZ version.

I used a local compile of the master branch on my 'production' node with the CZ version and noticed (that is, my script to collect the statistics did) that the definitions for INTL_NUMBER_OF_MEASUREMENTS and INTL_TIME_SENDING_MS are now identical:

intl_cz.h 
#define INTL_NUMBER_OF_MEASUREMENTS "Počet měření"
#define INTL_TIME_SENDING_MS "Počet měření"

Google translate suggests #define INTL_TIME_SENDING_MS "Trvání odesílání dat" . Perhaps you could change that before publishing.

All in all I think we're good to go :-)

@Phaze-III
Copy link
Contributor

Phaze-III commented Apr 7, 2024

For future firmware releases we may need to 'optimize' the firmware again (i.e. by removing some of the ciphers again, do we really need the chacha20).

ChaCha20 should be less CPU-intensive than AES-GCM based ciphers. Not sure about how that works out on an ESP-board. Furthermore it is a very common cipher used on strict HTTPS servers. For example on the forum there is someone using his own API-server using the following setup server-side:

PORT    STATE SERVICE
443/tcp open  https
| ssl-enum-ciphers: 
|   TLSv1.2: 
|     ciphers: 
|       TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256 (secp256r1) - A
|       TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (secp256r1) - A
|       TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (secp256r1) - A
|       TLS_DHE_RSA_WITH_AES_256_GCM_SHA384 (dh 2048) - A
|     compressors: 
|       NULL
|     cipher preference: server
|   TLSv1.3: 
|     ciphers: 
|       TLS_AKE_WITH_CHACHA20_POLY1305_SHA256 (ecdh_x25519) - A
|       TLS_AKE_WITH_AES_128_GCM_SHA256 (ecdh_x25519) - A
|       TLS_AKE_WITH_AES_256_GCM_SHA384 (ecdh_x25519) - A
|     cipher preference: server
|_  least strength: A

Uploading over HTTPS to the custom API on that server only worked with the beta. I would keep the ChaCha20. Will it save much by disabling it?

@ricki-z
Copy link
Member Author

ricki-z commented Apr 7, 2024

I will go through the ciphers list.

A quick comparison:

actual cipher list in beta
RAM: [==== ] 41.8% (used 34220 bytes from 81920 bytes)
Flash: [======= ] 67.1% (used 701191 bytes from 1044464 bytes)

BearSSL Basic (actual cipher list in latest published stable)
RAM: [==== ] 41.8% (used 34220 bytes from 81920 bytes)
Flash: [======= ] 65.1% (used 680247 bytes from 1044464 bytes)

@ricki-z
Copy link
Member Author

ricki-z commented Apr 7, 2024

I've seen some parts where we may be able to save memory.
I.e. the display part or the newly added sensors (NextPM, IPS-7100).

@ricki-z
Copy link
Member Author

ricki-z commented Apr 7, 2024

The new stable (master) version NRZ-2024-135 is now live.

@Phaze-III
Copy link
Contributor

Thanks Rajko! I've seen already a bunch of sensors auto-updating and still running fine. Do you monitor the distribution of installed versions (i.e. based on uploads or firmware requests) ?

@Phaze-III
Copy link
Contributor

Phaze-III commented Apr 8, 2024

One of the nice things about this release is that now more people can try the Power Save feature. I've been running my sensor in power save mode most of the time for almost a year now. The web-interface feels a little bit more sluggish (response time ~200ms instead of ~30ms but still very acceptable) but otherwise no real issues.

In my case the power consumption is almost half of the normal power consumption (~450 mW instead of ~900 mW):

Screenshot 2024-04-07 at 22 18 07

Now if all 10K+ sensors would run in Power Save mode that would save the equivalent of the power consumption of 15 average Dutch households :-)

@Petarkir2000
Copy link
Contributor

Two stations 24 hours with new 135 version are working fine. The problem occurred on one of them with previous beta version- didn’t want to update to the new one. After re-flash with non-beta, OTA was done.

@Phaze-III
Copy link
Contributor

The problem occurred on one of them with previous beta version- didn’t want to update to the new one.

That is expected behaviour. The Auto-Update of the previous beta (NRZ-2021-134-B4) is broken. People running NRZ-2021-134-B4 can only update with a re-flash.

@dokape
Copy link
Contributor

dokape commented Apr 8, 2024

Hi, long time no see,

My sensor did an update as expected during the night.
Checked now and have an issue.

I have DHT22 and BME230 both.
After the update, the values of BME280 are not shown anymore. Just the DHT22 Sensors.
Reboot, deactivate sensor and reactivate had no success.

@jmparatte
Copy link

jmparatte commented Apr 8, 2024

Congratulations! Mine has automatically updated about 4h past,
OpenSenseMap is now reactivated.

@Phaze-III
Copy link
Contributor

@dokape I don't have a DHT22 sensor so can't check but did you check what happens if you only activate BME280 and let it run for a while after a save/reboot?

@dokape
Copy link
Contributor

dokape commented Apr 8, 2024

Yes, was not recognized.
The log is maximum loglevel, no DHT22, Just BME280.

airrohr, FW 134.txt
Screenshot 2024-04-08 140017

I installed the airrohr after moving to another house 4 years ago and had no problem but 2 times it stopped. A Power-Reset did restart the sensor and it worked fine again. The last stable war really a very stable version!

Edit:
Power Reset did not change the behavior. BME values are gone.

@Phaze-III
Copy link
Contributor

Is it possible for you to capture the debug information just after the restart? It should show something like:

Read SDS...: 
Stopping SDS011...
Read BMx280...
Trying BMx280 sensor on 77 ... not found
Trying BMx280 sensor on 76 ... found
Send to :
sensor.community
Madavi.de

@dokape
Copy link
Contributor

dokape commented Apr 8, 2024

Ah, i think I got something. it looks belonging to translations.
I'm investigating.

@dokape
Copy link
Contributor

dokape commented Apr 8, 2024

Translation issue.

different screens German - English.
Different Values: German: BMP, English: BME, differenz amount of values.

german-BME-missing

english-morre values

english-BME_works

@Phaze-III
Copy link
Contributor

I can't really explain that from the code. Somehow it looks like in your case the German version detects a BMP280 instead of the actual BME280. On my sensor the German (and all other versions) correctly detect and show a BME280 . This is what I get when selecting both sensors (but as said I don't have a DHT22 connected):

Screenshot 2024-04-08 at 14 40 40

Do you get both sensors again when also selecting DHT22 using the English version?

@dokape
Copy link
Contributor

dokape commented Apr 8, 2024

english: DHT and BME works fine together.

This sensorhardware is about 6 to 8 years old. worked fine so far. Just the DHT22 has wrong humidity since some years.
Wasn't there an issue about some weird IDs for the BME some years ago?

engl-BME-DHT_works

@Phaze-III
Copy link
Contributor

Did you or could you try switching back to the German version again? I wouldn't be surprised that it will detect the BME280 again properly.

@dokape
Copy link
Contributor

dokape commented Apr 8, 2024

switched back to German and …

You are for shure not surprised:

It works fine

I know, I am the one to find strange behaviours. As always.

I guess it would be no fun to reproduce this behaviour.

IMG_6596

@Phaze-III
Copy link
Contributor

Most likely some issue with registers not cleared correctly during the update. Might also have been solved with removing power for a few minutes and powering back up. Thanks for reporting!

@dokape
Copy link
Contributor

dokape commented Apr 8, 2024

The power reset did not work.
thanks for reading!

@ricki-z
Copy link
Member Author

ricki-z commented Apr 8, 2024

@Phaze-III You've asked for some stats:
https://stats.sensor.community/scripts/active_sensors.php (sensor active in the last 5 minutes, was around 11.650 devices yesterday before publishing the update)
https://stats.sensor.community/scripts/firmware_versions.php (installed firmware versions, we are mostly done ;-) )

@Phaze-III
Copy link
Contributor

Phaze-III commented Apr 8, 2024

@ricki-z Nice statistics. Good to see that the update went/is going smoothly!

@issteve
Copy link

issteve commented Apr 9, 2024

Just to let you know, the update killed my connection to the sensor for about an hour. I was no longer able to get on the web interface and didn't receive any data via API (madavi and my own in the local network).
It was running fine - before the update for a couple of years now - with a DHT22 and a SHT3X (and a hardware connected but in the menu not activated SDS011). And while my other sensors did get back online within about 15 minutes, this took about an hour...

Is there any need for debugging the issue? Or is this behaviour acceptable as it solves itself within an hour?

@Phaze-III
Copy link
Contributor

It happens sometimes. When a sensor detects new firmware on the server it starts downloading the firmware (the actual .bin file, the .md5 for checking for corruption, the loader and md5). Each of these downloads can fail and normally the sensor stops the update process, continues to operate normally and tries again after 24 hours.

It can however also crash during the downloads after which it will start the update process again after reboot. It can happen that such a crash/restart/try-again cycle happens a few times in a row. Most of the time one of those cycles will succeed. I haven't been able to pinpoint a cause but it tends to happen under less than optimal WiFi conditions (say below -80 dB with a lot of noise) or with older and perhaps more worn out hardware.

Another scenario is that during the first start after the update the sensor crashes with a Software Watchdog or Exception and might do that a few times in a row. It most of the times can also recover after a while but sometimes only a power reset (keep power of a few minutes) helps.

I would say it is acceptable.

@GoetzGoerisch
Copy link

GoetzGoerisch commented Apr 17, 2024

My sensor still did not get the OTA. (NRZ-2021-134-B4/DE)
Not listed in the two statistics published last week but with up to data publishing on sensor.community.

Sorry for the noise, overlooked #1020 (comment)

@GoetzGoerisch
Copy link

GoetzGoerisch commented Apr 18, 2024

@ricki-z Thank you for the new FW, existing sensor running fine after a manual update, now also shown in your stats above. (NRZ-2024-136-B1 (Beta))

To have a backup I wanted to setup a new device. The flashing works fine with the airrohr-flasher. Although I cannot initially set a Wifi PW, as the new screenshots indicate.

Secondly the sensors comes up in AP mode, with airRohr-<ID> I cannot access the wifi, as it asks for a pw which I never set.

Found it in some other issue: #916 (comment)

Thank you.

@ricki-z
Copy link
Member Author

ricki-z commented Apr 22, 2024

Yes, we have to update our instructions. For all looking for the AP mode password here, it's 'airrohrcfg'.
We needed to update the firmware faster than planned (certificate issues).

@whiskerp
Copy link

whiskerp commented May 21, 2024

Mine has developed an OTA loop. It boots, connects to WiFi, downloads the three files, goes through stage 2, reboots, displays the 'new' version number, then simply downloads it again and reinstalls it again repeatedly. The two displayed MD5s always differ.

I can break the loop by disconnecting its wifi association on the AP, which fails the download, so it just runs normally. I have tried reverting back to the non-beta version
NRZ-2024-135, but that now also OTA loops. This only started happening on Sunday 18th May. It boots and runs fine if I disable auto-update.

EDIT:
In the end I completely erased the flash and manually flashed it via the USB port. It now boots correctly. There must be something mangling the installed image MD5 which gets checked but not reset by a new flash image.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants