Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Understanding SettingsLoad() logic #5054

Closed
5 tasks
hhaim opened this issue Jan 28, 2019 · 34 comments
Closed
5 tasks

Understanding SettingsLoad() logic #5054

hhaim opened this issue Jan 28, 2019 · 34 comments
Labels
enhancement Type - Enhancement that will be worked on fixed Result - The work on the issue has ended

Comments

@hhaim
Copy link

hhaim commented Jan 28, 2019

IMPORTANT NOTICE
If you do not complete the template below it is likely that your issue will not be addressed. When providing information about your issue please be as extensive as possible so that it can be solved by as little as possible responses.

FAILURE TO COMPLETE THE REQUESTED INFORMATION WILL RESULT IN YOUR ISSUE BEING CLOSED

Make sure these boxes are checked [x] before submitting your issue - Thank you!

source understanding issue
I'm trying to make the CRC issue more resilient but not sure I follow the current logic of the code
In case of default mode "(rotate 0 = Save in next flash slot)"

This is what I expect from this function:
To read the Setting from the sector that match the following

  1. Highest Settings.save_flag (most recent)
  2. Settings.CRC is valid
  3. Settings.cfg_holder == CFG_HOLDER

However It does not do that instead

    if ( (Settings.cfg_holder != _SettingsH.cfg_holder) || (Settings.save_flag > _SettingsH.save_flag)) {

I would expect it to be

    if ( ((Settings.cfg_holder ==CFG_HOLDER) && (Settings.save_flag > _SettingsH.save_flag))) {

This is the more resilient CRC code

// try to find the base valid CRC, with right HDR with higher save_flag
int FindBestSector(void){

  settings_location = SETTINGS_LOCATION +1;
  int last_index = -1;
  int last_flags = -1;

  for (byte i = 0; i < CFG_ROTATES; i++) {
    settings_location--;
    ESP.flashRead(settings_location * SPI_FLASH_SEC_SIZE, (uint32_t*)&Settings, sizeof(SYSCFG));

    bool bad_crc = false;
    if (Settings.version > 0x06000000) { 
        bad_crc = (Settings.cfg_crc != GetSettingsCrc()); 
    }
    if ( (bad_crc==false) && (Settings.cfg_holder == (uint16_t)CFG_HOLDER) ) {
        if (Settings.save_flag > last_flags ) {
            last_flags = Settings.save_flag;
            last_index = settings_location;
        }
    }
    delay(1);
  }

  if (last_index  >=0) {
      ESP.flashRead(last_index * SPI_FLASH_SEC_SIZE, (uint32_t*)&Settings, sizeof(SYSCFG));
  }

  return (last_index);
}
void SettingsLoad(void)
{
..
// replace this line
  if (bad_crc || (Settings.cfg_holder != (uint16_t)CFG_HOLDER)) {  SettingsDefault(); }

// with this line 
  if (bad_crc || (Settings.cfg_holder != (uint16_t)CFG_HOLDER)) {
      if (FindBestSector()  < 0) {
          SettingsDefault();
      }
  }


(Please, remember to close the issue when the problem has been addressed)

@ascillato
Copy link
Contributor

Just for reference, this is coming from the discussion on #5032
I have a different opinion about this but anyway it is for @arendst that @hhaim aimed this question.

@ascillato2 ascillato2 added question Type - Asking for Information Awaiting feedback from Project Owner Awaiting feedback / input from Project Owner labels Jan 28, 2019
@andrethomas
Copy link
Contributor

andrethomas commented Jan 28, 2019

This kind of reminds me of Windows' "Last known configuration"

I think I have two concerns (for now)

  1. How to be sure that the last flash page with the highest flag is indeed valid because if an OTA was just performed default values in the SPI Flash will persist so you may end up using a flash page thinking is valid when it is in fact not.
  2. If a write error occurs it could be because of a power outage at the exact moment of writing a new flash page it may be useful but it may as well also be because a hardware watchdog exception occurred (which could be more likely than a power failure) you would not want to use any of the flash pages if one of them has proven to be corrupted.

@hhaim
Copy link
Author

hhaim commented Jan 28, 2019 via email

@Jason2866
Copy link
Collaborator

Feel free to fork and use your own code. This is the nice thing of Open Source
If there are different points of understanding everyone can go his way.
It is up only to Theo Arendst it is his project.

@andrethomas
Copy link
Contributor

There is no intention to protect OTA under power failure (OTA is rare relative to flash updates)

So are CRC failures - if you're getting a lot of these its probably because you have bad spi flash, or your spi flash is end of life :)

@hhaim
Copy link
Author

hhaim commented Jan 28, 2019 via email

@andrethomas
Copy link
Contributor

I think you missed my point,

Yes, the flash write is after a setting has changed but if the watchdog resets due to flash failure then I definately do not want to use any of the flash blocks whether they can or cannot be verified to be valid in one way or another.

@hhaim
Copy link
Author

hhaim commented Jan 28, 2019 via email

@andrethomas
Copy link
Contributor

Could you say why NOT to protect this case?

I don’t think it is a good excuse not to use this protection beacuse there are cases that can’t be protected. We should do our best.

Its not an excuse - its a reason. There is a big difference between knowing with absolute certainty that a previously written flash page is good compared to restore it from a known good initial configuration.

But hey, I'm just an amateur programmer so I don't know much - lets wait for a response from @arendst before we play ping pong some more?

@hhaim
Copy link
Author

hhaim commented Jan 29, 2019

Looking at the code I see more issues

  1. Eeprom API is deprecated and could be removed

  2. flash API == two way to do the same, ESP class (ESP.flashxxx) or directly using lower level API spi_flash_xx why not to choose one way?

  3. This could be encapsulated into a function (there are too many copy of this line)
    ESP.flashRead(settings_location * SPI_FLASH_SEC_SIZE, (uint32_t*)&Settings, sizeof(SYSCFG));

read_setting_from sector(sector)

@andrethomas
Copy link
Contributor

Looking at the code I see more issues

Fork the code and feel free to make PR's so that they can be discussed there.

Opening an issue and suggesting code changes is not a constructive way of changing anyone's view on the way the code is implemented.

Lastly, opening an issue on the same topic because the previous one was closed is not the way to go.

@arendst
Copy link
Owner

arendst commented Jan 29, 2019

@hhaim I suggest you make your changes and then do some tests to see if it solves your issue. If so I might want to implement it.

As of now my 30+ devices never had a crc error causing them to reset to default. I also never have a need to remove AC to the devices either. When AC was lost unexpectedly they all came back up as expected.

Remember that there is only a flash write when a settings is going to be changed or the relay is switched. Then, depending on the state of savedata a flash update is written. Flash updates can be delayed or disabled too.

So your flash corruption could only happen during one of the above situations.

@arendst arendst removed the Awaiting feedback from Project Owner Awaiting feedback / input from Project Owner label Jan 29, 2019
@andrethomas2
Copy link
Collaborator

Closing this issue as it has been answered.

Support Information

See Wiki for more information.
See Chat for more user experience.

@hhaim
Copy link
Author

hhaim commented Jan 29, 2019

@arendst thanks for the quick answer. There are two things mixed in this issue.
My original issue (5032) was the boot loop issue which was resolved quickly by @ascillato
But I've looked into the code to fix the other potential CRC issue and had some questions about the logic of the current code.

  1. The logic of the code (this issue)
  2. The power failure while write causing CRC issue (5032)

While fixing #2 I get to issue #1

regarding #2:
I would create a stress test to prove that. ~10 years ago I've implemented the same idea for our customers, same flash, --spurious errors were there -- this is how flash works, it is all about the scale . When you have ~30 devices and 2 power failure a year you won't see this issue. In my case all the devices are configure to save to flash (using rotate)

regarding #1:
Would you want me to create a PR fixing the logic issue (if exits)? (finding the best sector with valid crc, last write, valid CFG_HOLDER) I can add more cleanup to the other things I saw

@benzino77
Copy link

@arendst is it possible to change BOOT_LOOP_TIME in user_config_override.h?
Will it be also possible to implement user defined values for situations when "fast restart" occurs. I think of lines 2487 till 2511 in sonoof.ino. Now after 3 fast restarts rule sets are disabled, then rules functionality is disabled. After 5 restarts the whole gpio configuration is reset and then the device is set as Sonoff Basic. Will it be possible to define those values (3 for rules sets, 4 for rules as functionality, etc.) somewhere in sonoff.h and let the user to ovewrite them in user_config_override.h?

I think that I have such situation in the past (my garage controller) and I lost GPIO configuration because of power flickering...
I could manipulate BOOT_LOOP_TIME (define lower value than 10sec) to mitigate such situation, or change thresholds for clearing GPIO configuration etc.

@mike2nl
Copy link
Contributor

mike2nl commented Jan 29, 2019

It is possible that you gus have hardware which is already damaged by yourself in a way you handle it?
Where you get 3 or more fast restarts? I think you manipulate things.
Why you get killed rules.

I never had these issue and i have a lot of sonoff devices.
It is possible an idea is to get a new peace of hardware and test it again.|
Poissible you both have the same cheap chinese hardware.

@benzino77
Copy link

Where you get 3 or more fast restarts?

by power flickering (the power which is delivered to my house). This is nothing unusual in the village I live, it happens from time to time (power flickering). There are moments that I have a "disco" in my house for 10-15 seconds.

If I correctly understand the lines from sonoff.ino then after 5 restarts during BOOT_LOOP_TIME I will lost my GPIO config.The device itself has nothing to do with the "gpio config reset", as I wrote, the power flickering and settings in Tasmota does.

If I understand it wrong (lines 2487 till 2511) then you can ignore my posts.

I never had these issue and i have a lot of sonoff devices.
It is possible an idea is to get a new peace of hardware and test it again.|
Poissible you both have the same cheap chinese hardware.

It is not a hardware problem (at least the scenario that I described) but the Tasmota behavior when there are "fast restarts" during BOOT_LOOP_TIME.

@ascillato
Copy link
Contributor

@benzino77

Please, check #4645

In that issue Theo has explained how to disable that feature if you don't need it. Thanks

@benzino77
Copy link

Sure, but there are no lines 2658 until 2680 in sonoff.ino anymore ;) I think those lines are now 2487 until 2511

What I'm asking is to allow user (by user_config_override.h) to slightly modify this feature by setting own BOOT_LOOP_TIME or change "boot number thresholds"

@hhaim
Copy link
Author

hhaim commented Jan 29, 2019 via email

@andrethomas
Copy link
Contributor

BOOT_LOOP_TIME can be changed in my_user_config.h or my_user_config_override.h (if you use the override config file) by adding the following two lines to your relevant config file.

#undef BOOT_LOOP_TIME
#define BOOT_LOOP_TIME                 20

@hhaim
Copy link
Author

hhaim commented Jan 29, 2019 via email

@andrethomas
Copy link
Contributor

Sent from my iPhone

Get off your phone and make a PR then :)

@Jason2866
Copy link
Collaborator

Jason2866 commented Jan 29, 2019

@hhaim We highly appreciate a PR to see what high skilled programmer like you can teach us

@jziolkowski
Copy link
Contributor

jziolkowski commented Jan 29, 2019

Why don't you guys buy a good UPS then?

@hhaim
Copy link
Author

hhaim commented Jan 29, 2019

tune boot-loop

@hhaim
Copy link
Author

hhaim commented Jan 29, 2019

@jziolkowski we are talking about UPS for all home electricity - In my case only a big generator will do the job.
Each IoT device is connected in a diffrent place and in a different configuraiton.

@benzino77
Copy link

@andrethomas thanks! Please correct me if I'm wrong: I should rather use lower value of BOOT_LOOP_TIME rather than higher? Lower value will lead to the situation that "power flickering" should be really fast to reset my configuration to safe state (disabled rules, GPIO, Sonoff Basic module, etc.).
I definitely don't want to remove this feature as suggested by @ascillato. I don't want to brick my device (and have to remove the device from the wall to reflash it by serial - this feature is here for a good reason) by wrong configuration which will lead to "boot loop" or whatever. I suspect that something like the code below (but I'm not Tasmota programmer) should give me (and maybe others, but again, please correct me if I'm wrong) better control of the "power flickering" situation:

  #define BOOT_COUNTS_TH 1
 // Disable functionality as possible cause of fast restart within BOOT_LOOP_TIME seconds (Exception, WDT or restarts)
  if (RtcReboot.fast_reboot_count > BOOT_COUNTS_TH + 1) {          // Restart twice
    Settings.flag3.user_esp8285_enable = 0;       // Disable ESP8285 Generic GPIOs interfering with flash SPI
    if (RtcReboot.fast_reboot_count > BOOT_COUNTS_TH + 2) {        // Restart 3 times
      for (uint8_t i = 0; i < MAX_RULE_SETS; i++) {
        if (bitRead(Settings.rule_stop, i)) {
          bitWrite(Settings.rule_enabled, i, 0);  // Disable rules causing boot loop
        }
      }
    }
    if (RtcReboot.fast_reboot_count > BOOT_COUNTS_TH + 3) {        // Restarted 4 times
      Settings.rule_enabled = 0;                  // Disable all rules
    }
    if (RtcReboot.fast_reboot_count > BOOT_COUNTS_TH + 4) {        // Restarted 5 times
      for (uint8_t i = 0; i < sizeof(Settings.my_gp); i++) {
        Settings.my_gp.io[i] = GPIO_NONE;         // Reset user defined GPIO disabling sensors
      }
    }
    if (RtcReboot.fast_reboot_count > BOOT_COUNTS_TH + 5) {        // Restarted 6 times
      Settings.module = SONOFF_BASIC;             // Reset module to Sonoff Basic
//      Settings.last_module = SONOFF_BASIC;

I can then redefine BOOT_COUNTS_TH and BOOT_LOOP_TIME to fit my environment/needs.

@hhaim
Copy link
Author

hhaim commented Jan 30, 2019 via email

@andrethomas
Copy link
Contributor

@benzino77

BOOT_LOOP_TIME defines the number of seconds after successful boot which RtcReboot.fast_reboot_count will be reset to 0 which essentially means that things are probably working as they should.

I think its fair to assume that if you have very unstable AC and you want the device to believe that everything is OK then you can decrease this value so that the reboot counter stops contributing to settings changes sooner (lets say 1 second) - So if no secondary resets within a 1 second period then the firmware will assume everything is OK and continue to attempt to function as configured.

I think its fair to say that if a reset/reboot occurs because of a hardware configuration issue it will most likely occur within milliseconds from boot so this will give you sufficient protection in case you configured a special function pin by accident which causes the wdt to meet one of its reset conditions.

The main challenge here is to find a balance between how many possible reboots you would expect within a certain time frame measured against the probability of having a rule that places your device in a boot loop.

I think implementing the BOOT_COUNTS_TH code you proposed will not make much difference if you're trying to mitigate power on-off transitions from your utility provider as it just adds another counter into the mix.

The question is what is the shortest possible time in which your utility company will power cycle your house/neighborhood... Will it cause your device to boot more than once in 10 seconds? If not, then redefining it to a lower value would not be beneficial.

Also note that the serious changes only start after the 3rd reboot within BOOT_LOOP_TIME seconds so there really has to be something wrong with the device - I don't think any power utility in the world can shed and restore electricity to a neighborhood 3 times in succession without exceeding the 10 second window at least once which would leave the configuration in tact as expected.

@benzino77
Copy link

benzino77 commented Jan 30, 2019

@andrethomas thanks again!

The main challenge here is to find a balance between how many possible reboots you would expect within a certain time frame measured against the probability of having a rule that places your device in a boot loop.

Yes, this is definitely a challenge to find that balance. It is not trivial and obvious.

I think implementing the BOOT_COUNTS_TH code you proposed will not make much difference if you're trying to mitigate power on-off transitions from your utility provider as it just adds another counter into the mix.

That was just a proposition for tasmota masters to comment or point weakness of that approach - thanks for that

I don't think any power utility in the world can shed and restore electricity to a neighborhood 3 times in succession without exceeding the 10 second

You are a man of little faith :)
Some AC lines are located near the trees and hanging on the air (from pole to pole). When the wind starts to blow, and it is nothing unusual in my neighborhood, that the wind speed is 80km/h or more , they (lines) start to "swing" or the branches on the trees are beginning to break. I live in a small village (400+ citizens) over 25km from the nearest "civilization"... "Strobo" or "disco" effect happens in my neighborhood - me and my neighbors got used to it :)

@hhaim thanks - looks promising

@andrethomas
Copy link
Contributor

Some AC lines are located near the trees and hanging on the air (from pole to pole). When the wind starts to blow, and it is nothing unusual in my neighborhood, that the wind speed is 80km/h or more , they (lines) start to "swing" or the branches on the trees are beginning to break. I live in a small village (400+ citizens) over 25km from the nearest "civilization"... "Strobo" or "disco" effect happens in my neighborhood - me and my neighbors got used to it :)

Sure, use 1 second then to give some tolerance to hardware configuration issues and make sure your rules are correct before enforcing them on a device which is inside a wall or a tree trunk :)

@andrethomas
Copy link
Contributor

@benzino77 Sounds like a nice village to retire in...

Theo made changes with the following merges:

9825d6f (Add resiliency to saved Settings)
2c164a8 (6.4.1.13 Add boot loop offset)

@andrethomas2 andrethomas2 added enhancement Type - Enhancement that will be worked on fixed Result - The work on the issue has ended and removed question Type - Asking for Information labels Jan 30, 2019
@benzino77
Copy link

WOW! There is even SetOptionXX for that!

@arendst @andrethomas @hhaim - thank you gents!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Type - Enhancement that will be worked on fixed Result - The work on the issue has ended
Projects
None yet
Development

No branches or pull requests

10 participants