Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[tracking] Improve failure tolerance in driver/mtd_spi_nor #20769

Open
6 tasks
crasbe opened this issue Jul 2, 2024 · 0 comments
Open
6 tasks

[tracking] Improve failure tolerance in driver/mtd_spi_nor #20769

crasbe opened this issue Jul 2, 2024 · 0 comments

Comments

@crasbe
Copy link
Contributor

crasbe commented Jul 2, 2024

Description

The work in #20589 uncovered a lot of possible improvements for driver/mtd_spi_nor.
Especially in it's current form it is not very resiliant against failing operations or non-responsive devices, leading to loops and essentially a hanging program.

  • When mtd_spi_nor_power tries to read the JEDEC ID and fails and no timers are used, it will retry 560,000,000 times to read the JEDEC ID, which takes a considerable amount of time, essentially blocking the system from startup. The variable "retries" has to be set to a sensible value.
  • There is no timeout for the functions wait_for_write_enable_cleared and wait_for_write_enable_set, which would get the thread stuck if the chip does not answer. https://www.macronix.com/Lists/Datasheet/Attachments/8868/MX25R6435F,%20Wide%20Range,%2064Mb,%20v1.6.pdf Page 34 has a block diagram. Macronix says to issue the WREN command again if the flag has not been set. So the right thing to do isn't waiting for the flag to be set but retrying (for a finite amount of times).
  • The function wait_for_write_complete does not have a timeout either but it counts the attempts and how many times it yielded the thread. This can be used as a basis for a timeout. The timeout should be dependent on the operation, chip erase can take a very long time (in the multiple minutes range), whereas other operations shouldn't take very long.
  • Add a software fallback for the data integrity functions (to check if the program or erase was successful). Depending on the length of the operation, maybe only do some random checks to avoid doing a full blank check of a 128MB chip for example.
  • Patch the nRF52840DK board mtd.c to utililze the data integrety checks by default. Depends on driver/mtd_spi_nor, pkg/littlefs: improve reliability with corrupted flash (new PR) #20589.
  • When a chip reset is issued and the microcontroller is reset, reading out the JEDEC ID will fail because the Flash is still busy with the reset. Therefore the JEDEC check should check the WIP flag to see if there's still something going on. The function should then return -EBUSY (but I don't know how that would be handled by the rest of the MTD subsystem?)

The first three points are probably a single PR, the fourth one is a single PR, the fifth one too and the last one might become a can of worms again.

Useful links

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant