Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Boot assessment must wait for network #1263

Closed
kkaempf opened this issue Mar 1, 2024 · 3 comments · Fixed by #1315
Closed

Boot assessment must wait for network #1263

kkaempf opened this issue Mar 1, 2024 · 3 comments · Fixed by #1315
Assignees
Labels
area/booting kind/bug Something isn't working kind/enhancement New feature or request
Milestone

Comments

@kkaempf
Copy link
Contributor

kkaempf commented Mar 1, 2024

We have a user case where the system comes up, passes boot assessment, but fails to start the workload.

Turns out that there's a filesystem error, preventing even NetworkManager.service from starting.

Boot assessment should capture this case and reboot into fallback if network fails to start up.

@kkaempf kkaempf added kind/bug Something isn't working kind/enhancement New feature or request area/booting labels Mar 1, 2024
@kkaempf kkaempf added this to the slem6 milestone Mar 1, 2024
@kkaempf kkaempf removed this from the slem6 milestone Mar 12, 2024
@davidcassany
Copy link
Contributor

davidcassany commented Mar 13, 2024

This is a tricky topic. Currently the boot assessment relies on systemd being capable to reach certain stage at boot and rebooting automatically in case of failure.

We should explore how health check from SLE Micro and also check if we need to define some sort of more elaborated concept to verify the system booted as expected. I'd say a sane check could be that the system managed to register itself once the default target is reached and before certain timeout. I'd like to see the check configurable, like giving a check script that is executed certain amount of times with a predefined cadence before considering it actually failed to successfully boot. On failure we could simply reboot.

This would give as the chance to also reboot in case systemd booted but on a degraded state (e.g. no network) and it would also give us the chance to configure some explicit constraints to consider for an upgrade.

@davidcassany
Copy link
Contributor

Just did a quick check to Micro health check and I do believe we should migrate to use such a service. What I am wondering is if we could easily conceptualize this concept in a generic way so we could provide a health check system for elemental-toolkit that does not depend on Micro.

@kkaempf kkaempf added this to the Micro6 milestone Mar 19, 2024
@davidcassany
Copy link
Contributor

After a a further look at health checker form Micro I do believe there are little chances for us to adopt it right now (it is really coupled to btrfs and several Micro specifics such as grub, etc.). We need a deeper integration and further perspective to make use of it. However what we can actually do is build our own checker script and logic in a compatible way, we could build something that is really close to a health checker plugin, so we can easily adapt/adopt former Micro system when there is a chance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/booting kind/bug Something isn't working kind/enhancement New feature or request
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

2 participants