-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding an network-online checker for elemental #1315
Adding an network-online checker for elemental #1315
Conversation
Signed-off-by: David Cassany <[email protected]>
e81b1f7
to
594981b
Compare
🤔 the scenario that worries me is if there is some transient network connectivity issue (some remote edge deployment with unstable connectivity?). |
Good point, that could happen if the network outage happen during the upgrade reboot, which is a corner case as network required to run the upgrade. The window in which this could happen is rather small, but I guess at scale this is not that hard hitting such a corner case in few nodes... Yes this is all annoying. That's why I am hesitant to include certain checks, it could easily hit us back with undesired effects on certain corner cases. Also it is unclear what |
We could also use the |
This is quite interesting, but this seams to relay on https://uapi-group.org/specifications/specs/boot_loader_specification which we are not following (if we do it is by pure coincidence). Following this spec would be nice though, I wonder if it is possible to follow it with grub2 and systemd-boot. If so then we could have our bootloader setup agnostic to the underlaying bootloader, which would be pretty nice. I think this require some deeper investigation. |
@fgiudici in rancher/elemental-toolkit#2027 the boot_assessment changes to only iterate over existing snapshots without including recovery. Hence on reboot after install, upgrade or reset it will kick-in the boot assessment and run checks if any. In case of failure it will reboot to most recent fallback snapshot and from there keep trying until there is a successful boot or until the oldest snapshot. In the oldest snapshot if it fails to boot it will essentially keep rebooting every few minutes in an infinite loop from that old snapshot. Probably instead of being stuck to the oldest we could consider restarting from scratch, active and then falling to most recent non active snapshot up to the oldest again. I'd consider that a minor detail though, if no snapshot is booting there isn't much we can do... Toughts? |
In the docs it's stated that it's a generic synchronization point so it should work using any bootloader. We would probably need to |
594981b
to
dbb38be
Compare
So, it makes sense then, if things go awry we may end-up with an older snapshot, one could always retry an OS upgrade later 👍🏼 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Approving from my side 👍🏼
Would like anyway to get some more on @frelon PoV.
Because we saw a customer with a failed upgrade (filesystem corruption) and it booted with many failed services (dbus, networkmanager, etc), but since none of these is required it just booted in degraded mode and it was not reachable anymore. Boot assessment did not catch the error because the system booted from systemd perspective. Here is where the health check concept comes from. We could also consider having a health check on elemental-register service instead of network-online. If we stick to elemental-register check that means that a node that is not capable to re-register after (re)booting will try to restart to a fallback snapshot. That could also be a more elemental specific approach. |
dbb38be
to
d46bc2b
Compare
Sorry, dismissing the review, the PR changed
If the elemental-register service fails to register on a reboot after install, reset or upgrade the checker will cause a reboot to a fallback OS. Signed-off-by: David Cassany <[email protected]>
d46bc2b
to
deb8b0c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, the elemental-register sync-up after upgrade is mandatory IMO,
LGTM! Thanks David
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
With this checker if the network-online.target does not reach the active state the system will reboot to fallback.
Few considerations for current boot_assessment:
Passive will also run the boot_assessment on upgrade in case active fails. This is relevant as it allows the system to keep trying on different fallbacks (assuming there are more than one old snapshot). However this also implies that if none is functional it will eventually reach recovery system. Before it used to only stay on a single fallback only, never reaching recovery automatically, because the boot assessment was not executed on passive.
Boot_assessment is not active by default on regular boots, just for reboots after upgrading. This is relevant as the boot assessment is not executed after install. So it could happen that the boot assessment is not passing and you just realize this on an upgrade. In that case the system will land in to recovery mode (out of k8s) after upgrading.
So I am wondering if it would make sense defaulting the boot_assessment to always run, not only after upgrades. This way we assess the installed boot actually passess the boot assessment.
@frelon @anmazzotti @fgiudici any thougths about this?
Fixes #1263