Adding an network-online checker for elemental #1315

davidcassany · 2024-03-22T15:01:51Z

With this checker if the network-online.target does not reach the active state the system will reboot to fallback.

Few considerations for current boot_assessment:

Passive will also run the boot_assessment on upgrade in case active fails. This is relevant as it allows the system to keep trying on different fallbacks (assuming there are more than one old snapshot). However this also implies that if none is functional it will eventually reach recovery system. Before it used to only stay on a single fallback only, never reaching recovery automatically, because the boot assessment was not executed on passive.
Boot_assessment is not active by default on regular boots, just for reboots after upgrading. This is relevant as the boot assessment is not executed after install. So it could happen that the boot assessment is not passing and you just realize this on an upgrade. In that case the system will land in to recovery mode (out of k8s) after upgrading.

So I am wondering if it would make sense defaulting the boot_assessment to always run, not only after upgrades. This way we assess the installed boot actually passess the boot assessment.

@frelon @anmazzotti @fgiudici any thougths about this?

Fixes #1263

Signed-off-by: David Cassany <[email protected]>

fgiudici · 2024-03-22T15:30:58Z

🤔 the scenario that worries me is if there is some transient network connectivity issue (some remote edge deployment with unstable connectivity?).
Will that mean that a system may end up in recovery state in that case?

davidcassany · 2024-03-22T17:10:28Z

🤔 the scenario that worries me is if there is some transient network connectivity issue (some remote edge deployment with unstable connectivity?). Will that mean that a system may end up in recovery state in that case?

Good point, that could happen if the network outage happen during the upgrade reboot, which is a corner case as network required to run the upgrade. The window in which this could happen is rather small, but I guess at scale this is not that hard hitting such a corner case in few nodes...

Yes this is all annoying. That's why I am hesitant to include certain checks, it could easily hit us back with undesired effects on certain corner cases. Also it is unclear what network-online.target encompasses.

frelon · 2024-03-25T07:57:28Z

We could also use the boot-complete.target since that seems to be what the systemd boot assessment uses.

davidcassany · 2024-03-25T12:20:31Z

We could also use the boot-complete.target since that seems to be what the systemd boot assessment uses.

This is quite interesting, but this seams to relay on https://uapi-group.org/specifications/specs/boot_loader_specification which we are not following (if we do it is by pure coincidence). Following this spec would be nice though, I wonder if it is possible to follow it with grub2 and systemd-boot. If so then we could have our bootloader setup agnostic to the underlaying bootloader, which would be pretty nice. I think this require some deeper investigation.

davidcassany · 2024-03-25T12:33:22Z

@fgiudici in rancher/elemental-toolkit#2027 the boot_assessment changes to only iterate over existing snapshots without including recovery. Hence on reboot after install, upgrade or reset it will kick-in the boot assessment and run checks if any. In case of failure it will reboot to most recent fallback snapshot and from there keep trying until there is a successful boot or until the oldest snapshot. In the oldest snapshot if it fails to boot it will essentially keep rebooting every few minutes in an infinite loop from that old snapshot.

Probably instead of being stuck to the oldest we could consider restarting from scratch, active and then falling to most recent non active snapshot up to the oldest again. I'd consider that a minor detail though, if no snapshot is booting there isn't much we can do...

Toughts?

frelon · 2024-03-25T12:39:20Z

This is quite interesting, but this seams to relay on https://uapi-group.org/specifications/specs/boot_loader_specification which we are not following (if we do it is by pure coincidence). Following this spec would be nice though, I wonder if it is possible to follow it with grub2 and systemd-boot. If so then we could have our bootloader setup agnostic to the underlaying bootloader, which would be pretty nice. I think this require some deeper investigation.

In the docs it's stated that it's a generic synchronization point so it should work using any bootloader. We would probably need to Require= it somewhere in our boot to actually activate it.

fgiudici · 2024-03-25T14:08:24Z

@fgiudici in rancher/elemental-toolkit#2027 the boot_assessment changes to only iterate over existing snapshots without including recovery. Hence on reboot after install, upgrade or reset it will kick-in the boot assessment and run checks if any. In case of failure it will reboot to most recent fallback snapshot and from there keep trying until there is a successful boot or until the oldest snapshot. In the oldest snapshot if it fails to boot it will essentially keep rebooting every few minutes in an infinite loop from that old snapshot.

Probably instead of being stuck to the oldest we could consider restarting from scratch, active and then falling to most recent non active snapshot up to the oldest again. I'd consider that a minor detail though, if no snapshot is booting there isn't much we can do...

Toughts?

So, it makes sense then, if things go awry we may end-up with an older snapshot, one could always retry an OS upgrade later 👍🏼
Why the network-online.target then? Is it because it is the last one?
The proposal from @frelon to use the boot-complete.target sounds more "generic" (also if the result would be the same, if I understood it correctly). And sure, we are not going all the systemd schema to manage rollbacks, so no real need to follow that, we would just align to the same target service from systemd.
But anyway, I think rebooting with a previous snapshot will manage the corner case well.
We have to remember to proper document that, in order to make it pretty clear if anyone uses Elemental in special scenarios (edge?).

fgiudici

LGTM! Approving from my side 👍🏼
Would like anyway to get some more on @frelon PoV.

davidcassany · 2024-03-25T14:17:30Z

Why the network-online.target then? Is it because it is the last one?

Because we saw a customer with a failed upgrade (filesystem corruption) and it booted with many failed services (dbus, networkmanager, etc), but since none of these is required it just booted in degraded mode and it was not reachable anymore. Boot assessment did not catch the error because the system booted from systemd perspective. Here is where the health check concept comes from. We could also consider having a health check on elemental-register service instead of network-online.

If we stick to elemental-register check that means that a node that is not capable to re-register after (re)booting will try to restart to a fallback snapshot. That could also be a more elemental specific approach.

Sorry, dismissing the review, the PR changed

If the elemental-register service fails to register on a reboot after install, reset or upgrade the checker will cause a reboot to a fallback OS. Signed-off-by: David Cassany <[email protected]>

fgiudici

Well, the elemental-register sync-up after upgrade is mandatory IMO,
LGTM! Thanks David

frelon

LGTM!

davidcassany requested a review from a team as a code owner March 22, 2024 15:01

Adding an network-online checker for elemental

594981b

Signed-off-by: David Cassany <[email protected]>

davidcassany force-pushed the boot_assessment_checker branch from e81b1f7 to 594981b Compare March 22, 2024 15:17

davidcassany force-pushed the boot_assessment_checker branch from 594981b to dbb38be Compare March 25, 2024 14:08

fgiudici previously approved these changes Mar 25, 2024

View reviewed changes

davidcassany force-pushed the boot_assessment_checker branch from dbb38be to d46bc2b Compare March 25, 2024 16:35

davidcassany requested review from fgiudici and a team March 25, 2024 17:19

Add an elemental-register checker

deb8b0c

If the elemental-register service fails to register on a reboot after install, reset or upgrade the checker will cause a reboot to a fallback OS. Signed-off-by: David Cassany <[email protected]>

davidcassany force-pushed the boot_assessment_checker branch from d46bc2b to deb8b0c Compare March 26, 2024 07:38

fgiudici approved these changes Mar 26, 2024

View reviewed changes

frelon approved these changes Mar 26, 2024

View reviewed changes

davidcassany merged commit c01a9a1 into rancher:main Mar 26, 2024
16 of 19 checks passed

davidcassany deleted the boot_assessment_checker branch March 26, 2024 08:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding an network-online checker for elemental #1315

Adding an network-online checker for elemental #1315

davidcassany commented Mar 22, 2024 •

edited

Loading

fgiudici commented Mar 22, 2024

davidcassany commented Mar 22, 2024

frelon commented Mar 25, 2024

davidcassany commented Mar 25, 2024

davidcassany commented Mar 25, 2024

frelon commented Mar 25, 2024

fgiudici commented Mar 25, 2024

fgiudici left a comment

davidcassany commented Mar 25, 2024 •

edited

Loading

fgiudici left a comment

frelon left a comment

Adding an network-online checker for elemental #1315

Adding an network-online checker for elemental #1315

Conversation

davidcassany commented Mar 22, 2024 • edited Loading

fgiudici commented Mar 22, 2024

davidcassany commented Mar 22, 2024

frelon commented Mar 25, 2024

davidcassany commented Mar 25, 2024

davidcassany commented Mar 25, 2024

frelon commented Mar 25, 2024

fgiudici commented Mar 25, 2024

fgiudici left a comment

Choose a reason for hiding this comment

davidcassany commented Mar 25, 2024 • edited Loading

fgiudici left a comment

Choose a reason for hiding this comment

frelon left a comment

Choose a reason for hiding this comment

davidcassany commented Mar 22, 2024 •

edited

Loading

davidcassany commented Mar 25, 2024 •

edited

Loading