Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding an network-online checker for elemental #1315

Merged
merged 2 commits into from
Mar 26, 2024

Conversation

davidcassany
Copy link
Contributor

@davidcassany davidcassany commented Mar 22, 2024

With this checker if the network-online.target does not reach the active state the system will reboot to fallback.

Few considerations for current boot_assessment:

  1. Passive will also run the boot_assessment on upgrade in case active fails. This is relevant as it allows the system to keep trying on different fallbacks (assuming there are more than one old snapshot). However this also implies that if none is functional it will eventually reach recovery system. Before it used to only stay on a single fallback only, never reaching recovery automatically, because the boot assessment was not executed on passive.

  2. Boot_assessment is not active by default on regular boots, just for reboots after upgrading. This is relevant as the boot assessment is not executed after install. So it could happen that the boot assessment is not passing and you just realize this on an upgrade. In that case the system will land in to recovery mode (out of k8s) after upgrading.

So I am wondering if it would make sense defaulting the boot_assessment to always run, not only after upgrades. This way we assess the installed boot actually passess the boot assessment.

@frelon @anmazzotti @fgiudici any thougths about this?

Fixes #1263

@davidcassany davidcassany requested a review from a team as a code owner March 22, 2024 15:01
@fgiudici
Copy link
Member

🤔 the scenario that worries me is if there is some transient network connectivity issue (some remote edge deployment with unstable connectivity?).
Will that mean that a system may end up in recovery state in that case?

@davidcassany
Copy link
Contributor Author

🤔 the scenario that worries me is if there is some transient network connectivity issue (some remote edge deployment with unstable connectivity?). Will that mean that a system may end up in recovery state in that case?

Good point, that could happen if the network outage happen during the upgrade reboot, which is a corner case as network required to run the upgrade. The window in which this could happen is rather small, but I guess at scale this is not that hard hitting such a corner case in few nodes...

Yes this is all annoying. That's why I am hesitant to include certain checks, it could easily hit us back with undesired effects on certain corner cases. Also it is unclear what network-online.target encompasses.

@frelon
Copy link
Contributor

frelon commented Mar 25, 2024

We could also use the boot-complete.target since that seems to be what the systemd boot assessment uses.

@davidcassany
Copy link
Contributor Author

We could also use the boot-complete.target since that seems to be what the systemd boot assessment uses.

This is quite interesting, but this seams to relay on https://uapi-group.org/specifications/specs/boot_loader_specification which we are not following (if we do it is by pure coincidence). Following this spec would be nice though, I wonder if it is possible to follow it with grub2 and systemd-boot. If so then we could have our bootloader setup agnostic to the underlaying bootloader, which would be pretty nice. I think this require some deeper investigation.

@davidcassany
Copy link
Contributor Author

@fgiudici in rancher/elemental-toolkit#2027 the boot_assessment changes to only iterate over existing snapshots without including recovery. Hence on reboot after install, upgrade or reset it will kick-in the boot assessment and run checks if any. In case of failure it will reboot to most recent fallback snapshot and from there keep trying until there is a successful boot or until the oldest snapshot. In the oldest snapshot if it fails to boot it will essentially keep rebooting every few minutes in an infinite loop from that old snapshot.

Probably instead of being stuck to the oldest we could consider restarting from scratch, active and then falling to most recent non active snapshot up to the oldest again. I'd consider that a minor detail though, if no snapshot is booting there isn't much we can do...

Toughts?

@frelon
Copy link
Contributor

frelon commented Mar 25, 2024

This is quite interesting, but this seams to relay on https://uapi-group.org/specifications/specs/boot_loader_specification which we are not following (if we do it is by pure coincidence). Following this spec would be nice though, I wonder if it is possible to follow it with grub2 and systemd-boot. If so then we could have our bootloader setup agnostic to the underlaying bootloader, which would be pretty nice. I think this require some deeper investigation.

In the docs it's stated that it's a generic synchronization point so it should work using any bootloader. We would probably need to Require= it somewhere in our boot to actually activate it.

@fgiudici
Copy link
Member

@fgiudici in rancher/elemental-toolkit#2027 the boot_assessment changes to only iterate over existing snapshots without including recovery. Hence on reboot after install, upgrade or reset it will kick-in the boot assessment and run checks if any. In case of failure it will reboot to most recent fallback snapshot and from there keep trying until there is a successful boot or until the oldest snapshot. In the oldest snapshot if it fails to boot it will essentially keep rebooting every few minutes in an infinite loop from that old snapshot.

Probably instead of being stuck to the oldest we could consider restarting from scratch, active and then falling to most recent non active snapshot up to the oldest again. I'd consider that a minor detail though, if no snapshot is booting there isn't much we can do...

Toughts?

So, it makes sense then, if things go awry we may end-up with an older snapshot, one could always retry an OS upgrade later 👍🏼
Why the network-online.target then? Is it because it is the last one?
The proposal from @frelon to use the boot-complete.target sounds more "generic" (also if the result would be the same, if I understood it correctly). And sure, we are not going all the systemd schema to manage rollbacks, so no real need to follow that, we would just align to the same target service from systemd.
But anyway, I think rebooting with a previous snapshot will manage the corner case well.
We have to remember to proper document that, in order to make it pretty clear if anyone uses Elemental in special scenarios (edge?).

fgiudici
fgiudici previously approved these changes Mar 25, 2024
Copy link
Member

@fgiudici fgiudici left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Approving from my side 👍🏼
Would like anyway to get some more on @frelon PoV.

@davidcassany
Copy link
Contributor Author

davidcassany commented Mar 25, 2024

Why the network-online.target then? Is it because it is the last one?

Because we saw a customer with a failed upgrade (filesystem corruption) and it booted with many failed services (dbus, networkmanager, etc), but since none of these is required it just booted in degraded mode and it was not reachable anymore. Boot assessment did not catch the error because the system booted from systemd perspective. Here is where the health check concept comes from. We could also consider having a health check on elemental-register service instead of network-online.

If we stick to elemental-register check that means that a node that is not capable to re-register after (re)booting will try to restart to a fallback snapshot. That could also be a more elemental specific approach.

@davidcassany davidcassany dismissed fgiudici’s stale review March 25, 2024 17:19

Sorry, dismissing the review, the PR changed

@davidcassany davidcassany requested review from fgiudici and a team March 25, 2024 17:19
If the elemental-register service fails to register on a
reboot after install, reset or upgrade the checker will
cause a reboot to a fallback OS.

Signed-off-by: David Cassany <[email protected]>
Copy link
Member

@fgiudici fgiudici left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the elemental-register sync-up after upgrade is mandatory IMO,
LGTM! Thanks David

Copy link
Contributor

@frelon frelon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@davidcassany davidcassany merged commit c01a9a1 into rancher:main Mar 26, 2024
16 of 19 checks passed
@davidcassany davidcassany deleted the boot_assessment_checker branch March 26, 2024 08:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Boot assessment must wait for network
3 participants