Control which nodes restart dockerd on MCR/Engine Upgrades via Launchpad #530

james-nesbitt · 2024-12-09T15:16:28Z

A useful enhancement for launchpad would be to add functionality so that during an MCR/Engine upgrade, users could have the ability to specify on which nodes a restart of dockerd would occur. Some uses cases of large clusters have their upgrades performed in batches of workers, so that pods can be shifted around to avoid impact. The idea is a user could specify the specific host(s) for restarts of dockerd to occur instead of it restarting dockerd on all hosts one by one in a linear fashion. This results in less/unnecessary impact and disruption during an upgrade.

As a suggestion, perhaps a "don’t restart” flag to launchpad, which would tell launchpad to not do anything during the “Restart MCR” phase. Thank you!

@abrainerd : this is migrated from the other repo

ebourgeois · 2024-12-09T16:10:50Z

I really like the idea of "don't restart any" and allow users to restart as they see fit.

james-nesbitt · 2025-01-16T11:20:21Z

There are two issues with preventative restarts:

launchpad no longer causes the restarts, except for cases where there is a change in MCR daemon json for a host (we did have a hypothesis that launchpad is dectecting changes when there is none, but that needs to be verified) It is the packaging and process managements system (like systemd) which are restarting MCR now.
if the MCR daemon, containerd or runc components are upgraded without any restarts then the system will be in an unpredictable state, which would cause unknown problems and perhaps confuse MKE.

We have a couple of options:

allow a staged upgrade of workers, allowing a launchpad run to limit worker upgrade to certain nodes only (managers would still be upgraded when needed)
try to trick systemd into not restarting the workers - unknown

james-nesbitt · 2025-02-11T16:56:54Z

@abrainerd and I had a good chat about requirements for this. I am going to submit a "definition of problem" here, and then see if I can propose a UI for the solution.

james-nesbitt · 2025-02-12T15:24:21Z

Here is my proposal for the node control during upgrade:

Strategic Goal: System engineer who runs an upgrade can maintain the stability of a workload during the upgrade

notes:

it will most likely require a batch worker upgrade so that specific swarm/kube nodes can be rotated/upgraded as a group, to allow protecting a workload

Options:

The engineer runs launchpad repeatedly, specifying nodes to be included/ignored (most likely ignored which would allow launchpad to make the batching transparently optional)
The engineer runs launchpad specifying a sequence of node groups which should be rotated

Considerations:

between batches engineers will want to perform manual operations such as drains/audits (it will not be a good idea to automate moving from one batch to the next)
it makes more sense to give the engineer control of the nodes as opposed to allowing the engineer to specify swarm/components to target for rotation - the engineer gets complete control, and launchpad doesn't need to perform complicated discovery operatinos.

because of this Option 2. seems like a bad approach, but 1. should work.

I will use a 2nd comment to go over what the first options will look like.

james-nesbitt · 2025-02-12T15:37:06Z

Batching node MCR upgrade proposal:

Tactic: a system engineer can control node upgrade by running launchpad repeatedly, specifying nodes that should be ignored in each batch, by using a change in the launchpad yaml host declarations.

Risks:

MCR manager upgrading in batches is likely a bad idea as MCR may not like to be in a mixed version (I'll confirm with the engine team on this) We may need to exempt managers from the batching concept (please comment on if this is accepable

Details:

an optional flag will be added to the launchpad host declaration which will indicate that it can be skipped
the upgrade_mcr phase, when collecting machines for the existing worker batch collection, will ignore machines that have the flag.
nodes that were upgraded in the previous run should be skipped (although the upgrade script will still run) - meaning no upgrade and no MCR restart (this needs to be verified, otherwise they will need to have an ignore)

abrainerd · 2025-02-14T19:26:39Z

@ebourgeois - at your convenience, looking for your feedback on this plan, as discussed. Thank you!

ebourgeois · 2025-02-17T13:50:21Z

Hi @james-nesbitt and @abrainerd,

I am good with this solution, allow us to have the option of running anything we need in between restarts. @james-nesbitt , your details and risks are spot on and I am good with this approach.

james-nesbitt · 2025-02-18T15:19:47Z

Effort is proceeding.

I have the internal documents setup for lining this up for a release, but will try to work on getting this moved to github (project.) My goal is to get a techical preview ready for review within 2 weeks, with a full release schedule lined up for a bit later (full integration testing capacity is limited at the moment.)

james-nesbitt · 2025-03-02T10:40:02Z

#558 is a first attempt. it is a small change, but required a lot of evaluation of the behaviour of the installed components. The change allows skipping of certain hosts when upgrading MCR.

The change is still a draft because I need to evaluate what workload risks ther when the MKE verison upgrades the kubernetes components - if the KME upgrade destabilzes workload en-masse, then this change may not be enough to allow host upgrade batching.

james-nesbitt · 2025-03-02T10:40:45Z

I will do the PR fixups (linting, tests etc) and the probably put out a PR specific release.

james-nesbitt · 2025-03-03T09:15:51Z

The PR has a tagged release for testing: https://github.com/Mirantis/launchpad/releases/tag/v1.5.11-530-tp2

Quick note: this is our first release since the repo restructure, so it is our first chance to properly test our release process.

james-nesbitt mentioned this issue Dec 9, 2024

Control which nodes restart dockerd on MCR/Engine Upgrades via Launchpad Mirantis/launchpad_legacy#101

Closed

This was referenced Jan 16, 2025

Drain worker nodes before upgrading MCR #354

Open

PRODENG-2826 MCR Uninstall now swarm drains and prunes volumes #535

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Control which nodes restart dockerd on MCR/Engine Upgrades via Launchpad #530

Control which nodes restart dockerd on MCR/Engine Upgrades via Launchpad #530

james-nesbitt commented Dec 9, 2024

ebourgeois commented Dec 9, 2024

james-nesbitt commented Jan 16, 2025

james-nesbitt commented Feb 11, 2025

james-nesbitt commented Feb 12, 2025 •

edited

Loading

james-nesbitt commented Feb 12, 2025

abrainerd commented Feb 14, 2025

ebourgeois commented Feb 17, 2025

james-nesbitt commented Feb 18, 2025

james-nesbitt commented Mar 2, 2025

james-nesbitt commented Mar 2, 2025

james-nesbitt commented Mar 3, 2025 •

edited

Loading

Control which nodes restart dockerd on MCR/Engine Upgrades via Launchpad #530

Control which nodes restart dockerd on MCR/Engine Upgrades via Launchpad #530

Comments

james-nesbitt commented Dec 9, 2024

ebourgeois commented Dec 9, 2024

james-nesbitt commented Jan 16, 2025

james-nesbitt commented Feb 11, 2025

james-nesbitt commented Feb 12, 2025 • edited Loading

james-nesbitt commented Feb 12, 2025

abrainerd commented Feb 14, 2025

ebourgeois commented Feb 17, 2025

james-nesbitt commented Feb 18, 2025

james-nesbitt commented Mar 2, 2025

james-nesbitt commented Mar 2, 2025

james-nesbitt commented Mar 3, 2025 • edited Loading

james-nesbitt commented Feb 12, 2025 •

edited

Loading

james-nesbitt commented Mar 3, 2025 •

edited

Loading