feat (cluster): [day2-ops] image update configuration node-level only #405

ferantivero · 2024-03-11T23:04:41Z

surgical changes to:

move from fully automatic updates (AKS version updates + nodes) to node-only with weekly security patches.
removes verbosity from the docs, cleaning up some terminology and ideas.

ckittel · 2024-03-12T13:26:25Z

07-bootstrap-validation.md


   > The node update phase of the cluster’s lifecycle belongs to day2 operations. Cluster operations will regularly update node images for two main reasons: 1) to update the Kubernetes cluster version, and 2) to keep up with node-level OS updates. A new AKS release introduces new features, such as addons and new Kubernetes versions, while new AKS node images bring changes at the OS level. Both types of releases adhere to Azure Safe Deployment Practices for rollout across all regions. For more information, please refer to [How to use the release tracker](https://learn.microsoft.com/azure/aks/release-tracker#how-to-use-the-release-tracker). Additionally, cluster operations aim to stay updated with supported Kubernetes versions for Service Level Agreement (SLA) compliance and to avoid accumulating updates, as version updates cannot be skipped at will. For more details, please see [Kubernetes version upgrades](https://learn.microsoft.com/azure/aks/upgrade-aks-cluster?tabs=azure-cli#kubernetes-version-upgrades).

-   > When a new update becomes available, it can be manually applied for the greatest degree of control by making requests against the Azure control plane. Alternatively, the operations team can opt to automatically update to the latest version by configuring an update channel to follow the desired cadence. This can be combined with a planned maintenance window, one for Kubernetes version updates and another for OS-level upgrades. AKS offers two different configurable auto-upgrade channels dedicated to these update types. For more information, please refer to [Upgrade options for Azure Kubernetes Service (AKS) clusters](https://learn.microsoft.com/azure/aks/upgrade-cluster). Node pools in this AKS cluster span multiple availability zones. Therefore, it’s important to note that automatic updates are conducted based on a best-effort zone balancing in node groups. To prevent zone imbalance and increase availability, Nodes Max Surge and Pod Disruption Budget are configured in this baseline. By default, cluster nodes are updated one at a time. Max Surge can adjust the speed of a cluster upgrade. In clusters with 6+ nodes hosting disruption-sensitive workloads, a surge of up to `33%` is recommended for a safe upgrade pace. For more information, please see [Customer node surge upgrade](https://learn.microsoft.com/azure/aks/upgrade-aks-cluster?tabs=azure-cli#customize-node-surge-upgrade). To minimize disruption, production clusters should be configured with [node draining timeout](https://learn.microsoft.com/azure/aks/upgrade-aks-cluster?tabs=azure-cli#set-node-drain-timeout-valuei) and [soak time](https://learn.microsoft.com/azure/aks/upgrade-aks-cluster?tabs=azure-cli#set-node-soak-time-value), taking into account the specific characteristics of their workloads.
+   > When a new update becomes available, it can be manually applied for the greatest degree of control by making requests against the Azure control plane which is what we recommmend for Kubernetes version updates. Alternatively, the operations team can opt to automatically update to the latest version by configuring an update channel to follow the desired cadence. This can be combined with a planned maintenance window, one for Kubernetes version updates and another for OS-level upgrades. AKS offers two different configurable auto-upgrade channels dedicated to these update types. For more information, please refer to [Upgrade options for Azure Kubernetes Service (AKS) clusters](https://learn.microsoft.com/azure/aks/upgrade-cluster). Node pools in this AKS cluster span multiple availability zones. Therefore, it’s important to note that automatic updates are conducted based on a best-effort zone balancing in node groups. To prevent zone imbalance and increase availability, Nodes Max Surge and Pod Disruption Budget are configured in this baseline. By default, cluster nodes are updated one at a time. Max Surge can adjust the speed of a cluster upgrade. In clusters with 6+ nodes hosting disruption-sensitive workloads, a surge of up to `33%` is recommended for a safe upgrade pace. For more information, please see [Customer node surge upgrade](https://learn.microsoft.com/azure/aks/upgrade-aks-cluster?tabs=azure-cli#customize-node-surge-upgrade). To minimize disruption, production clusters should be configured with [node draining timeout](https://learn.microsoft.com/azure/aks/upgrade-aks-cluster?tabs=azure-cli#set-node-drain-timeout-valuei) and [soak time](https://learn.microsoft.com/azure/aks/upgrade-aks-cluster?tabs=azure-cli#set-node-soak-time-value), taking into account the specific characteristics of their workloads.


When a new update becomes available

Please be specific, we have at three types of updates possible in AKS (not including add-on/extension updates).

thanks for feedback @ckittel, the following commit should be clearing up most of the notes/details/~guidance we had and being more specific about terms. Regarding three types, we are now preserving those concepts for the RA.

done | addressed from abe4669

07-bootstrap-validation.md

ckittel · 2024-03-12T13:31:32Z

07-bootstrap-validation.md

+> - By default, the OS-level updates maintenance window is scheduled on a daily cadence. This is because the OS channel is configured with `SecurityPatch`, where a new update can be shipped when available.
+> - If you choose the `NodeImage` channel, consider changing the maintenance window to weekly since updates get shipped on a weekly cadence.


I'd actually show the maintenance window being on Tue, Thurs, and the weekend. Just because you COULD get security updates daily doesn't mean you will nor does it mean your maintenance window MUST align with that possibility. It would be good to show that the cluster operator still has control here.

@ckittel auto-upgrade channels won't allow us to configure two different days of the weeks as requested. We are now moving over weekly maintenance window on every Tue at 9PM following.

The weekly design decision is based on official docs recommendations for NodeImage channel. Open to keep discussing alternatives as I also think the two-days-catchingup is a great pattern.

partially done | attempt to addressed from abe4669, a5257c2 and 25d471b

07-bootstrap-validation.md

ckittel · 2024-03-12T13:32:58Z

07-bootstrap-validation.md

 >
 > Maintenance Operations:
 > - Keep in mind that performing maintenance operations is considered best-effort. They are not guaranteed to occur within a specific window.
-> - While it’s not strictly recommended, if you require greater control, consider manually updating your cluster.
+> - While it’s not recommended for OS-level updates, if you require greater control, consider manually updating your cluster.


While it’s not recommended for OS-level updates, if you require greater control, consider manually updating your cluster.

Which update are you suggesting they consider, it's not clear from this context. OS, node image, cluster, extensions/add-ons?

thanks for feedback @ckittel, the following commit should be clearing up most of the notes/details/~guidance we had and being more specific about terms.

Regarding updates they should consider are AKS node-only updates (it involves updates for security, kernel, other os-related and node-related).

done | addressed from abe4669

ckittel · 2024-03-12T13:35:06Z

07-bootstrap-validation.md

   }
   ```

-   > This cluster now receives weekly updates for both the Operating System (OS) and Kubernetes. For workloads that need to always run the most secure OS version, you can opt-in for regular updates by selecting the `SecurityPatch` channel.
+   > This cluster now receives daily updates for the Operating System (OS) security patches and leave that up to a customer to perform Kubernetes updates (after testing).

   > The node update phase of the cluster’s lifecycle belongs to day2 operations. Cluster operations will regularly update node images for two main reasons: 1) to update the Kubernetes cluster version, and 2) to keep up with node-level OS updates. A new AKS release introduces new features, such as addons and new Kubernetes versions, while new AKS node images bring changes at the OS level. Both types of releases adhere to Azure Safe Deployment Practices for rollout across all regions. For more information, please refer to [How to use the release tracker](https://learn.microsoft.com/azure/aks/release-tracker#how-to-use-the-release-tracker). Additionally, cluster operations aim to stay updated with supported Kubernetes versions for Service Level Agreement (SLA) compliance and to avoid accumulating updates, as version updates cannot be skipped at will. For more details, please see [Kubernetes version upgrades](https://learn.microsoft.com/azure/aks/upgrade-aks-cluster?tabs=azure-cli#kubernetes-version-upgrades).


Why does this section call node image updates AKS version updates? Can you please bring more distinction and clarity here? Please keep AKS version updates a completely separate conversation than in-cluster, node image / OS / add-on/extension updates.

Sure thing, for this distinction Im sharing the following understanding so we can fix this section and others.

AKS version updates are the ones users can find from here https://github.com/Azure/AKS/releases and these releases include:

"k8s cluster version updates"

"node images"

addons (and more).

I plan start using the ^ understanding, combined with the following naming convention to be aligned with the PG docs (Im not listing all being used but just the more relevant we might consider using in our docs), unless you consider otherwise (open to change the terminology as desired):

Automatically upgrade AKS node images (interchangeable called by its feature name "auto-upgrade node OS images") | Manually upgrade AKS node images -> this is a Linux or Windows OS upgrade (AKSUbuntu-1604-2020.10.08 -> AKSUbuntu-1604-2020.10.28) that can be interchangeable called "node-level OS <security> updates" or "node OS image <auto->upgrades" or "node OS <auto->upgrades" or simply "node OS <automatic> <security> updates". Automatic updates are configured from the "node OS auto-upgrade channel".

Automatically upgrade AKS cluster (interchangeable called by its feature name "cluster auto-upgrade") | Manually Upgrade AKS cluster -> this is a "Kubernetes cluster upgrade" or "Upgrade to the latest Kubernetes version" or "Kubernetes patch or minor version updates" or "new features or patches from upstream Kubernetes" (i.e. upgrade between 1.14.x -> 1.15.x). Automatic updates are configured from the "<cluster> auto-upgrade channel".

Lastly, maintenance cadence can be scheduled to a finely controlled cadence of your choice by creating a "<Node OS> planned maintenance window".

Please don't hesitate to simplify all these terms to the ones we'd rather use from our docs, change/add new ones or simply grant to use them all interchangeable.

@ckittel as discussed offline we are now using the following terminology when possible and always open to keep discussing as much as needed:

Node image updates: it includes updates around security, kernel and the rest of node-related stuff. In other words, everything node-related not just OS.

AKS Cluster version update: it is the update of the version of a managed Kubernetes cluster running on AKS (i.e. 1.14.x to 1.15.x).

done | addressed from abe4669

…patches channel for regulated

…#405)

ferantivero added 2 commits March 11, 2024 20:02

change update strategy to os-only with security patches

aae3fe6

Address PR Feedback: reinstate preview feat registration

26643b6

ckittel requested changes Mar 12, 2024

View reviewed changes

ferantivero added 2 commits March 14, 2024 18:45

Address PR Feedback: perform weekly node image updates and leave sec …

a5257c2

…patches channel for regulated

Address PR Feedback: reserve guidance/notes for our RA instead

abe4669

ferantivero force-pushed the feature/192733_day2-node-updates-guidance-secpatch branch from 0b20930 to abe4669 Compare March 14, 2024 21:46

bug fix: move from thur to tue

25d471b

ferantivero marked this pull request as ready for review March 14, 2024 22:06

ferantivero requested a review from ckittel March 14, 2024 22:06

Address PR Feedback: streamline step docs removing unnecesary steps

868bce8

ferantivero changed the title ~~feat (cluster): [day2-ops] node update configuration OS-level only~~ feat (cluster): [day2-ops] image update configuration node-level only Mar 15, 2024

ferantivero merged commit 9754715 into feature/192733_day2-node-updates-guidance Mar 20, 2024
1 check passed

ferantivero deleted the feature/192733_day2-node-updates-guidance-secpatch branch March 20, 2024 16:17

ferantivero added a commit that referenced this pull request Mar 20, 2024

feat (cluster): [day2-ops] image update configuration node-level only (…

e99adff

…#405)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat (cluster): [day2-ops] image update configuration node-level only #405

feat (cluster): [day2-ops] image update configuration node-level only #405

ferantivero commented Mar 11, 2024 •

edited

Loading

ckittel Mar 12, 2024

ferantivero Mar 14, 2024 •

edited

Loading

ckittel Mar 12, 2024

ferantivero Mar 14, 2024 •

edited

Loading

ckittel Mar 12, 2024

ferantivero Mar 14, 2024 •

edited

Loading

ckittel Mar 12, 2024

ferantivero Mar 12, 2024 •

edited

Loading

ferantivero Mar 14, 2024

		> - By default, the OS-level updates maintenance window is scheduled on a daily cadence. This is because the OS channel is configured with `SecurityPatch`, where a new update can be shipped when available.
		> - If you choose the `NodeImage` channel, consider changing the maintenance window to weekly since updates get shipped on a weekly cadence.

feat (cluster): [day2-ops] image update configuration node-level only #405

feat (cluster): [day2-ops] image update configuration node-level only #405

Conversation

ferantivero commented Mar 11, 2024 • edited Loading

ckittel Mar 12, 2024

Choose a reason for hiding this comment

ferantivero Mar 14, 2024 • edited Loading

Choose a reason for hiding this comment

ckittel Mar 12, 2024

Choose a reason for hiding this comment

ferantivero Mar 14, 2024 • edited Loading

Choose a reason for hiding this comment

ckittel Mar 12, 2024

Choose a reason for hiding this comment

ferantivero Mar 14, 2024 • edited Loading

Choose a reason for hiding this comment

ckittel Mar 12, 2024

Choose a reason for hiding this comment

ferantivero Mar 12, 2024 • edited Loading

Choose a reason for hiding this comment

ferantivero Mar 14, 2024

Choose a reason for hiding this comment

ferantivero commented Mar 11, 2024 •

edited

Loading

ferantivero Mar 14, 2024 •

edited

Loading

ferantivero Mar 14, 2024 •

edited

Loading

ferantivero Mar 14, 2024 •

edited

Loading

ferantivero Mar 12, 2024 •

edited

Loading