Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat (cluster): [day2-ops] image update configuration node-level only #405

Conversation

ferantivero
Copy link
Contributor

@ferantivero ferantivero commented Mar 11, 2024

surgical changes to:

  • move from fully automatic updates (AKS version updates + nodes) to node-only with weekly security patches.
  • removes verbosity from the docs, cleaning up some terminology and ideas.


> The node update phase of the cluster’s lifecycle belongs to day2 operations. Cluster operations will regularly update node images for two main reasons: 1) to update the Kubernetes cluster version, and 2) to keep up with node-level OS updates. A new AKS release introduces new features, such as addons and new Kubernetes versions, while new AKS node images bring changes at the OS level. Both types of releases adhere to Azure Safe Deployment Practices for rollout across all regions. For more information, please refer to [How to use the release tracker](https://learn.microsoft.com/azure/aks/release-tracker#how-to-use-the-release-tracker). Additionally, cluster operations aim to stay updated with supported Kubernetes versions for Service Level Agreement (SLA) compliance and to avoid accumulating updates, as version updates cannot be skipped at will. For more details, please see [Kubernetes version upgrades](https://learn.microsoft.com/azure/aks/upgrade-aks-cluster?tabs=azure-cli#kubernetes-version-upgrades).

> When a new update becomes available, it can be manually applied for the greatest degree of control by making requests against the Azure control plane. Alternatively, the operations team can opt to automatically update to the latest version by configuring an update channel to follow the desired cadence. This can be combined with a planned maintenance window, one for Kubernetes version updates and another for OS-level upgrades. AKS offers two different configurable auto-upgrade channels dedicated to these update types. For more information, please refer to [Upgrade options for Azure Kubernetes Service (AKS) clusters](https://learn.microsoft.com/azure/aks/upgrade-cluster). Node pools in this AKS cluster span multiple availability zones. Therefore, it’s important to note that automatic updates are conducted based on a best-effort zone balancing in node groups. To prevent zone imbalance and increase availability, Nodes Max Surge and Pod Disruption Budget are configured in this baseline. By default, cluster nodes are updated one at a time. Max Surge can adjust the speed of a cluster upgrade. In clusters with 6+ nodes hosting disruption-sensitive workloads, a surge of up to `33%` is recommended for a safe upgrade pace. For more information, please see [Customer node surge upgrade](https://learn.microsoft.com/azure/aks/upgrade-aks-cluster?tabs=azure-cli#customize-node-surge-upgrade). To minimize disruption, production clusters should be configured with [node draining timeout](https://learn.microsoft.com/azure/aks/upgrade-aks-cluster?tabs=azure-cli#set-node-drain-timeout-valuei) and [soak time](https://learn.microsoft.com/azure/aks/upgrade-aks-cluster?tabs=azure-cli#set-node-soak-time-value), taking into account the specific characteristics of their workloads.
> When a new update becomes available, it can be manually applied for the greatest degree of control by making requests against the Azure control plane which is what we recommmend for Kubernetes version updates. Alternatively, the operations team can opt to automatically update to the latest version by configuring an update channel to follow the desired cadence. This can be combined with a planned maintenance window, one for Kubernetes version updates and another for OS-level upgrades. AKS offers two different configurable auto-upgrade channels dedicated to these update types. For more information, please refer to [Upgrade options for Azure Kubernetes Service (AKS) clusters](https://learn.microsoft.com/azure/aks/upgrade-cluster). Node pools in this AKS cluster span multiple availability zones. Therefore, it’s important to note that automatic updates are conducted based on a best-effort zone balancing in node groups. To prevent zone imbalance and increase availability, Nodes Max Surge and Pod Disruption Budget are configured in this baseline. By default, cluster nodes are updated one at a time. Max Surge can adjust the speed of a cluster upgrade. In clusters with 6+ nodes hosting disruption-sensitive workloads, a surge of up to `33%` is recommended for a safe upgrade pace. For more information, please see [Customer node surge upgrade](https://learn.microsoft.com/azure/aks/upgrade-aks-cluster?tabs=azure-cli#customize-node-surge-upgrade). To minimize disruption, production clusters should be configured with [node draining timeout](https://learn.microsoft.com/azure/aks/upgrade-aks-cluster?tabs=azure-cli#set-node-drain-timeout-valuei) and [soak time](https://learn.microsoft.com/azure/aks/upgrade-aks-cluster?tabs=azure-cli#set-node-soak-time-value), taking into account the specific characteristics of their workloads.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When a new update becomes available

Please be specific, we have at three types of updates possible in AKS (not including add-on/extension updates).

Copy link
Contributor Author

@ferantivero ferantivero Mar 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for feedback @ckittel, the following commit should be clearing up most of the notes/details/~guidance we had and being more specific about terms. Regarding three types, we are now preserving those concepts for the RA.

done | addressed from abe4669

07-bootstrap-validation.md Outdated Show resolved Hide resolved
Comment on lines 62 to 63
> - By default, the OS-level updates maintenance window is scheduled on a daily cadence. This is because the OS channel is configured with `SecurityPatch`, where a new update can be shipped when available.
> - If you choose the `NodeImage` channel, consider changing the maintenance window to weekly since updates get shipped on a weekly cadence.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd actually show the maintenance window being on Tue, Thurs, and the weekend. Just because you COULD get security updates daily doesn't mean you will nor does it mean your maintenance window MUST align with that possibility. It would be good to show that the cluster operator still has control here.

Copy link
Contributor Author

@ferantivero ferantivero Mar 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ckittel auto-upgrade channels won't allow us to configure two different days of the weeks as requested. We are now moving over weekly maintenance window on every Tue at 9PM following.

The weekly design decision is based on official docs recommendations for NodeImage channel. Open to keep discussing alternatives as I also think the two-days-catchingup is a great pattern.

partially done | attempt to addressed from abe4669, a5257c2 and 25d471b

07-bootstrap-validation.md Outdated Show resolved Hide resolved
>
> Maintenance Operations:
> - Keep in mind that performing maintenance operations is considered best-effort. They are not guaranteed to occur within a specific window.
> - While it’s not strictly recommended, if you require greater control, consider manually updating your cluster.
> - While it’s not recommended for OS-level updates, if you require greater control, consider manually updating your cluster.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While it’s not recommended for OS-level updates, if you require greater control, consider manually updating your cluster.

Which update are you suggesting they consider, it's not clear from this context. OS, node image, cluster, extensions/add-ons?

Copy link
Contributor Author

@ferantivero ferantivero Mar 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for feedback @ckittel, the following commit should be clearing up most of the notes/details/~guidance we had and being more specific about terms.

Regarding updates they should consider are AKS node-only updates (it involves updates for security, kernel, other os-related and node-related).

done | addressed from abe4669

}
```

> This cluster now receives weekly updates for both the Operating System (OS) and Kubernetes. For workloads that need to always run the most secure OS version, you can opt-in for regular updates by selecting the `SecurityPatch` channel.
> This cluster now receives daily updates for the Operating System (OS) security patches and leave that up to a customer to perform Kubernetes updates (after testing).

> The node update phase of the cluster’s lifecycle belongs to day2 operations. Cluster operations will regularly update node images for two main reasons: 1) to update the Kubernetes cluster version, and 2) to keep up with node-level OS updates. A new AKS release introduces new features, such as addons and new Kubernetes versions, while new AKS node images bring changes at the OS level. Both types of releases adhere to Azure Safe Deployment Practices for rollout across all regions. For more information, please refer to [How to use the release tracker](https://learn.microsoft.com/azure/aks/release-tracker#how-to-use-the-release-tracker). Additionally, cluster operations aim to stay updated with supported Kubernetes versions for Service Level Agreement (SLA) compliance and to avoid accumulating updates, as version updates cannot be skipped at will. For more details, please see [Kubernetes version upgrades](https://learn.microsoft.com/azure/aks/upgrade-aks-cluster?tabs=azure-cli#kubernetes-version-upgrades).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this section call node image updates AKS version updates? Can you please bring more distinction and clarity here? Please keep AKS version updates a completely separate conversation than in-cluster, node image / OS / add-on/extension updates.

Copy link
Contributor Author

@ferantivero ferantivero Mar 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure thing, for this distinction Im sharing the following understanding so we can fix this section and others.

AKS version updates are the ones users can find from here https://github.com/Azure/AKS/releases and these releases include:

  • "k8s cluster version updates"
  • "node images"
  • addons (and more).

I plan start using the ^ understanding, combined with the following naming convention to be aligned with the PG docs (Im not listing all being used but just the more relevant we might consider using in our docs), unless you consider otherwise (open to change the terminology as desired):

  • Automatically upgrade AKS node images (interchangeable called by its feature name "auto-upgrade node OS images") | Manually upgrade AKS node images -> this is a Linux or Windows OS upgrade (AKSUbuntu-1604-2020.10.08 -> AKSUbuntu-1604-2020.10.28) that can be interchangeable called "node-level OS <security> updates" or "node OS image <auto->upgrades" or "node OS <auto->upgrades" or simply "node OS <automatic> <security> updates". Automatic updates are configured from the "node OS auto-upgrade channel".
  • Automatically upgrade AKS cluster (interchangeable called by its feature name "cluster auto-upgrade") | Manually Upgrade AKS cluster -> this is a "Kubernetes cluster upgrade" or "Upgrade to the latest Kubernetes version" or "Kubernetes patch or minor version updates" or "new features or patches from upstream Kubernetes" (i.e. upgrade between 1.14.x -> 1.15.x). Automatic updates are configured from the "<cluster> auto-upgrade channel".

Lastly, maintenance cadence can be scheduled to a finely controlled cadence of your choice by creating a "<Node OS> planned maintenance window".

Please don't hesitate to simplify all these terms to the ones we'd rather use from our docs, change/add new ones or simply grant to use them all interchangeable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ckittel as discussed offline we are now using the following terminology when possible and always open to keep discussing as much as needed:

  • Node image updates: it includes updates around security, kernel and the rest of node-related stuff. In other words, everything node-related not just OS.
  • AKS Cluster version update: it is the update of the version of a managed Kubernetes cluster running on AKS (i.e. 1.14.x to 1.15.x).

done | addressed from abe4669

@ferantivero ferantivero force-pushed the feature/192733_day2-node-updates-guidance-secpatch branch from 0b20930 to abe4669 Compare March 14, 2024 21:46
@ferantivero ferantivero marked this pull request as ready for review March 14, 2024 22:06
@ferantivero ferantivero changed the title feat (cluster): [day2-ops] node update configuration OS-level only feat (cluster): [day2-ops] image update configuration node-level only Mar 15, 2024
@ferantivero ferantivero merged commit 9754715 into feature/192733_day2-node-updates-guidance Mar 20, 2024
1 check passed
@ferantivero ferantivero deleted the feature/192733_day2-node-updates-guidance-secpatch branch March 20, 2024 16:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants