-
Notifications
You must be signed in to change notification settings - Fork 515
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat (cluster): [day2-ops] image update configuration node-level only #405
Changes from 1 commit
aae3fe6
26643b6
a5257c2
abe4669
25d471b
868bce8
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -34,16 +34,16 @@ GitOps allows a team to author Kubernetes manifest files, persist them in their | |
|
||
```outcome | ||
{ | ||
"nodeOsUpgradeChannel": "NodeImage", | ||
"upgradeChannel": "node-image" | ||
"nodeOsUpgradeChannel": "SecurityPatch", | ||
"upgradeChannel": "none" | ||
} | ||
``` | ||
|
||
> This cluster now receives weekly updates for both the Operating System (OS) and Kubernetes. For workloads that need to always run the most secure OS version, you can opt-in for regular updates by selecting the `SecurityPatch` channel. | ||
> This cluster now receives daily updates for the Operating System (OS) security patches and leave that up to a customer to perform Kubernetes updates (after testing). | ||
|
||
> The node update phase of the cluster’s lifecycle belongs to day2 operations. Cluster operations will regularly update node images for two main reasons: 1) to update the Kubernetes cluster version, and 2) to keep up with node-level OS updates. A new AKS release introduces new features, such as addons and new Kubernetes versions, while new AKS node images bring changes at the OS level. Both types of releases adhere to Azure Safe Deployment Practices for rollout across all regions. For more information, please refer to [How to use the release tracker](https://learn.microsoft.com/azure/aks/release-tracker#how-to-use-the-release-tracker). Additionally, cluster operations aim to stay updated with supported Kubernetes versions for Service Level Agreement (SLA) compliance and to avoid accumulating updates, as version updates cannot be skipped at will. For more details, please see [Kubernetes version upgrades](https://learn.microsoft.com/azure/aks/upgrade-aks-cluster?tabs=azure-cli#kubernetes-version-upgrades). | ||
|
||
> When a new update becomes available, it can be manually applied for the greatest degree of control by making requests against the Azure control plane. Alternatively, the operations team can opt to automatically update to the latest version by configuring an update channel to follow the desired cadence. This can be combined with a planned maintenance window, one for Kubernetes version updates and another for OS-level upgrades. AKS offers two different configurable auto-upgrade channels dedicated to these update types. For more information, please refer to [Upgrade options for Azure Kubernetes Service (AKS) clusters](https://learn.microsoft.com/azure/aks/upgrade-cluster). Node pools in this AKS cluster span multiple availability zones. Therefore, it’s important to note that automatic updates are conducted based on a best-effort zone balancing in node groups. To prevent zone imbalance and increase availability, Nodes Max Surge and Pod Disruption Budget are configured in this baseline. By default, cluster nodes are updated one at a time. Max Surge can adjust the speed of a cluster upgrade. In clusters with 6+ nodes hosting disruption-sensitive workloads, a surge of up to `33%` is recommended for a safe upgrade pace. For more information, please see [Customer node surge upgrade](https://learn.microsoft.com/azure/aks/upgrade-aks-cluster?tabs=azure-cli#customize-node-surge-upgrade). To minimize disruption, production clusters should be configured with [node draining timeout](https://learn.microsoft.com/azure/aks/upgrade-aks-cluster?tabs=azure-cli#set-node-drain-timeout-valuei) and [soak time](https://learn.microsoft.com/azure/aks/upgrade-aks-cluster?tabs=azure-cli#set-node-soak-time-value), taking into account the specific characteristics of their workloads. | ||
> When a new update becomes available, it can be manually applied for the greatest degree of control by making requests against the Azure control plane which is what we recommmend for Kubernetes version updates. Alternatively, the operations team can opt to automatically update to the latest version by configuring an update channel to follow the desired cadence. This can be combined with a planned maintenance window, one for Kubernetes version updates and another for OS-level upgrades. AKS offers two different configurable auto-upgrade channels dedicated to these update types. For more information, please refer to [Upgrade options for Azure Kubernetes Service (AKS) clusters](https://learn.microsoft.com/azure/aks/upgrade-cluster). Node pools in this AKS cluster span multiple availability zones. Therefore, it’s important to note that automatic updates are conducted based on a best-effort zone balancing in node groups. To prevent zone imbalance and increase availability, Nodes Max Surge and Pod Disruption Budget are configured in this baseline. By default, cluster nodes are updated one at a time. Max Surge can adjust the speed of a cluster upgrade. In clusters with 6+ nodes hosting disruption-sensitive workloads, a surge of up to `33%` is recommended for a safe upgrade pace. For more information, please see [Customer node surge upgrade](https://learn.microsoft.com/azure/aks/upgrade-aks-cluster?tabs=azure-cli#customize-node-surge-upgrade). To minimize disruption, production clusters should be configured with [node draining timeout](https://learn.microsoft.com/azure/aks/upgrade-aks-cluster?tabs=azure-cli#set-node-drain-timeout-valuei) and [soak time](https://learn.microsoft.com/azure/aks/upgrade-aks-cluster?tabs=azure-cli#set-node-soak-time-value), taking into account the specific characteristics of their workloads. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Please be specific, we have at three types of updates possible in AKS (not including add-on/extension updates). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
|
||
1. See your maitenance configuration | ||
|
||
|
@@ -56,18 +56,19 @@ GitOps allows a team to author Kubernetes manifest files, persist them in their | |
> Mindful Timing for Upgrades: | ||
> - Be mindful of when upgrades should occur. If you have overlapping maintenance windows, AKS will determine the running order. | ||
> - To avoid conflicts, leave at least 24 hours between maintenance window configurations. The timing will depend on the number of nodes in your specific cluster and the duration required for upgrades. | ||
> - Current configuration should not represent a conflict since Kubernetes version updates are applied manually when customer see fit. | ||
ferantivero marked this conversation as resolved.
Show resolved
Hide resolved
|
||
> | ||
> OS-Level Updates: | ||
> - By default, the OS-level updates maintenance window is scheduled on a weekly cadence. This is because the OS channel is configured with `NodeImage`, where a new node image is shipped every week. | ||
> - If you choose the `SecurityPatch` channel, consider changing the maintenance window to daily for more frequent updates. | ||
> - By default, the OS-level updates maintenance window is scheduled on a daily cadence. This is because the OS channel is configured with `SecurityPatch`, where a new update can be shipped when available. | ||
> - If you choose the `NodeImage` channel, consider changing the maintenance window to weekly since updates get shipped on a weekly cadence. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd actually show the maintenance window being on Tue, Thurs, and the weekend. Just because you COULD get security updates daily doesn't mean you will nor does it mean your maintenance window MUST align with that possibility. It would be good to show that the cluster operator still has control here. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @ckittel auto-upgrade channels won't allow us to configure two different days of the weeks as requested. We are now moving over weekly maintenance window on every Tue at 9PM following. The weekly design decision is based on official docs recommendations for partially done | attempt to addressed from abe4669, a5257c2 and 25d471b |
||
> | ||
> Kubernetes Version Management: | ||
> - To stay current with the latest Kubernetes version, a monthly cadence is generally sufficient. However, you can adjust this based on your specific needs. | ||
> - For more regular updates, configure your cluster to upgrade every two weeks. | ||
> - To ensure the Kubernetes version is a supported one, a monthly update is generally sufficient. However, it is recommended to track the AKS releases and adjust accordinly. | ||
> - Being proactive and keep the Kubernetes version current is the best practice. | ||
> | ||
> Maintenance Operations: | ||
> - Keep in mind that performing maintenance operations is considered best-effort. They are not guaranteed to occur within a specific window. | ||
> - While it’s not strictly recommended, if you require greater control, consider manually updating your cluster. | ||
> - While it’s not recommended for OS-level updates, if you require greater control, consider manually updating your cluster. | ||
ferantivero marked this conversation as resolved.
Show resolved
Hide resolved
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Which update are you suggesting they consider, it's not clear from this context. OS, node image, cluster, extensions/add-ons? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. thanks for feedback @ckittel, the following commit should be clearing up most of the notes/details/~guidance we had and being more specific about terms. Regarding updates they should consider are AKS node-only updates (it involves updates for security, kernel, other os-related and node-related). done | addressed from abe4669 |
||
> | ||
> Remember that these guidelines provide flexibility, allowing you to strike a balance between timely updates and operational control. Choose the approach that aligns best with your organization’s requirements. | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this section call node image updates AKS version updates? Can you please bring more distinction and clarity here? Please keep AKS version updates a completely separate conversation than in-cluster, node image / OS / add-on/extension updates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure thing, for this distinction Im sharing the following understanding so we can fix this section and others.
AKS version updates are the ones users can find from here https://github.com/Azure/AKS/releases and these releases include:
I plan start using the ^ understanding, combined with the following naming convention to be aligned with the PG docs (Im not listing all being used but just the more relevant we might consider using in our docs), unless you consider otherwise (open to change the terminology as desired):
Lastly, maintenance cadence can be scheduled to a finely controlled cadence of your choice by creating a "<Node OS> planned maintenance window".
Please don't hesitate to simplify all these terms to the ones we'd rather use from our docs, change/add new ones or simply grant to use them all interchangeable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ckittel as discussed offline we are now using the following terminology when possible and always open to keep discussing as much as needed:
done | addressed from abe4669