Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat (cluster): [day2-ops] image update configuration node-level only #405

Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 10 additions & 9 deletions 07-bootstrap-validation.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,16 +34,16 @@ GitOps allows a team to author Kubernetes manifest files, persist them in their

```outcome
{
"nodeOsUpgradeChannel": "NodeImage",
"upgradeChannel": "node-image"
"nodeOsUpgradeChannel": "SecurityPatch",
"upgradeChannel": "none"
}
```

> This cluster now receives weekly updates for both the Operating System (OS) and Kubernetes. For workloads that need to always run the most secure OS version, you can opt-in for regular updates by selecting the `SecurityPatch` channel.
> This cluster now receives daily updates for the Operating System (OS) security patches and leave that up to a customer to perform Kubernetes updates (after testing).

> The node update phase of the cluster’s lifecycle belongs to day2 operations. Cluster operations will regularly update node images for two main reasons: 1) to update the Kubernetes cluster version, and 2) to keep up with node-level OS updates. A new AKS release introduces new features, such as addons and new Kubernetes versions, while new AKS node images bring changes at the OS level. Both types of releases adhere to Azure Safe Deployment Practices for rollout across all regions. For more information, please refer to [How to use the release tracker](https://learn.microsoft.com/azure/aks/release-tracker#how-to-use-the-release-tracker). Additionally, cluster operations aim to stay updated with supported Kubernetes versions for Service Level Agreement (SLA) compliance and to avoid accumulating updates, as version updates cannot be skipped at will. For more details, please see [Kubernetes version upgrades](https://learn.microsoft.com/azure/aks/upgrade-aks-cluster?tabs=azure-cli#kubernetes-version-upgrades).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this section call node image updates AKS version updates? Can you please bring more distinction and clarity here? Please keep AKS version updates a completely separate conversation than in-cluster, node image / OS / add-on/extension updates.

Copy link
Contributor Author

@ferantivero ferantivero Mar 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure thing, for this distinction Im sharing the following understanding so we can fix this section and others.

AKS version updates are the ones users can find from here https://github.com/Azure/AKS/releases and these releases include:

  • "k8s cluster version updates"
  • "node images"
  • addons (and more).

I plan start using the ^ understanding, combined with the following naming convention to be aligned with the PG docs (Im not listing all being used but just the more relevant we might consider using in our docs), unless you consider otherwise (open to change the terminology as desired):

  • Automatically upgrade AKS node images (interchangeable called by its feature name "auto-upgrade node OS images") | Manually upgrade AKS node images -> this is a Linux or Windows OS upgrade (AKSUbuntu-1604-2020.10.08 -> AKSUbuntu-1604-2020.10.28) that can be interchangeable called "node-level OS <security> updates" or "node OS image <auto->upgrades" or "node OS <auto->upgrades" or simply "node OS <automatic> <security> updates". Automatic updates are configured from the "node OS auto-upgrade channel".
  • Automatically upgrade AKS cluster (interchangeable called by its feature name "cluster auto-upgrade") | Manually Upgrade AKS cluster -> this is a "Kubernetes cluster upgrade" or "Upgrade to the latest Kubernetes version" or "Kubernetes patch or minor version updates" or "new features or patches from upstream Kubernetes" (i.e. upgrade between 1.14.x -> 1.15.x). Automatic updates are configured from the "<cluster> auto-upgrade channel".

Lastly, maintenance cadence can be scheduled to a finely controlled cadence of your choice by creating a "<Node OS> planned maintenance window".

Please don't hesitate to simplify all these terms to the ones we'd rather use from our docs, change/add new ones or simply grant to use them all interchangeable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ckittel as discussed offline we are now using the following terminology when possible and always open to keep discussing as much as needed:

  • Node image updates: it includes updates around security, kernel and the rest of node-related stuff. In other words, everything node-related not just OS.
  • AKS Cluster version update: it is the update of the version of a managed Kubernetes cluster running on AKS (i.e. 1.14.x to 1.15.x).

done | addressed from abe4669


> When a new update becomes available, it can be manually applied for the greatest degree of control by making requests against the Azure control plane. Alternatively, the operations team can opt to automatically update to the latest version by configuring an update channel to follow the desired cadence. This can be combined with a planned maintenance window, one for Kubernetes version updates and another for OS-level upgrades. AKS offers two different configurable auto-upgrade channels dedicated to these update types. For more information, please refer to [Upgrade options for Azure Kubernetes Service (AKS) clusters](https://learn.microsoft.com/azure/aks/upgrade-cluster). Node pools in this AKS cluster span multiple availability zones. Therefore, it’s important to note that automatic updates are conducted based on a best-effort zone balancing in node groups. To prevent zone imbalance and increase availability, Nodes Max Surge and Pod Disruption Budget are configured in this baseline. By default, cluster nodes are updated one at a time. Max Surge can adjust the speed of a cluster upgrade. In clusters with 6+ nodes hosting disruption-sensitive workloads, a surge of up to `33%` is recommended for a safe upgrade pace. For more information, please see [Customer node surge upgrade](https://learn.microsoft.com/azure/aks/upgrade-aks-cluster?tabs=azure-cli#customize-node-surge-upgrade). To minimize disruption, production clusters should be configured with [node draining timeout](https://learn.microsoft.com/azure/aks/upgrade-aks-cluster?tabs=azure-cli#set-node-drain-timeout-valuei) and [soak time](https://learn.microsoft.com/azure/aks/upgrade-aks-cluster?tabs=azure-cli#set-node-soak-time-value), taking into account the specific characteristics of their workloads.
> When a new update becomes available, it can be manually applied for the greatest degree of control by making requests against the Azure control plane which is what we recommmend for Kubernetes version updates. Alternatively, the operations team can opt to automatically update to the latest version by configuring an update channel to follow the desired cadence. This can be combined with a planned maintenance window, one for Kubernetes version updates and another for OS-level upgrades. AKS offers two different configurable auto-upgrade channels dedicated to these update types. For more information, please refer to [Upgrade options for Azure Kubernetes Service (AKS) clusters](https://learn.microsoft.com/azure/aks/upgrade-cluster). Node pools in this AKS cluster span multiple availability zones. Therefore, it’s important to note that automatic updates are conducted based on a best-effort zone balancing in node groups. To prevent zone imbalance and increase availability, Nodes Max Surge and Pod Disruption Budget are configured in this baseline. By default, cluster nodes are updated one at a time. Max Surge can adjust the speed of a cluster upgrade. In clusters with 6+ nodes hosting disruption-sensitive workloads, a surge of up to `33%` is recommended for a safe upgrade pace. For more information, please see [Customer node surge upgrade](https://learn.microsoft.com/azure/aks/upgrade-aks-cluster?tabs=azure-cli#customize-node-surge-upgrade). To minimize disruption, production clusters should be configured with [node draining timeout](https://learn.microsoft.com/azure/aks/upgrade-aks-cluster?tabs=azure-cli#set-node-drain-timeout-valuei) and [soak time](https://learn.microsoft.com/azure/aks/upgrade-aks-cluster?tabs=azure-cli#set-node-soak-time-value), taking into account the specific characteristics of their workloads.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When a new update becomes available

Please be specific, we have at three types of updates possible in AKS (not including add-on/extension updates).

Copy link
Contributor Author

@ferantivero ferantivero Mar 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for feedback @ckittel, the following commit should be clearing up most of the notes/details/~guidance we had and being more specific about terms. Regarding three types, we are now preserving those concepts for the RA.

done | addressed from abe4669


1. See your maitenance configuration

Expand All @@ -56,18 +56,19 @@ GitOps allows a team to author Kubernetes manifest files, persist them in their
> Mindful Timing for Upgrades:
> - Be mindful of when upgrades should occur. If you have overlapping maintenance windows, AKS will determine the running order.
> - To avoid conflicts, leave at least 24 hours between maintenance window configurations. The timing will depend on the number of nodes in your specific cluster and the duration required for upgrades.
> - Current configuration should not represent a conflict since Kubernetes version updates are applied manually when customer see fit.
ferantivero marked this conversation as resolved.
Show resolved Hide resolved
>
> OS-Level Updates:
> - By default, the OS-level updates maintenance window is scheduled on a weekly cadence. This is because the OS channel is configured with `NodeImage`, where a new node image is shipped every week.
> - If you choose the `SecurityPatch` channel, consider changing the maintenance window to daily for more frequent updates.
> - By default, the OS-level updates maintenance window is scheduled on a daily cadence. This is because the OS channel is configured with `SecurityPatch`, where a new update can be shipped when available.
> - If you choose the `NodeImage` channel, consider changing the maintenance window to weekly since updates get shipped on a weekly cadence.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd actually show the maintenance window being on Tue, Thurs, and the weekend. Just because you COULD get security updates daily doesn't mean you will nor does it mean your maintenance window MUST align with that possibility. It would be good to show that the cluster operator still has control here.

Copy link
Contributor Author

@ferantivero ferantivero Mar 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ckittel auto-upgrade channels won't allow us to configure two different days of the weeks as requested. We are now moving over weekly maintenance window on every Tue at 9PM following.

The weekly design decision is based on official docs recommendations for NodeImage channel. Open to keep discussing alternatives as I also think the two-days-catchingup is a great pattern.

partially done | attempt to addressed from abe4669, a5257c2 and 25d471b

>
> Kubernetes Version Management:
> - To stay current with the latest Kubernetes version, a monthly cadence is generally sufficient. However, you can adjust this based on your specific needs.
> - For more regular updates, configure your cluster to upgrade every two weeks.
> - To ensure the Kubernetes version is a supported one, a monthly update is generally sufficient. However, it is recommended to track the AKS releases and adjust accordinly.
> - Being proactive and keep the Kubernetes version current is the best practice.
>
> Maintenance Operations:
> - Keep in mind that performing maintenance operations is considered best-effort. They are not guaranteed to occur within a specific window.
> - While it’s not strictly recommended, if you require greater control, consider manually updating your cluster.
> - While it’s not recommended for OS-level updates, if you require greater control, consider manually updating your cluster.
ferantivero marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While it’s not recommended for OS-level updates, if you require greater control, consider manually updating your cluster.

Which update are you suggesting they consider, it's not clear from this context. OS, node image, cluster, extensions/add-ons?

Copy link
Contributor Author

@ferantivero ferantivero Mar 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for feedback @ckittel, the following commit should be clearing up most of the notes/details/~guidance we had and being more specific about terms.

Regarding updates they should consider are AKS node-only updates (it involves updates for security, kernel, other os-related and node-related).

done | addressed from abe4669

>
> Remember that these guidelines provide flexibility, allowing you to strike a balance between timely updates and operational control. Choose the approach that aligns best with your organization’s requirements.

Expand Down
25 changes: 4 additions & 21 deletions cluster-stamp.bicep
Original file line number Diff line number Diff line change
Expand Up @@ -1817,8 +1817,8 @@ resource mc 'Microsoft.ContainerService/managedClusters@2024-01-02-preview' = {
enabled: false // Using Microsoft Entra Workload IDs for pod identities.
}
autoUpgradeProfile: {
nodeOSUpgradeChannel: 'NodeImage'
upgradeChannel: 'node-image'
nodeOSUpgradeChannel: 'SecurityPatch'
upgradeChannel: 'none'
}
azureMonitorProfile: {
metrics: {
Expand Down Expand Up @@ -1932,32 +1932,15 @@ resource mc 'Microsoft.ContainerService/managedClusters@2024-01-02-preview' = {
maintenanceWindow: {
durationHours: 12
schedule: {
weekly: {
dayOfWeek: 'Tuesday'
intervalWeeks: 1
daily: {
intervalDays: 2
}
}
startTime: '09:00'
}
}
}

resource k8s_maintenanceConfigurations 'maintenanceConfigurations' = {
name: 'aksManagedAutoUpgradeSchedule'
properties: {
maintenanceWindow: {
durationHours: 12
schedule: {
weekly: {
dayOfWeek: 'Wednesday'
intervalWeeks: 2
}
}
startTime: '21:00'
}
}
}

}

resource acrKubeletAcrPullRole_roleAssignment 'Microsoft.Authorization/roleAssignments@2020-10-01-preview' = {
Expand Down