AKS node pool k8s version not being updated #5541

jstevans · 2020-01-28T02:26:19Z

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Terraform (and AzureRM Provider) Version

terraform version: 0.12.8
azurerm provider version: 1.41

Affected Resource(s)

azurerm_kubernetes_cluster

Terraform Configuration Files

resource "azurerm_kubernetes_cluster" "k8s" {
  name                = var.cluster_info.name
  location            = azurerm_resource_group.k8s.location
  dns_prefix          = var.cluster_info.dns_prefix
  resource_group_name = azurerm_resource_group.k8s.name
  kubernetes_version  = var.kubernetes_version

  role_based_access_control {
    enabled = ! local.aks_aad_skip_rbac
    dynamic "azure_active_directory" {
      for_each = "${! local.aks_aad_skip_rbac ? list(local.aks_rbac_setting) : []}"
      content {
        client_app_id     = local.aks_rbac_setting.client_app_id
        server_app_id     = local.aks_rbac_setting.server_app_id
        server_app_secret = local.aks_rbac_setting.server_app_secret
        tenant_id         = local.aks_rbac_setting.tenant_id
      }
    }
  }

  default_node_pool {
    name               = var.agent_pool.name
    node_count         = var.agent_pool.count
    vm_size            = "Standard_DS2_v2"
    type               = "VirtualMachineScaleSets"
    os_disk_size_gb    = 30
    max_pods           = 30
    availability_zones = local.sanitized_availability_zones

    enable_auto_scaling= true
    min_count = 3
    max_count = 12
  }

  service_principal {
    client_id     = var.aks_login.client_id
    client_secret = var.aks_login.client_secret
  }

  addon_profile {
    oms_agent {
      enabled                    = true
      log_analytics_workspace_id = azurerm_log_analytics_workspace.k8s_logs.id
    }
  }

  tags = {
    CreatedBy   = var.tags.created_by != "" ? var.tags.created_by : null
    ChangedBy   = var.tags.changed_by != "" ? var.tags.changed_by : null
    Environment = var.tags.environment != "" ? var.tags.environment : null
  }

  lifecycle {
    ignore_changes = [
      role_based_access_control, 
      role_based_access_control["azure_active_directory"],
      agent_pool_profile.0.count]
  }

  network_profile {
    network_plugin    = "azure"
    load_balancer_sku = var.aks_load_balancer_sku
  }
}

Expected Behavior

With azurerm_provider == 1.39, the kubelet version in our cluster's node pool would be updated (in addition to the AKS k8s version) by changing var.kubernetes_version

Actual Behavior

With azurerm_provider == 1.41, the kubelet version in our cluster's node pool is not updated (in addition to the AKS k8s version) by changing var.kubernetes_version

Important Factoids

Switching back to azurerm_provider == 1.39 and doing terraform apply seems to perform the expected behavior, even if the exact same config was previously run with azurerm_provider == 1.41.

The text was updated successfully, but these errors were encountered:

jstevans · 2020-01-30T01:44:25Z

Just checked -- the issue also occurs in azurerm_provider == 1.40.

EDIT: I take it back. I've been git bisecting, and v1.40.0 doesn't seem to repro the issue when built myself. I'm starting over from v1.39.0..v1.41.0.

EDIT2: Double-takeback, there's a bug in my testbed. Re-starting.

jstevans · 2020-01-31T23:08:17Z

A git bisect v1.39.0..v1.40.0 points to 87de8ae as the cause of issue. The code changes in that commit seem unrelated, so I'd guess it has to do with the azure-sdk container-service version bump.

jstevans · 2020-02-01T01:00:06Z

I'm having trouble finding canon on what changed between azure-sdk/container-service 20190601 and 20191001, so I checked what az cli does on an az aks nodepool upgrade. Looks like OrchestratorVersion is settable when updating agent pools, so that's probably the fix.

If I were to guess, multiple agent pools is a new feature in AKS, so that probably drove the agent pool k8s upgrade logic's separation from the AKS k8s upgrade logic.

jstevans · 2020-02-01T01:04:17Z

@tombuildsstuff I'd like to contribute the PR, but would appreciate some guidance.

Would we rather
- keep the old behavior for now, of coupling the agent pool k8s version to the cluster k8s version; or
- support configuring the OrchestratorVersion field on agent pools individually?
Does terraform-provider-azurerm v1.x supports multiple agent pools for AKS?

tombuildsstuff · 2020-02-03T07:35:54Z

@jstevans

A git bisect v1.39.0..v1.40.0 points to 87de8ae as the cause of issue. The code changes in that commit seem unrelated, so I'd guess it has to do with the azure-sdk container-service version bump.

Yeah this is likely a change in behaviour between the different versions of the Container Service API

@tombuildsstuff I'd like to contribute the PR, but would appreciate some guidance.
Would we rather

keep the old behavior for now, of coupling the agent pool k8s version to the cluster k8s version; or

support configuring the OrchestratorVersion field on agent pools individually?
Does terraform-provider-azurerm v1.x supports multiple agent pools for AKS?

👍 we support multiple node pools via the azurerm_kubernetes_cluster_node_pool resource - however there's a default node pool defined within the azurerm_kubernetes_cluster resource. As such I'm wondering if the behaviour should be:

Default Node Pool follows the same version of Kubernetes as the Cluster
Other node pools can be any version >= the cluster (which I believe is the API limitation)

Alternatively we can make them all configurable - but since the default node pool's hosting system jobs it's treated a little differently to the other node pools so this probably wants some testing to confirm which way to go - maybe @jluk can confirm the expected behaviour here?

jluk · 2020-02-03T17:09:19Z

The ruleset between control plane and agent pools are defined in this public document. There is a window of config drift you are allowed to have between the control plane and each agent pool.
https://docs.microsoft.com/en-us/azure/aks/use-multiple-node-pools#upgrade-a-cluster-control-plane-with-multiple-node-pools

Rules for valid versions to upgrade node pools:

The node pool version must have the same major version as the control plane.
The node pool minor version must be within two minor versions of the control plane version.
The node pool version cannot be greater than the control major.minor.patch version.

Hope this helps.

jstevans · 2020-02-04T00:34:58Z

Great, thanks @jluk :) @tombuildsstuff:

So that I can unblock my team's upgrade to [email protected] while we continue this discussion, can I follow your first option to have default_node_pool.OrchestratorVersion := kubernetes_cluster.KubernetesVersion?
For the longer-term, what if kubernetes_cluster.default_node_pool and kubernetes_cluster_node_pool both have an optional string argument orchestrator_version which defaults to kubernetes_cluster.kubernetes_version?
Do the version drift rules between control plane vs. agent pool need to be encoded on the client? Is it insufficient to make the REST call and let it fail?

jstevans · 2020-02-10T18:11:45Z

@tombuildsstuff I'm going to work on this tomorrow according to my previous comment, please shout if that's the wrong thing to do 🙂

tombuildsstuff · 2020-02-11T10:19:32Z

@jstevans

So that I can unblock my team's upgrade to [email protected] while we continue this discussion, can I follow your first option to have default_node_pool.OrchestratorVersion := kubernetes_cluster.KubernetesVersion?

That sounds fine for now, however the field doesn't want exposing to users currently and instead wants to be an internal behaviour, until..

For the longer-term, what if kubernetes_cluster.default_node_pool and kubernetes_cluster_node_pool both have an optional string argument orchestrator_version which defaults to kubernetes_cluster.kubernetes_version?
Do the version drift rules between control plane vs. agent pool need to be encoded on the client? Is it insufficient to make the REST call and let it fail?

Adding properties for kubernetes_version or similar to each node pool seems fine - however we may need to add locks into the azurerm_kubernetes_cluster and azurerm_kubernetes_cluster_node_pool resources to account for this (we'd need to test upgrading a major version and the node pools at the same time and confirm what happens, we may be able to get away with locking on the node pool names in each resources rather than the name of the kubernetes_cluster - since this'd allow still creating/deleting each node pool in parallel). Realistically we'll only know which thing to lock on (and if this is even needed) by testing it however.

Hope that helps :)

Per hashicorp#5541, currently AKS node pool versions can never be updated. This occurred due to a change in behavior in new ARM behavior now that AKS clusters can have multiple agent pools. There's a more involved fix in the discussion on that issue which involves exposing OrchestratorVersion as a settable attribute on agent_pool_profiles, but this change focuses on recovering the old behavior.

derek-burdick · 2020-03-04T05:28:25Z

Coupling by default could work, as long as it can be disabled. Upgrading control plane and default node pool with one terraform apply could be very impactful to a cluster. I'm commenting to vote on exposing OrchestratorVersion for default node pool as well as the node pool resource.

I would rather have OrchestratorVersion exposed ASAP without the proper locking even if it means I can control it. My current plans are to use the GRAPH API directly to upgrade node pools. https://docs.microsoft.com/en-us/rest/api/aks/agentpools/createorupdate

I'm available (with Azure resources as well) to test potential patches and I am able to write golang, but I lack terraform internals experience. Let me know how I can help.

Exodus · 2020-03-18T18:50:26Z

Using azurerm provider at version "=2.1.0", I've upgraded an azurerm_kubernetes_cluster resource from 1.14 to 1.15. The Control Plane has seemed to upgrade to 1.15, but the VMSS node pool has stayed behind at 1.14.

What is the current expected behavior Kubernetes upgrade and their node pools? I understand (https://docs.microsoft.com/en-us/azure/aks/use-multiple-node-pools#validation-rules-for-upgrades) expresses certain validation conditions.

Will the node pool upgrade when it's obligated to? or can we control this as @jstevans suggested with an input?

williamayerst · 2020-05-14T07:34:11Z

I'd like to know this!

carct · 2020-05-28T09:13:10Z

@tombuildsstuff 🆙
any info on the above ?

ghost · 2020-06-11T13:42:02Z

This has been released in version 2.14.0 of the provider. Please see the Terraform documentation on provider versioning or reach out if you need any assistance upgrading. As an example:

provider "azurerm" {
    version = "~> 2.14.0"
}
# ... other configuration ...

nfsouzaj · 2020-06-26T15:08:38Z

Hello,
I am having the same issue. I tried to use the provider 2.14.0 or the 2.16.0 and even though my cluster upgraded correctly, the nodepools are still in the old version (1.15.10)...

EPinci · 2020-06-26T15:10:09Z

Have you set the new orchestrator_version property of the (default) nodepool(s)?

nfsouzaj · 2020-06-26T15:49:54Z

@EPinci Hello, I haven't. Lemme try!
Thanks for the heads up.

nfsouzaj · 2020-06-26T18:28:21Z

@EPinci Hey man, really appreciate the quick comment!
I set the orchestrator_version = my AKS version and it worked just fine!
Sorry for the dumb mistake.
Cheers,

ghost · 2020-07-11T14:50:07Z

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. If you feel I made an error 🤖 🙉 , please reach out to my human friends 👉 [email protected]. Thanks!

tombuildsstuff added bug service/kubernetes-cluster labels Jan 30, 2020

jstevans mentioned this issue Feb 13, 2020

Couple default_node_pool OrchestratorVersion to cluster k8s_version #5741

Closed

katbyte mentioned this issue Mar 10, 2020

Add Orchestrator version for Kubernetes Cluster and Node Pool #6047

Closed

tombuildsstuff self-assigned this May 14, 2020

tombuildsstuff added this to the v2.13.0 milestone May 28, 2020

neumanndaniel mentioned this issue May 28, 2020

ClusterUpgrade script - when is it required neumanndaniel/terraform#3

Closed

tombuildsstuff modified the milestones: v2.13.0, v2.14.0 Jun 4, 2020

tombuildsstuff mentioned this issue Jun 5, 2020

Azure Kubernetes Service: 2020-06 updates #7233

Merged

tombuildsstuff closed this as completed in #7233 Jun 10, 2020

ghost locked and limited conversation to collaborators Jul 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AKS node pool k8s version not being updated #5541

AKS node pool k8s version not being updated #5541

jstevans commented Jan 28, 2020 •

edited

Loading

jstevans commented Jan 30, 2020 •

edited

Loading

jstevans commented Jan 31, 2020

jstevans commented Feb 1, 2020

jstevans commented Feb 1, 2020

tombuildsstuff commented Feb 3, 2020

jluk commented Feb 3, 2020

jstevans commented Feb 4, 2020

jstevans commented Feb 10, 2020 •

edited

Loading

tombuildsstuff commented Feb 11, 2020

derek-burdick commented Mar 4, 2020

Exodus commented Mar 18, 2020 •

edited

Loading

williamayerst commented May 14, 2020

carct commented May 28, 2020

ghost commented Jun 11, 2020

nfsouzaj commented Jun 26, 2020

EPinci commented Jun 26, 2020 •

edited

Loading

nfsouzaj commented Jun 26, 2020

nfsouzaj commented Jun 26, 2020

ghost commented Jul 11, 2020

AKS node pool k8s version not being updated #5541

AKS node pool k8s version not being updated #5541

Comments

jstevans commented Jan 28, 2020 • edited Loading

Community Note

Terraform (and AzureRM Provider) Version

Affected Resource(s)

Terraform Configuration Files

Expected Behavior

Actual Behavior

Important Factoids

jstevans commented Jan 30, 2020 • edited Loading

jstevans commented Jan 31, 2020

jstevans commented Feb 1, 2020

jstevans commented Feb 1, 2020

tombuildsstuff commented Feb 3, 2020

jluk commented Feb 3, 2020

jstevans commented Feb 4, 2020

jstevans commented Feb 10, 2020 • edited Loading

tombuildsstuff commented Feb 11, 2020

derek-burdick commented Mar 4, 2020

Exodus commented Mar 18, 2020 • edited Loading

williamayerst commented May 14, 2020

carct commented May 28, 2020

ghost commented Jun 11, 2020

nfsouzaj commented Jun 26, 2020

EPinci commented Jun 26, 2020 • edited Loading

nfsouzaj commented Jun 26, 2020

nfsouzaj commented Jun 26, 2020

ghost commented Jul 11, 2020

jstevans commented Jan 28, 2020 •

edited

Loading

jstevans commented Jan 30, 2020 •

edited

Loading

jstevans commented Feb 10, 2020 •

edited

Loading

Exodus commented Mar 18, 2020 •

edited

Loading

EPinci commented Jun 26, 2020 •

edited

Loading