Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AKS node pool k8s version not being updated #5541

Closed
jstevans opened this issue Jan 28, 2020 · 19 comments · Fixed by #7233
Closed

AKS node pool k8s version not being updated #5541

jstevans opened this issue Jan 28, 2020 · 19 comments · Fixed by #7233

Comments

@jstevans
Copy link
Contributor

jstevans commented Jan 28, 2020

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Terraform (and AzureRM Provider) Version

  • terraform version: 0.12.8
  • azurerm provider version: 1.41

Affected Resource(s)

  • azurerm_kubernetes_cluster

Terraform Configuration Files

resource "azurerm_kubernetes_cluster" "k8s" {
  name                = var.cluster_info.name
  location            = azurerm_resource_group.k8s.location
  dns_prefix          = var.cluster_info.dns_prefix
  resource_group_name = azurerm_resource_group.k8s.name
  kubernetes_version  = var.kubernetes_version

  role_based_access_control {
    enabled = ! local.aks_aad_skip_rbac
    dynamic "azure_active_directory" {
      for_each = "${! local.aks_aad_skip_rbac ? list(local.aks_rbac_setting) : []}"
      content {
        client_app_id     = local.aks_rbac_setting.client_app_id
        server_app_id     = local.aks_rbac_setting.server_app_id
        server_app_secret = local.aks_rbac_setting.server_app_secret
        tenant_id         = local.aks_rbac_setting.tenant_id
      }
    }
  }

  default_node_pool {
    name               = var.agent_pool.name
    node_count         = var.agent_pool.count
    vm_size            = "Standard_DS2_v2"
    type               = "VirtualMachineScaleSets"
    os_disk_size_gb    = 30
    max_pods           = 30
    availability_zones = local.sanitized_availability_zones

    enable_auto_scaling= true
    min_count = 3
    max_count = 12
  }

  service_principal {
    client_id     = var.aks_login.client_id
    client_secret = var.aks_login.client_secret
  }

  addon_profile {
    oms_agent {
      enabled                    = true
      log_analytics_workspace_id = azurerm_log_analytics_workspace.k8s_logs.id
    }
  }

  tags = {
    CreatedBy   = var.tags.created_by != "" ? var.tags.created_by : null
    ChangedBy   = var.tags.changed_by != "" ? var.tags.changed_by : null
    Environment = var.tags.environment != "" ? var.tags.environment : null
  }

  lifecycle {
    ignore_changes = [
      role_based_access_control, 
      role_based_access_control["azure_active_directory"],
      agent_pool_profile.0.count]
  }

  network_profile {
    network_plugin    = "azure"
    load_balancer_sku = var.aks_load_balancer_sku
  }
}

Expected Behavior

With azurerm_provider == 1.39, the kubelet version in our cluster's node pool would be updated (in addition to the AKS k8s version) by changing var.kubernetes_version

Actual Behavior

With azurerm_provider == 1.41, the kubelet version in our cluster's node pool is not updated (in addition to the AKS k8s version) by changing var.kubernetes_version

Important Factoids

  • Switching back to azurerm_provider == 1.39 and doing terraform apply seems to perform the expected behavior, even if the exact same config was previously run with azurerm_provider == 1.41.
@jstevans
Copy link
Contributor Author

jstevans commented Jan 30, 2020

Just checked -- the issue also occurs in azurerm_provider == 1.40.

EDIT: I take it back. I've been git bisecting, and v1.40.0 doesn't seem to repro the issue when built myself. I'm starting over from v1.39.0..v1.41.0.

EDIT2: Double-takeback, there's a bug in my testbed. Re-starting.

@jstevans
Copy link
Contributor Author

A git bisect v1.39.0..v1.40.0 points to 87de8ae as the cause of issue. The code changes in that commit seem unrelated, so I'd guess it has to do with the azure-sdk container-service version bump.

@jstevans
Copy link
Contributor Author

jstevans commented Feb 1, 2020

I'm having trouble finding canon on what changed between azure-sdk/container-service 20190601 and 20191001, so I checked what az cli does on an az aks nodepool upgrade. Looks like OrchestratorVersion is settable when updating agent pools, so that's probably the fix.

If I were to guess, multiple agent pools is a new feature in AKS, so that probably drove the agent pool k8s upgrade logic's separation from the AKS k8s upgrade logic.

@jstevans
Copy link
Contributor Author

jstevans commented Feb 1, 2020

@tombuildsstuff I'd like to contribute the PR, but would appreciate some guidance.

  • Would we rather

    • keep the old behavior for now, of coupling the agent pool k8s version to the cluster k8s version; or
    • support configuring the OrchestratorVersion field on agent pools individually?
  • Does terraform-provider-azurerm v1.x supports multiple agent pools for AKS?

@tombuildsstuff
Copy link
Contributor

@jstevans

A git bisect v1.39.0..v1.40.0 points to 87de8ae as the cause of issue. The code changes in that commit seem unrelated, so I'd guess it has to do with the azure-sdk container-service version bump.

Yeah this is likely a change in behaviour between the different versions of the Container Service API

@tombuildsstuff I'd like to contribute the PR, but would appreciate some guidance.
Would we rather

  • keep the old behavior for now, of coupling the agent pool k8s version to the cluster k8s version; or
  • support configuring the OrchestratorVersion field on agent pools individually?
    Does terraform-provider-azurerm v1.x supports multiple agent pools for AKS?

👍 we support multiple node pools via the azurerm_kubernetes_cluster_node_pool resource - however there's a default node pool defined within the azurerm_kubernetes_cluster resource. As such I'm wondering if the behaviour should be:

  • Default Node Pool follows the same version of Kubernetes as the Cluster
  • Other node pools can be any version >= the cluster (which I believe is the API limitation)

Alternatively we can make them all configurable - but since the default node pool's hosting system jobs it's treated a little differently to the other node pools so this probably wants some testing to confirm which way to go - maybe @jluk can confirm the expected behaviour here?

@jluk
Copy link

jluk commented Feb 3, 2020

The ruleset between control plane and agent pools are defined in this public document. There is a window of config drift you are allowed to have between the control plane and each agent pool.
https://docs.microsoft.com/en-us/azure/aks/use-multiple-node-pools#upgrade-a-cluster-control-plane-with-multiple-node-pools

Rules for valid versions to upgrade node pools:

The node pool version must have the same major version as the control plane.
The node pool minor version must be within two minor versions of the control plane version.
The node pool version cannot be greater than the control major.minor.patch version.

Hope this helps.

@jstevans
Copy link
Contributor Author

jstevans commented Feb 4, 2020

Great, thanks @jluk :) @tombuildsstuff:

  • So that I can unblock my team's upgrade to [email protected] while we continue this discussion, can I follow your first option to have default_node_pool.OrchestratorVersion := kubernetes_cluster.KubernetesVersion?

  • For the longer-term, what if kubernetes_cluster.default_node_pool and kubernetes_cluster_node_pool both have an optional string argument orchestrator_version which defaults to kubernetes_cluster.kubernetes_version?

  • Do the version drift rules between control plane vs. agent pool need to be encoded on the client? Is it insufficient to make the REST call and let it fail?

@jstevans
Copy link
Contributor Author

jstevans commented Feb 10, 2020

@tombuildsstuff I'm going to work on this tomorrow according to my previous comment, please shout if that's the wrong thing to do 🙂

@tombuildsstuff
Copy link
Contributor

@jstevans

So that I can unblock my team's upgrade to [email protected] while we continue this discussion, can I follow your first option to have default_node_pool.OrchestratorVersion := kubernetes_cluster.KubernetesVersion?

That sounds fine for now, however the field doesn't want exposing to users currently and instead wants to be an internal behaviour, until..

For the longer-term, what if kubernetes_cluster.default_node_pool and kubernetes_cluster_node_pool both have an optional string argument orchestrator_version which defaults to kubernetes_cluster.kubernetes_version?
Do the version drift rules between control plane vs. agent pool need to be encoded on the client? Is it insufficient to make the REST call and let it fail?

Adding properties for kubernetes_version or similar to each node pool seems fine - however we may need to add locks into the azurerm_kubernetes_cluster and azurerm_kubernetes_cluster_node_pool resources to account for this (we'd need to test upgrading a major version and the node pools at the same time and confirm what happens, we may be able to get away with locking on the node pool names in each resources rather than the name of the kubernetes_cluster - since this'd allow still creating/deleting each node pool in parallel). Realistically we'll only know which thing to lock on (and if this is even needed) by testing it however.

Hope that helps :)

jstevans added a commit to jstevans/terraform-provider-azurerm that referenced this issue Feb 13, 2020
Per hashicorp#5541, currently AKS node pool versions can never be updated. This
occurred due to a change in behavior in new ARM behavior now that AKS
clusters can have multiple agent pools.

There's a more involved fix in the discussion on that issue which involves
exposing OrchestratorVersion as a settable attribute on agent_pool_profiles,
but this change focuses on recovering the old behavior.
@derek-burdick
Copy link

Coupling by default could work, as long as it can be disabled. Upgrading control plane and default node pool with one terraform apply could be very impactful to a cluster. I'm commenting to vote on exposing OrchestratorVersion for default node pool as well as the node pool resource.

I would rather have OrchestratorVersion exposed ASAP without the proper locking even if it means I can control it. My current plans are to use the GRAPH API directly to upgrade node pools. https://docs.microsoft.com/en-us/rest/api/aks/agentpools/createorupdate

I'm available (with Azure resources as well) to test potential patches and I am able to write golang, but I lack terraform internals experience. Let me know how I can help.

@Exodus
Copy link

Exodus commented Mar 18, 2020

Using azurerm provider at version "=2.1.0", I've upgraded an azurerm_kubernetes_cluster resource from 1.14 to 1.15. The Control Plane has seemed to upgrade to 1.15, but the VMSS node pool has stayed behind at 1.14.

What is the current expected behavior Kubernetes upgrade and their node pools? I understand (https://docs.microsoft.com/en-us/azure/aks/use-multiple-node-pools#validation-rules-for-upgrades) expresses certain validation conditions.

Will the node pool upgrade when it's obligated to? or can we control this as @jstevans suggested with an input?

@williamayerst
Copy link

I'd like to know this!

@tombuildsstuff tombuildsstuff self-assigned this May 14, 2020
@carct
Copy link

carct commented May 28, 2020

@tombuildsstuff 🆙
any info on the above ?

@ghost
Copy link

ghost commented Jun 11, 2020

This has been released in version 2.14.0 of the provider. Please see the Terraform documentation on provider versioning or reach out if you need any assistance upgrading. As an example:

provider "azurerm" {
    version = "~> 2.14.0"
}
# ... other configuration ...

@nfsouzaj
Copy link

Hello,
I am having the same issue. I tried to use the provider 2.14.0 or the 2.16.0 and even though my cluster upgraded correctly, the nodepools are still in the old version (1.15.10)...

@EPinci
Copy link

EPinci commented Jun 26, 2020

Have you set the new orchestrator_version property of the (default) nodepool(s)?

@nfsouzaj
Copy link

@EPinci Hello, I haven't. Lemme try!
Thanks for the heads up.

@nfsouzaj
Copy link

@EPinci Hey man, really appreciate the quick comment!
I set the orchestrator_version = my AKS version and it worked just fine!
Sorry for the dumb mistake.
Cheers,

@ghost
Copy link

ghost commented Jul 11, 2020

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. If you feel I made an error 🤖 🙉 , please reach out to my human friends 👉 [email protected]. Thanks!

@ghost ghost locked and limited conversation to collaborators Jul 11, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants