Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AKS] Cannot create nodepool with GPU #22084

Closed
1 task done
sinc59 opened this issue Jun 8, 2023 · 2 comments
Closed
1 task done

[AKS] Cannot create nodepool with GPU #22084

sinc59 opened this issue Jun 8, 2023 · 2 comments

Comments

@sinc59
Copy link
Contributor

sinc59 commented Jun 8, 2023

Is there an existing issue for this?

  • I have searched the existing issues

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Terraform Version

1.3.1

AzureRM Provider Version

3.59.0

Affected Resource(s)/Data Source(s)

azurerm_kubernetes_cluster_node_pool

Terraform Configuration Files

resource "azurerm_kubernetes_cluster_node_pool" "node_pools" {
    enable_auto_scaling   = true
    enable_node_public_ip = false
    kubernetes_cluster_id = azurerm_kubernetes_cluster.aks.id
    max_count             = 4
    max_pods              = 30
    min_count             = 1
    mode                  = "User"
    name                  = "gpu"
    node_taints           = [
        "sku=gpu:NoSchedule",
        ]
    os_disk_size_gb       = 130
    os_disk_type          = "Managed"
    os_type               = "Linux"
    priority              = "Regular"
    scale_down_mode       = "Delete"
    spot_max_price        = -1
    ultra_ssd_enabled     = false
    vm_size               = "standard_nc4as_t4_v3"
    vnet_subnet_id        = var.subnet_id
    zones                 = [
        "1",
        "2",
        "3",
        ]
    }

Debug Output/Panic Output

Terraform debug logs (requests that create the nodepool without GPU capabilities):

2023-06-07T18:42:56.598+0200 [DEBUG] provider.terraform-provider-azurerm_v3.54.0_x5: AzureRM Request:
PUT /subscriptions/{subscription_id}/resourceGroups/{resource-group_name}/providers/Microsoft.ContainerService/managedClusters/{cluster_name}/agentPools/gpu?api-version=2023-02-02-preview HTTP/1.1
Host: management.azure.com
User-Agent: Go/go1.19.3 (amd64-linux) go-autorest/v14.2.1 hashicorp/go-azure-sdk/agentpools/2023-02-02-preview HashiCorp Terraform/1.3.1 (+https://www.terraform.io) Terraform Plugin SDK/2.10.1 terraform-provider-azurerm/dev pid-222c6c49-1b0a-5959-a213-6608f9eb8820
Content-Length: 757
Content-Type: application/json; charset=utf-8
X-Ms-Authorization-Auxiliary: Bearer xxxxxxx
Accept-Encoding: gzip

{"name":"gpu","properties":{"availabilityZones":["2","3","1"],"count":1,"enableAutoScaling":true,"enableCustomCATrust":false,"enableEncryptionAtHost":false,"enableFIPS":false,"enableNodePublicIP":false,"enableUltraSSD":false,"kubeletDiskType":"","maxCount":4,"maxPods":30,"minCount":1,"mode":"User","nodeTaints":["sku=gpu:NoSchedule"],"osDiskSizeGB":130,"osDiskType":"Managed","osType":"Linux","scaleDownMode":"Delete","scaleSetPriority":"Regular","tags":{"env":"staging","stack":"aks"},"type":"VirtualMachineScaleSets","upgradeSettings":{},"vmSize":"standard_nc4as_t4_v3","vnetSubnetID":"/subscriptions/{subscription_id}/resourceGroups/{resource-group_name}/providers/Microsoft.Network/virtualNetworks/vnet-aks-eos-ue-staging/subnets/aks"}}: timestamp=2023-06-07T18:42:56.598+0200

Azure-cli debug logs (requests that create a working nodepool with GPU capabilities):
Note that the API response show a different nodeImageVersion with GPU, a good indicator that it works (and final tests with kubectl describe node (Capacity).

cli.azure.cli.core.sdk.policies: Request URL: 'https://management.azure.com/subscriptions/{subscription_id}/resourceGroups/{resource-group_name}/providers/Microsoft.ContainerService/managedClusters/{resource-group_name}2/agentPools/gpu?api-version=2023-04-02-preview'
cli.azure.cli.core.sdk.policies: Request method: 'PUT'
cli.azure.cli.core.sdk.policies: Request headers:
cli.azure.cli.core.sdk.policies:     'Content-Type': 'application/json'
cli.azure.cli.core.sdk.policies:     'Content-Length': '589'
cli.azure.cli.core.sdk.policies:     'UseGPUDedicatedVHD': 'true'
cli.azure.cli.core.sdk.policies:     'Accept': 'application/json'
cli.azure.cli.core.sdk.policies:     'x-ms-client-request-id': 'xxxxxxxxxxxxxxxxxxxxxx'
cli.azure.cli.core.sdk.policies:     'CommandName': 'aks nodepool add'
cli.azure.cli.core.sdk.policies:     'ParameterSetName': '--resource-group --cluster-name --name --node-count --node-vm-size --node-taints --aks-custom-headers --zones --debug'
cli.azure.cli.core.sdk.policies:     'User-Agent': 'AZURECLI/2.49.0 azsdk-python-azure-mgmt-containerservice/22.1.0b Python/3.10.11 (Linux-6.2.15-100.fc36.x86_64-x86_64-with-glibc2.35)'
cli.azure.cli.core.sdk.policies:     'Authorization': '*****'
cli.azure.cli.core.sdk.policies: Request body:
cli.azure.cli.core.sdk.policies: {"properties": {"count": 1, "vmSize": "standard_nc4as_t4_v3", "osDiskSizeGB": 0, "workloadRuntime": "OCIContainer", "osType": "Linux", "enableAutoScaling": false, "scaleDownMode": "Delete", "type": "VirtualMachineScaleSets", "mode": "User", "upgradeSettings": {}, "availabilityZones": ["1", "2", "3"], "enableNodePublicIP": false, "enableCustomCATrust": false, "scaleSetPriority": "Regular", "scaleSetEvictionPolicy": "Delete", "spotMaxPrice": -1.0, "nodeTaints": ["sku=gpu:NoSchedule"], "enableEncryptionAtHost": false, "enableUltraSSD": false, "enableFIPS": false, "networkProfile": {}}}
urllib3.connectionpool: Starting new HTTPS connection (1): management.azure.com:443
urllib3.connectionpool: https://management.azure.com:443 "PUT /subscriptions/{subscription_id}/resourceGroups/{resource-group_name}/providers/Microsoft.ContainerService/managedClusters/{resource-group_name}2/agentPools/gpu?api-version=2023-04-02-preview HTTP/1.1" 201 1189
cli.azure.cli.core.sdk.policies:     'Content-Type': 'application/json'
cli.azure.cli.core.sdk.policies: Response content:
cli.azure.cli.core.sdk.policies: {
 "id": "/subscriptions/{subscription_id}/resourcegroups/{resource-group_name}/providers/Microsoft.ContainerService/managedClusters/{resource-group_name}2/agentPools/gpu",
  "name": "gpu",
  "type": "Microsoft.ContainerService/managedClusters/agentPools",
  "properties": {
   "count": 1,
   "vmSize": "standard_nc4as_t4_v3",
   "osDiskSizeGB": 128,
   "osDiskType": "Ephemeral",
   "kubeletDiskType": "OS",
   "workloadRuntime": "OCIContainer",
   "maxPods": 110,
   "type": "VirtualMachineScaleSets",
   "availabilityZones": [
    "1",
    "2",
    "3"
   ],
   "enableAutoScaling": false,
   "scaleDownMode": "Delete",
   "provisioningState": "Creating",
   "powerState": {
    "code": "Running"
   },
   "orchestratorVersion": "1.25.6",
   "currentOrchestratorVersion": "1.25.6",
   "enableNodePublicIP": false,
   "enableCustomCATrust": false,
   "nodeTaints": [
    "sku=gpu:NoSchedule"
   ],
   "mode": "User",
   "enableEncryptionAtHost": false,
   "enableUltraSSD": false,
   "osType": "Linux",
   "osSKU": "Ubuntu",
   "nodeImageVersion": "AKSUbuntu-1804gen2gpucontainerd-202305.24.0",
   "upgradeSettings": {},
   "enableFIPS": false,
   "networkProfile": {}
  }
 }

Expected Behaviour

After the provisionning, the nodes should have "nvidia.com/gpu" capacity.

$ kubectl describe node aks-gpu-xxxx
Capacity:
  cpu:                4
  nvidia.com/gpu:     1

Actual Behaviour

After the provisionning, the nodes does not have gpu capacity:

$ kubectl describe node aks-gpu-xxxx
Capacity:
  cpu:                4
  ephemeral-storage:  129886128Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             28748596Ki
  pods:               110

Steps to Reproduce

Using the azure-cli, we can make a valid nodepool with GPU capatibilities (see documentation).

Comparing the azure-cli and azurerm requests, the requests body seems to be the same but the request header "UseGPUDedicatedVHD=true" is missing on the terraform requests. -> that the point.

I have tested with the same request that azurerm provider (using postman) with adding the missing headers and that works.
Could you add an option on this terraform resource to made this ?

Thanks a lot,

Important Factoids

No response

References

No response

@stephybun
Copy link
Member

Thanks for raising this issue @sinc59.

Support for headers is tracked in #6793. To consolidate discussion on the issue I'm going to close this in favour of #6793.

Copy link

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators May 22, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants