Add AKS starter deployment E2E test#14351
Conversation
This adds a new end-to-end deployment test that validates Azure Kubernetes Service (AKS) infrastructure creation: - Creates resource group, ACR, and AKS cluster - Configures kubectl credentials - Verifies cluster connectivity - Cleans up resources after test Phase 1 focuses on infrastructure only - Aspire deployment will be added in subsequent phases.
|
🚀 Dogfood this PR with:
curl -fsSL https://raw.githubusercontent.com/dotnet/aspire/main/eng/scripts/get-aspire-cli-pr.sh | bash -s -- 14351Or
iex "& { $(irm https://raw.githubusercontent.com/dotnet/aspire/main/eng/scripts/get-aspire-cli-pr.ps1) } 14351" |
|
/deployment-test |
|
🚀 Deployment tests starting on PR #14351... This will deploy to real Azure infrastructure. Results will be posted here when complete. |
There was a problem hiding this comment.
Pull request overview
This PR adds a Phase 1 end-to-end test for deploying Aspire applications to Azure Kubernetes Service (AKS). The test validates the infrastructure provisioning foundation by creating a resource group, Azure Container Registry (ACR), and AKS cluster, then verifying cluster connectivity.
Changes:
- Adds
AksStarterDeploymentTests.cswith infrastructure-only E2E test - Creates minimal AKS cluster (1 node, Standard_B2s) with attached ACR
- Verifies cluster health using kubectl commands
- Implements fire-and-forget resource cleanup
| /// </summary> | ||
| public sealed class AksStarterDeploymentTests(ITestOutputHelper output) | ||
| { | ||
| // Timeout set to 45 minutes to allow for AKS provisioning (~10-15 min) plus deployment. |
There was a problem hiding this comment.
The timeout comment mentions "plus deployment" but Phase 1 (current implementation) only performs infrastructure provisioning and verification, not deployment. The actual operations sum to approximately 25 minutes maximum (20 min AKS creation + ~5 min for other operations). While the 45-minute timeout may be intended for future phases, the comment could be misleading for the current implementation. Consider updating the comment to reflect the current phase or noting that the timeout accounts for future deployment steps.
| // Timeout set to 45 minutes to allow for AKS provisioning (~10-15 min) plus deployment. | |
| // Timeout set to 45 minutes to allow for AKS provisioning and verification (currently ~25 min), | |
| // with additional headroom reserved for future phases that will include deployment. |
| private static void TriggerCleanupResourceGroup(string resourceGroupName, ITestOutputHelper output) | ||
| { | ||
| var process = new System.Diagnostics.Process | ||
| { | ||
| StartInfo = new System.Diagnostics.ProcessStartInfo | ||
| { | ||
| FileName = "az", | ||
| Arguments = $"group delete --name {resourceGroupName} --yes --no-wait", | ||
| RedirectStandardOutput = true, | ||
| RedirectStandardError = true, | ||
| UseShellExecute = false, | ||
| CreateNoWindow = true | ||
| } | ||
| }; | ||
|
|
||
| try | ||
| { | ||
| process.Start(); | ||
| output.WriteLine($"Cleanup triggered for resource group: {resourceGroupName}"); | ||
| } | ||
| catch (Exception ex) | ||
| { | ||
| output.WriteLine($"Failed to trigger cleanup: {ex.Message}"); | ||
| } | ||
| } |
There was a problem hiding this comment.
The cleanup method pattern is inconsistent with the rest of the codebase. Most other deployment tests (AzureContainerRegistryDeploymentTests, AzureAppConfigDeploymentTests, AzureEventHubsDeploymentTests, etc.) use a private async method named CleanupResourceGroupAsync that awaits WaitForExitAsync() on the process. This test uses a synchronous TriggerCleanupResourceGroup method that doesn't wait for process completion or dispose the process object.
This inconsistency makes the codebase harder to maintain and could lead to resource leaks since the Process object is never disposed. Consider refactoring to match the established pattern in files like AzureContainerRegistryDeploymentTests.cs:204-237.
| var process = new System.Diagnostics.Process | ||
| { | ||
| StartInfo = new System.Diagnostics.ProcessStartInfo | ||
| { | ||
| FileName = "az", | ||
| Arguments = $"group delete --name {resourceGroupName} --yes --no-wait", | ||
| RedirectStandardOutput = true, | ||
| RedirectStandardError = true, | ||
| UseShellExecute = false, | ||
| CreateNoWindow = true | ||
| } | ||
| }; | ||
|
|
||
| try | ||
| { | ||
| process.Start(); | ||
| output.WriteLine($"Cleanup triggered for resource group: {resourceGroupName}"); | ||
| } | ||
| catch (Exception ex) | ||
| { | ||
| output.WriteLine($"Failed to trigger cleanup: {ex.Message}"); | ||
| } |
There was a problem hiding this comment.
The Process object created in the cleanup method is never disposed, which can lead to resource leaks. The process should be wrapped in a using statement or explicitly disposed after starting. This pattern is correctly implemented in other deployment tests like AzureContainerRegistryDeploymentTests.cs where the process is properly awaited and implicitly disposed through proper async patterns.
| // Clean up the resource group we created (includes AKS cluster and ACR) | ||
| output.WriteLine($"Triggering cleanup of resource group: {resourceGroupName}"); | ||
| TriggerCleanupResourceGroup(resourceGroupName, output); | ||
| DeploymentReporter.ReportCleanupStatus(resourceGroupName, success: true, "Cleanup triggered (fire-and-forget)"); |
There was a problem hiding this comment.
The cleanup status is reported as successful (success: true) even though the cleanup process is fire-and-forget and doesn't wait for completion or check if the deletion actually succeeded. This could be misleading in logs and reports. While this pattern is also used in AcaStarterDeploymentTests.cs:281, most other deployment tests (like AzureContainerRegistryDeploymentTests, AzureAppConfigDeploymentTests, etc.) use an async cleanup method that waits for the process and checks the exit code before reporting status. Consider either: (1) waiting for the process to complete and reporting actual success/failure, or (2) using a more accurate message like "Cleanup initiated" and a neutral status indicator.
🎬 CLI E2E Test RecordingsThe following terminal recordings are available for commit
📹 Recordings uploaded automatically from CI run #21740059443 |
|
❌ Deployment E2E Tests failed Summary: 10 passed, 3 failed, 0 cancelled Passed Tests
Failed Tests
🎬 Terminal Recordings
|
Add step to register Microsoft.ContainerService and Microsoft.ContainerRegistry resource providers before attempting to create AKS resources. This fixes the MissingSubscriptionRegistration error when the subscription hasn't been configured for AKS usage.
|
/deployment-test |
|
🚀 Deployment tests starting on PR #14351... This will deploy to real Azure infrastructure. Results will be posted here when complete. |
|
❌ Deployment E2E Tests failed Summary: 10 passed, 3 failed, 0 cancelled Passed Tests
Failed Tests
🎬 Terminal Recordings
|
When multiple endpoints resolve to the same port number, the Service manifest generator was creating duplicate port entries, which Kubernetes rejects as invalid. This fix deduplicates ports by (port, protocol) before adding them to the Service spec. Fixes the error: Service 'xxx-service' is invalid: spec.ports[1]: Duplicate value
|
/deployment-test |
|
🚀 Deployment tests starting on PR #14351... This will deploy to real Azure infrastructure. Results will be posted here when complete. |
|
❌ Deployment E2E Tests failed Summary: 10 passed, 3 failed, 0 cancelled Passed Tests
Failed Tests
🎬 Terminal Recordings
|
Added Step 6 to explicitly run 'az aks update --attach-acr' after AKS cluster creation to ensure the AcrPull role assignment has properly propagated. This addresses potential image pull permission issues where AKS cannot pull images from the attached ACR. Also renumbered all subsequent steps to maintain proper ordering.
|
/deployment-test |
|
🚀 Deployment tests starting on PR #14351... This will deploy to real Azure infrastructure. Results will be posted here when complete. |
|
❌ Deployment E2E Tests failed Summary: 10 passed, 3 failed, 0 cancelled Passed Tests
Failed Tests
🎬 Terminal Recordings
|
| }, | ||
| }; | ||
|
|
||
| // Deduplicate ports by port number and protocol to avoid invalid Service specs |
There was a problem hiding this comment.
Are we allocating ports properly there should be no dupes unless the user explicitly added dupes
The Kubernetes publisher was generating duplicate Service/container ports (both 8080/TCP) for ProjectResources with default http+https endpoints. The root cause is that GenerateDefaultProjectEndpointMapping assigns the same default port 8080 to every endpoint with None target port. The proper fix mirrors the core framework's SetBothPortsEnvVariables() behavior: skip the DefaultHttpsEndpoint (which the container won't listen on — TLS termination happens at ingress/service mesh). The https endpoint still gets an EndpointMapping (for service discovery) but reuses the http endpoint's HelmValue, so no duplicate K8s port is generated. Added Aspire.Hosting.Kubernetes to InternalsVisibleTo to access ProjectResource.DefaultHttpsEndpoint. The downstream dedup in ToService() and WithContainerPorts() remains as defense-in-depth. Fixes #14029
Validates the Aspire starter template with Redis cache enabled deploys correctly to AKS. Exercises the full pipeline: webfrontend → apiservice → Redis by hitting the /weather page (SSR, uses Redis output caching). Key differences from the base AKS test: - Selects 'Yes' for Redis Cache in aspire new prompts - Redis uses public container image (no ACR push needed) - Verifies /weather page content (confirms Redis integration works)
Both AKS tests generated the same ACR name from RunId+RunAttempt. Use different prefixes (acrs/acrr) to ensure uniqueness.
Work around K8s publisher bug where cross-resource secret references create Helm value paths under the consuming resource instead of referencing the owning resource's secret. The webfrontend template expects secrets.webfrontend.cache_password but values.yaml only has secrets.cache.REDIS_PASSWORD. Provide the missing value via --set.
Summary
This PR adds a new end-to-end deployment test that validates deploying Aspire applications to Azure Kubernetes Service (AKS).
Phase 1 (This PR)
Infrastructure foundation only:
kubectl get nodes,kubectl cluster-info)Future Phases
Test Details
Testing
Request
/deployment-testto run the new test against Azure infrastructure.