Optimize E2E CI: parallelize tests across languages, 3x wall-clock speedup#3405
Optimize E2E CI: parallelize tests across languages, 3x wall-clock speedup#3405berndverst merged 14 commits intoAzure:devfrom
Conversation
…rchestrator - Replace LongRunningOrchestrator (100K activities) with HttpLongRunningOrchestrator (timer-based) to eliminate emulator operation queue contention - Parallelize independent orchestration lifecycle operations within DedupeStatusesTests and PurgeOnlyPurgesTerminalOrchestrations using Task.WhenAll - Add helper methods to PurgeInstancesTests for cleaner parallel test structure Measured improvement: 14m 23s -> 2m 39s (5.4x speedup) for these tests
|
FYI @sophiatev - the Dedupe tests that we had were responsible for the majority of the CI slowness. Please take a look at this PR because the tests are significantly altered now. |
There was a problem hiding this comment.
Pull request overview
This PR speeds up the slowest Durable Task Scheduler (DTS) E2E tests by reducing long-running orchestration load and parallelizing independent test operations, while introducing helper methods to make the new structure easier to follow.
Changes:
- Replaces the activity-flooding long-running orchestrator usage with a timer-based alternative in the tests.
- Parallelizes orchestration start/transition/purge/retry phases to reduce end-to-end runtime.
- Refactors
PurgeInstancesTestswith helper methods to support the phased parallel structure.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 7 comments.
| File | Description |
|---|---|
test/e2e/Tests/Tests/PurgeInstancesTests.cs |
Parallelizes purge test orchestration lifecycle steps and adds helper methods; switches the “running” orchestrator used by the test. |
test/e2e/Tests/Tests/DedupeStatusesTests.cs |
Parallelizes dedupe-status test orchestration lifecycle steps and switches the “running” orchestrator used by the tests. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…x cleanup - Use HttpLongRunningOrchestrator only for dotnet-isolated, LongRunningOrchestrator for other languages (HttpLongRunningOrchestrator only exists in BasicDotNetIsolated) - Dispose all HttpResponseMessage results from StartAndWaitForState/WithDedupeStatuses - Assert StatusCode and dispose cleanup terminate responses in Phase 4 - Dispose non-terminal purge responses in PurgeOnlyPurgesTerminalOrchestrations - Fix StartOrchAndWaitForStatus to pass orchestrationName in scheduled path
The e2e-dts pipeline ran 5 languages (dotnet, PowerShell, Python, Node, Java) sequentially in a single job, taking ~46 minutes total. This change: - Split each language into a separate parallel matrix job (5x parallelism) - Each job only builds its own test app (not all 5) - Only install Python/Java SDKs when needed (conditional steps) - Replace 30s hardcoded emulator sleep with health check polling (~5s) Expected result: wall-clock time limited by the slowest single language (~15min for dotnet-isolated) instead of the sum of all 5.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Add pending instance cleanup in DedupeStatusesTests Phase 4 - Add non-terminal instance cleanup in PurgeOnlyPurgesTerminalOrchestrations - Map port 8081 for DTS emulator health checks (HTTP/1) - Poll 8081 instead of 8080 (gRPC/HTTP2 incompatible with Invoke-WebRequest) - Fail script with exit 1 if emulator not ready (instead of warning)
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Parallelize e2e-azurestorage-linux, e2e-azurestorage-windows, e2e-mssql with the same language matrix pattern used for e2e-dts - Each language runs as its own parallel job with only its app built - Conditional Python/Java SDK setup (only when needed) - Fix DTS health check: use TCP probe on port 8080 instead of HTTP on 8081 - Increase health check timeout to 60s with container log dump on failure
…ents - Add comments explaining LongRunningOrchestrator load is isolated per CI job - Improve cleanup comments in PurgeOnlyPurgesTerminalOrchestrations
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
CI Performance ResultsThe latest CI runs confirm all optimizations are working. MSSQL is kept sequential (see Design Decisions in PR description).
Overall wall-clock improvement: ~3x (bottleneck was DTS at ~46m, now ~15m). |
- Replace 30s hardcoded MSSQL sleep with sqlcmd readiness polling - Pre-create DurableDB database after container starts so all languages can use it independently without relying on dotnet-isolated running first - Exit with error if SQL Server doesn't become ready within 30s
Non-dotnet languages use LongRunningOrchestrator (100K activities per instance). Starting 3+ of these concurrently overwhelms the function host with activity spam, causing socket errors. For non-dotnet, long-running orchestration starts are now sequential while fast orchestrations (HelloCities, LargeOutput) remain concurrent. Dotnet-isolated uses HttpLongRunningOrchestrator (timer-based) and retains full parallelization.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- MSSQL: reverted to sequential (original) because non-dotnet apps lack the MSSQL extension DLLs and depend on dotnet-isolated running first to install them - Reverted the canParallelize/sequential LongRunningOrchestrator branching in DedupeStatusesTests and PurgeInstancesTests all phases are parallel again - AzureStorage, DTS remain fully parallelized across languages
…anup - Fix DTS readiness comment: 'emulator port' not 'gRPC port' - Wrap TcpClient in try/finally for proper disposal on connect failure - Add pendingId to Phase 4 cleanup in CanStartOrchestration test - Remove misleading 'log' from cleanup comments (we only dispose) - Remove StatusCode assertion from cleanup (terminate may fail for already-completed or pending instances)
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 5 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…ests Replace all blocking .Result accesses with await to prevent potential deadlocks per xUnit analyzer rule xUnit1031. The tasks are already completed after Task.WhenAll, so await returns immediately.
andystaples
left a comment
There was a problem hiding this comment.
Some questions/comments
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 5 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…container - Restore removed comments describing orchestration state context and skipped test reasons in DedupeStatusesTests.cs and PurgeInstancesTests.cs - Fix misleading CI topology comments (MSSQL runs multiple languages sequentially) - Add cleanup response logging for non-OK TerminateInstance calls - Add --name dts-emulator --rm to DTS docker run for reliable log retrieval - Use explicit container name in docker logs instead of brittle ancestor filter
Summary
Optimize E2E CI pipeline performance through test parallelization, test-level optimizations, and infrastructure improvements. Reduces overall CI wall-clock time significantly.
Changes
1. Test Optimizations (DedupeStatusesTests & PurgeInstancesTests)
LongRunningOrchestratorwithHttpLongRunningOrchestratorfor dotnet-isolated — the timer-based orchestrator avoids 100K activity dispatch spam that contends with the emulator/backendTask.WhenAll— tests that previously ran 6 status chains sequentially now run them in 3-4 parallel phasesHttpLongRunningOrchestratorfor dotnet-isolated (timer-based, lightweight),LongRunningOrchestratorfor other languages (only available there)HttpResponseMessageresults2. Pipeline Parallelization (E2ETest.yml)
DurableTask.SqlServer) which are only bundled with the dotnet-isolated app via NuGet. The sequential approach ensures dotnet-isolated runs first and sets up the schema.-E2EAppName), not all 53. Infrastructure Improvements (build-e2e-test.ps1)
sqlcmdreadiness polling; pre-createsDurableDBdatabase for independent language runsExpected Performance
Design Decisions
MSSQL stays sequential because non-dotnet function apps use
func extensions syncwhich only installs the base Durable Task extension, not the MSSQL storage provider. The provider DLLs (DurableTask.SqlServer.dll,DurableTask.SqlServer.AzureFunctions.dll) are only available in the dotnet-isolated app's.azurefunctionsfolder. Each parallel GitHub Actions job runs on a separate VM with no shared filesystem.HttpLongRunningOrchestratoronly for dotnet-isolated because it's defined only inBasicDotNetIsolated/HTTPFeature.cs. Other language apps useLongRunningOrchestrator(100K activities) which is heavier but the only option available.Test parallelization within tests uses
Task.WhenAllfor independent orchestration lifecycle chains (each uses unique instance IDs). Cleanup is best-effort (dispose without asserting StatusCode) since instances may have already completed.