Skip to content

Optimize E2E CI: parallelize tests across languages, 3x wall-clock speedup#3405

Merged
berndverst merged 14 commits intoAzure:devfrom
berndverst:beverst/improveCI
Mar 23, 2026
Merged

Optimize E2E CI: parallelize tests across languages, 3x wall-clock speedup#3405
berndverst merged 14 commits intoAzure:devfrom
berndverst:beverst/improveCI

Conversation

@berndverst
Copy link
Copy Markdown
Member

@berndverst berndverst commented Mar 22, 2026

Summary

Optimize E2E CI pipeline performance through test parallelization, test-level optimizations, and infrastructure improvements. Reduces overall CI wall-clock time significantly.

Changes

1. Test Optimizations (DedupeStatusesTests & PurgeInstancesTests)

  • Replace LongRunningOrchestrator with HttpLongRunningOrchestrator for dotnet-isolated — the timer-based orchestrator avoids 100K activity dispatch spam that contends with the emulator/backend
  • Parallelize independent orchestration lifecycle operations within each test using Task.WhenAll — tests that previously ran 6 status chains sequentially now run them in 3-4 parallel phases
  • Language-aware orchestrator selectionHttpLongRunningOrchestrator for dotnet-isolated (timer-based, lightweight), LongRunningOrchestrator for other languages (only available there)
  • Proper cleanup — terminate pending/running/suspended instances after tests; dispose all HttpResponseMessage results

2. Pipeline Parallelization (E2ETest.yml)

  • Azure Storage (Linux & Windows) and DTS: Split each backend's 5-language sequential run into parallel matrix jobs (5 languages × 2 TFMs = 10 parallel jobs per backend)
  • MSSQL: Kept sequential — non-dotnet apps lack the MSSQL extension DLLs (DurableTask.SqlServer) which are only bundled with the dotnet-isolated app via NuGet. The sequential approach ensures dotnet-isolated runs first and sets up the schema.
  • Each parallel job only builds its own language's test app (-E2EAppName), not all 5
  • Python/Java SDKs conditionally installed only when needed

3. Infrastructure Improvements (build-e2e-test.ps1)

  • DTS emulator: Replaced 30s hardcoded sleep with TCP readiness polling (port 8080), 60s timeout, proper TcpClient disposal, error exit with container log dump
  • MSSQL container: Replaced 30s sleep with sqlcmd readiness polling; pre-creates DurableDB database for independent language runs

Expected Performance

Job Type Before (sequential) After (parallel) Speedup
e2e-azurestorage-linux ~12m ~7m (wall-clock) ~1.7x
e2e-azurestorage-windows ~30m ~13m (wall-clock) ~2.3x
e2e-mssql ~12m ~12m (unchanged, sequential)
e2e-dts ~46m ~15m (wall-clock) ~3x
DedupeStatuses tests ~14m ~2.5m ~5.4x
PurgeOnlyPurgesTerminal ~2m ~24s ~5x

Design Decisions

  1. MSSQL stays sequential because non-dotnet function apps use func extensions sync which only installs the base Durable Task extension, not the MSSQL storage provider. The provider DLLs (DurableTask.SqlServer.dll, DurableTask.SqlServer.AzureFunctions.dll) are only available in the dotnet-isolated app's .azurefunctions folder. Each parallel GitHub Actions job runs on a separate VM with no shared filesystem.

  2. HttpLongRunningOrchestrator only for dotnet-isolated because it's defined only in BasicDotNetIsolated/HTTPFeature.cs. Other language apps use LongRunningOrchestrator (100K activities) which is heavier but the only option available.

  3. Test parallelization within tests uses Task.WhenAll for independent orchestration lifecycle chains (each uses unique instance IDs). Cleanup is best-effort (dispose without asserting StatusCode) since instances may have already completed.

…rchestrator

- Replace LongRunningOrchestrator (100K activities) with HttpLongRunningOrchestrator
  (timer-based) to eliminate emulator operation queue contention
- Parallelize independent orchestration lifecycle operations within DedupeStatusesTests
  and PurgeOnlyPurgesTerminalOrchestrations using Task.WhenAll
- Add helper methods to PurgeInstancesTests for cleaner parallel test structure

Measured improvement: 14m 23s -> 2m 39s (5.4x speedup) for these tests
Copilot AI review requested due to automatic review settings March 22, 2026 06:16
@berndverst
Copy link
Copy Markdown
Member Author

FYI @sophiatev - the Dedupe tests that we had were responsible for the majority of the CI slowness. Please take a look at this PR because the tests are significantly altered now.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR speeds up the slowest Durable Task Scheduler (DTS) E2E tests by reducing long-running orchestration load and parallelizing independent test operations, while introducing helper methods to make the new structure easier to follow.

Changes:

  • Replaces the activity-flooding long-running orchestrator usage with a timer-based alternative in the tests.
  • Parallelizes orchestration start/transition/purge/retry phases to reduce end-to-end runtime.
  • Refactors PurgeInstancesTests with helper methods to support the phased parallel structure.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 7 comments.

File Description
test/e2e/Tests/Tests/PurgeInstancesTests.cs Parallelizes purge test orchestration lifecycle steps and adds helper methods; switches the “running” orchestrator used by the test.
test/e2e/Tests/Tests/DedupeStatusesTests.cs Parallelizes dedupe-status test orchestration lifecycle steps and switches the “running” orchestrator used by the tests.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Bernd Verst added 2 commits March 21, 2026 23:42
…x cleanup

- Use HttpLongRunningOrchestrator only for dotnet-isolated, LongRunningOrchestrator
  for other languages (HttpLongRunningOrchestrator only exists in BasicDotNetIsolated)
- Dispose all HttpResponseMessage results from StartAndWaitForState/WithDedupeStatuses
- Assert StatusCode and dispose cleanup terminate responses in Phase 4
- Dispose non-terminal purge responses in PurgeOnlyPurgesTerminalOrchestrations
- Fix StartOrchAndWaitForStatus to pass orchestrationName in scheduled path
The e2e-dts pipeline ran 5 languages (dotnet, PowerShell, Python, Node, Java)
sequentially in a single job, taking ~46 minutes total. This change:

- Split each language into a separate parallel matrix job (5x parallelism)
- Each job only builds its own test app (not all 5)
- Only install Python/Java SDKs when needed (conditional steps)
- Replace 30s hardcoded emulator sleep with health check polling (~5s)

Expected result: wall-clock time limited by the slowest single language
(~15min for dotnet-isolated) instead of the sum of all 5.
Copilot AI review requested due to automatic review settings March 22, 2026 18:44
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Bernd Verst added 2 commits March 22, 2026 12:05
- Add pending instance cleanup in DedupeStatusesTests Phase 4
- Add non-terminal instance cleanup in PurgeOnlyPurgesTerminalOrchestrations
- Map port 8081 for DTS emulator health checks (HTTP/1)
- Poll 8081 instead of 8080 (gRPC/HTTP2 incompatible with Invoke-WebRequest)
- Fail script with exit 1 if emulator not ready (instead of warning)
Copilot AI review requested due to automatic review settings March 22, 2026 19:16
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Bernd Verst added 2 commits March 22, 2026 13:36
- Parallelize e2e-azurestorage-linux, e2e-azurestorage-windows, e2e-mssql
  with the same language matrix pattern used for e2e-dts
- Each language runs as its own parallel job with only its app built
- Conditional Python/Java SDK setup (only when needed)
- Fix DTS health check: use TCP probe on port 8080 instead of HTTP on 8081
- Increase health check timeout to 60s with container log dump on failure
…ents

- Add comments explaining LongRunningOrchestrator load is isolated per CI job
- Improve cleanup comments in PurgeOnlyPurgesTerminalOrchestrations
Copilot AI review requested due to automatic review settings March 22, 2026 20:42
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@berndverst
Copy link
Copy Markdown
Member Author

berndverst commented Mar 22, 2026

CI Performance Results

The latest CI runs confirm all optimizations are working. MSSQL is kept sequential (see Design Decisions in PR description).

Job Type Before After Speedup Status
azurestorage-linux ~12m ~7m ~1.7x ✅ All pass
azurestorage-windows ~30m ~13m ~2.3x ✅ All pass
mssql ~12m ~12m (sequential, unchanged) ✅ Pass
dts ~46m ~15m ~3x ✅ All pass
DedupeStatuses test ~14m ~2.5m ~5.4x
PurgeOnlyPurges test ~2m ~24s ~5x

Overall wall-clock improvement: ~3x (bottleneck was DTS at ~46m, now ~15m).

Bernd Verst added 2 commits March 22, 2026 13:54
- Replace 30s hardcoded MSSQL sleep with sqlcmd readiness polling
- Pre-create DurableDB database after container starts so all languages
  can use it independently without relying on dotnet-isolated running first
- Exit with error if SQL Server doesn't become ready within 30s
Non-dotnet languages use LongRunningOrchestrator (100K activities per instance).
Starting 3+ of these concurrently overwhelms the function host with activity
spam, causing socket errors. For non-dotnet, long-running orchestration starts
are now sequential while fast orchestrations (HelloCities, LargeOutput) remain
concurrent. Dotnet-isolated uses HttpLongRunningOrchestrator (timer-based) and
retains full parallelization.
Copilot AI review requested due to automatic review settings March 22, 2026 21:05
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Bernd Verst added 2 commits March 22, 2026 14:19
- MSSQL: reverted to sequential (original) because non-dotnet apps lack the
  MSSQL extension DLLs and depend on dotnet-isolated running first to install them
- Reverted the canParallelize/sequential LongRunningOrchestrator branching in
  DedupeStatusesTests and PurgeInstancesTests  all phases are parallel again
- AzureStorage, DTS remain fully parallelized across languages
…anup

- Fix DTS readiness comment: 'emulator port' not 'gRPC port'
- Wrap TcpClient in try/finally for proper disposal on connect failure
- Add pendingId to Phase 4 cleanup in CanStartOrchestration test
- Remove misleading 'log' from cleanup comments (we only dispose)
- Remove StatusCode assertion from cleanup (terminate may fail for
  already-completed or pending instances)
Copilot AI review requested due to automatic review settings March 22, 2026 21:25
@berndverst berndverst changed the title Optimize slow e2e DTS tests: 5.4x speedup (14min → 2.5min) Optimize E2E CI: parallelize tests across languages, 5x test speedup Mar 22, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 5 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@berndverst berndverst changed the title Optimize E2E CI: parallelize tests across languages, 5x test speedup Optimize E2E CI: parallelize tests across languages, 3x wall-clock speedup Mar 22, 2026
…ests

Replace all blocking .Result accesses with await to prevent potential
deadlocks per xUnit analyzer rule xUnit1031. The tasks are already
completed after Task.WhenAll, so await returns immediately.
Copy link
Copy Markdown
Collaborator

@andystaples andystaples left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some questions/comments

Copilot AI review requested due to automatic review settings March 23, 2026 16:54
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 5 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…container

- Restore removed comments describing orchestration state context and skipped test reasons in DedupeStatusesTests.cs and PurgeInstancesTests.cs
- Fix misleading CI topology comments (MSSQL runs multiple languages sequentially)
- Add cleanup response logging for non-OK TerminateInstance calls
- Add --name dts-emulator --rm to DTS docker run for reliable log retrieval
- Use explicit container name in docker logs instead of brittle ancestor filter
@berndverst berndverst merged commit 5aa36ee into Azure:dev Mar 23, 2026
69 of 70 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants