fix(pipeline): timeout + retry InstallPlaywrightModule so a hung download fails fast#5889
Conversation
…load fails fast `npx playwright install --with-deps` ran with no per-attempt timeout and no retry. When the network stalled mid-download (~500 MB of browser binaries), the module sat at the outer 30-minute module budget before failing, cascading into RunPlaywrightTestsModule, UploadToNuGetModule, and CreateReleaseModule (run 25238011861). Adds a 10-minute per-attempt timeout via a linked CancellationTokenSource plus the standard ModuleConfiguration.WithRetryCount(2). Happy path (~2 min on a warm runner) is unaffected; a hang now fails at 10 min and retries instead of burning the full 30.
There was a problem hiding this comment.
Code Review
Overall: The design is correct and addresses a real production flakiness issue cleanly. Two points worth considering.
Concern: OperationCanceledException and retry behaviour
The linked CTS approach is correct for propagating external cancellation, but it introduces a subtle risk: when attemptCts fires after 10 minutes, BashCommandOptions execution throws OperationCanceledException. Whether ModularPipelines retries on OperationCanceledException depends on how its retry logic distinguishes between:
- Timeout-caused OCE (from
attemptCts.Token) → should retry - External-cancellation-caused OCE (from outer
cancellationToken) → should not retry
Because attemptCts is a linked source, attemptCts.Token != cancellationToken, so cancellationToken.IsCancellationRequested is false when the timeout fires — the framework should retry. But if ModularPipelines' retry handler catches OperationCanceledException and simply rethrows without inspecting the token, the retry won't happen and the module fails after one attempt despite WithRetryCount(2).
Suggested check: Add a quick smoke-test where the timeout fires before the command finishes and confirm the framework actually retries. Alternatively, consider wrapping the cancellation and rethrowing as a non-OCE exception explicitly:
catch (OperationCanceledException) when (attemptCts.IsCancellationRequested && !cancellationToken.IsCancellationRequested)
{
throw new TimeoutException($"Playwright install timed out after {PerAttemptTimeout.TotalMinutes} minutes");
}This makes the timeout a TimeoutException (which the retry framework will definitely retry), and re-throws the original OCE only when external cancellation was the cause.
Minor: multi-line comment violates project style
CLAUDE.md specifies "Never write multi-paragraph docstrings or multi-line comment blocks — one short line max." The 3-line block comment on PerAttemptTimeout violates this.
The WHY here is non-obvious enough to warrant a comment, but it should be a single line — the PR description already documents the full reasoning:
// 10 min: far beyond happy-path (~2 min) but short enough to retry instead of burning the 30-min module budget.
private static readonly TimeSpan PerAttemptTimeout = TimeSpan.FromMinutes(10);What's good
- Correctly mirrors the
WithRetryCountpattern fromAbstractTestNugetPackageModule. using var attemptCts+CancelAfteris the right modern pattern (nonew CancellationTokenSource(TimeSpan)anti-pattern).- Each retry gets a fresh
CancellationTokenSourcebecauseExecuteAsyncis re-entered per attempt — that part is correct. - 10-minute per-attempt ceiling is a sensible choice given the stated ~2 min happy path and 30-min outer budget.
Up to standards ✅🟢 Issues
|
Updated [TUnit.Core](https://github.com/thomhurst/TUnit) from 1.44.0 to 1.45.8. <details> <summary>Release notes</summary> _Sourced from [TUnit.Core's releases](https://github.com/thomhurst/TUnit/releases)._ ## 1.45.8 <!-- Release notes generated using configuration in .github/release.yml at v1.45.8 --> ## What's Changed ### Other Changes * fix(aspire): route CreateHttpClient through IHttpClientFactory by @thomhurst in thomhurst/TUnit#5957 ### Dependencies * chore(deps): update tunit to 1.45.0 by @thomhurst in thomhurst/TUnit#5949 * chore(deps): update dependency dompurify to v3.4.5 by @thomhurst in thomhurst/TUnit#5951 * chore(deps): update dependency microsoft.testing.extensions.codecoverage to 18.7.0 by @thomhurst in thomhurst/TUnit#5953 * chore(deps): update dependency coverlet.collector to 10.0.1 by @thomhurst in thomhurst/TUnit#5952 * chore(deps): update dependency polyfill to 10.6.0 by @thomhurst in thomhurst/TUnit#5955 * chore(deps): update dependency polyfill to 10.6.0 by @thomhurst in thomhurst/TUnit#5954 **Full Changelog**: thomhurst/TUnit@v1.45.0...v1.45.8 ## 1.45.0 <!-- Release notes generated using configuration in .github/release.yml at v1.45.0 --> ## What's Changed ### Other Changes * fix(generator): place CallerArgumentExpression before params in [GenerateAssertion] emit by @JohnVerheij in thomhurst/TUnit#5940 * fix(sourcegen): drop covariant TActual when [GenerateAssertion] method has its own type parameters by @JohnVerheij in thomhurst/TUnit#5935 * feat(assertions): add CancellationToken overload to WaitsFor and Eventually by @JohnVerheij in thomhurst/TUnit#5938 * fix(html-report): extract categories using MTP Key=name convention by @thomhurst in thomhurst/TUnit#5946 * feat(html-report): rewrite as split-pane design template by @thomhurst in thomhurst/TUnit#5947 ### Dependencies * chore(deps): update microsoft.testing to 2.2.3 by @thomhurst in thomhurst/TUnit#5927 * chore(deps): update mstest to 4.2.3 by @thomhurst in thomhurst/TUnit#5928 * chore(deps): update tunit to 1.44.39 by @thomhurst in thomhurst/TUnit#5929 * chore(deps): update aspire to 13.3.3 by @thomhurst in thomhurst/TUnit#5933 * chore(deps): update dependency dompurify to v3.4.4 by @thomhurst in thomhurst/TUnit#5944 * chore(deps): update dependency qs to v6.15.2 by @thomhurst in thomhurst/TUnit#5941 **Full Changelog**: thomhurst/TUnit@v1.44.39...v1.45.0 ## 1.44.39 <!-- Release notes generated using configuration in .github/release.yml at v1.44.39 --> ## What's Changed ### Other Changes * fix(tests): retry trx read to dodge MTP post-exit flush race on Windows by @thomhurst in thomhurst/TUnit#5888 * fix(pipeline): timeout + retry InstallPlaywrightModule so a hung download fails fast by @thomhurst in thomhurst/TUnit#5889 * fix(otel): require two consecutive idle windows in DrainAsync to catch in-transit POSTs by @thomhurst in thomhurst/TUnit#5890 * test(assertions): drop flaky wall-clock upper bound on WaitsFor timeout test by @thomhurst in thomhurst/TUnit#5886 * fix(sourcegen): drop spurious ')' in MethodAssertionGenerator Task<bool> emit by @JohnVerheij in thomhurst/TUnit#5920 * fix(sourcegen): merge generic parameter lists in [AssertionExtension] emit by @JohnVerheij in thomhurst/TUnit#5921 * fix(aspnetcore): scope correlation processor per-factory to stop cross-factory tag leak by @thomhurst in thomhurst/TUnit#5891 * Changed FSharp.Core version to 10.1.300 by @licon4812 in thomhurst/TUnit#5909 * feat(mocks): add Mock.HttpClientFactory() helper by @thomhurst in thomhurst/TUnit#5894 * Harden WaitsFor timeout test by @thomhurst in thomhurst/TUnit#5926 * fix(sourcegen): emit `default` literal for value-type assertion parameters by @JohnVerheij in thomhurst/TUnit#5919 ### Dependencies * chore(deps): update dependency nunit to 4.6.0 by @thomhurst in thomhurst/TUnit#5826 * chore(deps): update tunit to 1.44.0 by @thomhurst in thomhurst/TUnit#5882 * chore(deps): update dependency mockolate to 3.2.0 by @thomhurst in thomhurst/TUnit#5892 * chore(deps): update dependency yaml to v2.9.0 by @thomhurst in thomhurst/TUnit#5887 * chore(deps): update dependency nuget.protocol to 7.6.0 by @thomhurst in thomhurst/TUnit#5897 * chore(deps): update dependency microsoft.entityframeworkcore to 10.0.8 by @thomhurst in thomhurst/TUnit#5898 * chore(deps): update dependency microsoft.templateengine.authoring.cli to v10.0.300 by @thomhurst in thomhurst/TUnit#5899 * chore(deps): update microsoft.extensions by @thomhurst in thomhurst/TUnit#5905 * chore(deps): update microsoft.aspnetcore to 10.0.8 by @thomhurst in thomhurst/TUnit#5904 * chore(deps): update dependency microsoft.templateengine.authoring.templateverifier to 10.0.300 by @thomhurst in thomhurst/TUnit#5902 * chore(deps): update aspire to 13.3.1 by @thomhurst in thomhurst/TUnit#5900 * chore(deps): update dependency system.commandline to 2.0.8 by @thomhurst in thomhurst/TUnit#5903 * chore(deps): update dependency azure.storage.blobs to 12.28.0 by @thomhurst in thomhurst/TUnit#5910 * chore(deps): update dependency dotnet-sdk to v10.0.300 by @thomhurst in thomhurst/TUnit#5901 * chore(deps): update dependency stackexchange.redis to 2.13.1 by @thomhurst in thomhurst/TUnit#5906 * chore(deps): update aspire to 13.3.2 by @thomhurst in thomhurst/TUnit#5924 * chore(deps): bump mermaid from 11.12.2 to 11.15.0 in /docs by @dependabot[bot] in thomhurst/TUnit#5893 * chore(deps): update dependency streamjsonrpc to 2.24.92 by @thomhurst in thomhurst/TUnit#5915 * chore(deps): update dependency dompurify to v3.4.3 by @thomhurst in thomhurst/TUnit#5913 * chore(deps): update microsoft.build to 18.6.3 by @thomhurst in thomhurst/TUnit#5914 **Full Changelog**: thomhurst/TUnit@v1.44.0...v1.44.39 Commits viewable in [compare view](thomhurst/TUnit@v1.44.0...v1.45.8). </details> [](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) </details> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Summary
npx playwright install --with-depsran with no per-attempt timeout and no retry. When the network stalled mid-download (~500 MB of browser binaries), the module sat at the outer 30-minute module budget before failing — cascading intoRunPlaywrightTestsModule,UploadToNuGetModule, andCreateReleaseModule(run 25238011861).Adds a 10-minute per-attempt timeout via a linked
CancellationTokenSourceplus the sameModuleConfiguration.WithRetryCount(2)pattern already used byTestNugetPackageModule. Happy path (~2 min on a warm runner) is unaffected; a hang now fails at 10 min and retries instead of burning the full 30.Notes
package.json, so the workflow'sactions/cache@v5step (.github/workflows/dotnet.yml:78-88) keys the cache againstdocs/**/package.jsonandDirectory.Packages.props— neither tracks the Playwright version. Cache will hit reliably across PR runs, but a Playwright release will silently invalidate which browsers are needed and trigger a fresh download. Out of scope to fix here.Test plan
dotnet build TUnit.Pipelinepasses