Broaden transient CI infrastructure rerun handling by radical · Pull Request #15157 · microsoft/aspire

radical · 2026-03-12T03:18:50Z

Description

This change broadens the transient CI rerun workflow's log-based infrastructure network detection beyond the older fixed build-step allowlist, while still keeping test execution failures out of the log override path.

It also improves the rerun reason surfaced in summaries and PR comments by explaining that the retry came from a likely transient infrastructure network failure and by including the specific matched pattern.

The workflow script, its documentation, and the focused workflow-script tests were updated together so the behavior and docs stay aligned.

Fixes # (issue)

Checklist

Is this feature complete?
- Yes. Ready to ship.
- No. Follow-up changes expected.
Are you including unit tests for the changes and scenario tests if relevant?
- Yes
- No
Did you add public API?
- Yes
  - If yes, did you have an API Review for it?
    - Yes
    - No
  - Did you add <remarks /> and <code /> elements on your triple slash comments?
    - Yes
    - No
- No
Does the change make any security assumptions or guarantees?
- Yes
  - If yes, have you done a threat model and had a security review?
    - Yes
    - No
- No
Does the change require an update in our Aspire docs?
- Yes
  - Link to aspire.dev issue:
    - New issue
- No

Validation

./restore.sh
dotnet test tests/Infrastructure.Tests/Infrastructure.Tests.csproj -- --filter-class "*.AutoRerunTransientCiFailuresTests" --filter-not-trait "quarantined=true" --filter-not-trait "outerloop=true"

Expand the log-based override beyond the old build-step allowlist, keep test execution failures excluded from the override, and clarify the retry reason surfaced in job summaries. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Surface the specific infrastructure-network pattern that matched so transient rerun summaries explain exactly why a job was considered retryable. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

github-actions · 2026-03-12T03:18:59Z

🚀 Dogfood this PR with:

⚠️ WARNING: Do not do this without first carefully reviewing the code of this PR to satisfy yourself it is safe.

curl -fsSL https://raw.githubusercontent.com/dotnet/aspire/main/eng/scripts/get-aspire-cli-pr.sh | bash -s -- 15157

Or

Run remotely in PowerShell:

iex "& { $(irm https://raw.githubusercontent.com/dotnet/aspire/main/eng/scripts/get-aspire-cli-pr.ps1) } 15157"

mitchdenny

Reviewed the security model around log injection. The workflow_run trigger ensures the analysis script always runs from the default branch, preventing privilege escalation from PR code. The broadened log-based override surface (step-name denylist replacing allowlist, plus shorter generic git patterns like RPC failed) creates a low-severity log injection vector that could force one unnecessary CI rerun, but existing safety rails (attempt-1 gate, 5-job cap, same-code reruns, full audit trail) keep the blast radius minimal. The test-execution denylist correctly prevents retry when actual test steps fail.

Make rerun analysis output more human-readable, render clickable summary links, show explicit skipped versus rerun outcomes, and surface failed attempt, rerun attempt, and PR comment links in the workflow summary. Also add an optional workflow_dispatch dry_run input so manual runs default to real reruns while still supporting inspection-only execution. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

This PR expands the transient CI auto-rerun workflow’s ability to detect likely infrastructure/network failures by scanning job logs (beyond the previous step-name allowlist), while preserving the existing “don’t override test execution failures” guardrails. It also refines the workflow summary/PR comment messaging to surface clearer rerun eligibility/outcome and the specific matched log pattern.

Changes:

Broaden log-based infrastructure/network failure detection and improve retry/skipped reason messages.
Update workflow summary output to use explicit outcome headings and clickable links (including PR comment links when posted).
Update docs and Infrastructure workflow-script tests/harness to align with the new behavior and summary formatting.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
`.github/workflows/auto-rerun-transient-ci-failures.js`	Expands log-based override matchers, adds reason formatting/helpers, and updates summary/comment formatting.
`.github/workflows/auto-rerun-transient-ci-failures.yml`	Adds `dry_run` input for `workflow_dispatch` and passes attempt metadata into summary output.
`docs/ci/auto-rerun-transient-ci-failures.md`	Updates documentation to reflect broader log override coverage and new manual dry-run behavior.
`tests/Infrastructure.Tests/WorkflowScripts/auto-rerun-transient-ci-failures.harness.js`	Extends the harness summary/GitHub recorder to support link/break events and comment URL responses.
`tests/Infrastructure.Tests/WorkflowScripts/AutoRerunTransientCiFailuresTests.cs`	Updates/extends tests to validate new reasons, broader override behavior, and new summary link events.

Copilot · 2026-03-12T05:45:36Z

.github/workflows/auto-rerun-transient-ci-failures.js

+function getInfrastructureNetworkLogOverrideReason(failedStepText, matchedPattern) {
+    const patternText = matchedPattern ? ` Matched pattern: ${matchedPattern}.` : '';
+    return `Failed step '${failedStepText}' will be retried because the job log shows a likely transient infrastructure network failure.${patternText}`;


getInfrastructureNetworkLogOverrideReason always starts with "Failed step" even when multiple steps failed (the caller passes failedStepText which can be a |-joined list). This will produce grammatically incorrect and potentially confusing retry reasons when more than one step failed. Consider using the existing pluralization helpers (formatFailedStepLabel/isSingleFailedStep) when building this reason, or pass failedSteps into the helper so it can choose "step" vs "steps".

Suggested change

function getInfrastructureNetworkLogOverrideReason(failedStepText, matchedPattern) {

const patternText = matchedPattern ? ` Matched pattern: ${matchedPattern}.` : '';

return `Failed step '${failedStepText}' will be retried because the job log shows a likely transient infrastructure network failure.${patternText}`;

function getInfrastructureNetworkLogOverrideReason(failedStepText, matchedPattern, failedSteps) {

const patternText = matchedPattern ? ` Matched pattern: ${matchedPattern}.` : '';

let failedStepLabel;

if (Array.isArray(failedSteps)) {

failedStepLabel = formatFailedStepLabel(failedSteps, failedStepText);

} else {

// Fall back to a simple heuristic when we only have the joined text.

const isSingle = !failedStepText.includes(' | ');

const label = isSingle ? 'Failed step' : 'Failed steps';

failedStepLabel = `${label} '${failedStepText}'`;

}

return `${failedStepLabel} will be retried because the job log shows a likely transient infrastructure network failure.${patternText}`;

Copilot · 2026-03-12T05:45:37Z

.github/workflows/auto-rerun-transient-ci-failures.js

+function getInfrastructureNetworkLogOverrideReason(failedStepText, matchedPattern) {
+    const patternText = matchedPattern ? ` Matched pattern: ${matchedPattern}.` : '';


The retry reason appends the matched regex via string interpolation (e.g., /.../i). These reasons are rendered as Markdown in both the workflow summary and PR comments; unescaped regex text (notably underscores like _packaging) can cause unintended Markdown formatting and make the message hard to read/copy. Consider wrapping the matched pattern in inline code (backticks) or otherwise escaping/sanitizing it before emitting it in Markdown.

Suggested change

function getInfrastructureNetworkLogOverrideReason(failedStepText, matchedPattern) {

const patternText = matchedPattern ? ` Matched pattern: ${matchedPattern}.` : '';

function formatMatchedPatternForMarkdown(matchedPattern) {

if (!matchedPattern) {

return '';

}

const safePattern = String(matchedPattern).replace(/`/g, '\\`');

return ` Matched pattern: \`${safePattern}\`.`;

}

function getInfrastructureNetworkLogOverrideReason(failedStepText, matchedPattern) {

const patternText = formatMatchedPatternForMarkdown(matchedPattern);

Copilot · 2026-03-12T05:45:37Z

.github/workflows/auto-rerun-transient-ci-failures.js

+        return 'The job annotations did not show a retry-safe transient infrastructure failure.';
+    }
+
+    return 'No retry-safe transient infrastructure signal was found in the job annotations or logs.';


The fallback message "No retry-safe transient infrastructure signal was found in the job annotations or logs." can be misleading for cases where log inspection wasn't performed (e.g., when failedSteps.length === 0, analyzeFailedJobs never fetches logs). Consider tailoring this message based on whether logs were actually inspected/available, or using wording that doesn't imply both sources were checked.

Suggested change

return 'No retry-safe transient infrastructure signal was found in the job annotations or logs.';

return 'No retry-safe transient infrastructure signal was found in the available job diagnostics.';

Copilot · 2026-03-12T05:45:37Z

.github/workflows/auto-rerun-transient-ci-failures.js

        const shouldInspectLogs =
            !classification.retryable &&
            getJobLogTextForJob &&
-            failedSteps.some(step => matchesAny(step, feedNetworkFailureStepPatterns));
+            canUseInfrastructureNetworkLogOverride(failedSteps);



analyzeFailedJobs now fetches and scans full job logs for any non-test failure (canUseInfrastructureNetworkLogOverride), which can significantly increase API traffic and memory usage because getJobLogText loads the entire log into a single string. Consider adding additional gating (e.g., only fetch logs when annotations already show a likely network issue) and/or limiting the amount of log text downloaded/scanned (first N KB / streaming) to keep the workflow reliable at scale.

Copilot · 2026-03-12T05:45:37Z

tests/Infrastructure.Tests/WorkflowScripts/auto-rerun-transient-ci-failures.harness.js

+                const htmlUrl = payload.commentHtmlUrlByNumber?.[issueNumber]
+                    ?? `https://github.com/${requestPayload.owner}/${requestPayload.repo}/pull/${issueNumber}#issuecomment-${issueNumber}`;


The harness' fallback html_url for posted PR comments uses the PR number as the issuecomment-<id> fragment. In real GitHub URLs the issue comment ID is a distinct value, so this fallback produces an invalid/misleading URL. Consider returning null when commentHtmlUrlByNumber isn't provided (and letting the workflow code handle missing URLs), or generating a clearly placeholder URL that doesn't look like a real issue comment link.

Suggested change

const htmlUrl = payload.commentHtmlUrlByNumber?.[issueNumber]

?? `https://github.com/${requestPayload.owner}/${requestPayload.repo}/pull/${issueNumber}#issuecomment-${issueNumber}`;

const htmlUrl = payload.commentHtmlUrlByNumber?.[issueNumber] ?? null;

radical and others added 2 commits March 11, 2026 22:55

Include matched pattern in rerun reason

e84dd8f

Surface the specific infrastructure-network pattern that matched so transient rerun summaries explain exactly why a job was considered retryable. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

mitchdenny approved these changes Mar 12, 2026

View reviewed changes

radical marked this pull request as ready for review March 12, 2026 05:41

Copilot AI review requested due to automatic review settings March 12, 2026 05:41

radical enabled auto-merge (squash) March 12, 2026 05:41

Copilot started reviewing on behalf of radical March 12, 2026 05:41 View session

Copilot AI reviewed Mar 12, 2026

View reviewed changes

radical merged commit 859d350 into microsoft:release/13.2 Mar 12, 2026
492 of 505 checks passed

dotnet-policy-service bot added this to the 13.2 milestone Mar 12, 2026

github-actions bot mentioned this pull request Mar 12, 2026

[13.2-burndown] Daily Burndown Report - March 12, 2026 #15179

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Broaden transient CI infrastructure rerun handling#15157

Broaden transient CI infrastructure rerun handling#15157
radical merged 3 commits intomicrosoft:release/13.2from
radical:improve-re-run

radical commented Mar 12, 2026

Uh oh!

github-actions bot commented Mar 12, 2026 •

edited

Loading

Uh oh!

mitchdenny left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 12, 2026

Uh oh!

Copilot AI Mar 12, 2026

Uh oh!

Copilot AI Mar 12, 2026

Uh oh!

Copilot AI Mar 12, 2026

Uh oh!

Copilot AI Mar 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

-function getInfrastructureNetworkLogOverrideReason(failedStepText, matchedPattern) {
-    const patternText = matchedPattern ? ` Matched pattern: ${matchedPattern}.` : '';
-    return `Failed step '${failedStepText}' will be retried because the job log shows a likely transient infrastructure network failure.${patternText}`;
+function getInfrastructureNetworkLogOverrideReason(failedStepText, matchedPattern, failedSteps) {
+    const patternText = matchedPattern ? ` Matched pattern: ${matchedPattern}.` : '';
+    let failedStepLabel;
+    if (Array.isArray(failedSteps)) {
+        failedStepLabel = formatFailedStepLabel(failedSteps, failedStepText);
+    } else {
+        // Fall back to a simple heuristic when we only have the joined text.
+        const isSingle = !failedStepText.includes(' | ');
+        const label = isSingle ? 'Failed step' : 'Failed steps';
+        failedStepLabel = `${label} '${failedStepText}'`;
+    }
+    return `${failedStepLabel} will be retried because the job log shows a likely transient infrastructure network failure.${patternText}`;

		function getInfrastructureNetworkLogOverrideReason(failedStepText, matchedPattern) {
		const patternText = matchedPattern ? ` Matched pattern: ${matchedPattern}.` : '';

-function getInfrastructureNetworkLogOverrideReason(failedStepText, matchedPattern) {
-    const patternText = matchedPattern ? ` Matched pattern: ${matchedPattern}.` : '';
+function formatMatchedPatternForMarkdown(matchedPattern) {
+    if (!matchedPattern) {
+        return '';
+    }
+    const safePattern = String(matchedPattern).replace(/`/g, '\\`');
+    return ` Matched pattern: \`${safePattern}\`.`;
+}
+function getInfrastructureNetworkLogOverrideReason(failedStepText, matchedPattern) {
+    const patternText = formatMatchedPatternForMarkdown(matchedPattern);

	return 'No retry-safe transient infrastructure signal was found in the job annotations or logs.';
	return 'No retry-safe transient infrastructure signal was found in the available job diagnostics.';

		const htmlUrl = payload.commentHtmlUrlByNumber?.[issueNumber]
		?? `https://github.com/${requestPayload.owner}/${requestPayload.repo}/pull/${issueNumber}#issuecomment-${issueNumber}`;

Conversation

radical commented Mar 12, 2026

Description

Checklist

Validation

Uh oh!

github-actions bot commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mitchdenny left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

github-actions bot commented Mar 12, 2026 •

edited

Loading