Skip to content

Broaden transient CI infrastructure rerun handling#15157

Merged
radical merged 3 commits intomicrosoft:release/13.2from
radical:improve-re-run
Mar 12, 2026
Merged

Broaden transient CI infrastructure rerun handling#15157
radical merged 3 commits intomicrosoft:release/13.2from
radical:improve-re-run

Conversation

@radical
Copy link
Member

@radical radical commented Mar 12, 2026

Description

This change broadens the transient CI rerun workflow's log-based infrastructure network detection beyond the older fixed build-step allowlist, while still keeping test execution failures out of the log override path.

It also improves the rerun reason surfaced in summaries and PR comments by explaining that the retry came from a likely transient infrastructure network failure and by including the specific matched pattern.

The workflow script, its documentation, and the focused workflow-script tests were updated together so the behavior and docs stay aligned.

Fixes # (issue)

Checklist

  • Is this feature complete?
    • Yes. Ready to ship.
    • No. Follow-up changes expected.
  • Are you including unit tests for the changes and scenario tests if relevant?
    • Yes
    • No
  • Did you add public API?
    • Yes
      • If yes, did you have an API Review for it?
        • Yes
        • No
      • Did you add <remarks /> and <code /> elements on your triple slash comments?
        • Yes
        • No
    • No
  • Does the change make any security assumptions or guarantees?
    • Yes
      • If yes, have you done a threat model and had a security review?
        • Yes
        • No
    • No
  • Does the change require an update in our Aspire docs?

Validation

  • ./restore.sh
  • dotnet test tests/Infrastructure.Tests/Infrastructure.Tests.csproj -- --filter-class "*.AutoRerunTransientCiFailuresTests" --filter-not-trait "quarantined=true" --filter-not-trait "outerloop=true"

radical and others added 2 commits March 11, 2026 22:55
Expand the log-based override beyond the old build-step allowlist, keep test execution failures excluded from the override, and clarify the retry reason surfaced in job summaries.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Surface the specific infrastructure-network pattern that matched so transient rerun summaries explain exactly why a job was considered retryable.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions
Copy link
Contributor

github-actions bot commented Mar 12, 2026

🚀 Dogfood this PR with:

⚠️ WARNING: Do not do this without first carefully reviewing the code of this PR to satisfy yourself it is safe.

curl -fsSL https://raw.githubusercontent.com/dotnet/aspire/main/eng/scripts/get-aspire-cli-pr.sh | bash -s -- 15157

Or

  • Run remotely in PowerShell:
iex "& { $(irm https://raw.githubusercontent.com/dotnet/aspire/main/eng/scripts/get-aspire-cli-pr.ps1) } 15157"

Copy link
Member

@mitchdenny mitchdenny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed the security model around log injection. The workflow_run trigger ensures the analysis script always runs from the default branch, preventing privilege escalation from PR code. The broadened log-based override surface (step-name denylist replacing allowlist, plus shorter generic git patterns like RPC failed) creates a low-severity log injection vector that could force one unnecessary CI rerun, but existing safety rails (attempt-1 gate, 5-job cap, same-code reruns, full audit trail) keep the blast radius minimal. The test-execution denylist correctly prevents retry when actual test steps fail.

Make rerun analysis output more human-readable, render clickable summary links, show explicit skipped versus rerun outcomes, and surface failed attempt, rerun attempt, and PR comment links in the workflow summary.

Also add an optional workflow_dispatch dry_run input so manual runs default to real reruns while still supporting inspection-only execution.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@radical radical marked this pull request as ready for review March 12, 2026 05:41
Copilot AI review requested due to automatic review settings March 12, 2026 05:41
@radical radical enabled auto-merge (squash) March 12, 2026 05:41
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR expands the transient CI auto-rerun workflow’s ability to detect likely infrastructure/network failures by scanning job logs (beyond the previous step-name allowlist), while preserving the existing “don’t override test execution failures” guardrails. It also refines the workflow summary/PR comment messaging to surface clearer rerun eligibility/outcome and the specific matched log pattern.

Changes:

  • Broaden log-based infrastructure/network failure detection and improve retry/skipped reason messages.
  • Update workflow summary output to use explicit outcome headings and clickable links (including PR comment links when posted).
  • Update docs and Infrastructure workflow-script tests/harness to align with the new behavior and summary formatting.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
.github/workflows/auto-rerun-transient-ci-failures.js Expands log-based override matchers, adds reason formatting/helpers, and updates summary/comment formatting.
.github/workflows/auto-rerun-transient-ci-failures.yml Adds dry_run input for workflow_dispatch and passes attempt metadata into summary output.
docs/ci/auto-rerun-transient-ci-failures.md Updates documentation to reflect broader log override coverage and new manual dry-run behavior.
tests/Infrastructure.Tests/WorkflowScripts/auto-rerun-transient-ci-failures.harness.js Extends the harness summary/GitHub recorder to support link/break events and comment URL responses.
tests/Infrastructure.Tests/WorkflowScripts/AutoRerunTransientCiFailuresTests.cs Updates/extends tests to validate new reasons, broader override behavior, and new summary link events.

Comment on lines +179 to +181
function getInfrastructureNetworkLogOverrideReason(failedStepText, matchedPattern) {
const patternText = matchedPattern ? ` Matched pattern: ${matchedPattern}.` : '';
return `Failed step '${failedStepText}' will be retried because the job log shows a likely transient infrastructure network failure.${patternText}`;
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getInfrastructureNetworkLogOverrideReason always starts with "Failed step" even when multiple steps failed (the caller passes failedStepText which can be a |-joined list). This will produce grammatically incorrect and potentially confusing retry reasons when more than one step failed. Consider using the existing pluralization helpers (formatFailedStepLabel/isSingleFailedStep) when building this reason, or pass failedSteps into the helper so it can choose "step" vs "steps".

Suggested change
function getInfrastructureNetworkLogOverrideReason(failedStepText, matchedPattern) {
const patternText = matchedPattern ? ` Matched pattern: ${matchedPattern}.` : '';
return `Failed step '${failedStepText}' will be retried because the job log shows a likely transient infrastructure network failure.${patternText}`;
function getInfrastructureNetworkLogOverrideReason(failedStepText, matchedPattern, failedSteps) {
const patternText = matchedPattern ? ` Matched pattern: ${matchedPattern}.` : '';
let failedStepLabel;
if (Array.isArray(failedSteps)) {
failedStepLabel = formatFailedStepLabel(failedSteps, failedStepText);
} else {
// Fall back to a simple heuristic when we only have the joined text.
const isSingle = !failedStepText.includes(' | ');
const label = isSingle ? 'Failed step' : 'Failed steps';
failedStepLabel = `${label} '${failedStepText}'`;
}
return `${failedStepLabel} will be retried because the job log shows a likely transient infrastructure network failure.${patternText}`;

Copilot uses AI. Check for mistakes.
Comment on lines +179 to +180
function getInfrastructureNetworkLogOverrideReason(failedStepText, matchedPattern) {
const patternText = matchedPattern ? ` Matched pattern: ${matchedPattern}.` : '';
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The retry reason appends the matched regex via string interpolation (e.g., /.../i). These reasons are rendered as Markdown in both the workflow summary and PR comments; unescaped regex text (notably underscores like _packaging) can cause unintended Markdown formatting and make the message hard to read/copy. Consider wrapping the matched pattern in inline code (backticks) or otherwise escaping/sanitizing it before emitting it in Markdown.

Suggested change
function getInfrastructureNetworkLogOverrideReason(failedStepText, matchedPattern) {
const patternText = matchedPattern ? ` Matched pattern: ${matchedPattern}.` : '';
function formatMatchedPatternForMarkdown(matchedPattern) {
if (!matchedPattern) {
return '';
}
const safePattern = String(matchedPattern).replace(/`/g, '\\`');
return ` Matched pattern: \`${safePattern}\`.`;
}
function getInfrastructureNetworkLogOverrideReason(failedStepText, matchedPattern) {
const patternText = formatMatchedPatternForMarkdown(matchedPattern);

Copilot uses AI. Check for mistakes.
return 'The job annotations did not show a retry-safe transient infrastructure failure.';
}

return 'No retry-safe transient infrastructure signal was found in the job annotations or logs.';
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fallback message "No retry-safe transient infrastructure signal was found in the job annotations or logs." can be misleading for cases where log inspection wasn't performed (e.g., when failedSteps.length === 0, analyzeFailedJobs never fetches logs). Consider tailoring this message based on whether logs were actually inspected/available, or using wording that doesn't imply both sources were checked.

Suggested change
return 'No retry-safe transient infrastructure signal was found in the job annotations or logs.';
return 'No retry-safe transient infrastructure signal was found in the available job diagnostics.';

Copilot uses AI. Check for mistakes.
Comment on lines 321 to 325
const shouldInspectLogs =
!classification.retryable &&
getJobLogTextForJob &&
failedSteps.some(step => matchesAny(step, feedNetworkFailureStepPatterns));
canUseInfrastructureNetworkLogOverride(failedSteps);

Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

analyzeFailedJobs now fetches and scans full job logs for any non-test failure (canUseInfrastructureNetworkLogOverride), which can significantly increase API traffic and memory usage because getJobLogText loads the entire log into a single string. Consider adding additional gating (e.g., only fetch logs when annotations already show a likely network issue) and/or limiting the amount of log text downloaded/scanned (first N KB / streaming) to keep the workflow reliable at scale.

Copilot uses AI. Check for mistakes.
Comment on lines +126 to +127
const htmlUrl = payload.commentHtmlUrlByNumber?.[issueNumber]
?? `https://github.com/${requestPayload.owner}/${requestPayload.repo}/pull/${issueNumber}#issuecomment-${issueNumber}`;
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The harness' fallback html_url for posted PR comments uses the PR number as the issuecomment-<id> fragment. In real GitHub URLs the issue comment ID is a distinct value, so this fallback produces an invalid/misleading URL. Consider returning null when commentHtmlUrlByNumber isn't provided (and letting the workflow code handle missing URLs), or generating a clearly placeholder URL that doesn't look like a real issue comment link.

Suggested change
const htmlUrl = payload.commentHtmlUrlByNumber?.[issueNumber]
?? `https://github.com/${requestPayload.owner}/${requestPayload.repo}/pull/${issueNumber}#issuecomment-${issueNumber}`;
const htmlUrl = payload.commentHtmlUrlByNumber?.[issueNumber] ?? null;

Copilot uses AI. Check for mistakes.
@radical radical merged commit 859d350 into microsoft:release/13.2 Mar 12, 2026
492 of 505 checks passed
@dotnet-policy-service dotnet-policy-service bot added this to the 13.2 milestone Mar 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants