Broaden transient CI infrastructure rerun handling#15157
Broaden transient CI infrastructure rerun handling#15157radical merged 3 commits intomicrosoft:release/13.2from
Conversation
Expand the log-based override beyond the old build-step allowlist, keep test execution failures excluded from the override, and clarify the retry reason surfaced in job summaries. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Surface the specific infrastructure-network pattern that matched so transient rerun summaries explain exactly why a job was considered retryable. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
🚀 Dogfood this PR with:
curl -fsSL https://raw.githubusercontent.com/dotnet/aspire/main/eng/scripts/get-aspire-cli-pr.sh | bash -s -- 15157Or
iex "& { $(irm https://raw.githubusercontent.com/dotnet/aspire/main/eng/scripts/get-aspire-cli-pr.ps1) } 15157" |
mitchdenny
left a comment
There was a problem hiding this comment.
Reviewed the security model around log injection. The workflow_run trigger ensures the analysis script always runs from the default branch, preventing privilege escalation from PR code. The broadened log-based override surface (step-name denylist replacing allowlist, plus shorter generic git patterns like RPC failed) creates a low-severity log injection vector that could force one unnecessary CI rerun, but existing safety rails (attempt-1 gate, 5-job cap, same-code reruns, full audit trail) keep the blast radius minimal. The test-execution denylist correctly prevents retry when actual test steps fail.
Make rerun analysis output more human-readable, render clickable summary links, show explicit skipped versus rerun outcomes, and surface failed attempt, rerun attempt, and PR comment links in the workflow summary. Also add an optional workflow_dispatch dry_run input so manual runs default to real reruns while still supporting inspection-only execution. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This PR expands the transient CI auto-rerun workflow’s ability to detect likely infrastructure/network failures by scanning job logs (beyond the previous step-name allowlist), while preserving the existing “don’t override test execution failures” guardrails. It also refines the workflow summary/PR comment messaging to surface clearer rerun eligibility/outcome and the specific matched log pattern.
Changes:
- Broaden log-based infrastructure/network failure detection and improve retry/skipped reason messages.
- Update workflow summary output to use explicit outcome headings and clickable links (including PR comment links when posted).
- Update docs and Infrastructure workflow-script tests/harness to align with the new behavior and summary formatting.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
.github/workflows/auto-rerun-transient-ci-failures.js |
Expands log-based override matchers, adds reason formatting/helpers, and updates summary/comment formatting. |
.github/workflows/auto-rerun-transient-ci-failures.yml |
Adds dry_run input for workflow_dispatch and passes attempt metadata into summary output. |
docs/ci/auto-rerun-transient-ci-failures.md |
Updates documentation to reflect broader log override coverage and new manual dry-run behavior. |
tests/Infrastructure.Tests/WorkflowScripts/auto-rerun-transient-ci-failures.harness.js |
Extends the harness summary/GitHub recorder to support link/break events and comment URL responses. |
tests/Infrastructure.Tests/WorkflowScripts/AutoRerunTransientCiFailuresTests.cs |
Updates/extends tests to validate new reasons, broader override behavior, and new summary link events. |
| function getInfrastructureNetworkLogOverrideReason(failedStepText, matchedPattern) { | ||
| const patternText = matchedPattern ? ` Matched pattern: ${matchedPattern}.` : ''; | ||
| return `Failed step '${failedStepText}' will be retried because the job log shows a likely transient infrastructure network failure.${patternText}`; |
There was a problem hiding this comment.
getInfrastructureNetworkLogOverrideReason always starts with "Failed step" even when multiple steps failed (the caller passes failedStepText which can be a |-joined list). This will produce grammatically incorrect and potentially confusing retry reasons when more than one step failed. Consider using the existing pluralization helpers (formatFailedStepLabel/isSingleFailedStep) when building this reason, or pass failedSteps into the helper so it can choose "step" vs "steps".
| function getInfrastructureNetworkLogOverrideReason(failedStepText, matchedPattern) { | |
| const patternText = matchedPattern ? ` Matched pattern: ${matchedPattern}.` : ''; | |
| return `Failed step '${failedStepText}' will be retried because the job log shows a likely transient infrastructure network failure.${patternText}`; | |
| function getInfrastructureNetworkLogOverrideReason(failedStepText, matchedPattern, failedSteps) { | |
| const patternText = matchedPattern ? ` Matched pattern: ${matchedPattern}.` : ''; | |
| let failedStepLabel; | |
| if (Array.isArray(failedSteps)) { | |
| failedStepLabel = formatFailedStepLabel(failedSteps, failedStepText); | |
| } else { | |
| // Fall back to a simple heuristic when we only have the joined text. | |
| const isSingle = !failedStepText.includes(' | '); | |
| const label = isSingle ? 'Failed step' : 'Failed steps'; | |
| failedStepLabel = `${label} '${failedStepText}'`; | |
| } | |
| return `${failedStepLabel} will be retried because the job log shows a likely transient infrastructure network failure.${patternText}`; |
| function getInfrastructureNetworkLogOverrideReason(failedStepText, matchedPattern) { | ||
| const patternText = matchedPattern ? ` Matched pattern: ${matchedPattern}.` : ''; |
There was a problem hiding this comment.
The retry reason appends the matched regex via string interpolation (e.g., /.../i). These reasons are rendered as Markdown in both the workflow summary and PR comments; unescaped regex text (notably underscores like _packaging) can cause unintended Markdown formatting and make the message hard to read/copy. Consider wrapping the matched pattern in inline code (backticks) or otherwise escaping/sanitizing it before emitting it in Markdown.
| function getInfrastructureNetworkLogOverrideReason(failedStepText, matchedPattern) { | |
| const patternText = matchedPattern ? ` Matched pattern: ${matchedPattern}.` : ''; | |
| function formatMatchedPatternForMarkdown(matchedPattern) { | |
| if (!matchedPattern) { | |
| return ''; | |
| } | |
| const safePattern = String(matchedPattern).replace(/`/g, '\\`'); | |
| return ` Matched pattern: \`${safePattern}\`.`; | |
| } | |
| function getInfrastructureNetworkLogOverrideReason(failedStepText, matchedPattern) { | |
| const patternText = formatMatchedPatternForMarkdown(matchedPattern); |
| return 'The job annotations did not show a retry-safe transient infrastructure failure.'; | ||
| } | ||
|
|
||
| return 'No retry-safe transient infrastructure signal was found in the job annotations or logs.'; |
There was a problem hiding this comment.
The fallback message "No retry-safe transient infrastructure signal was found in the job annotations or logs." can be misleading for cases where log inspection wasn't performed (e.g., when failedSteps.length === 0, analyzeFailedJobs never fetches logs). Consider tailoring this message based on whether logs were actually inspected/available, or using wording that doesn't imply both sources were checked.
| return 'No retry-safe transient infrastructure signal was found in the job annotations or logs.'; | |
| return 'No retry-safe transient infrastructure signal was found in the available job diagnostics.'; |
| const shouldInspectLogs = | ||
| !classification.retryable && | ||
| getJobLogTextForJob && | ||
| failedSteps.some(step => matchesAny(step, feedNetworkFailureStepPatterns)); | ||
| canUseInfrastructureNetworkLogOverride(failedSteps); | ||
|
|
There was a problem hiding this comment.
analyzeFailedJobs now fetches and scans full job logs for any non-test failure (canUseInfrastructureNetworkLogOverride), which can significantly increase API traffic and memory usage because getJobLogText loads the entire log into a single string. Consider adding additional gating (e.g., only fetch logs when annotations already show a likely network issue) and/or limiting the amount of log text downloaded/scanned (first N KB / streaming) to keep the workflow reliable at scale.
| const htmlUrl = payload.commentHtmlUrlByNumber?.[issueNumber] | ||
| ?? `https://github.com/${requestPayload.owner}/${requestPayload.repo}/pull/${issueNumber}#issuecomment-${issueNumber}`; |
There was a problem hiding this comment.
The harness' fallback html_url for posted PR comments uses the PR number as the issuecomment-<id> fragment. In real GitHub URLs the issue comment ID is a distinct value, so this fallback produces an invalid/misleading URL. Consider returning null when commentHtmlUrlByNumber isn't provided (and letting the workflow code handle missing URLs), or generating a clearly placeholder URL that doesn't look like a real issue comment link.
| const htmlUrl = payload.commentHtmlUrlByNumber?.[issueNumber] | |
| ?? `https://github.com/${requestPayload.owner}/${requestPayload.repo}/pull/${issueNumber}#issuecomment-${issueNumber}`; | |
| const htmlUrl = payload.commentHtmlUrlByNumber?.[issueNumber] ?? null; |
Description
This change broadens the transient CI rerun workflow's log-based infrastructure network detection beyond the older fixed build-step allowlist, while still keeping test execution failures out of the log override path.
It also improves the rerun reason surfaced in summaries and PR comments by explaining that the retry came from a likely transient infrastructure network failure and by including the specific matched pattern.
The workflow script, its documentation, and the focused workflow-script tests were updated together so the behavior and docs stay aligned.
Fixes # (issue)
Checklist
<remarks />and<code />elements on your triple slash comments?aspire.devissue:Validation
./restore.shdotnet test tests/Infrastructure.Tests/Infrastructure.Tests.csproj -- --filter-class "*.AutoRerunTransientCiFailuresTests" --filter-not-trait "quarantined=true" --filter-not-trait "outerloop=true"