feat: pulse CI failure pattern detection — identify systemic workflow bugs#2979
feat: pulse CI failure pattern detection — identify systemic workflow bugs#2979marcusquinn merged 4 commits intomainfrom
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
WalkthroughUpdated the pulse CI handling rule to introduce a pre-dispatch systemic failure check. When CI failures appear on multiple PRs identically, the check skips per-PR worker dispatch and routes to a workflow-level issue instead, distinguishing systemic from per-PR failures. Changes
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~3 minutes Possibly related PRs
Suggested labels
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
This PR appears orphaned — no active worker process found and no activity for 6+ hours. Flagging for re-dispatch. If work is still in progress, remove the |
Add a new 'CI failure pattern detection' subsection to pulse.md Step 3 that correlates CI failures across PRs to identify workflow-level bugs. When 3+ PRs share the same failing check, the pulse now files a single workflow-level issue instead of dispatching N individual fix workers that would try to fix code when the problem is the CI configuration. This implements the 'Multiple PRs fail CI with the same error pattern' signal from AGENTS.md Self-Improvement, which was guidance without implementation — the pulse processed PRs individually without cross-PR correlation, causing GH#2973 to go undetected. Closes #2975
6506858 to
9557d5c
Compare
🔍 Code Quality Report�[0;35m[MONITOR]�[0m Code Review Monitoring Report �[0;34m[INFO]�[0m Latest Quality Status: �[0;34m[INFO]�[0m Recent monitoring activity: 📈 Current Quality Metrics
Generated on: Fri Mar 6 17:10:37 UTC 2026 Generated by AI DevOps Framework Code Review Monitoring |
|
The |
|
Re-ran 'Wait for AI Review Bots' check (stale failure from pre-fix workflow). PR #3006 merged the review-bot-gate rate-limit fix — this re-run uses the updated workflow on main. |
|
Re-triggering review-bot-gate after GH#3007 fix (bidirectional prefix matching for CodeRabbit status context). |
🔍 Code Quality Report�[0;35m[MONITOR]�[0m Code Review Monitoring Report �[0;34m[INFO]�[0m Latest Quality Status: �[0;34m[INFO]�[0m Recent monitoring activity: 📈 Current Quality Metrics
Generated on: Fri Mar 6 20:40:53 UTC 2026 Generated by AI DevOps Framework Code Review Monitoring |
🔍 Code Quality Report�[0;35m[MONITOR]�[0m Code Review Monitoring Report �[0;34m[INFO]�[0m Latest Quality Status: �[0;34m[INFO]�[0m Recent monitoring activity: 📈 Current Quality Metrics
Generated on: Fri Mar 6 20:43:04 UTC 2026 Generated by AI DevOps Framework Code Review Monitoring |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
.agents/scripts/commands/pulse.md (1)
137-146:⚠️ Potential issue | 🟠 MajorImplement missing rate limit and strengthen deduplication.
Several features from the PR objectives are not implemented:
Rate limit (PR objective): The PR description specifies "Rate-limited to a maximum of 2 systemic CI issues per repo per pulse cycle," but there's no counter or enforcement logic here. Without this, a repo with 10 failing checks could create 10 issues in one pulse cycle.
Deduplication weakness: Line 144's search query
"<check name> failing"is not specific enough. Multiple unrelated issues could match this pattern. Consider using a more specific marker (e.g., labelsystemic-ci-failure+ search by check name) or a structured comment format that can be parsed deterministically.Security: command injection risk: The
<check name>variable in the search query should be sanitized or quoted. If a check name contains special characters, it could break the command or cause unintended behavior.Missing command: Line 145 says "file one" but doesn't show the actual
gh issue createcommand. Compare this to lines 599-614, which provide the complete command template for quality findings.🛡️ Proposed fixes
Add rate limit tracking:
+# Track systemic CI issues created this cycle (per-repo counter) +SYSTEMIC_CI_ISSUES_CREATED=0 +SYSTEMIC_CI_ISSUES_MAX=2 + **What to do when a systemic pattern is found:** 1. **Do NOT dispatch workers** to fix individual PRs for that check — the fix is in the workflow, not the PR code -2. **Search for an existing issue** describing the pattern: `gh issue list --repo <slug> --search "<check name> failing" --state open` -3. **If no issue exists**, file one describing: which check is failing, how many PRs are affected, the error message (from `gh run view <run_id> --log-failed`), and a hypothesis about the root cause +2. **Skip if rate limit reached:** If `SYSTEMIC_CI_ISSUES_CREATED >= SYSTEMIC_CI_ISSUES_MAX`, skip to next check (log: "Rate limit reached for systemic CI issues this cycle") +3. **Search for an existing issue** describing the pattern: `gh issue list --repo <slug> --label "systemic-ci-failure" --search "in:title \"$CHECK_NAME\"" --state open` +4. **If no issue exists**, file one: + + ```bash + gh issue create --repo <slug> \ + --title "systemic: CI check \"$CHECK_NAME\" failing on ${AFFECTED_PR_COUNT} PRs" \ + --label "bug,auto-dispatch,systemic-ci-failure" \ + --body "Detected by pulse CI failure pattern detection. + + **Check name**: $CHECK_NAME + **Affected PRs**: ${AFFECTED_PR_COUNT} (#${PR_LIST}) + **Error sample**: (from gh run view <run_id> --log-failed) + + **Hypothesis**: This is a workflow-level bug (misconfigured bot, permissions, or workflow logic) affecting all PRs identically. Fix the workflow file or infrastructure, not individual PR code. + + See pulse.md 'CI failure pattern detection' for detection criteria." + + SYSTEMIC_CI_ISSUES_CREATED=$((SYSTEMIC_CI_ISSUES_CREATED + 1)) + ``` -4. **Label it** `bug` + `auto-dispatch` so a worker picks it up and fixes the workflow itselfSanitize check name in search query:
# Sanitize check name for safe use in search queries CHECK_NAME_SAFE=$(printf '%s' "$CHECK_NAME" | sed 's/[^a-zA-Z0-9 _-]/_/g')
🧹 Nitpick comments (2)
.agents/scripts/commands/pulse.md (2)
125-125: Clarify the execution order for systemic failure detection.The reference to "check whether this is a systemic failure (see 'CI failure pattern detection' below)" creates potential ordering ambiguity. The correlation logic (lines 133-178) states it runs "After processing individual PRs," but this gate at line 125 needs the correlation results BEFORE deciding whether to dispatch.
Suggested flow clarification:
- Scan all PRs (merge ready ones, note failing ones)
- Run the CI failure pattern detection (lines 133-178) to identify systemic check names
- For each failing PR, consult the systemic check list before dispatching
Consider adding a note at line 133 like: "Run this correlation BEFORE dispatching workers for failing PRs above." to make the execution order explicit.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.agents/scripts/commands/pulse.md at line 125, Clarify the ordering so the CI correlation runs before per-PR dispatch: update the "Failing CI or changes requested" paragraph to state that the workflow should first scan all PRs and then run the "CI failure pattern detection" correlation step (the section titled "CI failure pattern detection") to produce a list of systemic checks; only after that, for each failing PR, consult that systemic check list before deciding to dispatch a fix worker. Add a short explicit note in the "CI failure pattern detection" section such as "Run this correlation BEFORE dispatching workers for failing PRs above" to remove ambiguity.
157-178: Provide implementation details for self-healing guard rails.The self-healing logic is conceptually sound but lacks concrete implementation:
Line 161: "check whether the issue has already been resolved" — no command shown for how to verify this. Need to query issue state and search for merged PRs that reference it.
Line 171: "Only re-run checks where you have evidence the fix is on main" — no implementation for gathering that evidence. Should this compare check results on recently-created PRs vs. older PRs?
Line 176: "Limit to 10 re-runs per pulse cycle" — good guard rail, but no code to enforce it. Need a counter variable and break condition.
Missing mixed-failure handling: PR objectives mention "PRs with both systemic and per-PR failures receive per-PR fix workers for the per-PR failures only." This scenario isn't addressed. A PR could have check A (systemic) and check B (per-PR) both failing — the pulse should re-run A (if fixed) and dispatch a worker for B.
📋 Implementation sketch for guard rails
# Track re-runs per cycle RERUN_COUNT=0 RERUN_MAX=10 # For each PR with stale failed/cancelled run for a systemic check: for pr_number in "${AFFECTED_PRS[@]}"; do # Guard: rate limit if [[ $RERUN_COUNT -ge $RERUN_MAX ]]; then echo "Re-run rate limit reached ($RERUN_MAX per cycle)" break fi # Guard: evidence that fix is on main # Option 1: Check if the systemic issue is closed with a merged PR ISSUE_STATE=$(gh issue view "$SYSTEMIC_ISSUE_NUMBER" --repo "$slug" --json state --jq '.state') if [[ "$ISSUE_STATE" != "CLOSED" ]]; then echo "Systemic issue #$SYSTEMIC_ISSUE_NUMBER not yet resolved, skipping re-run" continue fi # Option 2: Check if the same check now passes on recently-created PRs RECENT_PR_CHECK_STATUS=$(gh pr view "$RECENT_PR_NUMBER" --repo "$slug" --json statusCheckRollup --jq ".statusCheckRollup[] | select(.name==\"$CHECK_NAME\") | .conclusion") if [[ "$RECENT_PR_CHECK_STATUS" != "SUCCESS" ]]; then echo "Check still failing on recent PRs, fix may not be working" continue fi # Re-run the specific failed workflow gh run rerun "$RUN_ID" --repo "$slug" RERUN_COUNT=$((RERUN_COUNT + 1)) echo "Re-ran $CHECK_NAME on PR #$pr_number (stale failure from pre-fix workflow)" done🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.agents/scripts/commands/pulse.md around lines 157 - 178, Add concrete guard-rail logic to the self-healing script: implement a RERUN_COUNT/RERUN_MAX counter (e.g., RERUN_COUNT starts at 0, RERUN_MAX=10) and increment/break when the limit is hit; verify "issue resolved" by querying the systemic issue state (SYSTEMIC_ISSUE_NUMBER via gh issue view) and/or ensure evidence of a merged fix on main (search for merged PRs referencing the issue or check that RECENT_PR_NUMBER has CHECK_NAME returning success); only proceed to gh run rerun for the specific RUN_ID when those checks pass; and handle mixed failures by splitting per-PR failures (dispatch per-PR workers for per-pr checks) versus systemic checks (only rerun RUN_ID for systemic CHECK_NAME) before incrementing RERUN_COUNT and logging the action.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In @.agents/scripts/commands/pulse.md:
- Line 125: Clarify the ordering so the CI correlation runs before per-PR
dispatch: update the "Failing CI or changes requested" paragraph to state that
the workflow should first scan all PRs and then run the "CI failure pattern
detection" correlation step (the section titled "CI failure pattern detection")
to produce a list of systemic checks; only after that, for each failing PR,
consult that systemic check list before deciding to dispatch a fix worker. Add a
short explicit note in the "CI failure pattern detection" section such as "Run
this correlation BEFORE dispatching workers for failing PRs above" to remove
ambiguity.
- Around line 157-178: Add concrete guard-rail logic to the self-healing script:
implement a RERUN_COUNT/RERUN_MAX counter (e.g., RERUN_COUNT starts at 0,
RERUN_MAX=10) and increment/break when the limit is hit; verify "issue resolved"
by querying the systemic issue state (SYSTEMIC_ISSUE_NUMBER via gh issue view)
and/or ensure evidence of a merged fix on main (search for merged PRs
referencing the issue or check that RECENT_PR_NUMBER has CHECK_NAME returning
success); only proceed to gh run rerun for the specific RUN_ID when those checks
pass; and handle mixed failures by splitting per-PR failures (dispatch per-PR
workers for per-pr checks) versus systemic checks (only rerun RUN_ID for
systemic CHECK_NAME) before incrementing RERUN_COUNT and logging the action.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: e103007e-e948-4c04-bd88-a8dd100b2f65
📒 Files selected for processing (1)
.agents/scripts/commands/pulse.md
🔍 Code Quality Report�[0;35m[MONITOR]�[0m Code Review Monitoring Report �[0;34m[INFO]�[0m Latest Quality Status: �[0;34m[INFO]�[0m Recent monitoring activity: 📈 Current Quality Metrics
Generated on: Fri Mar 6 20:49:51 UTC 2026 Generated by AI DevOps Framework Code Review Monitoring |
|



Summary
### CI failure pattern detectionsubsection topulse.mdStep 3 that correlates CI failures across PRs to detect workflow-level bugsProblem
The pulse processed PRs individually without cross-PR correlation. When a workflow bug caused the same CI check to fail on all PRs (e.g., GH#2973: regex false positive + concurrency cancellation), the pulse dispatched individual fix workers for each PR — workers that tried to fix code when the problem was the CI configuration. This wasted worker slots and never fixed the root cause.
Design decisions
gh pr checksonly when 3+ PRs are failing, avoiding unnecessary API calls on normal cyclesCloses #2975
Summary by CodeRabbit