feat: pulse CI failure pattern detection — identify systemic workflow bugs by marcusquinn · Pull Request #2979 · marcusquinn/aidevops

marcusquinn · 2026-03-06T04:56:14Z

Summary

Adds a new ### CI failure pattern detection subsection to pulse.md Step 3 that correlates CI failures across PRs to detect workflow-level bugs
When 3+ PRs share the same failing check name, the pulse files a single workflow-level issue instead of dispatching N individual fix workers
Modifies the existing "Failing CI" bullet to cross-reference the new detection section before dispatching per-PR fix workers

Problem

The pulse processed PRs individually without cross-PR correlation. When a workflow bug caused the same CI check to fail on all PRs (e.g., GH#2973: regex false positive + concurrency cancellation), the pulse dispatched individual fix workers for each PR — workers that tried to fix code when the problem was the CI configuration. This wasted worker slots and never fixed the root cause.

Design decisions

Threshold of 3+ PRs: Guideline, not hard rule — the pulse agent uses judgment (2 PRs with identical errors = probably systemic; 4 PRs with different errors = probably coincidence)
On-demand check fetching: Individual check names are fetched via gh pr checks only when 3+ PRs are failing, avoiding unnecessary API calls on normal cycles
Rate limited: Max 2 systemic CI issues per repo per pulse cycle
Dedup: Checks for existing issues before creating new ones
Mixed handling: PRs with both systemic and per-PR failures get fix workers for the per-PR failures only

Closes #2975

Summary by CodeRabbit

Chores
- Enhanced CI failure handling with an intelligent pre-dispatch check that distinguishes between systemic workflow issues and per-PR problems, optimizing the dispatch process and reducing unnecessary worker invocations.

gemini-code-assist · 2026-03-06T04:56:18Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

coderabbitai · 2026-03-06T04:56:34Z

Walkthrough

Updated the pulse CI handling rule to introduce a pre-dispatch systemic failure check. When CI failures appear on multiple PRs identically, the check skips per-PR worker dispatch and routes to a workflow-level issue instead, distinguishing systemic from per-PR failures.

Changes

Cohort / File(s)	Summary
Pulse CI dispatch gating `.agents/scripts/commands/pulse.md`	Added pre-dispatch gate to detect systemic CI failures across PRs. If a check fails consistently on 3+ PRs, skip per-PR worker dispatch and file workflow issue; otherwise proceed with normal dispatch logic.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

Possibly related PRs

feat: pulse CI failure pattern detection — close self-improvement gap from GH#2973 #2976: Parallel implementation of systemic CI-failure detection with pre-dispatch gating to skip per-PR worker dispatch when the same failure pattern spans multiple PRs.
refactor: simplify pulse.md — 622 to 127 lines, trust intelligence over deterministic rules #2465: Modifies pulse.md dispatch behavior and pre-dispatch checks, including circuit-breaker and failing-CI judgment rules that interact with the new systemic-detection gate.
fix: add mandatory OPEN state check before dispatching workers #2455: Updates pulse.md dispatch gating logic with pre-dispatch controls that determine whether to skip or proceed with worker dispatch.

Suggested labels

enhancement

Poem

🔄 One gate to catch them all—
Systemic bugs won't slip through
Pulse sees the pattern now
Not N fixes, one truth,
Workflow bugs meet their match ✨

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main change: adding CI failure pattern detection to identify systemic workflow bugs, which is the primary focus of this PR.
Linked Issues check	✅ Passed	The PR implements all core coding requirements from `#2975`: CI failure pattern detection in pulse.md Step 3, systemic vs per-PR failure distinction, and cross-PR check correlation logic.
Out of Scope Changes check	✅ Passed	All changes are directly scoped to the CI failure pattern detection feature described in issue `#2975`; no unrelated modifications were introduced.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feature/ci-failure-pattern-detection

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

marcusquinn · 2026-03-06T15:40:47Z

This PR appears orphaned — no active worker process found and no activity for 6+ hours. Flagging for re-dispatch. If work is still in progress, remove the status:orphaned label.

Add a new 'CI failure pattern detection' subsection to pulse.md Step 3 that correlates CI failures across PRs to identify workflow-level bugs. When 3+ PRs share the same failing check, the pulse now files a single workflow-level issue instead of dispatching N individual fix workers that would try to fix code when the problem is the CI configuration. This implements the 'Multiple PRs fail CI with the same error pattern' signal from AGENTS.md Self-Improvement, which was guidance without implementation — the pulse processed PRs individually without cross-PR correlation, causing GH#2973 to go undetected. Closes #2975

github-actions · 2026-03-06T17:10:38Z

🔍 Code Quality Report

�[0;35m[MONITOR]�[0m Code Review Monitoring Report

�[0;34m[INFO]�[0m Latest Quality Status:
SonarCloud: 0 bugs, 0 vulnerabilities, 108 code smells

�[0;34m[INFO]�[0m Recent monitoring activity:
Fri Mar 6 17:10:34 UTC 2026: Code review monitoring started
Fri Mar 6 17:10:34 UTC 2026: SonarCloud - Bugs: 0, Vulnerabilities: 0, Code Smells: 108

📈 Current Quality Metrics

BUGS: 0
CODE SMELLS: 108
VULNERABILITIES: 0

Generated on: Fri Mar 6 17:10:37 UTC 2026

Generated by AI DevOps Framework Code Review Monitoring

marcusquinn · 2026-03-06T18:14:33Z

The Wait for AI Review Bots check is failing due to a systemic workflow issue — bots are rate-limited and the workflow treats rate-limit notices as non-reviews. Tracked in #3005. This is not a code issue with this PR.

marcusquinn · 2026-03-06T18:24:45Z

Re-ran 'Wait for AI Review Bots' check (stale failure from pre-fix workflow). PR #3006 merged the review-bot-gate rate-limit fix — this re-run uses the updated workflow on main.

marcusquinn · 2026-03-06T20:35:53Z

Re-triggering review-bot-gate after GH#3007 fix (bidirectional prefix matching for CodeRabbit status context).

github-actions · 2026-03-06T20:40:54Z

🔍 Code Quality Report

�[0;35m[MONITOR]�[0m Code Review Monitoring Report

�[0;34m[INFO]�[0m Latest Quality Status:
SonarCloud: 0 bugs, 0 vulnerabilities, 108 code smells

�[0;34m[INFO]�[0m Recent monitoring activity:
Fri Mar 6 20:40:49 UTC 2026: Code review monitoring started
Fri Mar 6 20:40:50 UTC 2026: SonarCloud - Bugs: 0, Vulnerabilities: 0, Code Smells: 108

📈 Current Quality Metrics

BUGS: 0
CODE SMELLS: 108
VULNERABILITIES: 0

Generated on: Fri Mar 6 20:40:53 UTC 2026

Generated by AI DevOps Framework Code Review Monitoring

github-actions · 2026-03-06T20:43:05Z

🔍 Code Quality Report

�[0;35m[MONITOR]�[0m Code Review Monitoring Report

�[0;34m[INFO]�[0m Latest Quality Status:
SonarCloud: 0 bugs, 0 vulnerabilities, 108 code smells

�[0;34m[INFO]�[0m Recent monitoring activity:
Fri Mar 6 20:43:01 UTC 2026: Code review monitoring started
Fri Mar 6 20:43:01 UTC 2026: SonarCloud - Bugs: 0, Vulnerabilities: 0, Code Smells: 108

📈 Current Quality Metrics

BUGS: 0
CODE SMELLS: 108
VULNERABILITIES: 0

Generated on: Fri Mar 6 20:43:04 UTC 2026

Generated by AI DevOps Framework Code Review Monitoring

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

.agents/scripts/commands/pulse.md (1)
137-146: ⚠️ Potential issue | 🟠 Major

Implement missing rate limit and strengthen deduplication.

Several features from the PR objectives are not implemented:

Rate limit (PR objective): The PR description specifies "Rate-limited to a maximum of 2 systemic CI issues per repo per pulse cycle," but there's no counter or enforcement logic here. Without this, a repo with 10 failing checks could create 10 issues in one pulse cycle.

Deduplication weakness: Line 144's search query "<check name> failing" is not specific enough. Multiple unrelated issues could match this pattern. Consider using a more specific marker (e.g., label systemic-ci-failure + search by check name) or a structured comment format that can be parsed deterministically.

Security: command injection risk: The <check name> variable in the search query should be sanitized or quoted. If a check name contains special characters, it could break the command or cause unintended behavior.

Missing command: Line 145 says "file one" but doesn't show the actual gh issue create command. Compare this to lines 599-614, which provide the complete command template for quality findings.
🛡️ Proposed fixes

Add rate limit tracking:
+# Track systemic CI issues created this cycle (per-repo counter)
+SYSTEMIC_CI_ISSUES_CREATED=0
+SYSTEMIC_CI_ISSUES_MAX=2
+
 **What to do when a systemic pattern is found:**

 1. **Do NOT dispatch workers** to fix individual PRs for that check — the fix is in the workflow, not the PR code
-2. **Search for an existing issue** describing the pattern: `gh issue list --repo <slug> --search "<check name> failing" --state open`
-3. **If no issue exists**, file one describing: which check is failing, how many PRs are affected, the error message (from `gh run view <run_id> --log-failed`), and a hypothesis about the root cause
+2. **Skip if rate limit reached:** If `SYSTEMIC_CI_ISSUES_CREATED >= SYSTEMIC_CI_ISSUES_MAX`, skip to next check (log: "Rate limit reached for systemic CI issues this cycle")
+3. **Search for an existing issue** describing the pattern: `gh issue list --repo <slug> --label "systemic-ci-failure" --search "in:title \"$CHECK_NAME\"" --state open`
+4. **If no issue exists**, file one:
+
+   ```bash
+   gh issue create --repo <slug> \
+     --title "systemic: CI check \"$CHECK_NAME\" failing on ${AFFECTED_PR_COUNT} PRs" \
+     --label "bug,auto-dispatch,systemic-ci-failure" \
+     --body "Detected by pulse CI failure pattern detection.
+
+   **Check name**: $CHECK_NAME
+   **Affected PRs**: ${AFFECTED_PR_COUNT} (#${PR_LIST})
+   **Error sample**: (from gh run view <run_id> --log-failed)
+
+   **Hypothesis**: This is a workflow-level bug (misconfigured bot, permissions, or workflow logic) affecting all PRs identically. Fix the workflow file or infrastructure, not individual PR code.
+
+   See pulse.md 'CI failure pattern detection' for detection criteria."
+   
+   SYSTEMIC_CI_ISSUES_CREATED=$((SYSTEMIC_CI_ISSUES_CREATED + 1))
+   ```
-4. **Label it** `bug` + `auto-dispatch` so a worker picks it up and fixes the workflow itself
Sanitize check name in search query:
# Sanitize check name for safe use in search queries
CHECK_NAME_SAFE=$(printf '%s' "$CHECK_NAME" | sed 's/[^a-zA-Z0-9 _-]/_/g')

🧹 Nitpick comments (2)

.agents/scripts/commands/pulse.md (2)
125-125: Clarify the execution order for systemic failure detection.

The reference to "check whether this is a systemic failure (see 'CI failure pattern detection' below)" creates potential ordering ambiguity. The correlation logic (lines 133-178) states it runs "After processing individual PRs," but this gate at line 125 needs the correlation results BEFORE deciding whether to dispatch.

Suggested flow clarification:

Scan all PRs (merge ready ones, note failing ones)

Run the CI failure pattern detection (lines 133-178) to identify systemic check names

For each failing PR, consult the systemic check list before dispatching

Consider adding a note at line 133 like: "Run this correlation BEFORE dispatching workers for failing PRs above." to make the execution order explicit.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.agents/scripts/commands/pulse.md at line 125, Clarify the ordering so the
CI correlation runs before per-PR dispatch: update the "Failing CI or changes
requested" paragraph to state that the workflow should first scan all PRs and
then run the "CI failure pattern detection" correlation step (the section titled
"CI failure pattern detection") to produce a list of systemic checks; only after
that, for each failing PR, consult that systemic check list before deciding to
dispatch a fix worker. Add a short explicit note in the "CI failure pattern
detection" section such as "Run this correlation BEFORE dispatching workers for
failing PRs above" to remove ambiguity.
157-178: Provide implementation details for self-healing guard rails.

The self-healing logic is conceptually sound but lacks concrete implementation:

Line 161: "check whether the issue has already been resolved" — no command shown for how to verify this. Need to query issue state and search for merged PRs that reference it.

Line 171: "Only re-run checks where you have evidence the fix is on main" — no implementation for gathering that evidence. Should this compare check results on recently-created PRs vs. older PRs?

Line 176: "Limit to 10 re-runs per pulse cycle" — good guard rail, but no code to enforce it. Need a counter variable and break condition.

Missing mixed-failure handling: PR objectives mention "PRs with both systemic and per-PR failures receive per-PR fix workers for the per-PR failures only." This scenario isn't addressed. A PR could have check A (systemic) and check B (per-PR) both failing — the pulse should re-run A (if fixed) and dispatch a worker for B.
📋 Implementation sketch for guard rails
# Track re-runs per cycle
RERUN_COUNT=0
RERUN_MAX=10

# For each PR with stale failed/cancelled run for a systemic check:
for pr_number in "${AFFECTED_PRS[@]}"; do
  # Guard: rate limit
  if [[ $RERUN_COUNT -ge $RERUN_MAX ]]; then
    echo "Re-run rate limit reached ($RERUN_MAX per cycle)"
    break
  fi
  
  # Guard: evidence that fix is on main
  # Option 1: Check if the systemic issue is closed with a merged PR
  ISSUE_STATE=$(gh issue view "$SYSTEMIC_ISSUE_NUMBER" --repo "$slug" --json state --jq '.state')
  if [[ "$ISSUE_STATE" != "CLOSED" ]]; then
    echo "Systemic issue #$SYSTEMIC_ISSUE_NUMBER not yet resolved, skipping re-run"
    continue
  fi
  
  # Option 2: Check if the same check now passes on recently-created PRs
  RECENT_PR_CHECK_STATUS=$(gh pr view "$RECENT_PR_NUMBER" --repo "$slug" --json statusCheckRollup --jq ".statusCheckRollup[] | select(.name==\"$CHECK_NAME\") | .conclusion")
  if [[ "$RECENT_PR_CHECK_STATUS" != "SUCCESS" ]]; then
    echo "Check still failing on recent PRs, fix may not be working"
    continue
  fi
  
  # Re-run the specific failed workflow
  gh run rerun "$RUN_ID" --repo "$slug"
  RERUN_COUNT=$((RERUN_COUNT + 1))
  echo "Re-ran $CHECK_NAME on PR #$pr_number (stale failure from pre-fix workflow)"
done
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.agents/scripts/commands/pulse.md around lines 157 - 178, Add concrete
guard-rail logic to the self-healing script: implement a RERUN_COUNT/RERUN_MAX
counter (e.g., RERUN_COUNT starts at 0, RERUN_MAX=10) and increment/break when
the limit is hit; verify "issue resolved" by querying the systemic issue state
(SYSTEMIC_ISSUE_NUMBER via gh issue view) and/or ensure evidence of a merged fix
on main (search for merged PRs referencing the issue or check that
RECENT_PR_NUMBER has CHECK_NAME returning success); only proceed to gh run rerun
for the specific RUN_ID when those checks pass; and handle mixed failures by
splitting per-PR failures (dispatch per-PR workers for per-pr checks) versus
systemic checks (only rerun RUN_ID for systemic CHECK_NAME) before incrementing
RERUN_COUNT and logging the action.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In @.agents/scripts/commands/pulse.md:
- Line 125: Clarify the ordering so the CI correlation runs before per-PR
dispatch: update the "Failing CI or changes requested" paragraph to state that
the workflow should first scan all PRs and then run the "CI failure pattern
detection" correlation step (the section titled "CI failure pattern detection")
to produce a list of systemic checks; only after that, for each failing PR,
consult that systemic check list before deciding to dispatch a fix worker. Add a
short explicit note in the "CI failure pattern detection" section such as "Run
this correlation BEFORE dispatching workers for failing PRs above" to remove
ambiguity.
- Around line 157-178: Add concrete guard-rail logic to the self-healing script:
implement a RERUN_COUNT/RERUN_MAX counter (e.g., RERUN_COUNT starts at 0,
RERUN_MAX=10) and increment/break when the limit is hit; verify "issue resolved"
by querying the systemic issue state (SYSTEMIC_ISSUE_NUMBER via gh issue view)
and/or ensure evidence of a merged fix on main (search for merged PRs
referencing the issue or check that RECENT_PR_NUMBER has CHECK_NAME returning
success); only proceed to gh run rerun for the specific RUN_ID when those checks
pass; and handle mixed failures by splitting per-PR failures (dispatch per-PR
workers for per-pr checks) versus systemic checks (only rerun RUN_ID for
systemic CHECK_NAME) before incrementing RERUN_COUNT and logging the action.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: e103007e-e948-4c04-bd88-a8dd100b2f65

📥 Commits

Reviewing files that changed from the base of the PR and between 35f34ca and 59cfedc.

📒 Files selected for processing (1)

.agents/scripts/commands/pulse.md

github-actions · 2026-03-06T20:49:53Z

🔍 Code Quality Report

�[0;35m[MONITOR]�[0m Code Review Monitoring Report

�[0;34m[INFO]�[0m Latest Quality Status:
SonarCloud: 0 bugs, 0 vulnerabilities, 108 code smells

�[0;34m[INFO]�[0m Recent monitoring activity:
Fri Mar 6 20:49:48 UTC 2026: Code review monitoring started
Fri Mar 6 20:49:49 UTC 2026: SonarCloud - Bugs: 0, Vulnerabilities: 0, Code Smells: 108

📈 Current Quality Metrics

BUGS: 0
CODE SMELLS: 108
VULNERABILITIES: 0

Generated on: Fri Mar 6 20:49:51 UTC 2026

Generated by AI DevOps Framework Code Review Monitoring

sonarqubecloud · 2026-03-06T20:50:38Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

This was referenced Mar 6, 2026

bug: review-bot-gate-helper.sh returns PASS on rate-limit notices (not actual reviews) #2980

Closed

[Supervisor:marcusquinn] 1 PR, 3 assigned, 0 workers at 14:42 UTC #2645

Open

marcusquinn added the status:orphaned label Mar 6, 2026

marcusquinn force-pushed the feature/ci-failure-pattern-detection branch from 6506858 to 9557d5c Compare March 6, 2026 17:10

marcusquinn mentioned this pull request Mar 6, 2026

bug: review-bot-gate.yml fails when bots post rate-limit notices instead of reviews #3005

Closed

marcusquinn mentioned this pull request Mar 6, 2026

t3005: Fix review-bot-gate to handle rate-limited bots via status check fallback #3006

Merged

This was referenced Mar 6, 2026

t1410: fix review-bot-gate status fallback: CodeRabbit context mismatch #3007

Closed

t1410: fix review-bot-gate status fallback for CodeRabbit context mismatch #3008

Merged

chore: re-trigger CI after review-bot-gate fix (GH#3007)

9c7c656

chore: re-trigger CI (CodeRabbit status now available)

59cfedc

github-actions bot added the enhancement Auto-created from TODO.md tag label Mar 6, 2026

coderabbitai bot reviewed Mar 6, 2026

View reviewed changes

coderabbitai bot approved these changes Mar 6, 2026

View reviewed changes

chore: re-trigger CI (CodeRabbit status confirmed on previous commit)

087535d

marcusquinn merged commit 128e611 into main Mar 6, 2026
12 checks passed

marcusquinn deleted the feature/ci-failure-pattern-detection branch March 6, 2026 21:02

github-actions bot mentioned this pull request Mar 6, 2026

feat: pulse CI failure pattern detection — identify systemic workflow bugs #2975

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: pulse CI failure pattern detection — identify systemic workflow bugs#2979

feat: pulse CI failure pattern detection — identify systemic workflow bugs#2979
marcusquinn merged 4 commits intomainfrom
feature/ci-failure-pattern-detection

marcusquinn commented Mar 6, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

gemini-code-assist bot commented Mar 6, 2026

Uh oh!

coderabbitai bot commented Mar 6, 2026 •

edited

Loading

Uh oh!

marcusquinn commented Mar 6, 2026

Uh oh!

github-actions bot commented Mar 6, 2026

Uh oh!

marcusquinn commented Mar 6, 2026

Uh oh!

marcusquinn commented Mar 6, 2026

Uh oh!

marcusquinn commented Mar 6, 2026

Uh oh!

github-actions bot commented Mar 6, 2026

Uh oh!

github-actions bot commented Mar 6, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

github-actions bot commented Mar 6, 2026

Uh oh!

sonarqubecloud bot commented Mar 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

marcusquinn commented Mar 6, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Design decisions

Summary by CodeRabbit

Uh oh!

gemini-code-assist bot commented Mar 6, 2026

Uh oh!

coderabbitai bot commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Poem

Uh oh!

marcusquinn commented Mar 6, 2026

Uh oh!

github-actions bot commented Mar 6, 2026

🔍 Code Quality Report

📈 Current Quality Metrics

Uh oh!

marcusquinn commented Mar 6, 2026

Uh oh!

marcusquinn commented Mar 6, 2026

Uh oh!

marcusquinn commented Mar 6, 2026

Uh oh!

github-actions bot commented Mar 6, 2026

🔍 Code Quality Report

📈 Current Quality Metrics

Uh oh!

github-actions bot commented Mar 6, 2026

🔍 Code Quality Report

📈 Current Quality Metrics

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Mar 6, 2026

🔍 Code Quality Report

📈 Current Quality Metrics

Uh oh!

sonarqubecloud bot commented Mar 6, 2026

Quality Gate passed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

marcusquinn commented Mar 6, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 6, 2026 •

edited

Loading