Skip to content

feat: pulse CI failure pattern detection — identify systemic workflow bugs#2979

Merged
marcusquinn merged 4 commits intomainfrom
feature/ci-failure-pattern-detection
Mar 6, 2026
Merged

feat: pulse CI failure pattern detection — identify systemic workflow bugs#2979
marcusquinn merged 4 commits intomainfrom
feature/ci-failure-pattern-detection

Conversation

@marcusquinn
Copy link
Owner

@marcusquinn marcusquinn commented Mar 6, 2026

Summary

  • Adds a new ### CI failure pattern detection subsection to pulse.md Step 3 that correlates CI failures across PRs to detect workflow-level bugs
  • When 3+ PRs share the same failing check name, the pulse files a single workflow-level issue instead of dispatching N individual fix workers
  • Modifies the existing "Failing CI" bullet to cross-reference the new detection section before dispatching per-PR fix workers

Problem

The pulse processed PRs individually without cross-PR correlation. When a workflow bug caused the same CI check to fail on all PRs (e.g., GH#2973: regex false positive + concurrency cancellation), the pulse dispatched individual fix workers for each PR — workers that tried to fix code when the problem was the CI configuration. This wasted worker slots and never fixed the root cause.

Design decisions

  • Threshold of 3+ PRs: Guideline, not hard rule — the pulse agent uses judgment (2 PRs with identical errors = probably systemic; 4 PRs with different errors = probably coincidence)
  • On-demand check fetching: Individual check names are fetched via gh pr checks only when 3+ PRs are failing, avoiding unnecessary API calls on normal cycles
  • Rate limited: Max 2 systemic CI issues per repo per pulse cycle
  • Dedup: Checks for existing issues before creating new ones
  • Mixed handling: PRs with both systemic and per-PR failures get fix workers for the per-PR failures only

Closes #2975

Summary by CodeRabbit

  • Chores
    • Enhanced CI failure handling with an intelligent pre-dispatch check that distinguishes between systemic workflow issues and per-PR problems, optimizing the dispatch process and reducing unnecessary worker invocations.

@gemini-code-assist
Copy link

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 6, 2026

Walkthrough

Updated the pulse CI handling rule to introduce a pre-dispatch systemic failure check. When CI failures appear on multiple PRs identically, the check skips per-PR worker dispatch and routes to a workflow-level issue instead, distinguishing systemic from per-PR failures.

Changes

Cohort / File(s) Summary
Pulse CI dispatch gating
.agents/scripts/commands/pulse.md
Added pre-dispatch gate to detect systemic CI failures across PRs. If a check fails consistently on 3+ PRs, skip per-PR worker dispatch and file workflow issue; otherwise proceed with normal dispatch logic.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

Possibly related PRs

Suggested labels

enhancement

Poem

🔄 One gate to catch them all—
Systemic bugs won't slip through
Pulse sees the pattern now
Not N fixes, one truth,
Workflow bugs meet their match ✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: adding CI failure pattern detection to identify systemic workflow bugs, which is the primary focus of this PR.
Linked Issues check ✅ Passed The PR implements all core coding requirements from #2975: CI failure pattern detection in pulse.md Step 3, systemic vs per-PR failure distinction, and cross-PR check correlation logic.
Out of Scope Changes check ✅ Passed All changes are directly scoped to the CI failure pattern detection feature described in issue #2975; no unrelated modifications were introduced.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feature/ci-failure-pattern-detection

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@marcusquinn
Copy link
Owner Author

This PR appears orphaned — no active worker process found and no activity for 6+ hours. Flagging for re-dispatch. If work is still in progress, remove the status:orphaned label.

Add a new 'CI failure pattern detection' subsection to pulse.md Step 3
that correlates CI failures across PRs to identify workflow-level bugs.

When 3+ PRs share the same failing check, the pulse now files a single
workflow-level issue instead of dispatching N individual fix workers
that would try to fix code when the problem is the CI configuration.

This implements the 'Multiple PRs fail CI with the same error pattern'
signal from AGENTS.md Self-Improvement, which was guidance without
implementation — the pulse processed PRs individually without cross-PR
correlation, causing GH#2973 to go undetected.

Closes #2975
@marcusquinn marcusquinn force-pushed the feature/ci-failure-pattern-detection branch from 6506858 to 9557d5c Compare March 6, 2026 17:10
@github-actions
Copy link

github-actions bot commented Mar 6, 2026

🔍 Code Quality Report

�[0;35m[MONITOR]�[0m Code Review Monitoring Report

�[0;34m[INFO]�[0m Latest Quality Status:
SonarCloud: 0 bugs, 0 vulnerabilities, 108 code smells

�[0;34m[INFO]�[0m Recent monitoring activity:
Fri Mar 6 17:10:34 UTC 2026: Code review monitoring started
Fri Mar 6 17:10:34 UTC 2026: SonarCloud - Bugs: 0, Vulnerabilities: 0, Code Smells: 108

📈 Current Quality Metrics

  • BUGS: 0
  • CODE SMELLS: 108
  • VULNERABILITIES: 0

Generated on: Fri Mar 6 17:10:37 UTC 2026


Generated by AI DevOps Framework Code Review Monitoring

@marcusquinn
Copy link
Owner Author

The Wait for AI Review Bots check is failing due to a systemic workflow issue — bots are rate-limited and the workflow treats rate-limit notices as non-reviews. Tracked in #3005. This is not a code issue with this PR.

@marcusquinn
Copy link
Owner Author

Re-ran 'Wait for AI Review Bots' check (stale failure from pre-fix workflow). PR #3006 merged the review-bot-gate rate-limit fix — this re-run uses the updated workflow on main.

@marcusquinn
Copy link
Owner Author

Re-triggering review-bot-gate after GH#3007 fix (bidirectional prefix matching for CodeRabbit status context).

@github-actions
Copy link

github-actions bot commented Mar 6, 2026

🔍 Code Quality Report

�[0;35m[MONITOR]�[0m Code Review Monitoring Report

�[0;34m[INFO]�[0m Latest Quality Status:
SonarCloud: 0 bugs, 0 vulnerabilities, 108 code smells

�[0;34m[INFO]�[0m Recent monitoring activity:
Fri Mar 6 20:40:49 UTC 2026: Code review monitoring started
Fri Mar 6 20:40:50 UTC 2026: SonarCloud - Bugs: 0, Vulnerabilities: 0, Code Smells: 108

📈 Current Quality Metrics

  • BUGS: 0
  • CODE SMELLS: 108
  • VULNERABILITIES: 0

Generated on: Fri Mar 6 20:40:53 UTC 2026


Generated by AI DevOps Framework Code Review Monitoring

@github-actions
Copy link

github-actions bot commented Mar 6, 2026

🔍 Code Quality Report

�[0;35m[MONITOR]�[0m Code Review Monitoring Report

�[0;34m[INFO]�[0m Latest Quality Status:
SonarCloud: 0 bugs, 0 vulnerabilities, 108 code smells

�[0;34m[INFO]�[0m Recent monitoring activity:
Fri Mar 6 20:43:01 UTC 2026: Code review monitoring started
Fri Mar 6 20:43:01 UTC 2026: SonarCloud - Bugs: 0, Vulnerabilities: 0, Code Smells: 108

📈 Current Quality Metrics

  • BUGS: 0
  • CODE SMELLS: 108
  • VULNERABILITIES: 0

Generated on: Fri Mar 6 20:43:04 UTC 2026


Generated by AI DevOps Framework Code Review Monitoring

@github-actions github-actions bot added the enhancement Auto-created from TODO.md tag label Mar 6, 2026
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
.agents/scripts/commands/pulse.md (1)

137-146: ⚠️ Potential issue | 🟠 Major

Implement missing rate limit and strengthen deduplication.

Several features from the PR objectives are not implemented:

  1. Rate limit (PR objective): The PR description specifies "Rate-limited to a maximum of 2 systemic CI issues per repo per pulse cycle," but there's no counter or enforcement logic here. Without this, a repo with 10 failing checks could create 10 issues in one pulse cycle.

  2. Deduplication weakness: Line 144's search query "<check name> failing" is not specific enough. Multiple unrelated issues could match this pattern. Consider using a more specific marker (e.g., label systemic-ci-failure + search by check name) or a structured comment format that can be parsed deterministically.

  3. Security: command injection risk: The <check name> variable in the search query should be sanitized or quoted. If a check name contains special characters, it could break the command or cause unintended behavior.

  4. Missing command: Line 145 says "file one" but doesn't show the actual gh issue create command. Compare this to lines 599-614, which provide the complete command template for quality findings.

🛡️ Proposed fixes

Add rate limit tracking:

+# Track systemic CI issues created this cycle (per-repo counter)
+SYSTEMIC_CI_ISSUES_CREATED=0
+SYSTEMIC_CI_ISSUES_MAX=2
+
 **What to do when a systemic pattern is found:**

 1. **Do NOT dispatch workers** to fix individual PRs for that check — the fix is in the workflow, not the PR code
-2. **Search for an existing issue** describing the pattern: `gh issue list --repo <slug> --search "<check name> failing" --state open`
-3. **If no issue exists**, file one describing: which check is failing, how many PRs are affected, the error message (from `gh run view <run_id> --log-failed`), and a hypothesis about the root cause
+2. **Skip if rate limit reached:** If `SYSTEMIC_CI_ISSUES_CREATED >= SYSTEMIC_CI_ISSUES_MAX`, skip to next check (log: "Rate limit reached for systemic CI issues this cycle")
+3. **Search for an existing issue** describing the pattern: `gh issue list --repo <slug> --label "systemic-ci-failure" --search "in:title \"$CHECK_NAME\"" --state open`
+4. **If no issue exists**, file one:
+
+   ```bash
+   gh issue create --repo <slug> \
+     --title "systemic: CI check \"$CHECK_NAME\" failing on ${AFFECTED_PR_COUNT} PRs" \
+     --label "bug,auto-dispatch,systemic-ci-failure" \
+     --body "Detected by pulse CI failure pattern detection.
+
+   **Check name**: $CHECK_NAME
+   **Affected PRs**: ${AFFECTED_PR_COUNT} (#${PR_LIST})
+   **Error sample**: (from gh run view <run_id> --log-failed)
+
+   **Hypothesis**: This is a workflow-level bug (misconfigured bot, permissions, or workflow logic) affecting all PRs identically. Fix the workflow file or infrastructure, not individual PR code.
+
+   See pulse.md 'CI failure pattern detection' for detection criteria."
+   
+   SYSTEMIC_CI_ISSUES_CREATED=$((SYSTEMIC_CI_ISSUES_CREATED + 1))
+   ```
-4. **Label it** `bug` + `auto-dispatch` so a worker picks it up and fixes the workflow itself

Sanitize check name in search query:

# Sanitize check name for safe use in search queries
CHECK_NAME_SAFE=$(printf '%s' "$CHECK_NAME" | sed 's/[^a-zA-Z0-9 _-]/_/g')
🧹 Nitpick comments (2)
.agents/scripts/commands/pulse.md (2)

125-125: Clarify the execution order for systemic failure detection.

The reference to "check whether this is a systemic failure (see 'CI failure pattern detection' below)" creates potential ordering ambiguity. The correlation logic (lines 133-178) states it runs "After processing individual PRs," but this gate at line 125 needs the correlation results BEFORE deciding whether to dispatch.

Suggested flow clarification:

  1. Scan all PRs (merge ready ones, note failing ones)
  2. Run the CI failure pattern detection (lines 133-178) to identify systemic check names
  3. For each failing PR, consult the systemic check list before dispatching

Consider adding a note at line 133 like: "Run this correlation BEFORE dispatching workers for failing PRs above." to make the execution order explicit.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.agents/scripts/commands/pulse.md at line 125, Clarify the ordering so the
CI correlation runs before per-PR dispatch: update the "Failing CI or changes
requested" paragraph to state that the workflow should first scan all PRs and
then run the "CI failure pattern detection" correlation step (the section titled
"CI failure pattern detection") to produce a list of systemic checks; only after
that, for each failing PR, consult that systemic check list before deciding to
dispatch a fix worker. Add a short explicit note in the "CI failure pattern
detection" section such as "Run this correlation BEFORE dispatching workers for
failing PRs above" to remove ambiguity.

157-178: Provide implementation details for self-healing guard rails.

The self-healing logic is conceptually sound but lacks concrete implementation:

  1. Line 161: "check whether the issue has already been resolved" — no command shown for how to verify this. Need to query issue state and search for merged PRs that reference it.

  2. Line 171: "Only re-run checks where you have evidence the fix is on main" — no implementation for gathering that evidence. Should this compare check results on recently-created PRs vs. older PRs?

  3. Line 176: "Limit to 10 re-runs per pulse cycle" — good guard rail, but no code to enforce it. Need a counter variable and break condition.

  4. Missing mixed-failure handling: PR objectives mention "PRs with both systemic and per-PR failures receive per-PR fix workers for the per-PR failures only." This scenario isn't addressed. A PR could have check A (systemic) and check B (per-PR) both failing — the pulse should re-run A (if fixed) and dispatch a worker for B.

📋 Implementation sketch for guard rails
# Track re-runs per cycle
RERUN_COUNT=0
RERUN_MAX=10

# For each PR with stale failed/cancelled run for a systemic check:
for pr_number in "${AFFECTED_PRS[@]}"; do
  # Guard: rate limit
  if [[ $RERUN_COUNT -ge $RERUN_MAX ]]; then
    echo "Re-run rate limit reached ($RERUN_MAX per cycle)"
    break
  fi
  
  # Guard: evidence that fix is on main
  # Option 1: Check if the systemic issue is closed with a merged PR
  ISSUE_STATE=$(gh issue view "$SYSTEMIC_ISSUE_NUMBER" --repo "$slug" --json state --jq '.state')
  if [[ "$ISSUE_STATE" != "CLOSED" ]]; then
    echo "Systemic issue #$SYSTEMIC_ISSUE_NUMBER not yet resolved, skipping re-run"
    continue
  fi
  
  # Option 2: Check if the same check now passes on recently-created PRs
  RECENT_PR_CHECK_STATUS=$(gh pr view "$RECENT_PR_NUMBER" --repo "$slug" --json statusCheckRollup --jq ".statusCheckRollup[] | select(.name==\"$CHECK_NAME\") | .conclusion")
  if [[ "$RECENT_PR_CHECK_STATUS" != "SUCCESS" ]]; then
    echo "Check still failing on recent PRs, fix may not be working"
    continue
  fi
  
  # Re-run the specific failed workflow
  gh run rerun "$RUN_ID" --repo "$slug"
  RERUN_COUNT=$((RERUN_COUNT + 1))
  echo "Re-ran $CHECK_NAME on PR #$pr_number (stale failure from pre-fix workflow)"
done
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.agents/scripts/commands/pulse.md around lines 157 - 178, Add concrete
guard-rail logic to the self-healing script: implement a RERUN_COUNT/RERUN_MAX
counter (e.g., RERUN_COUNT starts at 0, RERUN_MAX=10) and increment/break when
the limit is hit; verify "issue resolved" by querying the systemic issue state
(SYSTEMIC_ISSUE_NUMBER via gh issue view) and/or ensure evidence of a merged fix
on main (search for merged PRs referencing the issue or check that
RECENT_PR_NUMBER has CHECK_NAME returning success); only proceed to gh run rerun
for the specific RUN_ID when those checks pass; and handle mixed failures by
splitting per-PR failures (dispatch per-PR workers for per-pr checks) versus
systemic checks (only rerun RUN_ID for systemic CHECK_NAME) before incrementing
RERUN_COUNT and logging the action.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In @.agents/scripts/commands/pulse.md:
- Line 125: Clarify the ordering so the CI correlation runs before per-PR
dispatch: update the "Failing CI or changes requested" paragraph to state that
the workflow should first scan all PRs and then run the "CI failure pattern
detection" correlation step (the section titled "CI failure pattern detection")
to produce a list of systemic checks; only after that, for each failing PR,
consult that systemic check list before deciding to dispatch a fix worker. Add a
short explicit note in the "CI failure pattern detection" section such as "Run
this correlation BEFORE dispatching workers for failing PRs above" to remove
ambiguity.
- Around line 157-178: Add concrete guard-rail logic to the self-healing script:
implement a RERUN_COUNT/RERUN_MAX counter (e.g., RERUN_COUNT starts at 0,
RERUN_MAX=10) and increment/break when the limit is hit; verify "issue resolved"
by querying the systemic issue state (SYSTEMIC_ISSUE_NUMBER via gh issue view)
and/or ensure evidence of a merged fix on main (search for merged PRs
referencing the issue or check that RECENT_PR_NUMBER has CHECK_NAME returning
success); only proceed to gh run rerun for the specific RUN_ID when those checks
pass; and handle mixed failures by splitting per-PR failures (dispatch per-PR
workers for per-pr checks) versus systemic checks (only rerun RUN_ID for
systemic CHECK_NAME) before incrementing RERUN_COUNT and logging the action.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: e103007e-e948-4c04-bd88-a8dd100b2f65

📥 Commits

Reviewing files that changed from the base of the PR and between 35f34ca and 59cfedc.

📒 Files selected for processing (1)
  • .agents/scripts/commands/pulse.md

@github-actions
Copy link

github-actions bot commented Mar 6, 2026

🔍 Code Quality Report

�[0;35m[MONITOR]�[0m Code Review Monitoring Report

�[0;34m[INFO]�[0m Latest Quality Status:
SonarCloud: 0 bugs, 0 vulnerabilities, 108 code smells

�[0;34m[INFO]�[0m Recent monitoring activity:
Fri Mar 6 20:49:48 UTC 2026: Code review monitoring started
Fri Mar 6 20:49:49 UTC 2026: SonarCloud - Bugs: 0, Vulnerabilities: 0, Code Smells: 108

📈 Current Quality Metrics

  • BUGS: 0
  • CODE SMELLS: 108
  • VULNERABILITIES: 0

Generated on: Fri Mar 6 20:49:51 UTC 2026


Generated by AI DevOps Framework Code Review Monitoring

@sonarqubecloud
Copy link

sonarqubecloud bot commented Mar 6, 2026

@marcusquinn marcusquinn merged commit 128e611 into main Mar 6, 2026
12 checks passed
@marcusquinn marcusquinn deleted the feature/ci-failure-pattern-detection branch March 6, 2026 21:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement Auto-created from TODO.md tag status:orphaned

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: pulse CI failure pattern detection — identify systemic workflow bugs

1 participant