fix(supervisor): detect gh auth failures to prevent clean_exit_no_signal retry waste (t198)#924
fix(supervisor): detect gh auth failures to prevent clean_exit_no_signal retry waste (t198)#924marcusquinn wants to merge 1 commit intomainfrom
Conversation
…nal retry waste (t198) Root cause analysis of 560 worker logs found 79% (38/48) of clean_exit_no_signal cases are workers that completed work but couldn't push/create PR due to expired GitHub auth. These burned 3 retries each at Opus cost with identical failures. Changes: - extract_log_metadata(): add gh_auth_failure detection scanning full log for specific gh auth patterns (safe from false positives — patterns are specific) - evaluate_worker(): add blocked:gh_auth_expired check before clean_exit_no_signal fallback, so auth failures block immediately instead of wasting retries - Tests: 4 new test cases covering gh auth failure detection in both test suites Log analysis breakdown (560 total logs): - 169 backend errors (short logs, already handled) - 127 PR created but no signal (already handled by t192) - 92 success with signal - 57 clean_exit_no_signal candidates (48 after filtering) - 38 gh auth failures (NOW BLOCKED instead of retried) - 2 task already done (handled by task_obsolete detection) - 7 context exhaustion / short logs (legitimate retries) - 1 shell unavailable
WalkthroughThis PR adds GitHub authentication failure detection to supervisor task evaluation logic. The Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes Possibly related issues
Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary of ChangesHello @marcusquinn, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly improves resource efficiency by addressing a common cause of wasted computational cycles: tasks failing due to expired GitHub authentication. By implementing proactive detection of these authentication issues, the system now immediately blocks such tasks, preventing unnecessary retries and associated costs. This change allows for more efficient resolution of credential problems and optimizes overall task processing. Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
🔍 Code Quality Report�[0;35m[MONITOR]�[0m Code Review Monitoring Report �[0;34m[INFO]�[0m Latest Quality Status: �[0;34m[INFO]�[0m Recent monitoring activity: 📈 Current Quality Metrics
Generated on: Tue Feb 10 13:19:56 UTC 2026 Generated by AI DevOps Framework Code Review Monitoring |
|
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In @.agents/scripts/supervisor-helper.sh:
- Around line 4769-4780: The current full-log grep that sets
gh_auth_failure=true is too broad and matches generic phrases; update the grep
in the block that defines gh_auth_failure to restrict matches to explicit auth
failure lines only (references: variable gh_auth_failure, the grep invocation
using "$log_file"): match patterns like '^.*gh auth (status|login).*fail',
'(^|[:space])gh:', 'fatal: Authentication failed', 'remote: Permission denied',
and 'authentication token.*expired' (case-insensitive), and remove or
de-prioritize generic phrases such as "not logged in" or "failed to
authenticate" unless they appear in those auth-specific contexts; keep the
2>/dev/null and echo "gh_auth_failure=$gh_auth_failure" behavior.
| # GitHub auth failure detection (t198): workers that complete their work but | ||
| # can't push/create a PR because gh auth is expired. This is the #1 cause of | ||
| # clean_exit_no_signal (79% of cases in production logs). Without this check, | ||
| # the supervisor retries them 3x at Opus cost, each failing identically. | ||
| # Search the FULL log (not just tail) because auth failures appear in tool | ||
| # output mid-log, not at the end. This is safe because gh auth patterns are | ||
| # specific enough to avoid false positives from documentation content. | ||
| local gh_auth_failure="false" | ||
| if grep -qiE 'gh auth (status|login).*fail|authentication token.*expired|not logged in|try authenticating|gh: To use .* in a non-interactive context|could not authenticate|failed to authenticate' "$log_file" 2>/dev/null; then | ||
| gh_auth_failure="true" | ||
| fi | ||
| echo "gh_auth_failure=$gh_auth_failure" |
There was a problem hiding this comment.
Tighten auth-failure regex to avoid false positives in full-log scan.
Generic phrases like “not logged in” can appear in docs or code samples within logs, which would incorrectly block a task. Scope matches to gh/git auth error lines (e.g., gh auth, gh:, fatal: Authentication failed, remote: Permission denied) before setting gh_auth_failure=true.
Suggested refinement
- if grep -qiE 'gh auth (status|login).*fail|authentication token.*expired|not logged in|try authenticating|gh: To use .* in a non-interactive context|could not authenticate|failed to authenticate' "$log_file" 2>/dev/null; then
+ if grep -qiE '(gh auth (status|login).*(fail|not logged in|authenticate)|gh:.*(not logged in|authenticate)|GitHub CLI.*(not logged in|authenticate)|fatal:.*authentication failed|remote:.*(authentication|permission) denied)' "$log_file" 2>/dev/null; then
gh_auth_failure="true"
fi📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # GitHub auth failure detection (t198): workers that complete their work but | |
| # can't push/create a PR because gh auth is expired. This is the #1 cause of | |
| # clean_exit_no_signal (79% of cases in production logs). Without this check, | |
| # the supervisor retries them 3x at Opus cost, each failing identically. | |
| # Search the FULL log (not just tail) because auth failures appear in tool | |
| # output mid-log, not at the end. This is safe because gh auth patterns are | |
| # specific enough to avoid false positives from documentation content. | |
| local gh_auth_failure="false" | |
| if grep -qiE 'gh auth (status|login).*fail|authentication token.*expired|not logged in|try authenticating|gh: To use .* in a non-interactive context|could not authenticate|failed to authenticate' "$log_file" 2>/dev/null; then | |
| gh_auth_failure="true" | |
| fi | |
| echo "gh_auth_failure=$gh_auth_failure" | |
| # GitHub auth failure detection (t198): workers that complete their work but | |
| # can't push/create a PR because gh auth is expired. This is the `#1` cause of | |
| # clean_exit_no_signal (79% of cases in production logs). Without this check, | |
| # the supervisor retries them 3x at Opus cost, each failing identically. | |
| # Search the FULL log (not just tail) because auth failures appear in tool | |
| # output mid-log, not at the end. This is safe because gh auth patterns are | |
| # specific enough to avoid false positives from documentation content. | |
| local gh_auth_failure="false" | |
| if grep -qiE '(gh auth (status|login).*(fail|not logged in|authenticate)|gh:.*(not logged in|authenticate)|GitHub CLI.*(not logged in|authenticate)|fatal:.*authentication failed|remote:.*(authentication|permission) denied)' "$log_file" 2>/dev/null; then | |
| gh_auth_failure="true" | |
| fi | |
| echo "gh_auth_failure=$gh_auth_failure" |
🤖 Prompt for AI Agents
In @.agents/scripts/supervisor-helper.sh around lines 4769 - 4780, The current
full-log grep that sets gh_auth_failure=true is too broad and matches generic
phrases; update the grep in the block that defines gh_auth_failure to restrict
matches to explicit auth failure lines only (references: variable
gh_auth_failure, the grep invocation using "$log_file"): match patterns like
'^.*gh auth (status|login).*fail', '(^|[:space])gh:', 'fatal: Authentication
failed', 'remote: Permission denied', and 'authentication token.*expired'
(case-insensitive), and remove or de-prioritize generic phrases such as "not
logged in" or "failed to authenticate" unless they appear in those auth-specific
contexts; keep the 2>/dev/null and echo "gh_auth_failure=$gh_auth_failure"
behavior.
There was a problem hiding this comment.
Code Review
This pull request effectively addresses a significant source of wasted retries by detecting GitHub authentication failures early and blocking the task. The logic is sound and the implementation is clean. The addition of new tests to cover this case is also a great improvement. I have one minor suggestion to better align with the repository's shell scripting style guide.
| if grep -qiE 'gh auth (status|login).*fail|authentication token.*expired|not logged in|try authenticating|gh: To use .* in a non-interactive context|could not authenticate|failed to authenticate' "$log_file" 2>/dev/null; then | ||
| gh_auth_failure="true" | ||
| fi |
There was a problem hiding this comment.
This use of 2>/dev/null to suppress grep's error output violates the repository's style guide (line 50), which states that 2>/dev/null is only acceptable when redirecting to log files. A safer and more idiomatic way to handle cases where the log file might not exist is to check for the file's existence before calling grep.
| if grep -qiE 'gh auth (status|login).*fail|authentication token.*expired|not logged in|try authenticating|gh: To use .* in a non-interactive context|could not authenticate|failed to authenticate' "$log_file" 2>/dev/null; then | |
| gh_auth_failure="true" | |
| fi | |
| if [[ -f "$log_file" ]] && grep -qiE 'gh auth (status|login).*fail|authentication token.*expired|not logged in|try authenticating|gh: To use .* in a non-interactive context|could not authenticate|failed to authenticate' "$log_file"; then | |
| gh_auth_failure="true" | |
| fi |
References
- The style guide at line 50 bans blanket suppression of stderr with
2>/dev/null, allowing it only when redirecting to log files. The current code uses it to hide potential 'file not found' errors fromgrep. (link)
|
Closing: t198 was already completed in PR #834 (merged 2026-02-10). This PR was created by a re-dispatch of an already-completed task. |



Summary
clean_exit_no_signalcases are workers that completed work but couldn't push/create PR due to expired GitHub authblocked:gh_auth_expireddetection so auth failures block immediately instead of wasting retriesRoot Cause Analysis
Of the 48
clean_exit_no_signalcandidates:task_obsoletedetectionChanges
extract_log_metadata(): Addedgh_auth_failuredetection scanning full log for specificgh authpatternsevaluate_worker(): Addedblocked:gh_auth_expiredcheck beforeclean_exit_no_signalfallbackTesting
test-supervisor-state-machine.sh: 4 new tests pass, 13 pre-existing failures unchangedtest-dispatch-worktree-evaluate.sh: 1 new test passes, 12 pre-existing failures unchangedCloses #817
Summary by CodeRabbit
Release Notes
New Features
Tests