t1113: Add worker_never_started diagnostic and auto-retry with environment check#1980
t1113: Add worker_never_started diagnostic and auto-retry with environment check#1980marcusquinn merged 2 commits intomainfrom
Conversation
…ed failures (t1113) Before dispatch, verify the AI CLI binary exists and can execute. If the check fails, defer dispatch to next pulse instead of spawning a worker that will never produce output. This prevents burning retries on environment issues (missing binary, broken installation, PATH problems). The check_cli_health() function: - Verifies binary exists in PATH (command -v) - Runs lightweight version check with 10s timeout - Caches result per-pulse and for 5 minutes (file-based) - Logs diagnostic detail on failure for troubleshooting Also adds cli_health=ok to dispatch metadata log for auditability.
…without burning retries (t1113) Reclassify worker_never_started, log_file_missing, log_file_empty, and related dispatch infrastructure failures from LOGIC to new ENVIRONMENT category. These failures indicate the CLI/environment is broken, not the task itself. When the pulse detects an ENVIRONMENT failure: - Re-queues the task (evaluating -> queued) without incrementing retry count - Invalidates CLI health cache so next dispatch re-verifies the environment - Stores failure pattern for diagnostics without marking as task failure - Logs proof-log entry with failure_mode=ENVIRONMENT for auditability This prevents the Feb 13 scenario where 5 tasks exhausted their retries on worker_never_started:no_sentinel when the underlying issue was a broken CLI environment that no amount of task retries would fix.
Summary of ChangesHello @marcusquinn, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the robustness of the task dispatch system by intelligently handling environment-related failures. It introduces proactive checks for AI CLI health and a new failure classification, ensuring that tasks encountering infrastructure problems are re-queued without consuming valuable retries, thereby improving system efficiency and preventing tasks from being prematurely marked as failed due to external factors. Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
WalkthroughAdded CLI health verification function with caching to validate AI CLI binary availability before dispatch. Introduced ENVIRONMENT failure category to classify infrastructure/dispatch issues separately from logic errors. Integrated environment-aware failure handling in pulse Phase 1 that requeues without consuming retries, invalidates health cache, and performs pre-dispatch verification in dispatch flow. Changes
Sequence DiagramsequenceDiagram
participant Pulse as Pulse Supervisor
participant Dispatch as Dispatch Flow
participant Evaluate as Evaluate (Classify)
participant Health as CLI Health Check
participant Worker as Worker Process
participant Cache as Health Cache
Pulse->>Dispatch: Phase 1 - Check for work
Dispatch->>Health: check_cli_health()
Health->>Cache: Check 5-min cache
alt Cache Valid
Health->>Health: Return cached result
else Cache Expired
Health->>Health: Verify CLI binary exists
Health->>Health: Verify executable & version
Health->>Cache: Store result
end
Health-->>Dispatch: Health status (0=ok, 1=fail)
alt CLI Healthy
Dispatch->>Worker: Spawn worker process
Dispatch-->>Pulse: Return 0
else CLI Unhealthy
Dispatch-->>Pulse: Return 3 (defer)
end
alt Evaluation Phase
Pulse->>Evaluate: classify_failure_mode()
Evaluate-->>Pulse: ENVIRONMENT or other
alt Is ENVIRONMENT
Pulse->>Pulse: Requeue (no retry++)
Pulse->>Cache: Invalidate health cache
Pulse->>Worker: Cleanup process tree
else Non-ENVIRONMENT
Pulse->>Pulse: Original failure logic
end
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related issues
Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
🔍 Code Quality Report�[0;35m[MONITOR]�[0m Code Review Monitoring Report �[0;34m[INFO]�[0m Latest Quality Status: �[0;34m[INFO]�[0m Recent monitoring activity: 📈 Current Quality Metrics
Generated on: Thu Feb 19 22:48:09 UTC 2026 Generated by AI DevOps Framework Code Review Monitoring |
|
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
.agents/scripts/supervisor/dispatch.sh (2)
1530-1534: Document the externalunsetcontract for_PULSE_CLI_VERIFIED.
_PULSE_CLI_VERIFIEDis never cleared within this function or file — its lifecycle depends onpulse.shunsetting it onENVIRONMENTfailures. The function comment only describes the fast-path behaviour, not the obligation on the caller. A future refactor or new entrypoint that bypassespulse.shwould silently skip all CLI health checks for the entire process lifetime.Consider adding a brief note (matching the existing
_PULSE_HEALTH_VERIFIEDpattern) so the coupling is explicit:✏️ Suggested doc addition
# Pulse-level fast path: if CLI was already verified in this pulse, skip + # NOTE: _PULSE_CLI_VERIFIED must be unset by pulse.sh after ENVIRONMENT failures + # so the next pulse re-probes the CLI. See pulse.sh ENVIRONMENT failure handler. if [[ -n "${_PULSE_CLI_VERIFIED:-}" ]]; then🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.agents/scripts/supervisor/dispatch.sh around lines 1530 - 1534, The fast-path uses the external flag _PULSE_CLI_VERIFIED but its lifecycle is controlled elsewhere (pulse.sh) and is never unset in this file; add a short comment above the check for _PULSE_CLI_VERIFIED in dispatch.sh (matching the existing _PULSE_HEALTH_VERIFIED pattern) documenting that callers (pulse.sh) must unset this variable on ENVIRONMENT failures and that other entrypoints must honor/replicate that contract to avoid silently skipping CLI health checks for the process lifetime; reference the variable name _PULSE_CLI_VERIFIED and pulse.sh in the comment so future maintainers know where the unset happens and the obligation to clear it.
978-984:check_cli_healthis not called indo_prompt_repeatorcmd_reprompt, leaving those dispatch paths unguarded.The PR's stated goal is to avoid wasting retries when the CLI environment is broken. Both
do_prompt_repeat(line 980) andcmd_reprompt(line 3001) callcheck_model_healthonly, without verifying CLI binary health. Since both can spawn real workers and consume retry counts, an environment where theopencodebinary is missing or broken would be caught on the first dispatch viacmd_dispatch, but subsequent prompt-repeat and reprompt cycles would still proceed and fail.Consider adding the same guard to both functions:
♻️ Proposed addition to do_prompt_repeat (~Line 978)
# Pre-dispatch health check local health_model health_exit=0 + local cli_health_rc=0 + check_cli_health "$ai_cli" >/dev/null || cli_health_rc=$? + if [[ "$cli_health_rc" -ne 0 ]]; then + log_error "do_prompt_repeat: CLI health check failed ($ai_cli) — deferring" + return 1 + fi health_model=$(resolve_model "health" "$ai_cli")Apply the same pattern to
cmd_repromptnear line 3001. The redirect to/dev/nulldiscards diagnostic stdout while preserving stderr emission.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.agents/scripts/supervisor/dispatch.sh around lines 978 - 984, do_prompt_repeat and cmd_reprompt currently only call check_model_health, leaving CLI binary failures unguarded; add the same check_cli_health guard used in cmd_dispatch: after resolving the ai_cli and health_model (using resolve_model "$ai_cli"), call check_cli_health "$ai_cli" "$health_model" and capture its exit code (e.g., cli_exit=0; check_cli_health ... || cli_exit=$?); if cli_exit is non‑zero log a warning like the existing log_warn message and return 1 to abort the prompt-repeat/reprompt path so retries aren't wasted.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.agents/scripts/supervisor/dispatch.sh:
- Around line 1562-1595: The health-check incorrectly initializes version_exit=1
and uses the "cmd || version_exit=$?" idiom, causing successful commands to
leave version_exit at 1 and producing false positives; change each invocation
that currently does version_output=$(... ) || version_exit=$? to first capture
output then immediately set version_exit=$? (e.g., version_output=$(... 2>&1);
version_exit=$?) for all branches that run the CLI (the blocks referencing
timeout_cmd, ai_cli, and the "opencode" vs "claude CLI" branches), leaving the
rest of the logic (the check using version_exit, version_output, cli_cache_file,
_PULSE_CLI_VERIFIED, and log_info) unchanged.
---
Nitpick comments:
In @.agents/scripts/supervisor/dispatch.sh:
- Around line 1530-1534: The fast-path uses the external flag
_PULSE_CLI_VERIFIED but its lifecycle is controlled elsewhere (pulse.sh) and is
never unset in this file; add a short comment above the check for
_PULSE_CLI_VERIFIED in dispatch.sh (matching the existing _PULSE_HEALTH_VERIFIED
pattern) documenting that callers (pulse.sh) must unset this variable on
ENVIRONMENT failures and that other entrypoints must honor/replicate that
contract to avoid silently skipping CLI health checks for the process lifetime;
reference the variable name _PULSE_CLI_VERIFIED and pulse.sh in the comment so
future maintainers know where the unset happens and the obligation to clear it.
- Around line 978-984: do_prompt_repeat and cmd_reprompt currently only call
check_model_health, leaving CLI binary failures unguarded; add the same
check_cli_health guard used in cmd_dispatch: after resolving the ai_cli and
health_model (using resolve_model "$ai_cli"), call check_cli_health "$ai_cli"
"$health_model" and capture its exit code (e.g., cli_exit=0; check_cli_health
... || cli_exit=$?); if cli_exit is non‑zero log a warning like the existing
log_warn message and return 1 to abort the prompt-repeat/reprompt path so
retries aren't wasted.
| local version_output="" | ||
| local version_exit=1 | ||
|
|
||
| # Use timeout to prevent hanging on broken installations | ||
| local timeout_cmd="" | ||
| if command -v gtimeout &>/dev/null; then | ||
| timeout_cmd="gtimeout" | ||
| elif command -v timeout &>/dev/null; then | ||
| timeout_cmd="timeout" | ||
| fi | ||
|
|
||
| if [[ "$ai_cli" == "opencode" ]]; then | ||
| if [[ -n "$timeout_cmd" ]]; then | ||
| version_output=$("$timeout_cmd" 10 "$ai_cli" version 2>&1) || version_exit=$? | ||
| else | ||
| version_output=$("$ai_cli" version 2>&1) || version_exit=$? | ||
| fi | ||
| else | ||
| # claude CLI | ||
| if [[ -n "$timeout_cmd" ]]; then | ||
| version_output=$("$timeout_cmd" 10 "$ai_cli" --version 2>&1) || version_exit=$? | ||
| else | ||
| version_output=$("$ai_cli" --version 2>&1) || version_exit=$? | ||
| fi | ||
| fi | ||
|
|
||
| # If version command succeeded (exit 0) or produced output, CLI is working | ||
| if [[ "$version_exit" -eq 0 ]] || [[ -n "$version_output" && "$version_exit" -ne 124 && "$version_exit" -ne 137 ]]; then | ||
| # Cache the healthy result | ||
| date +%s >"$cli_cache_file" 2>/dev/null || true | ||
| _PULSE_CLI_VERIFIED="true" | ||
| log_info "CLI health: OK ($ai_cli: ${version_output:0:80})" | ||
| return 0 | ||
| fi |
There was a problem hiding this comment.
version_exit=1 initialization makes the exit_code==0 branch unreachable and enables a false-positive health result.
The || version_exit=$? idiom only assigns when the command exits non-zero (short-circuit evaluation). Because version_exit is initialized to 1, if opencode version / claude --version exits successfully (code 0), version_exit stays at 1 — the "$version_exit" -eq 0 guard on Line 1589 is permanently dead code.
More importantly this creates a false positive: a binary that exits non-zero (broken installation, startup crash) but still emits any output to stderr/stdout (which 2>&1 captures) will satisfy the second condition [[ -n "$version_output" && "$version_exit" -ne 124 && "$version_exit" -ne 137 ]] and be reported as healthy — exactly the class of environment failure this check is designed to catch.
🐛 Proposed fix — separate assignment from capture
- local version_output=""
- local version_exit=1
+ local version_output=""
+ local version_exit=0
if [[ "$ai_cli" == "opencode" ]]; then
if [[ -n "$timeout_cmd" ]]; then
- version_output=$("$timeout_cmd" 10 "$ai_cli" version 2>&1) || version_exit=$?
+ version_output=$("$timeout_cmd" 10 "$ai_cli" version 2>&1)
+ version_exit=$?
else
- version_output=$("$ai_cli" version 2>&1) || version_exit=$?
+ version_output=$("$ai_cli" version 2>&1)
+ version_exit=$?
fi
else
if [[ -n "$timeout_cmd" ]]; then
- version_output=$("$timeout_cmd" 10 "$ai_cli" --version 2>&1) || version_exit=$?
+ version_output=$("$timeout_cmd" 10 "$ai_cli" --version 2>&1)
+ version_exit=$?
else
- version_output=$("$ai_cli" --version 2>&1) || version_exit=$?
+ version_output=$("$ai_cli" --version 2>&1)
+ version_exit=$?
fi
fiWith this change, a clean exit 0 (even with empty output) is treated as healthy via the first condition, and a non-zero exit is correctly treated as a failure unless there is output (which preserves the "output-even-on-non-zero" lenient path intentionally).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.agents/scripts/supervisor/dispatch.sh around lines 1562 - 1595, The
health-check incorrectly initializes version_exit=1 and uses the "cmd ||
version_exit=$?" idiom, causing successful commands to leave version_exit at 1
and producing false positives; change each invocation that currently does
version_output=$(... ) || version_exit=$? to first capture output then
immediately set version_exit=$? (e.g., version_output=$(... 2>&1);
version_exit=$?) for all branches that run the CLI (the blocks referencing
timeout_cmd, ai_cli, and the "opencode" vs "claude CLI" branches), leaving the
rest of the logic (the check using version_exit, version_output, cli_cache_file,
_PULSE_CLI_VERIFIED, and log_info) unchanged.
There was a problem hiding this comment.
Code Review
The pull request introduces a valuable pre-dispatch health check for the AI CLI, improving the overall reliability of the supervisor by preventing wasted retries on environment-related issues. The classification of failures into a new ENVIRONMENT category is a sound architectural decision. However, there are some technical issues in the health check implementation, particularly regarding arithmetic safety with potentially empty variables and incorrect exit code handling due to initialization values.
| local cached_at | ||
| cached_at=$(cat "$cli_cache_file" 2>/dev/null || echo "0") | ||
| local now | ||
| now=$(date +%s) | ||
| local age=$((now - cached_at)) |
There was a problem hiding this comment.
The arithmetic expansion at line 1545 is unsafe because cached_at can be empty if the cache file exists but is empty (in which case cat succeeds and the || echo "0" fallback is not triggered). This would cause a syntax error in the arithmetic expansion. Additionally, following the repository style guide (line 11), declarations and assignments for expansions that can fail should be separate to ensure exit code safety.
| local cached_at | |
| cached_at=$(cat "$cli_cache_file" 2>/dev/null || echo "0") | |
| local now | |
| now=$(date +%s) | |
| local age=$((now - cached_at)) | |
| local cached_at | |
| cached_at=$(cat "$cli_cache_file" 2>/dev/null || echo "0") | |
| local now | |
| now=$(date +%s) | |
| local age | |
| age=$((now - ${cached_at:-0})) |
References
- Use
local var="$1"pattern in functions (declare and assign separately for exit code safety) (link) - In shell scripts with
set -e,var=$(cat file || echo 0)is not safe for arithmetic if the file can be empty.caton an empty file succeeds, makingvaran empty string and causing arithmetic expansion to fail. Use parameter expansion with a default value (e.g.,$((... - ${var:-0}))) to handle empty strings robustly.
|
|
||
| # Check 2: binary is executable and can produce version output | ||
| local version_output="" | ||
| local version_exit=1 |
There was a problem hiding this comment.
Initializing version_exit to 1 causes a logic error in the health check. When the version command succeeds (exit 0), the || version_exit=$? part is skipped, leaving version_exit at its initial value of 1. This causes the first condition of the health check at line 1589 ([[ "$version_exit" -eq 0 ]]) to fail even on success, forcing the check to rely on the more lenient output-based condition. It should be initialized to 0.
| local version_exit=1 | |
| local version_exit=0 |
| fi | ||
|
|
||
| # If version command succeeded (exit 0) or produced output, CLI is working | ||
| if [[ "$version_exit" -eq 0 ]] || [[ -n "$version_output" && "$version_exit" -ne 124 && "$version_exit" -ne 137 ]]; then |
There was a problem hiding this comment.
The health check logic is currently too lenient. If the version command fails (e.g., exit 1) but still produces some output (like an error message), it will be considered 'healthy' because of the [[ -n "$version_output" ... ]] condition. Given that the goal is to detect broken installations, a non-zero exit code should generally be treated as a failure unless there's a specific reason to trust the output of a failed command.
| if [[ "$version_exit" -eq 0 ]] || [[ -n "$version_output" && "$version_exit" -ne 124 && "$version_exit" -ne 137 ]]; then | |
| if [[ "$version_exit" -eq 0 ]]; then |
Auto-dismissed: bot review does not block autonomous pipeline



Summary
Prevents wasting retries on environment issues where the AI CLI is broken. On Feb 13, 5 tasks failed with
worker_never_started:no_sentinel— the CLI was invoked but never produced output. Each failure burned a retry, exhausting max_retries on an infrastructure problem that no task retry would fix.Changes
1. Pre-dispatch CLI health check (
dispatch.sh)check_cli_health()function verifies the AI CLI binary exists and can execute before spawning workerscommand -vfor PATH presence, then lightweight version/help command with 10s timeoutcli_health=okto dispatch metadata log for auditability2. ENVIRONMENT failure mode (
evaluate.sh)ENVIRONMENTfor dispatch infrastructure failuresworker_never_started*,log_file_missing*,log_file_empty,no_log_path_in_db*,dispatch_script_not_executablefromLOGICtoENVIRONMENT3. Re-queue without burning retries (
pulse.sh)ENVIRONMENTfailure, re-queues the task (evaluating -> queued) without incrementing retry countfailure_mode=ENVIRONMENT,retry_preserved=trueVerification
.shfiles pass ShellCheck with zero violationsevaluating:queuedis an existing valid state transitionfailed:queuedis also valid (used by existing recovery paths)Ref #1664
Summary by CodeRabbit
Release Notes
New Features
Improvements