t1190: Fix worker_never_started:no_sentinel dispatch failures#1981
t1190: Fix worker_never_started:no_sentinel dispatch failures#1981marcusquinn merged 2 commits intomainfrom
Conversation
… and wrapper sentinel (t1190) Root cause: dispatch/wrapper scripts used fixed filenames (e.g., t001-dispatch.sh), causing a race condition where a second dispatch overwrites the scripts before the first wrapper process reads them. The first wrapper then executes the new dispatch script which writes WORKER_STARTED to a different log file, leaving the original log with only the metadata header (no WORKER_STARTED → no_sentinel failure). Fixes: 1. Timestamped dispatch/wrapper filenames (t001-dispatch-20260213142302.sh) prevent overwrite race — each dispatch gets unique scripts that cannot be clobbered. 2. WRAPPER_STARTED sentinel written to log before running dispatch script, enabling distinguish between 'wrapper never ran' vs 'dispatch exec failed'. 3. Wrapper stderr redirected to log file (was /dev/null) so wrapper startup errors (bash not found, permission denied) are captured for diagnosis. 4. evaluate.sh: improved no_sentinel sub-classification using WRAPPER_STARTED signal. 5. cleanup.sh: remove timestamped scripts when task is cleaned up. Chose timestamped filenames over temp files (mktemp) to preserve debuggability — the timestamp matches the log file timestamp, making correlation easy.
WalkthroughThe supervisor scripts receive enhancements for reliability and observability: timestamped dispatch and wrapper scripts prevent race conditions, startup sentinels (WRAPPER_STARTED) enable fine-grained diagnostics, improved logging routes stderr to task log files, and a new helper function standardizes wrapper script launches with consistent process isolation. Changes
Sequence DiagramsequenceDiagram
participant Dispatcher
participant WrapperScript as Wrapper Script<br/>(timestamped)
participant DispatchScript as Dispatch Script<br/>(timestamped)
participant WorkerProcess as Worker Process
participant TaskLog as Task Log
participant Evaluator
Dispatcher->>Dispatcher: Generate timestamp
Dispatcher->>WrapperScript: Create with timestamp suffix
Dispatcher->>DispatchScript: Create with timestamp suffix
Dispatcher->>WrapperScript: Launch via _launch_wrapper_script
WrapperScript->>TaskLog: Write WRAPPER_STARTED sentinel
WrapperScript->>DispatchScript: Execute with setsid/nohup
DispatchScript->>TaskLog: Write START and dispatch logs
DispatchScript->>WorkerProcess: Launch worker
WorkerProcess->>TaskLog: Write WORKER_STARTED sentinel
WorkerProcess->>TaskLog: Execute and log results
Evaluator->>TaskLog: Extract log metadata
Evaluator->>Evaluator: Check WRAPPER_STARTED presence
Evaluator->>Evaluator: Check WORKER_STARTED presence
Evaluator->>Evaluator: Distinguish failure modes
Evaluator->>Evaluator: Report diagnostics
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~22 minutes Possibly related issues
Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
🔍 Code Quality Report�[0;35m[MONITOR]�[0m Code Review Monitoring Report �[0;34m[INFO]�[0m Latest Quality Status: �[0;34m[INFO]�[0m Recent monitoring activity: 📈 Current Quality Metrics
Generated on: Thu Feb 19 22:52:50 UTC 2026 Generated by AI DevOps Framework Code Review Monitoring |
Cancelled tasks are administrative cleanup actions (orphaned tasks, superseded work, cross-repo misregistration cleanup) — not worker failures. Including them in the failure count inflated the 7-day failure rate from <1% to 11%, causing false alarms and masking the true worker reliability signal. Root cause analysis (t1248): - Feb 12: 5 hung workers (t302, t303, t311.2x2, t311.3) — all hit the 1800s default timeout because their ~1h estimates weren't parsed. Fixed by t1199 (estimate-based timeout) and t1222 (graceful SIGTERM at 50% timeout). - Feb 13: 12 no_sentinel failures (t1010, t1030, t1032.1x4, t1032.2x2) at 14:40-14:56 UTC — caused by a race condition where concurrent dispatches overwrote fixed-filename wrapper scripts. Fixed by t1190 (timestamped filenames, WRAPPER_STARTED sentinel) — PR #1981 pending merge. - Feb 18-19: 67 cancellations (orphaned DB tasks, superseded tasks, cross-repo misregistration cleanup via t1237) — these are intentional, not failures. Fix: split 'failed' and 'cancelled' into separate metrics. Success rate now only counts status='failed' in the denominator, giving an accurate signal. Cancelled tasks get their own row for visibility without distorting the rate. True 7-day worker failure rate: 2/475 = 0.4% (not 11%) Pattern tracker overall rate: 94% (977/1037) — unchanged, counts retries
Summary of ChangesHello @marcusquinn, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses a critical race condition in the supervisor's worker dispatch mechanism, which previously caused 'worker_never_started:no_sentinel' failures. The changes enhance the robustness of worker startup by ensuring unique script execution and provide significantly improved diagnostic information when workers fail to initialize. This leads to more reliable task execution and easier debugging of startup issues. Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request effectively addresses a race condition in script dispatching by using timestamped filenames, and significantly improves error diagnostics by adding a WRAPPER_STARTED sentinel and redirecting wrapper script stderr to log files. The changes are logical and well-commented. My review includes a few suggestions to improve code consistency, adhere to the repository's shell scripting style guide, and reduce code duplication for better long-term maintainability.
…1983) Cancelled tasks are administrative cleanup actions (orphaned tasks, superseded work, cross-repo misregistration cleanup) — not worker failures. Including them in the failure count inflated the 7-day failure rate from <1% to 11%, causing false alarms and masking the true worker reliability signal. Root cause analysis (t1248): - Feb 12: 5 hung workers (t302, t303, t311.2x2, t311.3) — all hit the 1800s default timeout because their ~1h estimates weren't parsed. Fixed by t1199 (estimate-based timeout) and t1222 (graceful SIGTERM at 50% timeout). - Feb 13: 12 no_sentinel failures (t1010, t1030, t1032.1x4, t1032.2x2) at 14:40-14:56 UTC — caused by a race condition where concurrent dispatches overwrote fixed-filename wrapper scripts. Fixed by t1190 (timestamped filenames, WRAPPER_STARTED sentinel) — PR #1981 pending merge. - Feb 18-19: 67 cancellations (orphaned DB tasks, superseded tasks, cross-repo misregistration cleanup via t1237) — these are intentional, not failures. Fix: split 'failed' and 'cancelled' into separate metrics. Success rate now only counts status='failed' in the denominator, giving an accurate signal. Cancelled tasks get their own row for visibility without distorting the rate. True 7-day worker failure rate: 2/475 = 0.4% (not 11%) Pattern tracker overall rate: 94% (977/1037) — unchanged, counts retries
- cleanup.sh: normalize glob patterns ([0-9]* -> *) for reprompt/prompt-repeat scripts; remove redundant 2>/dev/null from rm -f (already guarded by [[ -f ]]) - dispatch.sh: extract duplicated setsid/nohup wrapper launch into _launch_wrapper_script() helper to eliminate code duplication - evaluate.sh: remove blanket 2>/dev/null suppression from grep calls per style guide (log file existence already checked before these calls)
🔍 Code Quality Report�[0;35m[MONITOR]�[0m Code Review Monitoring Report �[0;34m[INFO]�[0m Latest Quality Status: �[0;34m[INFO]�[0m Recent monitoring activity: 📈 Current Quality Metrics
Generated on: Thu Feb 19 23:36:22 UTC 2026 Generated by AI DevOps Framework Code Review Monitoring |
|
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
.agents/scripts/supervisor/evaluate.sh (1)
1028-1043:⚠️ Potential issue | 🟡 Minor
startup_errorgrep may false-match previous-run errors from the REPROMPT METADATA header.The
startup_errorextraction at Line 1031 scans the entire log. On reprompt retries the log file is pre-seeded with a=== REPROMPT METADATA ===header containingprevious_error=${terror:-none}. Ifterrorfrom a prior run containedWORKER_FAILED:orcommand not found, the grep will match that stale header entry — producing a misleadingfailed:worker_never_started:previous_error=WORKER_FAILED_...detail instead ofno_sentinelordispatch_exec_failed.This doesn't break the routing pipeline (outcome remains in the
worker_never_startedfamily), but it degrades diagnostic accuracy and could misattribute the cause of the new failure to the old one.🛡️ Proposed fix — exclude header lines from startup_error search
- startup_error=$(grep -i 'WORKER_FAILED\|WORKER_DISPATCH_ERROR\|command not found\|No such file\|Permission denied' "$tlog" | head -1 | head -c 200 || echo "") + # Skip REPROMPT METADATA header lines (previous_error= could match stale errors) + startup_error=$(grep -v '^=== \(REPROMPT METADATA\|END REPROMPT METADATA\)\|^task_id=\|^timestamp=\|^retry=\|^work_dir=\|^previous_error=\|^fresh_worktree=\|^strategy=' "$tlog" \ + | grep -i 'WORKER_FAILED\|WORKER_DISPATCH_ERROR\|command not found\|No such file\|Permission denied' \ + | head -1 | head -c 200 || echo "")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.agents/scripts/supervisor/evaluate.sh around lines 1028 - 1043, The grep used to set startup_error scans the whole log and can match stale entries in the REPROMPT METADATA header; change the extraction to exclude header lines by piping the log through a filter before matching (e.g., cat "$tlog" | grep -v '^=== REPROMPT METADATA ===' -A0 -B9999 or simply grep -v 'previous_error=' | grep -i 'WORKER_FAILED\|WORKER_DISPATCH_ERROR\|command not found\|No such file\|Permission denied' | head -1 | head -c 200) so the startup_error assignment (variable startup_error, using "$tlog") ignores metadata header lines and only considers actual runtime log lines. Ensure the rest of the logic (the WRAPPER_STARTED check and the echo branches) remains unchanged.
🧹 Nitpick comments (2)
.agents/scripts/supervisor/dispatch.sh (1)
2725-2741:_launch_wrapper_scriptrelies on implicit$!propagation — fragile contract worth hardening.Bash functions run inline (not as subprocesses), but running a function in the background or as part of a pipeline forces it to run as a subprocess. When called normally,
$!in the caller correctly reflects the last background process started inside the function.return 0only sets$?, not$!, so the currentlocal worker_pid=$!at Line 2708 captures the correct PID.The fragility: any future addition of another background command (
&) inside_launch_wrapper_script— even a log write or heartbeat — would silently breakworker_pid. Making the PID return explicit costs two lines and removes the implicit contract entirely.♻️ Explicit PID return via stdout
_launch_wrapper_script() { local wrapper_script="$1" local log_file="$2" if command -v setsid &>/dev/null; then nohup setsid bash "${wrapper_script}" >>"${log_file}" 2>&1 & else nohup bash "${wrapper_script}" >>"${log_file}" 2>&1 & fi - return 0 + echo "$!" }Caller update (Line 2708):
-_launch_wrapper_script "${wrapper_script}" "${log_file}" -local worker_pid=$! +local worker_pid +worker_pid=$(_launch_wrapper_script "${wrapper_script}" "${log_file}")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.agents/scripts/supervisor/dispatch.sh around lines 2725 - 2741, The helper _launch_wrapper_script currently relies on the caller reading $! externally which is fragile; change it to explicitly emit the launched wrapper PID on stdout so callers can capture it deterministically (e.g., have _launch_wrapper_script print the PID of the backgrounded nohup/setsid process before returning), then update callers that set local worker_pid=$! to instead capture that stdout (e.g., worker_pid="$( _launch_wrapper_script ... )"); ensure the function still redirects logs and returns non-zero on failure while printing only the PID on success so downstream parsing remains stable..agents/scripts/supervisor/cleanup.sh (1)
287-298: Redundant existence guard —rm -falready handles missing files gracefully.
[[ -f "$script" ]] && rm -f "$script" || trueis more complex than it needs to be. The-fflag tormsuppresses "no such file or directory" errors and exits 0 on missing operands, so the guard is redundant. When no glob matches, bash passes the literal pattern as$script, andrm -fon that non-existent path exits 0 without error.♻️ Simplified cleanup loop
- local script - for script in "$SUPERVISOR_DIR/pids/${task_id}"-dispatch-*.sh \ - "$SUPERVISOR_DIR/pids/${task_id}"-wrapper-*.sh \ - "$SUPERVISOR_DIR/pids/${task_id}"-reprompt-*.sh \ - "$SUPERVISOR_DIR/pids/${task_id}"-reprompt-wrapper-*.sh \ - "$SUPERVISOR_DIR/pids/${task_id}"-prompt-repeat-*.sh \ - "$SUPERVISOR_DIR/pids/${task_id}"-prompt-repeat-wrapper-*.sh; do - [[ -f "$script" ]] && rm -f "$script" || true - done + rm -f \ + "$SUPERVISOR_DIR/pids/${task_id}"-dispatch-*.sh \ + "$SUPERVISOR_DIR/pids/${task_id}"-wrapper-*.sh \ + "$SUPERVISOR_DIR/pids/${task_id}"-reprompt-*.sh \ + "$SUPERVISOR_DIR/pids/${task_id}"-reprompt-wrapper-*.sh \ + "$SUPERVISOR_DIR/pids/${task_id}"-prompt-repeat-*.sh \ + "$SUPERVISOR_DIR/pids/${task_id}"-prompt-repeat-wrapper-*.sh \ + 2>/dev/null || true🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.agents/scripts/supervisor/cleanup.sh around lines 287 - 298, The loop over files using the variable script (for script in "$SUPERVISOR_DIR/pids/${task_id}"-dispatch-*.sh ... ) has a redundant existence guard ([[ -f "$script" ]] && rm -f "$script" || true); simplify it by removing the conditional and calling rm -f "$script" directly inside the loop so that the cleanup uses rm -f "$script" for each glob-expanded entry (referencing SUPERVISOR_DIR, task_id and the script variable) and thereby reduce complexity.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.agents/scripts/supervisor/dispatch.sh:
- Around line 1128-1132: The inline daemon-launch block duplicated in
do_prompt_repeat and cmd_reprompt should be replaced with a call to the existing
helper _launch_wrapper_script to complete the DRY refactor: locate the inline if
command -v setsid ... nohup ... fi blocks inside the functions do_prompt_repeat
and cmd_reprompt and remove them, then invoke _launch_wrapper_script passing the
same wrapper_script and new_log_file variables (or their local equivalents) so
both functions use the centralized launch logic implemented for cmd_dispatch.
- Line 1076: The generated wrapper sentinel currently writes an inner echo that
uses single quotes so wrapper_pid=$$ is literal; update the three sentinel
emission sites that produce the line like echo "echo 'WRAPPER_STARTED
task_id=${task_id} strategy=prompt_repeat wrapper_pid=\$\$ ...' >>
'${new_log_file}'" to instead produce a fully double-quoted inner echo so $$
expands at runtime (i.e., make the inner echo use "WRAPPER_STARTED ...
wrapper_pid=$$ ..." rather than single quotes); locate the three occurrences in
.agents/scripts/supervisor/dispatch.sh that emit the WRAPPER_STARTED sentinel
(the same pattern at the lines noted in the review) and change the generated
echo quoting accordingly so the wrapper PID is recorded.
---
Outside diff comments:
In @.agents/scripts/supervisor/evaluate.sh:
- Around line 1028-1043: The grep used to set startup_error scans the whole log
and can match stale entries in the REPROMPT METADATA header; change the
extraction to exclude header lines by piping the log through a filter before
matching (e.g., cat "$tlog" | grep -v '^=== REPROMPT METADATA ===' -A0 -B9999 or
simply grep -v 'previous_error=' | grep -i
'WORKER_FAILED\|WORKER_DISPATCH_ERROR\|command not found\|No such
file\|Permission denied' | head -1 | head -c 200) so the startup_error
assignment (variable startup_error, using "$tlog") ignores metadata header lines
and only considers actual runtime log lines. Ensure the rest of the logic (the
WRAPPER_STARTED check and the echo branches) remains unchanged.
---
Nitpick comments:
In @.agents/scripts/supervisor/cleanup.sh:
- Around line 287-298: The loop over files using the variable script (for script
in "$SUPERVISOR_DIR/pids/${task_id}"-dispatch-*.sh ... ) has a redundant
existence guard ([[ -f "$script" ]] && rm -f "$script" || true); simplify it by
removing the conditional and calling rm -f "$script" directly inside the loop so
that the cleanup uses rm -f "$script" for each glob-expanded entry (referencing
SUPERVISOR_DIR, task_id and the script variable) and thereby reduce complexity.
In @.agents/scripts/supervisor/dispatch.sh:
- Around line 2725-2741: The helper _launch_wrapper_script currently relies on
the caller reading $! externally which is fragile; change it to explicitly emit
the launched wrapper PID on stdout so callers can capture it deterministically
(e.g., have _launch_wrapper_script print the PID of the backgrounded
nohup/setsid process before returning), then update callers that set local
worker_pid=$! to instead capture that stdout (e.g., worker_pid="$(
_launch_wrapper_script ... )"); ensure the function still redirects logs and
returns non-zero on failure while printing only the PID on success so downstream
parsing remains stable.
Auto-dismissed: bot review does not block autonomous pipeline



WIP - incremental commits
Root cause: dispatch/wrapper scripts used fixed filenames, causing a race condition where a second dispatch overwrites the scripts before the first wrapper process reads them.
Fixes:
Ref #1799
Summary by CodeRabbit