Skip to content

t1190: Fix worker_never_started:no_sentinel dispatch failures#1981

Merged
marcusquinn merged 2 commits intomainfrom
feature/t1190
Feb 20, 2026
Merged

t1190: Fix worker_never_started:no_sentinel dispatch failures#1981
marcusquinn merged 2 commits intomainfrom
feature/t1190

Conversation

@marcusquinn
Copy link
Owner

@marcusquinn marcusquinn commented Feb 19, 2026

WIP - incremental commits

Root cause: dispatch/wrapper scripts used fixed filenames, causing a race condition where a second dispatch overwrites the scripts before the first wrapper process reads them.

Fixes:

  • Timestamped dispatch/wrapper filenames prevent overwrite race condition
  • WRAPPER_STARTED sentinel distinguishes 'wrapper never ran' vs 'dispatch exec failed'
  • Wrapper stderr redirected to log file (was /dev/null) for diagnosis
  • evaluate.sh: improved no_sentinel sub-classification
  • cleanup.sh: remove timestamped scripts on task cleanup

Ref #1799

Summary by CodeRabbit

  • Bug Fixes
    • Fixed race conditions occurring during simultaneous task dispatches by implementing timestamped script identifiers.
    • Enhanced startup failure detection with improved diagnostics distinguishing between wrapper initialization and execution errors.
    • Improved error logging and diagnostics to better surface issues during task execution and startup phases.
    • Enhanced process cleanup to remove residual timestamped task-related scripts.

… and wrapper sentinel (t1190)

Root cause: dispatch/wrapper scripts used fixed filenames (e.g., t001-dispatch.sh),
causing a race condition where a second dispatch overwrites the scripts before the
first wrapper process reads them. The first wrapper then executes the new dispatch
script which writes WORKER_STARTED to a different log file, leaving the original
log with only the metadata header (no WORKER_STARTED → no_sentinel failure).

Fixes:
1. Timestamped dispatch/wrapper filenames (t001-dispatch-20260213142302.sh) prevent
   overwrite race — each dispatch gets unique scripts that cannot be clobbered.
2. WRAPPER_STARTED sentinel written to log before running dispatch script, enabling
   distinguish between 'wrapper never ran' vs 'dispatch exec failed'.
3. Wrapper stderr redirected to log file (was /dev/null) so wrapper startup errors
   (bash not found, permission denied) are captured for diagnosis.
4. evaluate.sh: improved no_sentinel sub-classification using WRAPPER_STARTED signal.
5. cleanup.sh: remove timestamped scripts when task is cleaned up.

Chose timestamped filenames over temp files (mktemp) to preserve debuggability —
the timestamp matches the log file timestamp, making correlation easy.
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 19, 2026

Walkthrough

The supervisor scripts receive enhancements for reliability and observability: timestamped dispatch and wrapper scripts prevent race conditions, startup sentinels (WRAPPER_STARTED) enable fine-grained diagnostics, improved logging routes stderr to task log files, and a new helper function standardizes wrapper script launches with consistent process isolation.

Changes

Cohort / File(s) Summary
Timestamped Script Management
\.agents/scripts/supervisor/cleanup.sh
Extends cleanup_worker_processes to remove timestamped dispatch, wrapper, reprompt, and prompt-repeat scripts from task's pids directory via conditional file deletion loop.
Dispatch and Launch Enhancement
\.agents/scripts/supervisor/dispatch.sh
Introduces timestamped filenames for dispatch and wrapper scripts; adds WRAPPER_STARTED startup sentinel and dispatch-level log output; creates _launch_wrapper_script helper for consistent resilient background launches; redirects stderr to task log files for improved traceability across prompt-repeat, reprompt, and standard dispatch paths.
Startup Diagnostics
\.agents/scripts/supervisor/evaluate.sh
Adds wrapper_started detection in log metadata extraction; expands startup failure diagnostics in evaluate_worker to distinguish three cases: startup_error present, WRAPPER_STARTED present but no WORKER_STARTED (dispatch_exec_failed), or neither (no_sentinel); improves error message capture by removing stderr suppression.

Sequence Diagram

sequenceDiagram
    participant Dispatcher
    participant WrapperScript as Wrapper Script<br/>(timestamped)
    participant DispatchScript as Dispatch Script<br/>(timestamped)
    participant WorkerProcess as Worker Process
    participant TaskLog as Task Log
    participant Evaluator

    Dispatcher->>Dispatcher: Generate timestamp
    Dispatcher->>WrapperScript: Create with timestamp suffix
    Dispatcher->>DispatchScript: Create with timestamp suffix
    Dispatcher->>WrapperScript: Launch via _launch_wrapper_script
    WrapperScript->>TaskLog: Write WRAPPER_STARTED sentinel
    WrapperScript->>DispatchScript: Execute with setsid/nohup
    DispatchScript->>TaskLog: Write START and dispatch logs
    DispatchScript->>WorkerProcess: Launch worker
    WorkerProcess->>TaskLog: Write WORKER_STARTED sentinel
    WorkerProcess->>TaskLog: Execute and log results
    Evaluator->>TaskLog: Extract log metadata
    Evaluator->>Evaluator: Check WRAPPER_STARTED presence
    Evaluator->>Evaluator: Check WORKER_STARTED presence
    Evaluator->>Evaluator: Distinguish failure modes
    Evaluator->>Evaluator: Report diagnostics
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Possibly related issues

Possibly related PRs

Poem

⏰ Timestamps now guard against racing time,
Sentinels light the way through startup's climb,
Wrapper speaks first, then worker takes the stage,
Logs reveal the truth on every page,
Reliability scripted, one shell at a time! 🚀

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically identifies the fix for a known issue (t1190) addressing a precise failure classification problem (worker_never_started:no_sentinel), making it immediately clear to reviewers what bug is being resolved.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feature/t1190

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link

🔍 Code Quality Report

�[0;35m[MONITOR]�[0m Code Review Monitoring Report

�[0;34m[INFO]�[0m Latest Quality Status:
SonarCloud: 0 bugs, 0 vulnerabilities, 30 code smells

�[0;34m[INFO]�[0m Recent monitoring activity:
Thu Feb 19 22:52:47 UTC 2026: Code review monitoring started
Thu Feb 19 22:52:48 UTC 2026: SonarCloud - Bugs: 0, Vulnerabilities: 0, Code Smells: 30

📈 Current Quality Metrics

  • BUGS: 0
  • CODE SMELLS: 30
  • VULNERABILITIES: 0

Generated on: Thu Feb 19 22:52:50 UTC 2026


Generated by AI DevOps Framework Code Review Monitoring

marcusquinn added a commit that referenced this pull request Feb 19, 2026
Cancelled tasks are administrative cleanup actions (orphaned tasks, superseded
work, cross-repo misregistration cleanup) — not worker failures. Including them
in the failure count inflated the 7-day failure rate from <1% to 11%, causing
false alarms and masking the true worker reliability signal.

Root cause analysis (t1248):
- Feb 12: 5 hung workers (t302, t303, t311.2x2, t311.3) — all hit the 1800s
  default timeout because their ~1h estimates weren't parsed. Fixed by t1199
  (estimate-based timeout) and t1222 (graceful SIGTERM at 50% timeout).
- Feb 13: 12 no_sentinel failures (t1010, t1030, t1032.1x4, t1032.2x2) at
  14:40-14:56 UTC — caused by a race condition where concurrent dispatches
  overwrote fixed-filename wrapper scripts. Fixed by t1190 (timestamped
  filenames, WRAPPER_STARTED sentinel) — PR #1981 pending merge.
- Feb 18-19: 67 cancellations (orphaned DB tasks, superseded tasks, cross-repo
  misregistration cleanup via t1237) — these are intentional, not failures.

Fix: split 'failed' and 'cancelled' into separate metrics. Success rate now
only counts status='failed' in the denominator, giving an accurate signal.
Cancelled tasks get their own row for visibility without distorting the rate.

True 7-day worker failure rate: 2/475 = 0.4% (not 11%)
Pattern tracker overall rate: 94% (977/1037) — unchanged, counts retries
@marcusquinn marcusquinn marked this pull request as ready for review February 19, 2026 23:15
@gemini-code-assist
Copy link

Summary of Changes

Hello @marcusquinn, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical race condition in the supervisor's worker dispatch mechanism, which previously caused 'worker_never_started:no_sentinel' failures. The changes enhance the robustness of worker startup by ensuring unique script execution and provide significantly improved diagnostic information when workers fail to initialize. This leads to more reliable task execution and easier debugging of startup issues.

Highlights

  • Race Condition Fix: Implemented timestamped filenames for dispatch and wrapper scripts to prevent overwrites during rapid dispatches, resolving a critical race condition.
  • Improved Diagnostics: Introduced a 'WRAPPER_STARTED' sentinel to distinguish between cases where the wrapper process never ran and where the dispatch script failed to execute, providing clearer failure insights.
  • Enhanced Error Logging: Redirected wrapper script stderr to log files instead of /dev/null, capturing startup errors for better diagnosis of worker failures.
  • Refined Failure Classification: Updated 'evaluate.sh' to provide more granular sub-classifications for 'worker_never_started' failures, differentiating between 'dispatch_exec_failed' and 'no_sentinel'.
  • Cleanup Mechanism: Added logic to 'cleanup.sh' to remove the newly introduced timestamped dispatch and wrapper scripts, ensuring proper system hygiene.
Changelog
  • .agents/scripts/supervisor/cleanup.sh
    • Added logic to remove timestamped dispatch, wrapper, reprompt, and prompt-repeat scripts associated with a task during cleanup.
  • .agents/scripts/supervisor/dispatch.sh
    • Modified 'do_prompt_repeat' to use timestamped filenames for 'prompt-repeat' dispatch and wrapper scripts.
    • Added a 'WRAPPER_STARTED' sentinel echo to the 'prompt-repeat' wrapper script.
    • Changed 'nohup' commands in 'do_prompt_repeat' to redirect wrapper stderr to 'new_log_file'.
    • Modified 'cmd_dispatch' to use timestamped filenames for main 'dispatch' and 'wrapper' scripts.
    • Added a 'WRAPPER_STARTED' sentinel echo to the main 'wrapper' script.
    • Changed 'nohup' commands in 'cmd_dispatch' to redirect wrapper stderr to 'log_file'.
    • Modified 'do_reprompt' to use timestamped filenames for 'reprompt' dispatch and wrapper scripts.
    • Added a 'WRAPPER_STARTED' sentinel echo to the 'reprompt' wrapper script.
    • Changed 'nohup' commands in 'do_reprompt' to redirect wrapper stderr to 'new_log_file'.
  • .agents/scripts/supervisor/evaluate.sh
    • Added 'wrapper_started' metadata extraction based on the 'WRAPPER_STARTED' sentinel.
    • Updated the 'evaluate_worker' function to differentiate 'worker_never_started' into 'dispatch_exec_failed' (if 'WRAPPER_STARTED' is present but 'WORKER_STARTED' is not) and 'no_sentinel' (if neither started).
Activity
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively addresses a race condition in script dispatching by using timestamped filenames, and significantly improves error diagnostics by adding a WRAPPER_STARTED sentinel and redirecting wrapper script stderr to log files. The changes are logical and well-commented. My review includes a few suggestions to improve code consistency, adhere to the repository's shell scripting style guide, and reduce code duplication for better long-term maintainability.

marcusquinn added a commit that referenced this pull request Feb 19, 2026
…1983)

Cancelled tasks are administrative cleanup actions (orphaned tasks, superseded
work, cross-repo misregistration cleanup) — not worker failures. Including them
in the failure count inflated the 7-day failure rate from <1% to 11%, causing
false alarms and masking the true worker reliability signal.

Root cause analysis (t1248):
- Feb 12: 5 hung workers (t302, t303, t311.2x2, t311.3) — all hit the 1800s
  default timeout because their ~1h estimates weren't parsed. Fixed by t1199
  (estimate-based timeout) and t1222 (graceful SIGTERM at 50% timeout).
- Feb 13: 12 no_sentinel failures (t1010, t1030, t1032.1x4, t1032.2x2) at
  14:40-14:56 UTC — caused by a race condition where concurrent dispatches
  overwrote fixed-filename wrapper scripts. Fixed by t1190 (timestamped
  filenames, WRAPPER_STARTED sentinel) — PR #1981 pending merge.
- Feb 18-19: 67 cancellations (orphaned DB tasks, superseded tasks, cross-repo
  misregistration cleanup via t1237) — these are intentional, not failures.

Fix: split 'failed' and 'cancelled' into separate metrics. Success rate now
only counts status='failed' in the denominator, giving an accurate signal.
Cancelled tasks get their own row for visibility without distorting the rate.

True 7-day worker failure rate: 2/475 = 0.4% (not 11%)
Pattern tracker overall rate: 94% (977/1037) — unchanged, counts retries
- cleanup.sh: normalize glob patterns ([0-9]* -> *) for reprompt/prompt-repeat
  scripts; remove redundant 2>/dev/null from rm -f (already guarded by [[ -f ]])
- dispatch.sh: extract duplicated setsid/nohup wrapper launch into
  _launch_wrapper_script() helper to eliminate code duplication
- evaluate.sh: remove blanket 2>/dev/null suppression from grep calls per
  style guide (log file existence already checked before these calls)
@github-actions
Copy link

🔍 Code Quality Report

�[0;35m[MONITOR]�[0m Code Review Monitoring Report

�[0;34m[INFO]�[0m Latest Quality Status:
SonarCloud: 0 bugs, 0 vulnerabilities, 30 code smells

�[0;34m[INFO]�[0m Recent monitoring activity:
Thu Feb 19 23:36:19 UTC 2026: Code review monitoring started
Thu Feb 19 23:36:19 UTC 2026: SonarCloud - Bugs: 0, Vulnerabilities: 0, Code Smells: 30

📈 Current Quality Metrics

  • BUGS: 0
  • CODE SMELLS: 30
  • VULNERABILITIES: 0

Generated on: Thu Feb 19 23:36:22 UTC 2026


Generated by AI DevOps Framework Code Review Monitoring

@sonarqubecloud
Copy link

coderabbitai[bot]
coderabbitai bot previously requested changes Feb 19, 2026
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
.agents/scripts/supervisor/evaluate.sh (1)

1028-1043: ⚠️ Potential issue | 🟡 Minor

startup_error grep may false-match previous-run errors from the REPROMPT METADATA header.

The startup_error extraction at Line 1031 scans the entire log. On reprompt retries the log file is pre-seeded with a === REPROMPT METADATA === header containing previous_error=${terror:-none}. If terror from a prior run contained WORKER_FAILED: or command not found, the grep will match that stale header entry — producing a misleading failed:worker_never_started:previous_error=WORKER_FAILED_... detail instead of no_sentinel or dispatch_exec_failed.

This doesn't break the routing pipeline (outcome remains in the worker_never_started family), but it degrades diagnostic accuracy and could misattribute the cause of the new failure to the old one.

🛡️ Proposed fix — exclude header lines from startup_error search
-		startup_error=$(grep -i 'WORKER_FAILED\|WORKER_DISPATCH_ERROR\|command not found\|No such file\|Permission denied' "$tlog" | head -1 | head -c 200 || echo "")
+		# Skip REPROMPT METADATA header lines (previous_error= could match stale errors)
+		startup_error=$(grep -v '^=== \(REPROMPT METADATA\|END REPROMPT METADATA\)\|^task_id=\|^timestamp=\|^retry=\|^work_dir=\|^previous_error=\|^fresh_worktree=\|^strategy=' "$tlog" \
+			| grep -i 'WORKER_FAILED\|WORKER_DISPATCH_ERROR\|command not found\|No such file\|Permission denied' \
+			| head -1 | head -c 200 || echo "")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.agents/scripts/supervisor/evaluate.sh around lines 1028 - 1043, The grep
used to set startup_error scans the whole log and can match stale entries in the
REPROMPT METADATA header; change the extraction to exclude header lines by
piping the log through a filter before matching (e.g., cat "$tlog" | grep -v
'^=== REPROMPT METADATA ===' -A0 -B9999 or simply grep -v 'previous_error=' |
grep -i 'WORKER_FAILED\|WORKER_DISPATCH_ERROR\|command not found\|No such
file\|Permission denied' | head -1 | head -c 200) so the startup_error
assignment (variable startup_error, using "$tlog") ignores metadata header lines
and only considers actual runtime log lines. Ensure the rest of the logic (the
WRAPPER_STARTED check and the echo branches) remains unchanged.
🧹 Nitpick comments (2)
.agents/scripts/supervisor/dispatch.sh (1)

2725-2741: _launch_wrapper_script relies on implicit $! propagation — fragile contract worth hardening.

Bash functions run inline (not as subprocesses), but running a function in the background or as part of a pipeline forces it to run as a subprocess. When called normally, $! in the caller correctly reflects the last background process started inside the function. return 0 only sets $?, not $!, so the current local worker_pid=$! at Line 2708 captures the correct PID.

The fragility: any future addition of another background command (&) inside _launch_wrapper_script — even a log write or heartbeat — would silently break worker_pid. Making the PID return explicit costs two lines and removes the implicit contract entirely.

♻️ Explicit PID return via stdout
 _launch_wrapper_script() {
 	local wrapper_script="$1"
 	local log_file="$2"
 
 	if command -v setsid &>/dev/null; then
 		nohup setsid bash "${wrapper_script}" >>"${log_file}" 2>&1 &
 	else
 		nohup bash "${wrapper_script}" >>"${log_file}" 2>&1 &
 	fi
-	return 0
+	echo "$!"
 }

Caller update (Line 2708):

-_launch_wrapper_script "${wrapper_script}" "${log_file}"
-local worker_pid=$!
+local worker_pid
+worker_pid=$(_launch_wrapper_script "${wrapper_script}" "${log_file}")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.agents/scripts/supervisor/dispatch.sh around lines 2725 - 2741, The helper
_launch_wrapper_script currently relies on the caller reading $! externally
which is fragile; change it to explicitly emit the launched wrapper PID on
stdout so callers can capture it deterministically (e.g., have
_launch_wrapper_script print the PID of the backgrounded nohup/setsid process
before returning), then update callers that set local worker_pid=$! to instead
capture that stdout (e.g., worker_pid="$( _launch_wrapper_script ... )"); ensure
the function still redirects logs and returns non-zero on failure while printing
only the PID on success so downstream parsing remains stable.
.agents/scripts/supervisor/cleanup.sh (1)

287-298: Redundant existence guard — rm -f already handles missing files gracefully.

[[ -f "$script" ]] && rm -f "$script" || true is more complex than it needs to be. The -f flag to rm suppresses "no such file or directory" errors and exits 0 on missing operands, so the guard is redundant. When no glob matches, bash passes the literal pattern as $script, and rm -f on that non-existent path exits 0 without error.

♻️ Simplified cleanup loop
-	local script
-	for script in "$SUPERVISOR_DIR/pids/${task_id}"-dispatch-*.sh \
-		"$SUPERVISOR_DIR/pids/${task_id}"-wrapper-*.sh \
-		"$SUPERVISOR_DIR/pids/${task_id}"-reprompt-*.sh \
-		"$SUPERVISOR_DIR/pids/${task_id}"-reprompt-wrapper-*.sh \
-		"$SUPERVISOR_DIR/pids/${task_id}"-prompt-repeat-*.sh \
-		"$SUPERVISOR_DIR/pids/${task_id}"-prompt-repeat-wrapper-*.sh; do
-		[[ -f "$script" ]] && rm -f "$script" || true
-	done
+	rm -f \
+		"$SUPERVISOR_DIR/pids/${task_id}"-dispatch-*.sh \
+		"$SUPERVISOR_DIR/pids/${task_id}"-wrapper-*.sh \
+		"$SUPERVISOR_DIR/pids/${task_id}"-reprompt-*.sh \
+		"$SUPERVISOR_DIR/pids/${task_id}"-reprompt-wrapper-*.sh \
+		"$SUPERVISOR_DIR/pids/${task_id}"-prompt-repeat-*.sh \
+		"$SUPERVISOR_DIR/pids/${task_id}"-prompt-repeat-wrapper-*.sh \
+		2>/dev/null || true
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.agents/scripts/supervisor/cleanup.sh around lines 287 - 298, The loop over
files using the variable script (for script in
"$SUPERVISOR_DIR/pids/${task_id}"-dispatch-*.sh ... ) has a redundant existence
guard ([[ -f "$script" ]] && rm -f "$script" || true); simplify it by removing
the conditional and calling rm -f "$script" directly inside the loop so that the
cleanup uses rm -f "$script" for each glob-expanded entry (referencing
SUPERVISOR_DIR, task_id and the script variable) and thereby reduce complexity.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.agents/scripts/supervisor/dispatch.sh:
- Around line 1128-1132: The inline daemon-launch block duplicated in
do_prompt_repeat and cmd_reprompt should be replaced with a call to the existing
helper _launch_wrapper_script to complete the DRY refactor: locate the inline if
command -v setsid ... nohup ... fi blocks inside the functions do_prompt_repeat
and cmd_reprompt and remove them, then invoke _launch_wrapper_script passing the
same wrapper_script and new_log_file variables (or their local equivalents) so
both functions use the centralized launch logic implemented for cmd_dispatch.
- Line 1076: The generated wrapper sentinel currently writes an inner echo that
uses single quotes so wrapper_pid=$$ is literal; update the three sentinel
emission sites that produce the line like echo "echo 'WRAPPER_STARTED
task_id=${task_id} strategy=prompt_repeat wrapper_pid=\$\$ ...' >>
'${new_log_file}'" to instead produce a fully double-quoted inner echo so $$
expands at runtime (i.e., make the inner echo use "WRAPPER_STARTED ...
wrapper_pid=$$ ..." rather than single quotes); locate the three occurrences in
.agents/scripts/supervisor/dispatch.sh that emit the WRAPPER_STARTED sentinel
(the same pattern at the lines noted in the review) and change the generated
echo quoting accordingly so the wrapper PID is recorded.

---

Outside diff comments:
In @.agents/scripts/supervisor/evaluate.sh:
- Around line 1028-1043: The grep used to set startup_error scans the whole log
and can match stale entries in the REPROMPT METADATA header; change the
extraction to exclude header lines by piping the log through a filter before
matching (e.g., cat "$tlog" | grep -v '^=== REPROMPT METADATA ===' -A0 -B9999 or
simply grep -v 'previous_error=' | grep -i
'WORKER_FAILED\|WORKER_DISPATCH_ERROR\|command not found\|No such
file\|Permission denied' | head -1 | head -c 200) so the startup_error
assignment (variable startup_error, using "$tlog") ignores metadata header lines
and only considers actual runtime log lines. Ensure the rest of the logic (the
WRAPPER_STARTED check and the echo branches) remains unchanged.

---

Nitpick comments:
In @.agents/scripts/supervisor/cleanup.sh:
- Around line 287-298: The loop over files using the variable script (for script
in "$SUPERVISOR_DIR/pids/${task_id}"-dispatch-*.sh ... ) has a redundant
existence guard ([[ -f "$script" ]] && rm -f "$script" || true); simplify it by
removing the conditional and calling rm -f "$script" directly inside the loop so
that the cleanup uses rm -f "$script" for each glob-expanded entry (referencing
SUPERVISOR_DIR, task_id and the script variable) and thereby reduce complexity.

In @.agents/scripts/supervisor/dispatch.sh:
- Around line 2725-2741: The helper _launch_wrapper_script currently relies on
the caller reading $! externally which is fragile; change it to explicitly emit
the launched wrapper PID on stdout so callers can capture it deterministically
(e.g., have _launch_wrapper_script print the PID of the backgrounded
nohup/setsid process before returning), then update callers that set local
worker_pid=$! to instead capture that stdout (e.g., worker_pid="$(
_launch_wrapper_script ... )"); ensure the function still redirects logs and
returns non-zero on failure while printing only the PID on success so downstream
parsing remains stable.

@marcusquinn marcusquinn dismissed coderabbitai[bot]’s stale review February 20, 2026 00:07

Auto-dismissed: bot review does not block autonomous pipeline

@marcusquinn marcusquinn merged commit 4dae331 into main Feb 20, 2026
27 checks passed
@marcusquinn marcusquinn deleted the feature/t1190 branch February 20, 2026 00:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant