t1031: Modularize supervisor-helper.sh into domain modules#1359
t1031: Modularize supervisor-helper.sh into domain modules#1359marcusquinn merged 2 commits intomainfrom
Conversation
…to 13 modules (t1031) Moved real implementations from 16K-line monolith into existing stub modules under .agents/scripts/supervisor/. The monolith now retains only: main(), show_usage(), cmd_contest(), and the case statement router (~680 lines). Module mapping: - database.sh: DB init, migrate, backup, restore (9 functions) - state.sh: state transitions, status, list, reset, cancel (9 functions) - batch.sh: task add, batch create (2 functions) - dispatch.sh: dispatch, model resolution, quality gates (16 functions) - evaluate.sh: worker evaluation, PR discovery, AI eval (10 functions) - pulse.sh: pulse cycle, post-PR lifecycle (3 functions) - cleanup.sh: worktree mgmt, worker cleanup (8 functions) - cron.sh: cron scheduling, auto-pickup, watch (4 functions) - deploy.sh: PR lifecycle, merge, deploy, verify (21 functions) - self-heal.sh: diagnostics, model escalation (6 functions) - utility.sh: proof-logs, monitoring, dashboard, notifications (22 functions) - issue-sync.sh: GitHub issues, labels, claiming, staleness (20 functions) - memory-integration.sh: memory recall, patterns, retrospectives (7 functions) Preserved modules already containing real implementations: - todo-sync.sh (1117 lines, 12 functions) - release.sh (289 lines, 2 functions) Log/db/sql_escape functions remain in _common.sh (sourced first). Constants (VALID_STATES, VALID_TRANSITIONS, etc.) remain in monolith header — modules access them at call time via bash late binding.
WalkthroughThis PR implements a comprehensive supervisor system across 13 shell script modules, transforming placeholder stubs into fully functional automation for task management, worker dispatch, GitHub integration, state transitions, and multi-phase orchestration. The implementation includes database management, proof logging, quality gates, model escalation, and batch processing workflows. Changes
Sequence Diagram(s)sequenceDiagram
actor User
participant Pulse as Pulse Orchestrator<br/>(pulse.sh)
participant Batch as Batch/State<br/>(batch.sh, state.sh)
participant DB as SQLite DB<br/>(database.sh)
participant Dispatch as Dispatcher<br/>(dispatch.sh)
participant Worker as Worker Process<br/>(subprocess)
participant Evaluate as Evaluator<br/>(evaluate.sh)
participant GitHub as GitHub API<br/>(issue-sync.sh)
User->>Pulse: cmd_pulse [--batch]
activate Pulse
Pulse->>DB: ensure_db() init schema
activate DB
DB-->>Pulse: ready
deactivate DB
Pulse->>Batch: cmd_auto_pickup() discover tasks
activate Batch
Batch->>DB: insert tasks
DB-->>Batch: task_ids
Batch-->>Pulse: queued tasks
deactivate Batch
Pulse->>Batch: Phase 1: evaluate existing
Pulse->>Dispatch: Phase 2: dispatch queued tasks
activate Dispatch
Dispatch->>Dispatch: resolve_task_model()
Dispatch->>Dispatch: check_model_health()
Dispatch->>Worker: spawn worker subprocess
activate Worker
Worker-->>Dispatch: worker_pid
deactivate Worker
deactivate Dispatch
Pulse->>Pulse: Phase 3-4: post-PR lifecycle
loop per completed task
Pulse->>Evaluate: evaluate_worker(task_id)
activate Evaluate
Evaluate->>Worker: check worker logs
alt worker success
Evaluate->>Evaluate: discover/create PR
Evaluate->>GitHub: link_pr_to_task()
Evaluate->>DB: transition to evaluating→complete
else worker needs retry
Evaluate->>DB: transition to retry
end
Evaluate-->>Pulse: outcome
deactivate Evaluate
end
Pulse->>GitHub: Phase 6: sync issues, labels
Pulse->>DB: Phase 5: summary metrics
Pulse-->>User: pulse complete
deactivate Pulse
sequenceDiagram
participant Task as Task (queued)
participant Dispatch as cmd_dispatch<br/>(dispatch.sh)
participant Worker as Worker Process
participant Evaluate as cmd_evaluate<br/>(evaluate.sh)
participant PR as PR Lifecycle<br/>(deploy.sh)
participant DB as State Machine<br/>(state.sh)
Task->>Dispatch: dispatch triggered
activate Dispatch
Dispatch->>Dispatch: resolve_task_model() select AI model
Dispatch->>Dispatch: check_model_health() verify availability
Dispatch->>Dispatch: create_task_worktree() setup workspace
Dispatch->>Dispatch: build_dispatch_cmd() construct worker cmd
Dispatch->>Worker: spawn worker (nohup/disown)
activate Worker
Dispatch->>DB: transition dispatched
DB->>Task: state=dispatched
deactivate Dispatch
Worker->>Worker: run task (AI inference)
Worker->>Worker: write logs, push branch
Worker-->>Worker: exit with status
deactivate Worker
Evaluate->>DB: evaluate task
activate Evaluate
Evaluate->>Evaluate: extract_log_metadata() parse logs
alt auto create PR needed
Evaluate->>Evaluate: auto_create_pr_for_task()
Evaluate->>DB: link_pr_to_task() record PR
end
Evaluate->>DB: transition evaluating→complete
deactivate Evaluate
PR->>DB: check_pr_status() fetch PR
activate PR
PR->>PR: check_review_threads() triage feedback
alt needs fixes
PR->>Worker: dispatch_review_fix_worker()
else ready to merge
PR->>PR: merge_task_pr()
PR->>PR: run_postflight_for_task()
PR->>PR: run_deploy_for_task()
end
PR->>DB: transition deployed
deactivate PR
DB-->>Task: state=deployed/complete
Estimated code review effort🎯 4 (Complex) | ⏱️ ~65 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
🔍 Code Quality Report�[0;35m[MONITOR]�[0m Code Review Monitoring Report �[0;34m[INFO]�[0m Latest Quality Status: �[0;34m[INFO]�[0m Recent monitoring activity: 📈 Current Quality Metrics
Generated on: Fri Feb 13 13:57:16 UTC 2026 Generated by AI DevOps Framework Code Review Monitoring |
Summary of ChangesHello @marcusquinn, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a significant architectural overhaul by modularizing the core supervisor script into dedicated functional units. This refactoring aims to enhance the system's scalability, maintainability, and extensibility by distributing responsibilities across specialized modules. The changes enable more intelligent task processing, robust error handling, and comprehensive lifecycle management for automated development workflows. Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Actionable comments posted: 14
Note
Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.
🤖 Fix all issues with AI agents
In @.agents/scripts/supervisor/batch.sh:
- Around line 49-55: Validate the max_retries value before using it in SQL:
ensure the variable max_retries (set in the --max-retries parsing block and
later used in the SQL interpolation at the other occurrence) matches a
non-negative integer pattern (e.g. regex like ^[0-9]+$) and optionally enforce
an upper bound, reject or normalize invalid input (call log_error and return
non-zero) and convert to a sanitized integer form before embedding into SQL;
apply the same validation for the second occurrence (the block referenced at
lines 150-153) so both parsing sites enforce and sanitize max_retries prior to
SQL interpolation.
- Around line 196-227: The flag parsing currently assigns user-supplied values
directly to concurrency, max_concurrency, tasks, and max_load_factor; add
numeric validation for each before assignment and before any SQL insert: verify
the value matches a non-negative integer (or appropriate numeric/range for
max_load_factor) using a regex or POSIX arithmetic check, log an error with
log_error and return 1 on invalid input, coerce/normalize valid values (e.g.,
strip leading zeros) or set a safe default, and ensure the same validation is
applied again where these variables are used for SQL inserts (sanitize or
parameterize the query) — update the parsing branches for --concurrency,
--max-concurrency, --tasks, and --max-load and the corresponding usage site
around the SQL insert to enforce this.
In @.agents/scripts/supervisor/cron.sh:
- Around line 50-65: The current logic embeds GH_TOKEN directly into the crontab
entry via cron_cmd; instead, if gh_token is detected, write it to a protected
env file (e.g., create a file under SUPERVISOR_DIR like cron_env, chmod 600) and
update cron_cmd to source that env file before invoking ${script_path} (or
simply rely on calling ${script_path} which uses `gh auth token` internally), so
GH_TOKEN is not placed directly into the crontab or process list; update
references around cron.sh, cron_cmd, GH_TOKEN, SUPERVISOR_DIR, script_path,
pulse, and cron_marker accordingly.
In @.agents/scripts/supervisor/database.sh:
- Around line 325-358: The migration rebuilds the tasks table but the INSERT ...
SELECT omits the triage_result column, causing data loss; update the INSERT INTO
tasks (...) column list and the corresponding SELECT ... FROM tasks_old_t148 to
include triage_result in the exact position/order used in the CREATE TABLE so
values from tasks_old_t148.triage_result are preserved when copying rows back
into tasks.
In @.agents/scripts/supervisor/deploy.sh:
- Around line 84-96: The if/then loses the deployment script's exit code because
$? is read after the if compound; change the pattern to capture the exit code
immediately after running the deploy command (assign deploy_output by running
"$deploy_script" "${deploy_args[@]}" 2>&1, then right away set deploy_exit=$?),
then branch on deploy_exit: if deploy_exit -eq 0 handle success (log_success,
write deploy_output to deploy_log and return 0), if deploy_exit -eq 2 log_info
"No agent changes to deploy for $task_id" and return 0, else log_failure/fail
accordingly; update references to deploy_output, deploy_exit, deploy_script,
deploy_args, deploy_log, and task_id in the same block.
- Around line 2216-2225: The current code builds merge_cmd and uses eval, which
risks command injection via pr_number or repo_slug; change to building a command
array (e.g., cmd=(gh pr merge "$pr_number" "--repo" "$repo_slug" "--squash") and
append "--admin" to the array when use_admin_flag is true), then execute the
array without eval and capture output (e.g., merge_output="$( "${cmd[@]}" 2>&1
)" and check its exit status) so you avoid word-splitting and shell
interpretation of user-controlled values; update references to merge_cmd and
merge_output accordingly.
- Around line 725-726: The block that reassigns escaped_id using sed should be
replaced to use the project's sql_escape() helper so you don't shadow or diverge
from the shared escaping behavior; locate the reassignment of escaped_id
(currently using printf/sed on task_id) and remove it, then assign escaped_id by
calling sql_escape with task_id (or reuse the earlier escaped_id) and ensure you
don't redeclare it with a new local that would overwrite the value used later in
the function.
- Around line 490-513: The SQL statements in the no_pr case interpolate the raw
variable task_id (id='$task_id') which risks SQL injection and quoting errors;
change those to use the precomputed escaped_id from sql_escape (replace
id='$task_id' with id='$escaped_id' in both the SELECT that reads no_pr_retries
and the UPDATE that writes error/json_set) so the queries use the properly
escaped identifier (keep no_pr_count logic and json_set usage unchanged).
In @.agents/scripts/supervisor/dispatch.sh:
- Around line 1387-1909: The verify-mode branch incorrectly calls resolve_model
"coding" (which maps to Opus) causing higher-cost models to be used; change the
verify-mode model resolution in the cmd_dispatch function where resolved_model
is set for verify_mode to call resolve_model "sonnet" (or the intended cheaper
tier) and adjust any log text referencing "coding-tier model" to reflect the
sonnet/verify tier so verify dispatches use the cheaper model when building the
verify dispatch command (see resolved_model, verify_mode, resolve_model,
build_verify_dispatch_cmd).
- Around line 569-666: The check_output_quality function hardcodes "main" as the
base branch when calling git (git -C "$tworktree" diff --stat "main..HEAD" and
git -C "$tworktree" diff --name-only "main..HEAD"), which breaks repos that use
master or other defaults; compute the repository's default branch first (e.g.,
using git -C "$tworktree" symbolic-ref refs/remotes/origin/HEAD or git -C
"$tworktree" rev-parse --abbrev-ref origin/HEAD and strip the
refs/remotes/origin/ prefix), fall back to "main" then "master" if detection
fails, store it in a local variable like base_branch, and replace the two
hardcoded "main..HEAD" occurrences with "$base_branch..HEAD" (also use that
variable for both diff --stat and diff --name-only checks in
check_output_quality).
In @.agents/scripts/supervisor/evaluate.sh:
- Around line 675-1086: In evaluate_worker(), the billing/credits detection
block incorrectly references an undefined variable log_file causing grep to read
stdin or misclassify errors; update that grep call to use the log path variable
tlog (used throughout the function) so the billing detection uses the actual
task log (refer to evaluate_worker, the block checking meta_backend_error_count
and meta_content_lines before echoing "blocked:billing_credits_exhausted").
In @.agents/scripts/supervisor/issue-sync.sh:
- Around line 1267-1347: The check_task_already_done function is vulnerable to
regex mis-matches because task_id is used raw; replace uses of the raw task_id
with the already-built escaped_task_regex and adjust boundary_pattern to use it
(e.g., boundary_pattern="${escaped_task_regex}([^.0-9]|$)"), and pass
escaped_task_regex into git grep (git -C "$project_root" log --oneline ...
--grep="$escaped_task_regex") and into the exclusion grep (grep -ivE "add
${escaped_task_regex}|claim ${escaped_task_regex}|mark
${escaped_task_regex}|queue ${escaped_task_regex}|blocked") so dotted IDs like
t128.10 are treated literally across git log, grep -E, and grep -ivE checks;
leave the rest of check_task_already_done logic unchanged.
In @.agents/scripts/supervisor/state.sh:
- Around line 817-848: The cmd_next function interpolates the unvalidated
variable limit directly into the SQL LIMIT clause; validate and sanitize limit
before use by checking it is a positive integer, applying a sensible default and
an upper cap (e.g., max_limit) if out of range, and then use the sanitized
numeric variable (not the raw input) when building the db query; update the
cmd_next function (references: cmd_next, limit, ensure_db, db, SUPERVISOR_DB) to
perform this numeric validation/normalization and only then substitute the
sanitized value into the SQL.
In @.agents/scripts/supervisor/utility.sh:
- Around line 459-463: The threshold calculation is wrong: because load_ratio is
already a percent, do not multiply by cpu_cores again; compute threshold from
max_load_factor as a percent (e.g., threshold = max_load_factor * 100,
converting floats to an integer percent if needed) and keep the existing
comparison against load_ratio; update the assignment to the threshold variable
(refer to threshold, load_ratio, cpu_cores, max_load_factor) and ensure numeric
conversion/rounding is handled so the if [[ "$load_ratio" -gt "$threshold" ]]
branch can trigger correctly.
🟡 Minor comments (8)
.agents/scripts/supervisor/utility.sh-991-1015 (1)
991-1015:⚠️ Potential issue | 🟡 Minor
--jsonoutput can be invalid when evidence/metadata contain backslashes or newlines.
Only quotes are escaped today; JSON consumers will break on\,\n,\r, or tabs.🔧 Suggested hardening (apply similarly in task-mode JSON block)
- local _esc_evidence="${pevidence:-}" - _esc_evidence="${_esc_evidence//\"/\\\"}" - local _esc_meta="${pmeta:-}" - _esc_meta="${_esc_meta//\"/\\\"}" + local _esc_evidence="${pevidence:-}" + _esc_evidence="${_esc_evidence//\\/\\\\}" + _esc_evidence="${_esc_evidence//$'\n'/\\n}" + _esc_evidence="${_esc_evidence//$'\r'/\\r}" + _esc_evidence="${_esc_evidence//$'\t'/\\t}" + _esc_evidence="${_esc_evidence//\"/\\\"}" + local _esc_meta="${pmeta:-}" + _esc_meta="${_esc_meta//\\/\\\\}" + _esc_meta="${_esc_meta//$'\n'/\\n}" + _esc_meta="${_esc_meta//$'\r'/\\r}" + _esc_meta="${_esc_meta//$'\t'/\\t}" + _esc_meta="${_esc_meta//\"/\\\"}"Also applies to: 1081-1105
.agents/scripts/supervisor/cron.sh-693-696 (1)
693-696:⚠️ Potential issue | 🟡 Minor
metadatacolumn update will fail without schema support.
The current DB schema doesn’t definetasks.metadata, so this UPDATE is a no‑op. Either add a migration formetadataor drop the update to avoid silent failures..agents/scripts/supervisor/pulse.sh-682-689 (1)
682-689:⚠️ Potential issue | 🟡 MinorCancel stale diagnostics when parent reaches verified states.
If a parent hitsverified/verify_failed, the diagnostic can remain queued indefinitely. Treat those as terminal too.🔧 Proposed fix
- AND p.status IN ('deployed', 'cancelled', 'failed', 'complete', 'merged'); + AND p.status IN ('deployed', 'verified', 'verify_failed', 'cancelled', 'failed', 'complete', 'merged');.agents/scripts/supervisor/dispatch.sh-35-50 (1)
35-50:⚠️ Potential issue | 🟡 MinorFix logger typo to avoid “command not found”.
log_warningisn’t defined in the shared helpers; uselog_warn.🔧 Proposed fix
- if command -v claude &>/dev/null; then - log_warning "Using deprecated claude CLI fallback. Install opencode: npm i -g opencode" + if command -v claude &>/dev/null; then + log_warn "Using deprecated claude CLI fallback. Install opencode: npm i -g opencode" echo "claude" return 0 fi.agents/scripts/supervisor/deploy.sh-2501-2509 (1)
2501-2509:⚠️ Potential issue | 🟡 Minor
bash "$check_arg"executes an arbitrary path sourced from VERIFY.md — verify trust boundary.The
bashcheck type runs whatever script path is written intodo/VERIFY.md. Ifpopulate_verify_queueis the sole writer, this is internally consistent. But if the file is ever committed by a worker or modified via a PR, an attacker could inject a malicious script path. Consider allowlisting script paths or at minimum documenting the trust assumption..agents/scripts/supervisor/deploy.sh-1278-1278 (1)
1278-1278:⚠️ Potential issue | 🟡 Minor
((dismissed_count++))returns exit code 1 on first increment (from 0) — unsafe withset -e.Postfix
++evaluates the old value;((0))has exit status 1 in bash. Ifset -e(orERRtrap) is ever active in the call chain, the first successful dismissal will cause the function to abort.🐛 Proposed fix
- ((dismissed_count++)) + dismissed_count=$((dismissed_count + 1)).agents/scripts/supervisor/deploy.sh-1161-1172 (1)
1161-1172:⚠️ Potential issue | 🟡 MinorPaths embedded in generated scripts via single quotes are vulnerable to quoting breakage.
Lines 1164, 1167 embed
${work_dir}and${worker_xdg_config}inside single-quoted strings in the generated script. If any path contains a literal', the resulting script will have a syntax error or, worse, unintended command execution. Theprintf '%q'pattern already used at line 1170 forcmd_partsis the safer approach.🛡️ Proposed fix: use printf '%q' for path embedding
- echo "cd '${work_dir}' || { echo 'WORKER_FAILED: cd to work_dir failed: ${work_dir}'; exit 1; }" + printf 'cd %q || { echo %q; exit 1; }\n' "$work_dir" "WORKER_FAILED: cd to work_dir failed: $work_dir" # Redirect worker to use MCP config with heavy indexers disabled (t221) if [[ -n "$worker_xdg_config" ]]; then - echo "export XDG_CONFIG_HOME='${worker_xdg_config}'" + printf 'export XDG_CONFIG_HOME=%q\n' "$worker_xdg_config" fi.agents/scripts/supervisor/deploy.sh-1640-1645 (1)
1640-1645:⚠️ Potential issue | 🟡 MinorRegex word-boundary match treats
.in subtask IDs as wildcard.
jqtest("\\b" + $tid + "\\b")uses$tidverbatim as a regex. For subtask IDs liket215.3, the.matches any character, so a PR titled "t21503" would falsely match. The downstreamlink_pr_to_tasklikely validates, but escaping the ID would be more precise.🛡️ Suggestion: escape regex metacharacters in task ID
- matched_pr_url=$(echo "$all_prs" | jq -r --arg tid "$tid" ' + local escaped_tid + escaped_tid=$(printf '%s' "$tid" | sed 's/[.[\*^$()+?{|\\]/\\&/g') + matched_pr_url=$(echo "$all_prs" | jq -r --arg tid "$escaped_tid" 'Alternatively, in the
jqfilter, usegsubto escape the ID before passing totest.
🧹 Nitpick comments (2)
.agents/scripts/supervisor/deploy.sh (2)
172-187: Manual JSON string construction is fragile — consider usingjqthroughout.
stage_nameandstage_timeare concatenated into a JSON object via string interpolation. If any value ever contains a"or\, the resulting JSON will be malformed and thejq -n --argjsoncall on line 195 will silently fail. Since the data is currently all internally generated timing strings, this is low-risk — but ajqpipeline would be more robust.
2260-2267:cmd_pr_mergerejectsunstable_sonarcloudstatus, butmerge_task_prsupports it with--admin.The manual
pr-mergecommand only allowsready_to_merge, while the lifecycle flow handlesunstable_sonarcloudvia--admin. If this is intentional (forcing users throughpr-lifecyclefor SonarCloud bypasses), a brief comment would clarify intent. Otherwise, consider acceptingunstable_sonarcloudhere too.
| --max-retries) | ||
| [[ $# -lt 2 ]] && { | ||
| log_error "--max-retries requires a value" | ||
| return 1 | ||
| } | ||
| max_retries="$2" | ||
| shift 2 |
There was a problem hiding this comment.
Validate --max-retries before embedding in SQL.
It’s currently unvalidated and interpolated directly, which can cause SQL errors or injection if malformed input slips in.
🔧 Proposed fix
--max-retries)
[[ $# -lt 2 ]] && {
log_error "--max-retries requires a value"
return 1
}
max_retries="$2"
shift 2
;;
@@
- db "$SUPERVISOR_DB" "
+ if ! [[ "$max_retries" =~ ^[0-9]+$ ]]; then
+ log_error "--max-retries must be a non-negative integer"
+ return 1
+ fi
+ db "$SUPERVISOR_DB" "
INSERT INTO tasks (id, repo, description, model, max_retries)
VALUES ('$escaped_id', '$escaped_repo', '$escaped_desc', '$escaped_model', $max_retries);
"Also applies to: 150-153
🤖 Prompt for AI Agents
In @.agents/scripts/supervisor/batch.sh around lines 49 - 55, Validate the
max_retries value before using it in SQL: ensure the variable max_retries (set
in the --max-retries parsing block and later used in the SQL interpolation at
the other occurrence) matches a non-negative integer pattern (e.g. regex like
^[0-9]+$) and optionally enforce an upper bound, reject or normalize invalid
input (call log_error and return non-zero) and convert to a sanitized integer
form before embedding into SQL; apply the same validation for the second
occurrence (the block referenced at lines 150-153) so both parsing sites enforce
and sanitize max_retries prior to SQL interpolation.
| --concurrency) | ||
| [[ $# -lt 2 ]] && { | ||
| log_error "--concurrency requires a value" | ||
| return 1 | ||
| } | ||
| concurrency="$2" | ||
| shift 2 | ||
| ;; | ||
| --max-concurrency) | ||
| [[ $# -lt 2 ]] && { | ||
| log_error "--max-concurrency requires a value" | ||
| return 1 | ||
| } | ||
| max_concurrency="$2" | ||
| shift 2 | ||
| ;; | ||
| --tasks) | ||
| [[ $# -lt 2 ]] && { | ||
| log_error "--tasks requires a value" | ||
| return 1 | ||
| } | ||
| tasks="$2" | ||
| shift 2 | ||
| ;; | ||
| --max-load) | ||
| [[ $# -lt 2 ]] && { | ||
| log_error "--max-load requires a value" | ||
| return 1 | ||
| } | ||
| max_load_factor="$2" | ||
| shift 2 | ||
| ;; |
There was a problem hiding this comment.
Validate numeric batch flags before SQL insert.
concurrency, max_concurrency, and max_load_factor are injected raw; invalid values will break inserts or allow injection.
🔧 Proposed fix
while [[ $# -gt 0 ]]; do
@@
done
+ for v in concurrency max_concurrency max_load_factor; do
+ if ! [[ "${!v}" =~ ^[0-9]+$ ]]; then
+ log_error "--$v must be a non-negative integer"
+ return 1
+ fi
+ done
+
if [[ -z "$name" ]]; then
log_error "Usage: supervisor-helper.sh batch <name> [--concurrency N] [--max-concurrency N] [--tasks \"t001,t002\"] [--release-on-complete] [--release-type patch|minor|major] [--skip-quality-gate]"
return 1
fiAlso applies to: 276-279
🤖 Prompt for AI Agents
In @.agents/scripts/supervisor/batch.sh around lines 196 - 227, The flag parsing
currently assigns user-supplied values directly to concurrency, max_concurrency,
tasks, and max_load_factor; add numeric validation for each before assignment
and before any SQL insert: verify the value matches a non-negative integer (or
appropriate numeric/range for max_load_factor) using a regex or POSIX arithmetic
check, log an error with log_error and return 1 on invalid input,
coerce/normalize valid values (e.g., strip leading zeros) or set a safe default,
and ensure the same validation is applied again where these variables are used
for SQL inserts (sanitize or parameterize the query) — update the parsing
branches for --concurrency, --max-concurrency, --tasks, and --max-load and the
corresponding usage site around the SQL insert to enforce this.
| # Detect GH_TOKEN from gh CLI if available (t1006) | ||
| local gh_token="" | ||
| if command -v gh &>/dev/null; then | ||
| gh_token=$(gh auth token 2>/dev/null || true) | ||
| fi | ||
|
|
||
| # Build cron command with environment variables | ||
| local env_vars="" | ||
| if [[ -n "$user_path" ]]; then | ||
| env_vars="PATH=${user_path}" | ||
| fi | ||
| if [[ -n "$gh_token" ]]; then | ||
| env_vars="${env_vars:+${env_vars} }GH_TOKEN=${gh_token}" | ||
| fi | ||
|
|
||
| local cron_cmd="*/${interval} * * * * ${env_vars:+${env_vars} }${script_path} pulse ${batch_arg} >> ${SUPERVISOR_DIR}/cron.log 2>&1 ${cron_marker}" |
There was a problem hiding this comment.
Avoid embedding GH_TOKEN directly in crontab.
This writes a long‑lived secret into a plaintext cron entry (and process list), which is a security foot‑gun. Prefer a protected env file (chmod 600) or rely on gh auth token inside the script.
🤖 Prompt for AI Agents
In @.agents/scripts/supervisor/cron.sh around lines 50 - 65, The current logic
embeds GH_TOKEN directly into the crontab entry via cron_cmd; instead, if
gh_token is detected, write it to a protected env file (e.g., create a file
under SUPERVISOR_DIR like cron_env, chmod 600) and update cron_cmd to source
that env file before invoking ${script_path} (or simply rely on calling
${script_path} which uses `gh auth token` internally), so GH_TOKEN is not placed
directly into the crontab or process list; update references around cron.sh,
cron_cmd, GH_TOKEN, SUPERVISOR_DIR, script_path, pulse, and cron_marker
accordingly.
| db "$SUPERVISOR_DB" <<'MIGRATE_T148' | ||
| PRAGMA foreign_keys=OFF; | ||
| BEGIN TRANSACTION; | ||
| ALTER TABLE tasks RENAME TO tasks_old_t148; | ||
| CREATE TABLE tasks ( | ||
| id TEXT PRIMARY KEY, | ||
| repo TEXT NOT NULL, | ||
| description TEXT, | ||
| status TEXT NOT NULL DEFAULT 'queued' | ||
| CHECK(status IN ('queued','dispatched','running','evaluating','retrying','complete','pr_review','review_triage','merging','merged','deploying','deployed','verifying','verified','verify_failed','blocked','failed','cancelled')), | ||
| session_id TEXT, | ||
| worktree TEXT, | ||
| branch TEXT, | ||
| log_file TEXT, | ||
| retries INTEGER NOT NULL DEFAULT 0, | ||
| max_retries INTEGER NOT NULL DEFAULT 3, | ||
| model TEXT DEFAULT 'anthropic/claude-opus-4-6', | ||
| error TEXT, | ||
| pr_url TEXT, | ||
| issue_url TEXT, | ||
| diagnostic_of TEXT, | ||
| triage_result TEXT, | ||
| created_at TEXT NOT NULL DEFAULT (strftime('%Y-%m-%dT%H:%M:%SZ','now')), | ||
| started_at TEXT, | ||
| completed_at TEXT, | ||
| updated_at TEXT NOT NULL DEFAULT (strftime('%Y-%m-%dT%H:%M:%SZ','now')) | ||
| ); | ||
| INSERT INTO tasks (id, repo, description, status, session_id, worktree, branch, | ||
| log_file, retries, max_retries, model, error, pr_url, issue_url, diagnostic_of, | ||
| created_at, started_at, completed_at, updated_at) | ||
| SELECT id, repo, description, status, session_id, worktree, branch, | ||
| log_file, retries, max_retries, model, error, pr_url, issue_url, diagnostic_of, | ||
| created_at, started_at, completed_at, updated_at | ||
| FROM tasks_old_t148; |
There was a problem hiding this comment.
Preserve triage_result during the t148 table rebuild.
The INSERT/SELECT lists omit triage_result, so any existing values are dropped when recreating tasks. That’s silent data loss for triage workflows.
🔧 Proposed fix
-INSERT INTO tasks (id, repo, description, status, session_id, worktree, branch,
- log_file, retries, max_retries, model, error, pr_url, issue_url, diagnostic_of,
- created_at, started_at, completed_at, updated_at)
-SELECT id, repo, description, status, session_id, worktree, branch,
- log_file, retries, max_retries, model, error, pr_url, issue_url, diagnostic_of,
- created_at, started_at, completed_at, updated_at
+INSERT INTO tasks (id, repo, description, status, session_id, worktree, branch,
+ log_file, retries, max_retries, model, error, pr_url, issue_url, diagnostic_of,
+ triage_result, created_at, started_at, completed_at, updated_at)
+SELECT id, repo, description, status, session_id, worktree, branch,
+ log_file, retries, max_retries, model, error, pr_url, issue_url, diagnostic_of,
+ triage_result, created_at, started_at, completed_at, updated_at
FROM tasks_old_t148;🤖 Prompt for AI Agents
In @.agents/scripts/supervisor/database.sh around lines 325 - 358, The migration
rebuilds the tasks table but the INSERT ... SELECT omits the triage_result
column, causing data loss; update the INSERT INTO tasks (...) column list and
the corresponding SELECT ... FROM tasks_old_t148 to include triage_result in the
exact position/order used in the CREATE TABLE so values from
tasks_old_t148.triage_result are preserved when copying rows back into tasks.
| local deploy_output | ||
| if deploy_output=$("$deploy_script" "${deploy_args[@]}" 2>&1); then | ||
| log_success "Targeted deploy complete for $task_id" | ||
| echo "$deploy_output" >"$deploy_log" 2>/dev/null || true | ||
| return 0 | ||
| fi | ||
|
|
||
| local deploy_exit=$? | ||
| if [[ "$deploy_exit" -eq 2 ]]; then | ||
| # Exit 2 = nothing to deploy (no changes detected) | ||
| log_info "No agent changes to deploy for $task_id" | ||
| return 0 | ||
| fi |
There was a problem hiding this comment.
Bug: $? after an if statement does not capture the command's exit code.
After if deploy_output=$(...); then ... fi, $? is 0 (the exit status of the if compound command when the body is not executed). The actual exit code from the deploy script is lost, so deploy_exit will always be 0 and the exit 2 ("nothing to deploy") branch on line 92 is dead code.
🐛 Proposed fix: capture exit code explicitly
local deploy_output
- if deploy_output=$("$deploy_script" "${deploy_args[@]}" 2>&1); then
+ local deploy_exit=0
+ deploy_output=$("$deploy_script" "${deploy_args[@]}" 2>&1) || deploy_exit=$?
+
+ if [[ "$deploy_exit" -eq 0 ]]; then
log_success "Targeted deploy complete for $task_id"
echo "$deploy_output" >"$deploy_log" 2>/dev/null || true
return 0
fi
- local deploy_exit=$?
if [[ "$deploy_exit" -eq 2 ]]; then
# Exit 2 = nothing to deploy (no changes detected)
log_info "No agent changes to deploy for $task_id"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| local deploy_output | |
| if deploy_output=$("$deploy_script" "${deploy_args[@]}" 2>&1); then | |
| log_success "Targeted deploy complete for $task_id" | |
| echo "$deploy_output" >"$deploy_log" 2>/dev/null || true | |
| return 0 | |
| fi | |
| local deploy_exit=$? | |
| if [[ "$deploy_exit" -eq 2 ]]; then | |
| # Exit 2 = nothing to deploy (no changes detected) | |
| log_info "No agent changes to deploy for $task_id" | |
| return 0 | |
| fi | |
| local deploy_output | |
| local deploy_exit=0 | |
| deploy_output=$("$deploy_script" "${deploy_args[@]}" 2>&1) || deploy_exit=$? | |
| if [[ "$deploy_exit" -eq 0 ]]; then | |
| log_success "Targeted deploy complete for $task_id" | |
| echo "$deploy_output" >"$deploy_log" 2>/dev/null || true | |
| return 0 | |
| fi | |
| if [[ "$deploy_exit" -eq 2 ]]; then | |
| # Exit 2 = nothing to deploy (no changes detected) | |
| log_info "No agent changes to deploy for $task_id" | |
| return 0 | |
| fi |
🤖 Prompt for AI Agents
In @.agents/scripts/supervisor/deploy.sh around lines 84 - 96, The if/then loses
the deployment script's exit code because $? is read after the if compound;
change the pattern to capture the exit code immediately after running the deploy
command (assign deploy_output by running "$deploy_script" "${deploy_args[@]}"
2>&1, then right away set deploy_exit=$?), then branch on deploy_exit: if
deploy_exit -eq 0 handle success (log_success, write deploy_output to deploy_log
and return 0), if deploy_exit -eq 2 log_info "No agent changes to deploy for
$task_id" and return 0, else log_failure/fail accordingly; update references to
deploy_output, deploy_exit, deploy_script, deploy_args, deploy_log, and task_id
in the same block.
| cmd_dispatch() { | ||
| local task_id="" batch_id="" | ||
|
|
||
| # First positional arg is task_id | ||
| if [[ $# -gt 0 && ! "$1" =~ ^-- ]]; then | ||
| task_id="$1" | ||
| shift | ||
| fi | ||
|
|
||
| while [[ $# -gt 0 ]]; do | ||
| case "$1" in | ||
| --batch) | ||
| [[ $# -lt 2 ]] && { | ||
| log_error "--batch requires a value" | ||
| return 1 | ||
| } | ||
| batch_id="$2" | ||
| shift 2 | ||
| ;; | ||
| *) | ||
| log_error "Unknown option: $1" | ||
| return 1 | ||
| ;; | ||
| esac | ||
| done | ||
|
|
||
| if [[ -z "$task_id" ]]; then | ||
| log_error "Usage: supervisor-helper.sh dispatch <task_id>" | ||
| return 1 | ||
| fi | ||
|
|
||
| ensure_db | ||
|
|
||
| # Get task details | ||
| local escaped_id | ||
| escaped_id=$(sql_escape "$task_id") | ||
| local task_row | ||
| task_row=$(db -separator $'\t' "$SUPERVISOR_DB" " | ||
| SELECT id, repo, description, status, model, retries, max_retries | ||
| FROM tasks WHERE id = '$escaped_id'; | ||
| ") | ||
|
|
||
| if [[ -z "$task_row" ]]; then | ||
| log_error "Task not found: $task_id" | ||
| return 1 | ||
| fi | ||
|
|
||
| local tid trepo tdesc tstatus tmodel tretries tmax_retries | ||
| IFS=$'\t' read -r tid trepo tdesc tstatus tmodel tretries tmax_retries <<<"$task_row" | ||
|
|
||
| # Validate task is in dispatchable state | ||
| if [[ "$tstatus" != "queued" ]]; then | ||
| log_error "Task $task_id is in '$tstatus' state, must be 'queued' to dispatch" | ||
| return 1 | ||
| fi | ||
|
|
||
| # Pre-dispatch verification: check if task was already completed in a prior batch. | ||
| # Searches git history for commits referencing this task ID. If a merged PR commit | ||
| # exists, the task is already done — cancel it instead of wasting an Opus session. | ||
| # This prevents the exact bug from backlog-10 where 6 t135 subtasks were dispatched | ||
| # despite being completed months earlier. | ||
| if check_task_already_done "$task_id" "${trepo:-.}"; then | ||
| log_warn "Task $task_id appears already completed in git history — cancelling" | ||
| db "$SUPERVISOR_DB" "UPDATE tasks SET status='cancelled', error='Pre-dispatch: already completed in git history' WHERE id='$(sql_escape "$task_id")';" | ||
| return 0 | ||
| fi | ||
|
|
||
| # Pre-dispatch reverification: detect previously-worked tasks (t1008) | ||
| # If a task was dispatched before (dead worker, unclaimed, re-queued, quality | ||
| # escalation), dispatch a lightweight verify worker instead of full implementation. | ||
| # Cost: ~$0.10-0.20 (sonnet) vs ~$1.00 (full session). The verify worker checks | ||
| # if deliverables exist and work; if incomplete, it continues implementation. | ||
| local verify_mode="" verify_reason="" | ||
| if [[ "${SUPERVISOR_SKIP_VERIFY_MODE:-false}" != "true" ]]; then | ||
| # Skip verify mode if the last error was verify_not_started_needs_full — | ||
| # the verify worker already confirmed no prior work exists, so a full | ||
| # implementation dispatch is needed (avoids infinite verify loop). | ||
| local last_error="" | ||
| last_error=$(db "$SUPERVISOR_DB" "SELECT COALESCE(error, '') FROM tasks WHERE id = '$(sql_escape "$task_id")';" 2>/dev/null) || last_error="" | ||
| if [[ "$last_error" == "verify_not_started_needs_full" || "$last_error" == "verify_incomplete_no_pr" ]]; then | ||
| log_info "Task $task_id: skipping verify mode (last error: $last_error) — using full dispatch" | ||
| else | ||
| verify_reason=$(was_previously_worked "$task_id" 2>/dev/null) || true | ||
| if [[ -n "$verify_reason" ]]; then | ||
| verify_mode="true" | ||
| log_info "Task $task_id was previously worked ($verify_reason) — using verify dispatch mode (t1008)" | ||
| fi | ||
| fi | ||
| fi | ||
|
|
||
| # Check if task is claimed by someone else via TODO.md assignee: field (t165) | ||
| local claimed_by="" | ||
| claimed_by=$(check_task_claimed "$task_id" "${trepo:-.}" 2>/dev/null) || true | ||
| if [[ -n "$claimed_by" ]]; then | ||
| # t1024: Check if the claim is stale (no active worker, claimed >2h ago) | ||
| # This prevents tasks from being stuck forever when a worker dies | ||
| local stale_threshold_seconds=7200 # 2 hours | ||
| local is_stale="false" | ||
|
|
||
| # Check if there's an active worker process for this task | ||
| local active_session="" | ||
| local escaped_id | ||
| escaped_id=$(sql_escape "$task_id") | ||
| active_session=$(db "$SUPERVISOR_DB" "SELECT session_id FROM tasks WHERE id = '$escaped_id' AND session_id IS NOT NULL AND status IN ('dispatched','running');" 2>/dev/null) || active_session="" | ||
|
|
||
| if [[ -z "$active_session" ]]; then | ||
| # No active worker — check how long the claim has been held | ||
| local todo_file="${trepo:-.}/TODO.md" | ||
| local task_line="" | ||
| task_line=$(grep -m1 "^[[:space:]]*- \[ \] $task_id " "$todo_file" 2>/dev/null) || task_line="" | ||
| if [[ -n "$task_line" ]]; then | ||
| local started_ts="" | ||
| started_ts=$(echo "$task_line" | sed -n 's/.*started:\([0-9T:Z-]*\).*/\1/p' 2>/dev/null) || started_ts="" | ||
| if [[ -n "$started_ts" ]]; then | ||
| local started_epoch now_epoch | ||
| started_epoch=$(date -j -f "%Y-%m-%dT%H:%M:%SZ" "$started_ts" "+%s" 2>/dev/null) || | ||
| started_epoch=$(date -d "$started_ts" "+%s" 2>/dev/null) || started_epoch=0 | ||
| now_epoch=$(date "+%s") | ||
| if [[ "$started_epoch" -gt 0 ]] && ((now_epoch - started_epoch > stale_threshold_seconds)); then | ||
| is_stale="true" | ||
| fi | ||
| else | ||
| # No started: timestamp but claimed — treat as stale if task is queued in DB | ||
| local db_status="" | ||
| db_status=$(db "$SUPERVISOR_DB" "SELECT status FROM tasks WHERE id = '$escaped_id';" 2>/dev/null) || db_status="" | ||
| if [[ "$db_status" == "queued" ]]; then | ||
| is_stale="true" | ||
| fi | ||
| fi | ||
| fi | ||
| fi | ||
|
|
||
| if [[ "$is_stale" == "true" ]]; then | ||
| log_warn "Task $task_id: stale claim by assignee:$claimed_by (no active worker, >2h) — auto-unclaiming (t1024)" | ||
| cmd_unclaim "$task_id" "${trepo:-.}" --force 2>/dev/null || true | ||
| else | ||
| log_warn "Task $task_id is claimed by assignee:$claimed_by — skipping dispatch" | ||
| return 0 | ||
| fi | ||
| fi | ||
|
|
||
| # Claim the task before dispatching (t165 — TODO.md primary, GH Issue sync optional) | ||
| # CRITICAL: abort dispatch if claim fails (race condition = another worker claimed first) | ||
| # Pass trepo so claim works from cron (where $PWD != repo dir) | ||
| if ! cmd_claim "$task_id" "${trepo:-.}"; then | ||
| log_error "Failed to claim $task_id — aborting dispatch" | ||
| return 1 | ||
| fi | ||
|
|
||
| # Authoritative concurrency check with adaptive load awareness (t151, t172) | ||
| # This is the single source of truth for concurrency enforcement. | ||
| # cmd_next() intentionally does NOT check concurrency to avoid a TOCTOU race | ||
| # where the count becomes stale between cmd_next() and cmd_dispatch() calls | ||
| # within the same pulse loop. | ||
| if [[ -n "$batch_id" ]]; then | ||
| local escaped_batch | ||
| escaped_batch=$(sql_escape "$batch_id") | ||
| local base_concurrency max_load_factor batch_max_concurrency | ||
| base_concurrency=$(db "$SUPERVISOR_DB" "SELECT concurrency FROM batches WHERE id = '$escaped_batch';") | ||
| max_load_factor=$(db "$SUPERVISOR_DB" "SELECT max_load_factor FROM batches WHERE id = '$escaped_batch';") | ||
| batch_max_concurrency=$(db "$SUPERVISOR_DB" "SELECT COALESCE(max_concurrency, 0) FROM batches WHERE id = '$escaped_batch';" 2>/dev/null || echo "0") | ||
| local concurrency | ||
| concurrency=$(calculate_adaptive_concurrency "${base_concurrency:-4}" "${max_load_factor:-2}" "${batch_max_concurrency:-0}") | ||
| local active_count | ||
| active_count=$(cmd_running_count "$batch_id") | ||
|
|
||
| if [[ "$active_count" -ge "$concurrency" ]]; then | ||
| log_warn "Concurrency limit reached ($active_count/$concurrency, base:$base_concurrency, adaptive) for batch $batch_id" | ||
| return 2 | ||
| fi | ||
| else | ||
| # Global concurrency check with adaptive load awareness (t151) | ||
| local base_global_concurrency="${SUPERVISOR_MAX_CONCURRENCY:-4}" | ||
| local global_concurrency | ||
| global_concurrency=$(calculate_adaptive_concurrency "$base_global_concurrency") | ||
| local global_active | ||
| global_active=$(cmd_running_count) | ||
| if [[ "$global_active" -ge "$global_concurrency" ]]; then | ||
| log_warn "Global concurrency limit reached ($global_active/$global_concurrency, base:$base_global_concurrency)" | ||
| return 2 | ||
| fi | ||
| fi | ||
|
|
||
| # Check max retries | ||
| if [[ "$tretries" -ge "$tmax_retries" ]]; then | ||
| log_error "Task $task_id has exceeded max retries ($tretries/$tmax_retries)" | ||
| cmd_transition "$task_id" "failed" --error "Max retries exceeded" | ||
| return 1 | ||
| fi | ||
|
|
||
| # Resolve AI CLI | ||
| local ai_cli | ||
| ai_cli=$(resolve_ai_cli) || return 1 | ||
|
|
||
| # Pre-dispatch model availability check (t233 — replaces simple health check) | ||
| # Calls model-availability-helper.sh check before spawning workers. | ||
| # Distinct exit codes prevent wasted dispatch attempts: | ||
| # exit 0 = healthy, proceed | ||
| # exit 1 = provider unavailable, defer dispatch | ||
| # exit 2 = rate limited, defer dispatch (retry next pulse) | ||
| # exit 3 = API key invalid/credits exhausted, block dispatch | ||
| # Previously: 9 wasted failures from ambiguous_ai_unavailable + backend_quota_error | ||
| # because the health check collapsed all failures to a single exit code. | ||
| local health_model health_exit=0 | ||
| health_model=$(resolve_model "health" "$ai_cli") | ||
| check_model_health "$ai_cli" "$health_model" || health_exit=$? | ||
| if [[ "$health_exit" -ne 0 ]]; then | ||
| case "$health_exit" in | ||
| 2) | ||
| log_warn "Provider rate-limited for $task_id ($health_model via $ai_cli) — deferring dispatch to next pulse" | ||
| return 3 # Return 3 = provider unavailable (distinct from concurrency limit 2) | ||
| ;; | ||
| 3) | ||
| log_error "API key invalid/credits exhausted for $task_id ($health_model via $ai_cli) — blocking dispatch" | ||
| log_error "Human action required: check API key or billing. Task will not auto-retry." | ||
| return 3 | ||
| ;; | ||
| *) | ||
| log_error "Provider unavailable for $task_id ($health_model via $ai_cli) — deferring dispatch" | ||
| return 3 | ||
| ;; | ||
| esac | ||
| fi | ||
|
|
||
| # Pre-dispatch GitHub auth check — verify the worker can push before | ||
| # creating worktrees and burning compute. Workers spawned via nohup/cron | ||
| # may lack SSH keys; gh auth git-credential only works with HTTPS remotes. | ||
| if ! check_gh_auth; then | ||
| log_error "GitHub auth unavailable for $task_id — check_gh_auth failed" | ||
| log_error "Workers need 'gh auth login' or GH_TOKEN set. Skipping dispatch." | ||
| return 3 | ||
| fi | ||
|
|
||
| # Verify repo remote uses HTTPS (not SSH) — workers in cron can't use SSH keys | ||
| local remote_url | ||
| remote_url=$(git -C "${trepo:-.}" remote get-url origin 2>/dev/null || echo "") | ||
| if [[ "$remote_url" == git@* || "$remote_url" == ssh://* ]]; then | ||
| log_warn "Remote URL is SSH ($remote_url) — switching to HTTPS for worker compatibility" | ||
| local https_url | ||
| https_url=$(echo "$remote_url" | sed -E 's|^git@github\.com:|https://github.com/|; s|^ssh://git@github\.com/|https://github.com/|; s|\.git$||').git | ||
| git -C "${trepo:-.}" remote set-url origin "$https_url" 2>/dev/null || true | ||
| log_info "Remote URL updated to $https_url" | ||
| fi | ||
|
|
||
| # Create worktree | ||
| log_info "Creating worktree for $task_id..." | ||
| local worktree_path | ||
| worktree_path=$(create_task_worktree "$task_id" "$trepo") || { | ||
| log_error "Failed to create worktree for $task_id" | ||
| cmd_transition "$task_id" "failed" --error "Worktree creation failed" | ||
| return 1 | ||
| } | ||
|
|
||
| # Validate worktree path is an actual directory (guards against stdout | ||
| # pollution from git commands inside create_task_worktree) | ||
| if [[ ! -d "$worktree_path" ]]; then | ||
| log_error "Worktree path is not a directory: '$worktree_path'" | ||
| log_error "This usually means a git command leaked stdout into the path variable" | ||
| cmd_transition "$task_id" "failed" --error "Worktree path invalid: $worktree_path" | ||
| return 1 | ||
| fi | ||
|
|
||
| local branch_name="feature/${task_id}" | ||
|
|
||
| # Set up log file | ||
| local log_dir="$SUPERVISOR_DIR/logs" | ||
| mkdir -p "$log_dir" | ||
| local log_file | ||
| log_file="$log_dir/${task_id}-$(date +%Y%m%d%H%M%S).log" | ||
|
|
||
| # Pre-create log file with dispatch metadata (t183) | ||
| # If the worker fails to start (opencode not found, permission error, etc.), | ||
| # the log file still exists with context for diagnosis instead of no_log_file. | ||
| { | ||
| echo "=== DISPATCH METADATA (t183) ===" | ||
| echo "task_id=$task_id" | ||
| echo "timestamp=$(date -u +%Y-%m-%dT%H:%M:%SZ)" | ||
| echo "worktree=$worktree_path" | ||
| echo "branch=$branch_name" | ||
| echo "model=${resolved_model:-${tmodel:-default}}" | ||
| echo "ai_cli=$(resolve_ai_cli 2>/dev/null || echo unknown)" | ||
| echo "dispatch_mode=$(detect_dispatch_mode 2>/dev/null || echo unknown)" | ||
| echo "dispatch_type=${verify_mode:+verify}" | ||
| echo "verify_reason=${verify_reason:-}" | ||
| echo "=== END DISPATCH METADATA ===" | ||
| echo "" | ||
| } >"$log_file" 2>/dev/null || true | ||
|
|
||
| # Transition to dispatched | ||
| cmd_transition "$task_id" "dispatched" \ | ||
| --worktree "$worktree_path" \ | ||
| --branch "$branch_name" \ | ||
| --log-file "$log_file" | ||
|
|
||
| # Detect dispatch mode | ||
| local dispatch_mode | ||
| dispatch_mode=$(detect_dispatch_mode) | ||
|
|
||
| # Recall relevant memories before dispatch (t128.6) | ||
| local memory_context="" | ||
| memory_context=$(recall_task_memories "$task_id" "$tdesc" 2>/dev/null || echo "") | ||
| if [[ -n "$memory_context" ]]; then | ||
| log_info "Injecting ${#memory_context} bytes of memory context for $task_id" | ||
| fi | ||
|
|
||
| # Resolve model via frontmatter + fallback chain (t132.5) | ||
| # t1008: For verify-mode dispatches, prefer sonnet tier (cheaper, sufficient for | ||
| # verification checks). The verify worker can escalate to full implementation if | ||
| # it discovers the work is incomplete, but starts cheap. | ||
| local resolved_model | ||
| if [[ "$verify_mode" == "true" ]]; then | ||
| resolved_model=$(resolve_model "coding" "$ai_cli" 2>/dev/null) || resolved_model="" | ||
| log_info "Verify mode: using coding-tier model ($resolved_model) instead of task-specific model" | ||
| else | ||
| resolved_model=$(resolve_task_model "$task_id" "$tmodel" "${trepo:-.}" "$ai_cli") | ||
| fi | ||
|
|
||
| # Contest mode intercept (t1011): if model resolves to CONTEST, delegate to | ||
| # contest-helper.sh which dispatches the same task to top-3 models in parallel. | ||
| # The original task stays in 'running' state while contest entries execute. | ||
| if [[ "$resolved_model" == "CONTEST" ]]; then | ||
| log_info "Contest mode activated for $task_id — delegating to contest-helper.sh" | ||
| local contest_helper="${SCRIPT_DIR}/contest-helper.sh" | ||
| if [[ -x "$contest_helper" ]]; then | ||
| local contest_id | ||
| contest_id=$("$contest_helper" create "$task_id" ${batch_id:+--batch "$batch_id"} 2>/dev/null) | ||
| if [[ -n "$contest_id" ]]; then | ||
| "$contest_helper" dispatch "$contest_id" 2>/dev/null || { | ||
| log_error "Contest dispatch failed for $task_id" | ||
| cmd_transition "$task_id" "failed" --error "Contest dispatch failed" | ||
| return 1 | ||
| } | ||
| # Keep original task in running state — pulse Phase 2.5 will check contest completion | ||
| db "$SUPERVISOR_DB" "UPDATE tasks SET error = 'contest:${contest_id}' WHERE id = '$(sql_escape "$task_id")';" | ||
| log_success "Contest $contest_id dispatched for $task_id" | ||
| echo "contest:${contest_id}" | ||
| return 0 | ||
| else | ||
| log_error "Failed to create contest for $task_id — falling back to default model" | ||
| resolved_model=$(resolve_model "coding" "$ai_cli") | ||
| fi | ||
| else | ||
| log_warn "contest-helper.sh not found — falling back to default model" | ||
| resolved_model=$(resolve_model "coding" "$ai_cli") | ||
| fi | ||
| fi | ||
|
|
||
| # Secondary availability check: verify the resolved model's provider (t233) | ||
| # The initial health check uses the "health" tier (typically anthropic). | ||
| # If the resolved model uses a different provider (e.g., google/gemini for pro tier), | ||
| # we need to verify that provider too. Skip if same provider or if using OpenCode | ||
| # (which manages routing internally). | ||
| if [[ "$ai_cli" != "opencode" && -n "$resolved_model" && "$resolved_model" == *"/"* ]]; then | ||
| local resolved_provider="${resolved_model%%/*}" | ||
| local health_provider="${health_model%%/*}" | ||
| if [[ "$resolved_provider" != "$health_provider" ]]; then | ||
| local availability_helper="${SCRIPT_DIR}/model-availability-helper.sh" | ||
| if [[ -x "$availability_helper" ]]; then | ||
| local resolved_avail_exit=0 | ||
| "$availability_helper" check "$resolved_provider" --quiet || resolved_avail_exit=$? | ||
| if [[ "$resolved_avail_exit" -ne 0 ]]; then | ||
| case "$resolved_avail_exit" in | ||
| 2) | ||
| log_warn "Resolved model provider '$resolved_provider' is rate-limited (exit $resolved_avail_exit) for $task_id — deferring dispatch" | ||
| ;; | ||
| 3) | ||
| log_error "Resolved model provider '$resolved_provider' has invalid key/credits (exit $resolved_avail_exit) for $task_id — blocking dispatch" | ||
| ;; | ||
| *) | ||
| log_warn "Resolved model provider '$resolved_provider' unavailable (exit $resolved_avail_exit) for $task_id — deferring dispatch" | ||
| ;; | ||
| esac | ||
| return 3 | ||
| fi | ||
| fi | ||
| fi | ||
| fi | ||
|
|
||
| local dispatch_type="full" | ||
| if [[ "$verify_mode" == "true" ]]; then | ||
| dispatch_type="verify" | ||
| fi | ||
| log_info "Dispatching $task_id via $ai_cli ($dispatch_mode mode, $dispatch_type dispatch)" | ||
| log_info "Worktree: $worktree_path" | ||
| log_info "Model: $resolved_model" | ||
| log_info "Log: $log_file" | ||
|
|
||
| # Build and execute dispatch command | ||
| # t1008: Use verify dispatch for previously-worked tasks (cheaper, focused) | ||
| # Use NUL-delimited read to preserve multi-line prompts as single arguments | ||
| local -a cmd_parts=() | ||
| if [[ "$verify_mode" == "true" ]]; then | ||
| while IFS= read -r -d '' part; do | ||
| cmd_parts+=("$part") | ||
| done < <(build_verify_dispatch_cmd "$task_id" "$worktree_path" "$log_file" "$ai_cli" "$memory_context" "$resolved_model" "$tdesc" "$verify_reason") | ||
| else | ||
| while IFS= read -r -d '' part; do | ||
| cmd_parts+=("$part") | ||
| done < <(build_dispatch_cmd "$task_id" "$worktree_path" "$log_file" "$ai_cli" "$memory_context" "$resolved_model" "$tdesc") | ||
| fi | ||
|
|
||
| # Ensure PID directory exists before dispatch | ||
| mkdir -p "$SUPERVISOR_DIR/pids" | ||
|
|
||
| # Set FULL_LOOP_HEADLESS for all supervisor-dispatched workers (t174) | ||
| # This ensures headless mode even if the AI doesn't parse --headless from the prompt | ||
| local headless_env="FULL_LOOP_HEADLESS=true" | ||
|
|
||
| # Generate worker-specific MCP config with heavy indexers disabled (t221) | ||
| # Saves ~4 CPU cores per worker by preventing osgrep from indexing | ||
| local worker_xdg_config="" | ||
| worker_xdg_config=$(generate_worker_mcp_config "$task_id" 2>/dev/null) || true | ||
|
|
||
| # Write dispatch script to a temp file to avoid bash -c quoting issues | ||
| # with multi-line prompts (newlines in printf '%q' break bash -c strings) | ||
| local dispatch_script="${SUPERVISOR_DIR}/pids/${task_id}-dispatch.sh" | ||
| { | ||
| echo '#!/usr/bin/env bash' | ||
| echo "# Startup sentinel (t183): if this line appears in the log, the script started" | ||
| echo "echo 'WORKER_STARTED task_id=${task_id} pid=\$\$ timestamp='\$(date -u +%Y-%m-%dT%H:%M:%SZ)" | ||
| echo "cd '${worktree_path}' || { echo 'WORKER_FAILED: cd to worktree failed: ${worktree_path}'; exit 1; }" | ||
| echo "export ${headless_env}" | ||
| # Redirect worker to use MCP config with heavy indexers disabled (t221) | ||
| if [[ -n "$worker_xdg_config" ]]; then | ||
| echo "export XDG_CONFIG_HOME='${worker_xdg_config}'" | ||
| fi | ||
| # Write each cmd_part as a properly quoted array element | ||
| printf 'exec ' | ||
| printf '%q ' "${cmd_parts[@]}" | ||
| printf '\n' | ||
| } >"$dispatch_script" | ||
| chmod +x "$dispatch_script" | ||
|
|
||
| # Wrapper script (t183): captures errors from the dispatch script itself. | ||
| # Previous approach used nohup bash -c with &>/dev/null which swallowed | ||
| # errors when the dispatch script failed to start (e.g., opencode not found). | ||
| # Now errors are appended to the log file for diagnosis. | ||
| # t253: Add cleanup handlers to prevent orphaned children when wrapper exits | ||
| local wrapper_script="${SUPERVISOR_DIR}/pids/${task_id}-wrapper.sh" | ||
| { | ||
| echo '#!/usr/bin/env bash' | ||
| echo '# t253: Recursive cleanup to kill all descendant processes' | ||
| echo '_kill_descendants_recursive() {' | ||
| echo ' local parent_pid="$1"' | ||
| echo ' local children' | ||
| echo ' children=$(pgrep -P "$parent_pid" 2>/dev/null || true)' | ||
| echo ' if [[ -n "$children" ]]; then' | ||
| echo ' for child in $children; do' | ||
| echo ' _kill_descendants_recursive "$child"' | ||
| echo ' done' | ||
| echo ' fi' | ||
| echo ' kill -TERM "$parent_pid" 2>/dev/null || true' | ||
| echo '}' | ||
| echo '' | ||
| echo 'cleanup_children() {' | ||
| echo ' local wrapper_pid=$$' | ||
| echo ' local children' | ||
| echo ' children=$(pgrep -P "$wrapper_pid" 2>/dev/null || true)' | ||
| echo ' if [[ -n "$children" ]]; then' | ||
| echo ' # Recursively kill all descendants' | ||
| echo ' for child in $children; do' | ||
| echo ' _kill_descendants_recursive "$child"' | ||
| echo ' done' | ||
| echo ' sleep 0.5' | ||
| echo ' # Force kill any survivors' | ||
| echo ' for child in $children; do' | ||
| echo ' pkill -9 -P "$child" 2>/dev/null || true' | ||
| echo ' kill -9 "$child" 2>/dev/null || true' | ||
| echo ' done' | ||
| echo ' fi' | ||
| echo '}' | ||
| echo '# Register cleanup on EXIT, INT, TERM (KILL cannot be trapped)' | ||
| echo 'trap cleanup_children EXIT INT TERM' | ||
| echo '' | ||
| echo "'${dispatch_script}' >> '${log_file}' 2>&1" | ||
| echo "rc=\$?" | ||
| echo "echo \"EXIT:\${rc}\" >> '${log_file}'" | ||
| echo "if [ \$rc -ne 0 ]; then" | ||
| echo " echo \"WORKER_DISPATCH_ERROR: dispatch script exited with code \${rc}\" >> '${log_file}'" | ||
| echo "fi" | ||
| } >"$wrapper_script" | ||
| chmod +x "$wrapper_script" | ||
|
|
||
| if [[ "$dispatch_mode" == "tabby" ]]; then | ||
| # Tabby: attempt to open in a new tab via OSC 1337 escape sequence | ||
| log_info "Opening Tabby tab for $task_id..." | ||
| printf '\e]1337;NewTab=%s\a' "'${wrapper_script}'" 2>/dev/null || true | ||
| # Also start background process as fallback (Tabby may not support OSC 1337) | ||
| # t253: Use setsid if available (Linux) for process group isolation | ||
| # Use nohup + disown to survive parent (cron) exit | ||
| if command -v setsid &>/dev/null; then | ||
| nohup setsid bash "${wrapper_script}" &>/dev/null & | ||
| else | ||
| nohup bash "${wrapper_script}" &>/dev/null & | ||
| fi | ||
| else | ||
| # Headless: background process | ||
| # t253: Use setsid if available (Linux) for process group isolation | ||
| # Use nohup + disown to survive parent (cron) exit — without this, | ||
| # workers die after ~2 minutes when the cron pulse script exits | ||
| if command -v setsid &>/dev/null; then | ||
| nohup setsid bash "${wrapper_script}" &>/dev/null & | ||
| else | ||
| nohup bash "${wrapper_script}" &>/dev/null & | ||
| fi | ||
| fi | ||
|
|
||
| local worker_pid=$! | ||
| disown "$worker_pid" 2>/dev/null || true | ||
|
|
||
| # Store PID for monitoring | ||
| echo "$worker_pid" >"$SUPERVISOR_DIR/pids/${task_id}.pid" | ||
|
|
||
| # Transition to running | ||
| cmd_transition "$task_id" "running" --session "pid:$worker_pid" | ||
|
|
||
| # Add dispatched:model label to GitHub issue (t1010) | ||
| add_model_label "$task_id" "dispatched" "$resolved_model" "${trepo:-.}" 2>>"$SUPERVISOR_LOG" || true | ||
|
|
||
| log_success "Dispatched $task_id (PID: $worker_pid)" | ||
| echo "$worker_pid" | ||
| return 0 | ||
| } |
There was a problem hiding this comment.
Verify‑mode should resolve to a cheaper tier (sonnet), not coding.
The comment says “prefer sonnet tier,” but resolve_model "coding" maps to Opus; this inflates cost and defeats the optimization.
🔧 Proposed fix
if [[ "$verify_mode" == "true" ]]; then
- resolved_model=$(resolve_model "coding" "$ai_cli" 2>/dev/null) || resolved_model=""
- log_info "Verify mode: using coding-tier model ($resolved_model) instead of task-specific model"
+ resolved_model=$(resolve_model "sonnet" "$ai_cli" 2>/dev/null) || resolved_model=""
+ log_info "Verify mode: using sonnet-tier model ($resolved_model) instead of task-specific model"
else
resolved_model=$(resolve_task_model "$task_id" "$tmodel" "${trepo:-.}" "$ai_cli")
fi📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| cmd_dispatch() { | |
| local task_id="" batch_id="" | |
| # First positional arg is task_id | |
| if [[ $# -gt 0 && ! "$1" =~ ^-- ]]; then | |
| task_id="$1" | |
| shift | |
| fi | |
| while [[ $# -gt 0 ]]; do | |
| case "$1" in | |
| --batch) | |
| [[ $# -lt 2 ]] && { | |
| log_error "--batch requires a value" | |
| return 1 | |
| } | |
| batch_id="$2" | |
| shift 2 | |
| ;; | |
| *) | |
| log_error "Unknown option: $1" | |
| return 1 | |
| ;; | |
| esac | |
| done | |
| if [[ -z "$task_id" ]]; then | |
| log_error "Usage: supervisor-helper.sh dispatch <task_id>" | |
| return 1 | |
| fi | |
| ensure_db | |
| # Get task details | |
| local escaped_id | |
| escaped_id=$(sql_escape "$task_id") | |
| local task_row | |
| task_row=$(db -separator $'\t' "$SUPERVISOR_DB" " | |
| SELECT id, repo, description, status, model, retries, max_retries | |
| FROM tasks WHERE id = '$escaped_id'; | |
| ") | |
| if [[ -z "$task_row" ]]; then | |
| log_error "Task not found: $task_id" | |
| return 1 | |
| fi | |
| local tid trepo tdesc tstatus tmodel tretries tmax_retries | |
| IFS=$'\t' read -r tid trepo tdesc tstatus tmodel tretries tmax_retries <<<"$task_row" | |
| # Validate task is in dispatchable state | |
| if [[ "$tstatus" != "queued" ]]; then | |
| log_error "Task $task_id is in '$tstatus' state, must be 'queued' to dispatch" | |
| return 1 | |
| fi | |
| # Pre-dispatch verification: check if task was already completed in a prior batch. | |
| # Searches git history for commits referencing this task ID. If a merged PR commit | |
| # exists, the task is already done — cancel it instead of wasting an Opus session. | |
| # This prevents the exact bug from backlog-10 where 6 t135 subtasks were dispatched | |
| # despite being completed months earlier. | |
| if check_task_already_done "$task_id" "${trepo:-.}"; then | |
| log_warn "Task $task_id appears already completed in git history — cancelling" | |
| db "$SUPERVISOR_DB" "UPDATE tasks SET status='cancelled', error='Pre-dispatch: already completed in git history' WHERE id='$(sql_escape "$task_id")';" | |
| return 0 | |
| fi | |
| # Pre-dispatch reverification: detect previously-worked tasks (t1008) | |
| # If a task was dispatched before (dead worker, unclaimed, re-queued, quality | |
| # escalation), dispatch a lightweight verify worker instead of full implementation. | |
| # Cost: ~$0.10-0.20 (sonnet) vs ~$1.00 (full session). The verify worker checks | |
| # if deliverables exist and work; if incomplete, it continues implementation. | |
| local verify_mode="" verify_reason="" | |
| if [[ "${SUPERVISOR_SKIP_VERIFY_MODE:-false}" != "true" ]]; then | |
| # Skip verify mode if the last error was verify_not_started_needs_full — | |
| # the verify worker already confirmed no prior work exists, so a full | |
| # implementation dispatch is needed (avoids infinite verify loop). | |
| local last_error="" | |
| last_error=$(db "$SUPERVISOR_DB" "SELECT COALESCE(error, '') FROM tasks WHERE id = '$(sql_escape "$task_id")';" 2>/dev/null) || last_error="" | |
| if [[ "$last_error" == "verify_not_started_needs_full" || "$last_error" == "verify_incomplete_no_pr" ]]; then | |
| log_info "Task $task_id: skipping verify mode (last error: $last_error) — using full dispatch" | |
| else | |
| verify_reason=$(was_previously_worked "$task_id" 2>/dev/null) || true | |
| if [[ -n "$verify_reason" ]]; then | |
| verify_mode="true" | |
| log_info "Task $task_id was previously worked ($verify_reason) — using verify dispatch mode (t1008)" | |
| fi | |
| fi | |
| fi | |
| # Check if task is claimed by someone else via TODO.md assignee: field (t165) | |
| local claimed_by="" | |
| claimed_by=$(check_task_claimed "$task_id" "${trepo:-.}" 2>/dev/null) || true | |
| if [[ -n "$claimed_by" ]]; then | |
| # t1024: Check if the claim is stale (no active worker, claimed >2h ago) | |
| # This prevents tasks from being stuck forever when a worker dies | |
| local stale_threshold_seconds=7200 # 2 hours | |
| local is_stale="false" | |
| # Check if there's an active worker process for this task | |
| local active_session="" | |
| local escaped_id | |
| escaped_id=$(sql_escape "$task_id") | |
| active_session=$(db "$SUPERVISOR_DB" "SELECT session_id FROM tasks WHERE id = '$escaped_id' AND session_id IS NOT NULL AND status IN ('dispatched','running');" 2>/dev/null) || active_session="" | |
| if [[ -z "$active_session" ]]; then | |
| # No active worker — check how long the claim has been held | |
| local todo_file="${trepo:-.}/TODO.md" | |
| local task_line="" | |
| task_line=$(grep -m1 "^[[:space:]]*- \[ \] $task_id " "$todo_file" 2>/dev/null) || task_line="" | |
| if [[ -n "$task_line" ]]; then | |
| local started_ts="" | |
| started_ts=$(echo "$task_line" | sed -n 's/.*started:\([0-9T:Z-]*\).*/\1/p' 2>/dev/null) || started_ts="" | |
| if [[ -n "$started_ts" ]]; then | |
| local started_epoch now_epoch | |
| started_epoch=$(date -j -f "%Y-%m-%dT%H:%M:%SZ" "$started_ts" "+%s" 2>/dev/null) || | |
| started_epoch=$(date -d "$started_ts" "+%s" 2>/dev/null) || started_epoch=0 | |
| now_epoch=$(date "+%s") | |
| if [[ "$started_epoch" -gt 0 ]] && ((now_epoch - started_epoch > stale_threshold_seconds)); then | |
| is_stale="true" | |
| fi | |
| else | |
| # No started: timestamp but claimed — treat as stale if task is queued in DB | |
| local db_status="" | |
| db_status=$(db "$SUPERVISOR_DB" "SELECT status FROM tasks WHERE id = '$escaped_id';" 2>/dev/null) || db_status="" | |
| if [[ "$db_status" == "queued" ]]; then | |
| is_stale="true" | |
| fi | |
| fi | |
| fi | |
| fi | |
| if [[ "$is_stale" == "true" ]]; then | |
| log_warn "Task $task_id: stale claim by assignee:$claimed_by (no active worker, >2h) — auto-unclaiming (t1024)" | |
| cmd_unclaim "$task_id" "${trepo:-.}" --force 2>/dev/null || true | |
| else | |
| log_warn "Task $task_id is claimed by assignee:$claimed_by — skipping dispatch" | |
| return 0 | |
| fi | |
| fi | |
| # Claim the task before dispatching (t165 — TODO.md primary, GH Issue sync optional) | |
| # CRITICAL: abort dispatch if claim fails (race condition = another worker claimed first) | |
| # Pass trepo so claim works from cron (where $PWD != repo dir) | |
| if ! cmd_claim "$task_id" "${trepo:-.}"; then | |
| log_error "Failed to claim $task_id — aborting dispatch" | |
| return 1 | |
| fi | |
| # Authoritative concurrency check with adaptive load awareness (t151, t172) | |
| # This is the single source of truth for concurrency enforcement. | |
| # cmd_next() intentionally does NOT check concurrency to avoid a TOCTOU race | |
| # where the count becomes stale between cmd_next() and cmd_dispatch() calls | |
| # within the same pulse loop. | |
| if [[ -n "$batch_id" ]]; then | |
| local escaped_batch | |
| escaped_batch=$(sql_escape "$batch_id") | |
| local base_concurrency max_load_factor batch_max_concurrency | |
| base_concurrency=$(db "$SUPERVISOR_DB" "SELECT concurrency FROM batches WHERE id = '$escaped_batch';") | |
| max_load_factor=$(db "$SUPERVISOR_DB" "SELECT max_load_factor FROM batches WHERE id = '$escaped_batch';") | |
| batch_max_concurrency=$(db "$SUPERVISOR_DB" "SELECT COALESCE(max_concurrency, 0) FROM batches WHERE id = '$escaped_batch';" 2>/dev/null || echo "0") | |
| local concurrency | |
| concurrency=$(calculate_adaptive_concurrency "${base_concurrency:-4}" "${max_load_factor:-2}" "${batch_max_concurrency:-0}") | |
| local active_count | |
| active_count=$(cmd_running_count "$batch_id") | |
| if [[ "$active_count" -ge "$concurrency" ]]; then | |
| log_warn "Concurrency limit reached ($active_count/$concurrency, base:$base_concurrency, adaptive) for batch $batch_id" | |
| return 2 | |
| fi | |
| else | |
| # Global concurrency check with adaptive load awareness (t151) | |
| local base_global_concurrency="${SUPERVISOR_MAX_CONCURRENCY:-4}" | |
| local global_concurrency | |
| global_concurrency=$(calculate_adaptive_concurrency "$base_global_concurrency") | |
| local global_active | |
| global_active=$(cmd_running_count) | |
| if [[ "$global_active" -ge "$global_concurrency" ]]; then | |
| log_warn "Global concurrency limit reached ($global_active/$global_concurrency, base:$base_global_concurrency)" | |
| return 2 | |
| fi | |
| fi | |
| # Check max retries | |
| if [[ "$tretries" -ge "$tmax_retries" ]]; then | |
| log_error "Task $task_id has exceeded max retries ($tretries/$tmax_retries)" | |
| cmd_transition "$task_id" "failed" --error "Max retries exceeded" | |
| return 1 | |
| fi | |
| # Resolve AI CLI | |
| local ai_cli | |
| ai_cli=$(resolve_ai_cli) || return 1 | |
| # Pre-dispatch model availability check (t233 — replaces simple health check) | |
| # Calls model-availability-helper.sh check before spawning workers. | |
| # Distinct exit codes prevent wasted dispatch attempts: | |
| # exit 0 = healthy, proceed | |
| # exit 1 = provider unavailable, defer dispatch | |
| # exit 2 = rate limited, defer dispatch (retry next pulse) | |
| # exit 3 = API key invalid/credits exhausted, block dispatch | |
| # Previously: 9 wasted failures from ambiguous_ai_unavailable + backend_quota_error | |
| # because the health check collapsed all failures to a single exit code. | |
| local health_model health_exit=0 | |
| health_model=$(resolve_model "health" "$ai_cli") | |
| check_model_health "$ai_cli" "$health_model" || health_exit=$? | |
| if [[ "$health_exit" -ne 0 ]]; then | |
| case "$health_exit" in | |
| 2) | |
| log_warn "Provider rate-limited for $task_id ($health_model via $ai_cli) — deferring dispatch to next pulse" | |
| return 3 # Return 3 = provider unavailable (distinct from concurrency limit 2) | |
| ;; | |
| 3) | |
| log_error "API key invalid/credits exhausted for $task_id ($health_model via $ai_cli) — blocking dispatch" | |
| log_error "Human action required: check API key or billing. Task will not auto-retry." | |
| return 3 | |
| ;; | |
| *) | |
| log_error "Provider unavailable for $task_id ($health_model via $ai_cli) — deferring dispatch" | |
| return 3 | |
| ;; | |
| esac | |
| fi | |
| # Pre-dispatch GitHub auth check — verify the worker can push before | |
| # creating worktrees and burning compute. Workers spawned via nohup/cron | |
| # may lack SSH keys; gh auth git-credential only works with HTTPS remotes. | |
| if ! check_gh_auth; then | |
| log_error "GitHub auth unavailable for $task_id — check_gh_auth failed" | |
| log_error "Workers need 'gh auth login' or GH_TOKEN set. Skipping dispatch." | |
| return 3 | |
| fi | |
| # Verify repo remote uses HTTPS (not SSH) — workers in cron can't use SSH keys | |
| local remote_url | |
| remote_url=$(git -C "${trepo:-.}" remote get-url origin 2>/dev/null || echo "") | |
| if [[ "$remote_url" == git@* || "$remote_url" == ssh://* ]]; then | |
| log_warn "Remote URL is SSH ($remote_url) — switching to HTTPS for worker compatibility" | |
| local https_url | |
| https_url=$(echo "$remote_url" | sed -E 's|^git@github\.com:|https://github.com/|; s|^ssh://git@github\.com/|https://github.com/|; s|\.git$||').git | |
| git -C "${trepo:-.}" remote set-url origin "$https_url" 2>/dev/null || true | |
| log_info "Remote URL updated to $https_url" | |
| fi | |
| # Create worktree | |
| log_info "Creating worktree for $task_id..." | |
| local worktree_path | |
| worktree_path=$(create_task_worktree "$task_id" "$trepo") || { | |
| log_error "Failed to create worktree for $task_id" | |
| cmd_transition "$task_id" "failed" --error "Worktree creation failed" | |
| return 1 | |
| } | |
| # Validate worktree path is an actual directory (guards against stdout | |
| # pollution from git commands inside create_task_worktree) | |
| if [[ ! -d "$worktree_path" ]]; then | |
| log_error "Worktree path is not a directory: '$worktree_path'" | |
| log_error "This usually means a git command leaked stdout into the path variable" | |
| cmd_transition "$task_id" "failed" --error "Worktree path invalid: $worktree_path" | |
| return 1 | |
| fi | |
| local branch_name="feature/${task_id}" | |
| # Set up log file | |
| local log_dir="$SUPERVISOR_DIR/logs" | |
| mkdir -p "$log_dir" | |
| local log_file | |
| log_file="$log_dir/${task_id}-$(date +%Y%m%d%H%M%S).log" | |
| # Pre-create log file with dispatch metadata (t183) | |
| # If the worker fails to start (opencode not found, permission error, etc.), | |
| # the log file still exists with context for diagnosis instead of no_log_file. | |
| { | |
| echo "=== DISPATCH METADATA (t183) ===" | |
| echo "task_id=$task_id" | |
| echo "timestamp=$(date -u +%Y-%m-%dT%H:%M:%SZ)" | |
| echo "worktree=$worktree_path" | |
| echo "branch=$branch_name" | |
| echo "model=${resolved_model:-${tmodel:-default}}" | |
| echo "ai_cli=$(resolve_ai_cli 2>/dev/null || echo unknown)" | |
| echo "dispatch_mode=$(detect_dispatch_mode 2>/dev/null || echo unknown)" | |
| echo "dispatch_type=${verify_mode:+verify}" | |
| echo "verify_reason=${verify_reason:-}" | |
| echo "=== END DISPATCH METADATA ===" | |
| echo "" | |
| } >"$log_file" 2>/dev/null || true | |
| # Transition to dispatched | |
| cmd_transition "$task_id" "dispatched" \ | |
| --worktree "$worktree_path" \ | |
| --branch "$branch_name" \ | |
| --log-file "$log_file" | |
| # Detect dispatch mode | |
| local dispatch_mode | |
| dispatch_mode=$(detect_dispatch_mode) | |
| # Recall relevant memories before dispatch (t128.6) | |
| local memory_context="" | |
| memory_context=$(recall_task_memories "$task_id" "$tdesc" 2>/dev/null || echo "") | |
| if [[ -n "$memory_context" ]]; then | |
| log_info "Injecting ${#memory_context} bytes of memory context for $task_id" | |
| fi | |
| # Resolve model via frontmatter + fallback chain (t132.5) | |
| # t1008: For verify-mode dispatches, prefer sonnet tier (cheaper, sufficient for | |
| # verification checks). The verify worker can escalate to full implementation if | |
| # it discovers the work is incomplete, but starts cheap. | |
| local resolved_model | |
| if [[ "$verify_mode" == "true" ]]; then | |
| resolved_model=$(resolve_model "coding" "$ai_cli" 2>/dev/null) || resolved_model="" | |
| log_info "Verify mode: using coding-tier model ($resolved_model) instead of task-specific model" | |
| else | |
| resolved_model=$(resolve_task_model "$task_id" "$tmodel" "${trepo:-.}" "$ai_cli") | |
| fi | |
| # Contest mode intercept (t1011): if model resolves to CONTEST, delegate to | |
| # contest-helper.sh which dispatches the same task to top-3 models in parallel. | |
| # The original task stays in 'running' state while contest entries execute. | |
| if [[ "$resolved_model" == "CONTEST" ]]; then | |
| log_info "Contest mode activated for $task_id — delegating to contest-helper.sh" | |
| local contest_helper="${SCRIPT_DIR}/contest-helper.sh" | |
| if [[ -x "$contest_helper" ]]; then | |
| local contest_id | |
| contest_id=$("$contest_helper" create "$task_id" ${batch_id:+--batch "$batch_id"} 2>/dev/null) | |
| if [[ -n "$contest_id" ]]; then | |
| "$contest_helper" dispatch "$contest_id" 2>/dev/null || { | |
| log_error "Contest dispatch failed for $task_id" | |
| cmd_transition "$task_id" "failed" --error "Contest dispatch failed" | |
| return 1 | |
| } | |
| # Keep original task in running state — pulse Phase 2.5 will check contest completion | |
| db "$SUPERVISOR_DB" "UPDATE tasks SET error = 'contest:${contest_id}' WHERE id = '$(sql_escape "$task_id")';" | |
| log_success "Contest $contest_id dispatched for $task_id" | |
| echo "contest:${contest_id}" | |
| return 0 | |
| else | |
| log_error "Failed to create contest for $task_id — falling back to default model" | |
| resolved_model=$(resolve_model "coding" "$ai_cli") | |
| fi | |
| else | |
| log_warn "contest-helper.sh not found — falling back to default model" | |
| resolved_model=$(resolve_model "coding" "$ai_cli") | |
| fi | |
| fi | |
| # Secondary availability check: verify the resolved model's provider (t233) | |
| # The initial health check uses the "health" tier (typically anthropic). | |
| # If the resolved model uses a different provider (e.g., google/gemini for pro tier), | |
| # we need to verify that provider too. Skip if same provider or if using OpenCode | |
| # (which manages routing internally). | |
| if [[ "$ai_cli" != "opencode" && -n "$resolved_model" && "$resolved_model" == *"/"* ]]; then | |
| local resolved_provider="${resolved_model%%/*}" | |
| local health_provider="${health_model%%/*}" | |
| if [[ "$resolved_provider" != "$health_provider" ]]; then | |
| local availability_helper="${SCRIPT_DIR}/model-availability-helper.sh" | |
| if [[ -x "$availability_helper" ]]; then | |
| local resolved_avail_exit=0 | |
| "$availability_helper" check "$resolved_provider" --quiet || resolved_avail_exit=$? | |
| if [[ "$resolved_avail_exit" -ne 0 ]]; then | |
| case "$resolved_avail_exit" in | |
| 2) | |
| log_warn "Resolved model provider '$resolved_provider' is rate-limited (exit $resolved_avail_exit) for $task_id — deferring dispatch" | |
| ;; | |
| 3) | |
| log_error "Resolved model provider '$resolved_provider' has invalid key/credits (exit $resolved_avail_exit) for $task_id — blocking dispatch" | |
| ;; | |
| *) | |
| log_warn "Resolved model provider '$resolved_provider' unavailable (exit $resolved_avail_exit) for $task_id — deferring dispatch" | |
| ;; | |
| esac | |
| return 3 | |
| fi | |
| fi | |
| fi | |
| fi | |
| local dispatch_type="full" | |
| if [[ "$verify_mode" == "true" ]]; then | |
| dispatch_type="verify" | |
| fi | |
| log_info "Dispatching $task_id via $ai_cli ($dispatch_mode mode, $dispatch_type dispatch)" | |
| log_info "Worktree: $worktree_path" | |
| log_info "Model: $resolved_model" | |
| log_info "Log: $log_file" | |
| # Build and execute dispatch command | |
| # t1008: Use verify dispatch for previously-worked tasks (cheaper, focused) | |
| # Use NUL-delimited read to preserve multi-line prompts as single arguments | |
| local -a cmd_parts=() | |
| if [[ "$verify_mode" == "true" ]]; then | |
| while IFS= read -r -d '' part; do | |
| cmd_parts+=("$part") | |
| done < <(build_verify_dispatch_cmd "$task_id" "$worktree_path" "$log_file" "$ai_cli" "$memory_context" "$resolved_model" "$tdesc" "$verify_reason") | |
| else | |
| while IFS= read -r -d '' part; do | |
| cmd_parts+=("$part") | |
| done < <(build_dispatch_cmd "$task_id" "$worktree_path" "$log_file" "$ai_cli" "$memory_context" "$resolved_model" "$tdesc") | |
| fi | |
| # Ensure PID directory exists before dispatch | |
| mkdir -p "$SUPERVISOR_DIR/pids" | |
| # Set FULL_LOOP_HEADLESS for all supervisor-dispatched workers (t174) | |
| # This ensures headless mode even if the AI doesn't parse --headless from the prompt | |
| local headless_env="FULL_LOOP_HEADLESS=true" | |
| # Generate worker-specific MCP config with heavy indexers disabled (t221) | |
| # Saves ~4 CPU cores per worker by preventing osgrep from indexing | |
| local worker_xdg_config="" | |
| worker_xdg_config=$(generate_worker_mcp_config "$task_id" 2>/dev/null) || true | |
| # Write dispatch script to a temp file to avoid bash -c quoting issues | |
| # with multi-line prompts (newlines in printf '%q' break bash -c strings) | |
| local dispatch_script="${SUPERVISOR_DIR}/pids/${task_id}-dispatch.sh" | |
| { | |
| echo '#!/usr/bin/env bash' | |
| echo "# Startup sentinel (t183): if this line appears in the log, the script started" | |
| echo "echo 'WORKER_STARTED task_id=${task_id} pid=\$\$ timestamp='\$(date -u +%Y-%m-%dT%H:%M:%SZ)" | |
| echo "cd '${worktree_path}' || { echo 'WORKER_FAILED: cd to worktree failed: ${worktree_path}'; exit 1; }" | |
| echo "export ${headless_env}" | |
| # Redirect worker to use MCP config with heavy indexers disabled (t221) | |
| if [[ -n "$worker_xdg_config" ]]; then | |
| echo "export XDG_CONFIG_HOME='${worker_xdg_config}'" | |
| fi | |
| # Write each cmd_part as a properly quoted array element | |
| printf 'exec ' | |
| printf '%q ' "${cmd_parts[@]}" | |
| printf '\n' | |
| } >"$dispatch_script" | |
| chmod +x "$dispatch_script" | |
| # Wrapper script (t183): captures errors from the dispatch script itself. | |
| # Previous approach used nohup bash -c with &>/dev/null which swallowed | |
| # errors when the dispatch script failed to start (e.g., opencode not found). | |
| # Now errors are appended to the log file for diagnosis. | |
| # t253: Add cleanup handlers to prevent orphaned children when wrapper exits | |
| local wrapper_script="${SUPERVISOR_DIR}/pids/${task_id}-wrapper.sh" | |
| { | |
| echo '#!/usr/bin/env bash' | |
| echo '# t253: Recursive cleanup to kill all descendant processes' | |
| echo '_kill_descendants_recursive() {' | |
| echo ' local parent_pid="$1"' | |
| echo ' local children' | |
| echo ' children=$(pgrep -P "$parent_pid" 2>/dev/null || true)' | |
| echo ' if [[ -n "$children" ]]; then' | |
| echo ' for child in $children; do' | |
| echo ' _kill_descendants_recursive "$child"' | |
| echo ' done' | |
| echo ' fi' | |
| echo ' kill -TERM "$parent_pid" 2>/dev/null || true' | |
| echo '}' | |
| echo '' | |
| echo 'cleanup_children() {' | |
| echo ' local wrapper_pid=$$' | |
| echo ' local children' | |
| echo ' children=$(pgrep -P "$wrapper_pid" 2>/dev/null || true)' | |
| echo ' if [[ -n "$children" ]]; then' | |
| echo ' # Recursively kill all descendants' | |
| echo ' for child in $children; do' | |
| echo ' _kill_descendants_recursive "$child"' | |
| echo ' done' | |
| echo ' sleep 0.5' | |
| echo ' # Force kill any survivors' | |
| echo ' for child in $children; do' | |
| echo ' pkill -9 -P "$child" 2>/dev/null || true' | |
| echo ' kill -9 "$child" 2>/dev/null || true' | |
| echo ' done' | |
| echo ' fi' | |
| echo '}' | |
| echo '# Register cleanup on EXIT, INT, TERM (KILL cannot be trapped)' | |
| echo 'trap cleanup_children EXIT INT TERM' | |
| echo '' | |
| echo "'${dispatch_script}' >> '${log_file}' 2>&1" | |
| echo "rc=\$?" | |
| echo "echo \"EXIT:\${rc}\" >> '${log_file}'" | |
| echo "if [ \$rc -ne 0 ]; then" | |
| echo " echo \"WORKER_DISPATCH_ERROR: dispatch script exited with code \${rc}\" >> '${log_file}'" | |
| echo "fi" | |
| } >"$wrapper_script" | |
| chmod +x "$wrapper_script" | |
| if [[ "$dispatch_mode" == "tabby" ]]; then | |
| # Tabby: attempt to open in a new tab via OSC 1337 escape sequence | |
| log_info "Opening Tabby tab for $task_id..." | |
| printf '\e]1337;NewTab=%s\a' "'${wrapper_script}'" 2>/dev/null || true | |
| # Also start background process as fallback (Tabby may not support OSC 1337) | |
| # t253: Use setsid if available (Linux) for process group isolation | |
| # Use nohup + disown to survive parent (cron) exit | |
| if command -v setsid &>/dev/null; then | |
| nohup setsid bash "${wrapper_script}" &>/dev/null & | |
| else | |
| nohup bash "${wrapper_script}" &>/dev/null & | |
| fi | |
| else | |
| # Headless: background process | |
| # t253: Use setsid if available (Linux) for process group isolation | |
| # Use nohup + disown to survive parent (cron) exit — without this, | |
| # workers die after ~2 minutes when the cron pulse script exits | |
| if command -v setsid &>/dev/null; then | |
| nohup setsid bash "${wrapper_script}" &>/dev/null & | |
| else | |
| nohup bash "${wrapper_script}" &>/dev/null & | |
| fi | |
| fi | |
| local worker_pid=$! | |
| disown "$worker_pid" 2>/dev/null || true | |
| # Store PID for monitoring | |
| echo "$worker_pid" >"$SUPERVISOR_DIR/pids/${task_id}.pid" | |
| # Transition to running | |
| cmd_transition "$task_id" "running" --session "pid:$worker_pid" | |
| # Add dispatched:model label to GitHub issue (t1010) | |
| add_model_label "$task_id" "dispatched" "$resolved_model" "${trepo:-.}" 2>>"$SUPERVISOR_LOG" || true | |
| log_success "Dispatched $task_id (PID: $worker_pid)" | |
| echo "$worker_pid" | |
| return 0 | |
| } | |
| local resolved_model | |
| if [[ "$verify_mode" == "true" ]]; then | |
| resolved_model=$(resolve_model "sonnet" "$ai_cli" 2>/dev/null) || resolved_model="" | |
| log_info "Verify mode: using sonnet-tier model ($resolved_model) instead of task-specific model" | |
| else | |
| resolved_model=$(resolve_task_model "$task_id" "$tmodel" "${trepo:-.}" "$ai_cli") | |
| fi |
🤖 Prompt for AI Agents
In @.agents/scripts/supervisor/dispatch.sh around lines 1387 - 1909, The
verify-mode branch incorrectly calls resolve_model "coding" (which maps to Opus)
causing higher-cost models to be used; change the verify-mode model resolution
in the cmd_dispatch function where resolved_model is set for verify_mode to call
resolve_model "sonnet" (or the intended cheaper tier) and adjust any log text
referencing "coding-tier model" to reflect the sonnet/verify tier so verify
dispatches use the cheaper model when building the verify dispatch command (see
resolved_model, verify_mode, resolve_model, build_verify_dispatch_cmd).
| evaluate_worker() { | ||
| local task_id="$1" | ||
| local skip_ai_eval="${2:-false}" | ||
|
|
||
| ensure_db | ||
|
|
||
| local escaped_id | ||
| escaped_id=$(sql_escape "$task_id") | ||
| local task_row | ||
| task_row=$(db -separator '|' "$SUPERVISOR_DB" " | ||
| SELECT status, log_file, retries, max_retries, session_id, pr_url | ||
| FROM tasks WHERE id = '$escaped_id'; | ||
| ") | ||
|
|
||
| if [[ -z "$task_row" ]]; then | ||
| log_error "Task not found: $task_id" | ||
| return 1 | ||
| fi | ||
|
|
||
| local tstatus tlog tretries tmax_retries tsession tpr_url | ||
| IFS='|' read -r tstatus tlog tretries tmax_retries tsession tpr_url <<<"$task_row" | ||
|
|
||
| # Enhanced no_log_file diagnostics (t183) | ||
| # Instead of a bare "failed:no_log_file", gather context about why the log | ||
| # is missing so the supervisor can make better retry/block decisions and | ||
| # self-healing diagnostics have actionable information. | ||
| if [[ -z "$tlog" ]]; then | ||
| # No log path in DB at all — dispatch likely failed before setting log_file | ||
| local diag_detail="no_log_path_in_db" | ||
| local pid_file="$SUPERVISOR_DIR/pids/${task_id}.pid" | ||
| if [[ -f "$pid_file" ]]; then | ||
| local stale_pid | ||
| stale_pid=$(cat "$pid_file" 2>/dev/null || echo "") | ||
| if [[ -n "$stale_pid" ]] && ! kill -0 "$stale_pid" 2>/dev/null; then | ||
| diag_detail="no_log_path_in_db:worker_pid_${stale_pid}_dead" | ||
| elif [[ -n "$stale_pid" ]]; then | ||
| diag_detail="no_log_path_in_db:worker_pid_${stale_pid}_alive" | ||
| fi | ||
| fi | ||
| echo "failed:${diag_detail}" | ||
| return 0 | ||
| fi | ||
|
|
||
| if [[ ! -f "$tlog" ]]; then | ||
| # Log path set in DB but file doesn't exist — worker wrapper never ran | ||
| local diag_detail="log_file_missing" | ||
| local dispatch_script="${SUPERVISOR_DIR}/pids/${task_id}-dispatch.sh" | ||
| local wrapper_script="${SUPERVISOR_DIR}/pids/${task_id}-wrapper.sh" | ||
| if [[ ! -f "$dispatch_script" && ! -f "$wrapper_script" ]]; then | ||
| diag_detail="log_file_missing:no_dispatch_scripts" | ||
| elif [[ -f "$dispatch_script" && ! -x "$dispatch_script" ]]; then | ||
| diag_detail="log_file_missing:dispatch_script_not_executable" | ||
| fi | ||
| local pid_file="$SUPERVISOR_DIR/pids/${task_id}.pid" | ||
| if [[ -f "$pid_file" ]]; then | ||
| local stale_pid | ||
| stale_pid=$(cat "$pid_file" 2>/dev/null || echo "") | ||
| if [[ -n "$stale_pid" ]] && ! kill -0 "$stale_pid" 2>/dev/null; then | ||
| diag_detail="${diag_detail}:worker_pid_${stale_pid}_dead" | ||
| fi | ||
| else | ||
| diag_detail="${diag_detail}:no_pid_file" | ||
| fi | ||
| echo "failed:${diag_detail}" | ||
| return 0 | ||
| fi | ||
|
|
||
| # Log file exists but may be empty or contain only metadata header (t183) | ||
| local log_size | ||
| log_size=$(wc -c <"$tlog" 2>/dev/null | tr -d ' ') | ||
| if [[ "$log_size" -eq 0 ]]; then | ||
| echo "failed:log_file_empty" | ||
| return 0 | ||
| fi | ||
|
|
||
| # Check if worker never started (only dispatch metadata, no WORKER_STARTED sentinel) | ||
| if [[ "$log_size" -lt 500 ]] && ! grep -q 'WORKER_STARTED' "$tlog" 2>/dev/null; then | ||
| # Log has metadata but worker never started — extract any error from log | ||
| local startup_error="" | ||
| startup_error=$(grep -i 'WORKER_FAILED\|WORKER_DISPATCH_ERROR\|command not found\|No such file\|Permission denied' "$tlog" 2>/dev/null | head -1 | head -c 200 || echo "") | ||
| if [[ -n "$startup_error" ]]; then | ||
| echo "failed:worker_never_started:$(echo "$startup_error" | tr ' ' '_' | tr -cd '[:alnum:]_:-')" | ||
| else | ||
| echo "failed:worker_never_started:no_sentinel" | ||
| fi | ||
| return 0 | ||
| fi | ||
|
|
||
| # --- Tier 1: Deterministic signal detection --- | ||
|
|
||
| # Parse structured metadata from log (bash 3.2 compatible - no associative arrays) | ||
| local meta_output | ||
| meta_output=$(extract_log_metadata "$tlog") | ||
|
|
||
| # Helper: extract a value from key=value metadata output | ||
| _meta_get() { | ||
| local key="$1" default="${2:-}" | ||
| local val | ||
| val=$(echo "$meta_output" | grep "^${key}=" | head -1 | cut -d= -f2-) | ||
| echo "${val:-$default}" | ||
| } | ||
|
|
||
| local meta_signal meta_pr_url meta_exit_code | ||
| meta_signal=$(_meta_get "signal" "none") | ||
| meta_pr_url=$(_meta_get "pr_url" "") | ||
| meta_exit_code=$(_meta_get "exit_code" "") | ||
|
|
||
| # Seed PR URL from DB (t171): check_pr_status() or a previous pulse may have | ||
| # already found and persisted the PR URL. Use it before expensive gh API calls. | ||
| if [[ -z "$meta_pr_url" && -n "${tpr_url:-}" && "$tpr_url" != "no_pr" && "$tpr_url" != "task_only" ]]; then | ||
| meta_pr_url="$tpr_url" | ||
| fi | ||
|
|
||
| # Resolve repo slug early — needed for PR validation (t195) and fallback detection | ||
| local task_repo task_branch repo_slug_detect | ||
| task_repo=$(sqlite3 "$SUPERVISOR_DB" "SELECT repo FROM tasks WHERE id = '$escaped_id';" 2>/dev/null || echo "") | ||
| task_branch=$(sqlite3 "$SUPERVISOR_DB" "SELECT branch FROM tasks WHERE id = '$escaped_id';" 2>/dev/null || echo "") | ||
| repo_slug_detect="" | ||
| if [[ -n "$task_repo" ]]; then | ||
| repo_slug_detect=$(detect_repo_slug "$task_repo" 2>/dev/null || echo "") | ||
| fi | ||
|
|
||
| # Validate PR URL belongs to this task (t195, t223): a previous pulse | ||
| # may have stored a PR URL that doesn't actually reference this task ID | ||
| # (e.g., branch reuse, stale data, or log containing another task's PR URL). | ||
| # Validate before using for attribution. If repo slug detection failed, | ||
| # clear the PR URL entirely — unvalidated URLs cause cross-contamination | ||
| # where the wrong PR gets linked to the wrong task (t223). | ||
| if [[ -n "$meta_pr_url" ]]; then | ||
| if [[ -n "$repo_slug_detect" ]]; then | ||
| local validated_url | ||
| validated_url=$(validate_pr_belongs_to_task "$task_id" "$repo_slug_detect" "$meta_pr_url") || validated_url="" | ||
| if [[ -z "$validated_url" ]]; then | ||
| log_warn "evaluate_worker: PR URL for $task_id failed task ID validation — clearing" | ||
| meta_pr_url="" | ||
| fi | ||
| else | ||
| log_warn "evaluate_worker: cannot validate PR URL for $task_id (repo slug detection failed) — clearing to prevent cross-contamination" | ||
| meta_pr_url="" | ||
| fi | ||
| fi | ||
|
|
||
| # Fallback PR URL detection via centralized discover_pr_by_branch() (t232, t161, t195) | ||
| if [[ -z "$meta_pr_url" && -n "$repo_slug_detect" ]]; then | ||
| meta_pr_url=$(discover_pr_by_branch "$task_id" "$repo_slug_detect" "$task_branch") || meta_pr_url="" | ||
| fi | ||
|
|
||
| local meta_rate_limit_count meta_auth_error_count meta_conflict_count | ||
| local meta_timeout_count meta_oom_count meta_backend_error_count | ||
| meta_rate_limit_count=$(_meta_get "rate_limit_count" "0") | ||
| meta_auth_error_count=$(_meta_get "auth_error_count" "0") | ||
| meta_conflict_count=$(_meta_get "conflict_count" "0") | ||
| meta_timeout_count=$(_meta_get "timeout_count" "0") | ||
| meta_oom_count=$(_meta_get "oom_count" "0") | ||
| meta_backend_error_count=$(_meta_get "backend_error_count" "0") | ||
|
|
||
| # FULL_LOOP_COMPLETE = definitive success | ||
| if [[ "$meta_signal" == "FULL_LOOP_COMPLETE" ]]; then | ||
| echo "complete:${meta_pr_url:-no_pr}" | ||
| return 0 | ||
| fi | ||
|
|
||
| # t1008: Verify-mode worker signals | ||
| # VERIFY_COMPLETE = verification confirmed prior work is done | ||
| if [[ "$meta_signal" == "VERIFY_COMPLETE" ]]; then | ||
| log_info "Verify worker confirmed $task_id is complete" | ||
| echo "complete:${meta_pr_url:-verified_complete}" | ||
| return 0 | ||
| fi | ||
|
|
||
| # VERIFY_INCOMPLETE = prior work exists but needs more; worker continued implementation | ||
| if [[ "$meta_signal" == "VERIFY_INCOMPLETE" ]]; then | ||
| if [[ -n "$meta_pr_url" ]]; then | ||
| log_info "Verify worker found incomplete work for $task_id, continued and created PR" | ||
| echo "complete:${meta_pr_url}" | ||
| return 0 | ||
| fi | ||
| # No PR = worker found incomplete work but couldn't finish | ||
| log_info "Verify worker found incomplete work for $task_id but no PR created" | ||
| echo "retry:verify_incomplete_no_pr" | ||
| return 0 | ||
| fi | ||
|
|
||
| # VERIFY_NOT_STARTED = no prior work found; worker should have done full implementation | ||
| if [[ "$meta_signal" == "VERIFY_NOT_STARTED" ]]; then | ||
| if [[ -n "$meta_pr_url" ]]; then | ||
| log_info "Verify worker found no prior work for $task_id, did full implementation" | ||
| echo "complete:${meta_pr_url}" | ||
| return 0 | ||
| fi | ||
| # No PR = verify worker couldn't complete full implementation (expected — it's lightweight) | ||
| log_info "Verify worker found no prior work for $task_id and couldn't complete — re-queue for full dispatch" | ||
| echo "retry:verify_not_started_needs_full" | ||
| return 0 | ||
| fi | ||
|
|
||
| # TASK_COMPLETE with clean exit = partial success (PR phase may have failed) | ||
| # If a PR URL is available (from DB or gh fallback), include it. | ||
| if [[ "$meta_signal" == "TASK_COMPLETE" && "$meta_exit_code" == "0" ]]; then | ||
| echo "complete:${meta_pr_url:-task_only}" | ||
| return 0 | ||
| fi | ||
|
|
||
| # PR URL with clean exit = task completed (PR was created successfully) | ||
| # This takes priority over heuristic error patterns because log content | ||
| # may discuss auth/errors as part of the task itself (e.g., creating an | ||
| # API integration subagent that documents authentication flows) | ||
| if [[ -n "$meta_pr_url" && "$meta_exit_code" == "0" ]]; then | ||
| echo "complete:${meta_pr_url}" | ||
| return 0 | ||
| fi | ||
|
|
||
| # Backend infrastructure error with EXIT:0 (t095-diag-1): CLI wrappers like | ||
| # OpenCode exit 0 even when the backend rejects the request (quota exceeded, | ||
| # backend down). A short log with backend errors means the worker never | ||
| # started - this is NOT content discussion, it's a real failure. | ||
| # Must be checked BEFORE clean_exit_no_signal to avoid wasting retries. | ||
| # (t198): Use content_lines instead of log_lines to exclude REPROMPT METADATA | ||
| # headers that inflate the line count in retry logs (8-line header caused | ||
| # 12-line logs to miss the < 10 threshold). | ||
| if [[ "$meta_exit_code" == "0" && "$meta_signal" == "none" ]]; then | ||
| local meta_content_lines | ||
| meta_content_lines=$(_meta_get "content_lines" "0") | ||
| # Billing/credits errors: block immediately, retrying won't help. | ||
| # OpenCode Zen proxy returns CreditsError when credits exhausted; | ||
| # this is a billing issue, not a transient backend error. | ||
| if [[ "$meta_backend_error_count" -gt 0 && "$meta_content_lines" -lt 10 ]]; then | ||
| if grep -qi 'CreditsError\|Insufficient balance' "$log_file" 2>/dev/null; then | ||
| echo "blocked:billing_credits_exhausted" | ||
| return 0 | ||
| fi | ||
| echo "retry:backend_quota_error" | ||
| return 0 | ||
| fi | ||
| fi | ||
|
|
||
| # Task obsolete detection (t198): workers that determine a task is already | ||
| # done or obsolete exit cleanly with EXIT:0, no signal, and no PR. Without | ||
| # this check, the supervisor retries them as clean_exit_no_signal, wasting | ||
| # retry attempts on work that will never produce a PR. | ||
| # Uses the final "type":"text" entry (authoritative) to detect explicit | ||
| # "already done" / "no changes needed" language from the worker. | ||
| if [[ "$meta_exit_code" == "0" && "$meta_signal" == "none" ]]; then | ||
| local meta_task_obsolete | ||
| meta_task_obsolete=$(_meta_get "task_obsolete" "false") | ||
| if [[ "$meta_task_obsolete" == "true" ]]; then | ||
| echo "complete:task_obsolete" | ||
| return 0 | ||
| fi | ||
| fi | ||
|
|
||
| # Clean exit with no completion signal and no PR (checked DB + gh API above) | ||
| # = likely incomplete. The agent finished cleanly but didn't emit a signal | ||
| # and no PR was found. Retry (agent may have run out of context or hit a | ||
| # soft limit). If a PR exists, it was caught at line ~3179 via DB seed (t171) | ||
| # or gh fallback (t161). | ||
| if [[ "$meta_exit_code" == "0" && "$meta_signal" == "none" ]]; then | ||
| echo "retry:clean_exit_no_signal" | ||
| return 0 | ||
| fi | ||
|
|
||
| # --- Tier 2: Heuristic error pattern matching --- | ||
| # ONLY applied when exit code is non-zero or missing. | ||
| # When exit=0, the agent finished cleanly - any "error" strings in the log | ||
| # are content (e.g., subagents documenting auth flows), not real failures. | ||
|
|
||
| if [[ "$meta_exit_code" != "0" ]]; then | ||
| # Backend infrastructure error (quota, API gateway) = transient retry | ||
| # Only checked on non-zero exit: a clean exit with backend error strings in | ||
| # the log is content discussion, not a real infrastructure failure. | ||
| if [[ "$meta_backend_error_count" -gt 0 ]]; then | ||
| echo "retry:backend_infrastructure_error" | ||
| return 0 | ||
| fi | ||
|
|
||
| # Auth errors are always blocking (human must fix credentials) | ||
| if [[ "$meta_auth_error_count" -gt 0 ]]; then | ||
| echo "blocked:auth_error" | ||
| return 0 | ||
| fi | ||
|
|
||
| # Merge conflicts require human resolution | ||
| if [[ "$meta_conflict_count" -gt 0 ]]; then | ||
| echo "blocked:merge_conflict" | ||
| return 0 | ||
| fi | ||
|
|
||
| # OOM is infrastructure - blocking | ||
| if [[ "$meta_oom_count" -gt 0 ]]; then | ||
| echo "blocked:out_of_memory" | ||
| return 0 | ||
| fi | ||
|
|
||
| # Rate limiting is transient - retry with backoff | ||
| if [[ "$meta_rate_limit_count" -gt 0 ]]; then | ||
| echo "retry:rate_limited" | ||
| return 0 | ||
| fi | ||
|
|
||
| # Timeout is transient - retry | ||
| if [[ "$meta_timeout_count" -gt 0 ]]; then | ||
| echo "retry:timeout" | ||
| return 0 | ||
| fi | ||
| fi | ||
|
|
||
| # Non-zero exit with known code | ||
| if [[ -n "$meta_exit_code" && "$meta_exit_code" != "0" ]]; then | ||
| # Exit code 130 = SIGINT (Ctrl+C), 137 = SIGKILL, 143 = SIGTERM | ||
| case "$meta_exit_code" in | ||
| 130) | ||
| echo "retry:interrupted_sigint" | ||
| return 0 | ||
| ;; | ||
| 137) | ||
| echo "retry:killed_sigkill" | ||
| return 0 | ||
| ;; | ||
| 143) | ||
| echo "retry:terminated_sigterm" | ||
| return 0 | ||
| ;; | ||
| esac | ||
| fi | ||
|
|
||
| # Check if retries exhausted before attempting AI eval | ||
| if [[ "$tretries" -ge "$tmax_retries" ]]; then | ||
| echo "failed:max_retries" | ||
| return 0 | ||
| fi | ||
|
|
||
| # --- Tier 2.5: Git heuristic signals (t175) --- | ||
| # Before expensive AI eval, check for concrete evidence of work in the | ||
| # task's worktree/branch. This resolves most ambiguous outcomes cheaply | ||
| # and prevents false retries when the worker completed but didn't emit | ||
| # a signal (e.g., context exhaustion after creating a PR). | ||
|
|
||
| # Reuse task_repo/task_branch from PR detection above; fetch worktree | ||
| local task_worktree | ||
| task_worktree=$(db "$SUPERVISOR_DB" "SELECT worktree FROM tasks WHERE id = '$escaped_id';" 2>/dev/null || echo "") | ||
|
|
||
| if [[ -n "$task_repo" && -d "$task_repo" ]]; then | ||
| # Use worktree path if available, otherwise fall back to repo | ||
| local git_dir="${task_worktree:-$task_repo}" | ||
| if [[ ! -d "$git_dir" ]]; then | ||
| git_dir="$task_repo" | ||
| fi | ||
|
|
||
| # Check for commits on branch ahead of main/master | ||
| local branch_commits=0 | ||
| if [[ -n "$task_branch" ]]; then | ||
| local base_branch | ||
| base_branch=$(git -C "$git_dir" symbolic-ref refs/remotes/origin/HEAD 2>/dev/null | sed 's|refs/remotes/origin/||' || echo "main") | ||
| branch_commits=$(git -C "$git_dir" rev-list --count "${base_branch}..${task_branch}" 2>/dev/null || echo 0) | ||
| fi | ||
|
|
||
| # Check for uncommitted changes in worktree | ||
| local uncommitted_changes=0 | ||
| if [[ -n "$task_worktree" && -d "$task_worktree" ]]; then | ||
| uncommitted_changes=$(git -C "$task_worktree" status --porcelain 2>/dev/null | wc -l | tr -d ' ') | ||
| fi | ||
|
|
||
| # Decision matrix: | ||
| # - Commits + PR URL → complete (worker finished, signal was lost) | ||
| # - Commits + no PR → auto-create PR (t247.2), fallback to task_only | ||
| # - No commits + uncommitted changes → retry:work_in_progress | ||
| # - No commits + no changes → genuine ambiguity (fall through to AI/retry) | ||
|
|
||
| if [[ "$branch_commits" -gt 0 ]]; then | ||
| if [[ -n "$meta_pr_url" ]]; then | ||
| echo "complete:${meta_pr_url}" | ||
| else | ||
| # t247.2: Auto-create PR instead of returning task_only. | ||
| # Saves ~300s per retry by preserving work for review. | ||
| local auto_pr_url="" | ||
| if [[ -n "$repo_slug_detect" && -n "$task_branch" ]]; then | ||
| auto_pr_url=$(auto_create_pr_for_task "$task_id" "$git_dir" "$task_branch" "$repo_slug_detect" 2>>"${SUPERVISOR_LOG:-/dev/null}") || auto_pr_url="" | ||
| fi | ||
| if [[ -n "$auto_pr_url" ]]; then | ||
| echo "complete:${auto_pr_url}" | ||
| else | ||
| echo "complete:task_only" | ||
| fi | ||
| fi | ||
| return 0 | ||
| fi | ||
|
|
||
| if [[ "$uncommitted_changes" -gt 0 ]]; then | ||
| echo "retry:work_in_progress" | ||
| return 0 | ||
| fi | ||
| fi | ||
|
|
||
| # --- Tier 3: AI evaluation for ambiguous outcomes --- | ||
|
|
||
| if [[ "$skip_ai_eval" == "true" ]]; then | ||
| echo "retry:ambiguous_skipped_ai" | ||
| return 0 | ||
| fi | ||
|
|
||
| local ai_verdict | ||
| ai_verdict=$(evaluate_with_ai "$task_id" "$tlog" 2>/dev/null || echo "") | ||
|
|
||
| if [[ -n "$ai_verdict" ]]; then | ||
| echo "$ai_verdict" | ||
| return 0 | ||
| fi | ||
|
|
||
| # AI eval failed or unavailable - default to retry | ||
| echo "retry:ambiguous_ai_unavailable" | ||
| return 0 | ||
| } |
There was a problem hiding this comment.
Fix undefined log_file reference in billing error check.
This path uses $log_file (undefined) instead of $tlog, which can block on stdin or misclassify errors.
🔧 Proposed fix
- if [[ "$meta_backend_error_count" -gt 0 && "$meta_content_lines" -lt 10 ]]; then
- if grep -qi 'CreditsError\|Insufficient balance' "$log_file" 2>/dev/null; then
+ if [[ "$meta_backend_error_count" -gt 0 && "$meta_content_lines" -lt 10 ]]; then
+ if grep -qi 'CreditsError\|Insufficient balance' "$tlog" 2>/dev/null; then
echo "blocked:billing_credits_exhausted"
return 0
fi🤖 Prompt for AI Agents
In @.agents/scripts/supervisor/evaluate.sh around lines 675 - 1086, In
evaluate_worker(), the billing/credits detection block incorrectly references an
undefined variable log_file causing grep to read stdin or misclassify errors;
update that grep call to use the log path variable tlog (used throughout the
function) so the billing detection uses the actual task log (refer to
evaluate_worker, the block checking meta_backend_error_count and
meta_content_lines before echoing "blocked:billing_credits_exhausted").
| check_task_already_done() { | ||
| local task_id="${1:-}" | ||
| local project_root="${2:-.}" | ||
|
|
||
| if [[ -z "$task_id" ]]; then | ||
| return 1 | ||
| fi | ||
|
|
||
| # Check 1: Is the task already marked [x] in TODO.md? | ||
| # IMPORTANT: TODO.md may contain the same task ID in multiple sections: | ||
| # - Active task list (authoritative — near the top) | ||
| # - Completed plan archive (historical — further down, from earlier iterations) | ||
| # We must check the FIRST occurrence only. If the first match is [x], it's done. | ||
| # If the first match is [ ] or [-], it's NOT done (even if a later [x] exists). | ||
| local todo_file="$project_root/TODO.md" | ||
| if [[ -f "$todo_file" ]]; then | ||
| local first_match="" | ||
| first_match=$(grep -E "^\s*- \[(x| |-)\] ${task_id}[[:space:]]" "$todo_file" 2>/dev/null | head -1) || true | ||
| if [[ -n "$first_match" ]]; then | ||
| # Extract ONLY the checkbox at the start of the line, not [x] anywhere in description | ||
| local checkbox="" | ||
| checkbox=$(printf '%s' "$first_match" | sed -n 's/^[[:space:]]*- \[\(.\)\].*/\1/p') | ||
| if [[ "$checkbox" == "x" ]]; then | ||
| log_info "Pre-dispatch check: $task_id is marked [x] in TODO.md (first occurrence)" >&2 | ||
| return 0 | ||
| else | ||
| # First occurrence is [ ] or [-] — task is NOT done, skip further checks | ||
| log_info "Pre-dispatch check: $task_id is [ ] in TODO.md (first occurrence — ignoring any later [x] entries)" >&2 | ||
| return 1 | ||
| fi | ||
| fi | ||
| fi | ||
|
|
||
| # Check 2: Are there merged commits referencing this task ID? | ||
| # IMPORTANT: Use word-boundary matching to prevent t020 matching t020.6. | ||
| # Escaped task_id for regex: dots become literal dots. | ||
| local escaped_task_regex | ||
| escaped_task_regex=$(printf '%s' "$task_id" | sed 's/\./\\./g') | ||
| # grep -w uses word boundaries but dots aren't word chars, so for subtask IDs | ||
| # like t020.1 we need a custom boundary: task_id followed by non-digit or EOL. | ||
| # This prevents t020 from matching t020.1, t020.2, etc. | ||
| local boundary_pattern="${task_id}([^.0-9]|$)" | ||
|
|
||
| local commit_count=0 | ||
| commit_count=$(git -C "$project_root" log --oneline -500 --all --grep="$task_id" 2>/dev/null | | ||
| grep -cE "$boundary_pattern" 2>/dev/null) || true | ||
| if [[ "$commit_count" -gt 0 ]]; then | ||
| # Verify at least one commit looks like a REAL completion: | ||
| # Must have a PR merge reference "(#NNN)" AND the exact task ID. | ||
| # Exclude: "add tNNN", "claim tNNN", "mark tNNN blocked", "queue tNNN" | ||
| local completion_evidence="" | ||
| completion_evidence=$(git -C "$project_root" log --oneline -500 --all --grep="$task_id" 2>/dev/null | | ||
| grep -E "$boundary_pattern" | | ||
| grep -iE "\(#[0-9]+\)|PR #[0-9]+ merged" | | ||
| grep -ivE "add ${task_id}|claim ${task_id}|mark ${task_id}|queue ${task_id}|blocked" | | ||
| head -1) || true | ||
| if [[ -n "$completion_evidence" ]]; then | ||
| log_info "Pre-dispatch check: $task_id has completion evidence: $completion_evidence" >&2 | ||
| return 0 | ||
| fi | ||
| fi | ||
|
|
||
| # Check 3: Does a merged PR exist for this task? | ||
| # Only check if gh CLI is available and authenticated (cached check). | ||
| # Use exact task ID in title search to prevent substring matches. | ||
| # IMPORTANT: gh pr list --repo requires OWNER/REPO slug, not a local path (t224). | ||
| if command -v gh &>/dev/null && check_gh_auth 2>/dev/null; then | ||
| local repo_slug="" | ||
| repo_slug=$(detect_repo_slug "$project_root" 2>/dev/null) || true | ||
| if [[ -n "$repo_slug" ]]; then | ||
| local pr_count=0 | ||
| pr_count=$(gh pr list --repo "$repo_slug" --state merged --search "\"$task_id\" in:title" --limit 1 --json number --jq 'length' 2>/dev/null) || true | ||
| if [[ "$pr_count" -gt 0 ]]; then | ||
| log_info "Pre-dispatch check: $task_id has a merged PR on GitHub" >&2 | ||
| return 0 | ||
| fi | ||
| fi | ||
| fi | ||
|
|
||
| return 1 | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# Find and examine the issue-sync.sh file
find . -name "issue-sync.sh" -type fRepository: marcusquinn/aidevops
Length of output: 106
🏁 Script executed:
# Check the file at the specified lines
if [[ -f ".agents/scripts/supervisor/issue-sync.sh" ]]; then
wc -l ".agents/scripts/supervisor/issue-sync.sh"
fiRepository: marcusquinn/aidevops
Length of output: 109
🏁 Script executed:
# Read the relevant section of the code
sed -n '1267,1347p' ".agents/scripts/supervisor/issue-sync.sh"Repository: marcusquinn/aidevops
Length of output: 3650
🏁 Script executed:
# Create a test git repository to verify git log --grep behavior
cd /tmp && rm -rf test_repo && mkdir test_repo && cd test_repo && git init
# Create some test commits with task IDs
git config user.email "test@test.com"
git config user.name "Test"
echo "test" > file.txt
git add . && git commit -m "Fix t128.10 task"
echo "test2" > file.txt
git add . && git commit -m "Fix t128x10 related issue"
echo "test3" > file.txt
git add . && git commit -m "Fix t020 task"
# Now test if unescaped dot in grep matches both
echo "=== Testing unescaped dot (false positive) ==="
git log --oneline -500 --all --grep="t128.10" 2>/dev/null || true
echo "=== Testing escaped dot (should be more specific) ==="
git log --oneline -500 --all --grep="t128\.10" 2>/dev/null || trueRepository: marcusquinn/aidevops
Length of output: 1146
🏁 Script executed:
# Test the boundary pattern with grep -E as used in the code
echo "=== Testing boundary pattern with unescaped task_id ==="
printf "Fix t128.10 task\nFix t128x10 related\n" | grep -cE "t128.10([^.0-9]|$)" && echo "Unescaped matches: 2"
echo "=== Testing boundary pattern with escaped task_id ==="
printf "Fix t128.10 task\nFix t128x10 related\n" | grep -cE "t128\.10([^.0-9]|$)" && echo "Escaped matches: 1"Repository: marcusquinn/aidevops
Length of output: 217
🏁 Script executed:
# Also test the last grep pattern in the code that uses ${task_id}
# These are in the exclusion patterns for completion_evidence
task_id="t128.10"
echo "=== Testing exclusion pattern with unescaped task_id ==="
printf "add t128.10\nadd t128x10\n" | grep -ivE "add ${task_id}" || echo "No matches (correct)"
echo "=== Testing with escaped task_id ==="
escaped_task_id=$(printf '%s' "$task_id" | sed 's/\./\\./g')
printf "add t128.10\nadd t128x10\n" | grep -ivE "add ${escaped_task_id}" || echo "Matches line with x"Repository: marcusquinn/aidevops
Length of output: 190
Escape task IDs comprehensively in all regex patterns to prevent false positives.
Dotted task IDs (e.g., t128.10) are treated as regex wildcards in git log --grep and grep -E, causing t128.10 to match t128x10, t128a10, etc. The code creates escaped_task_regex but never uses it; the unescaped task_id is passed to three pattern contexts:
boundary_pattern="${task_id}([^.0-9]|$)"— matches wrong tasks ingrep -cEgit log --grep="$task_id"— returns commits for unrelated tasksgrep -ivE "add ${task_id}|claim ${task_id}|..."— incorrectly excludes substring matches
This causes the function to incorrectly mark unrelated tasks as "already done", disrupting task dispatch logic.
🔧 Fix: Use escaped_task_regex in all regex contexts
local escaped_task_regex
- escaped_task_regex=$(printf '%s' "$task_id" | sed 's/\./\\./g')
+ escaped_task_regex=$(printf '%s' "$task_id" | sed 's/[][\\/.*^$+?{}()|]/\\&/g')
- local boundary_pattern="${task_id}([^.0-9]|$)"
+ local boundary_pattern="${escaped_task_regex}([^.0-9]|$)"
- commit_count=$(git -C "$project_root" log --oneline -500 --all --grep="$task_id" 2>/dev/null |
+ commit_count=$(git -C "$project_root" log --oneline -500 --all --grep="$escaped_task_regex" 2>/dev/null |
- completion_evidence=$(git -C "$project_root" log --oneline -500 --all --grep="$task_id" 2>/dev/null |
+ completion_evidence=$(git -C "$project_root" log --oneline -500 --all --grep="$escaped_task_regex" 2>/dev/null |
- grep -ivE "add ${task_id}|claim ${task_id}|mark ${task_id}|queue ${task_id}|blocked" |
+ grep -ivE "add ${escaped_task_regex}|claim ${escaped_task_regex}|mark ${escaped_task_regex}|queue ${escaped_task_regex}|blocked" |🤖 Prompt for AI Agents
In @.agents/scripts/supervisor/issue-sync.sh around lines 1267 - 1347, The
check_task_already_done function is vulnerable to regex mis-matches because
task_id is used raw; replace uses of the raw task_id with the already-built
escaped_task_regex and adjust boundary_pattern to use it (e.g.,
boundary_pattern="${escaped_task_regex}([^.0-9]|$)"), and pass
escaped_task_regex into git grep (git -C "$project_root" log --oneline ...
--grep="$escaped_task_regex") and into the exclusion grep (grep -ivE "add
${escaped_task_regex}|claim ${escaped_task_regex}|mark
${escaped_task_regex}|queue ${escaped_task_regex}|blocked") so dotted IDs like
t128.10 are treated literally across git log, grep -E, and grep -ivE checks;
leave the rest of check_task_already_done logic unchanged.
| cmd_next() { | ||
| local batch_id="${1:-}" limit="${2:-1}" | ||
|
|
||
| ensure_db | ||
|
|
||
| if [[ -n "$batch_id" ]]; then | ||
| local escaped_batch | ||
| escaped_batch=$(sql_escape "$batch_id") | ||
|
|
||
| db -separator $'\t' "$SUPERVISOR_DB" " | ||
| SELECT t.id, t.repo, t.description, t.model | ||
| FROM batch_tasks bt | ||
| JOIN tasks t ON bt.task_id = t.id | ||
| WHERE bt.batch_id = '$escaped_batch' | ||
| AND t.status = 'queued' | ||
| AND t.retries < t.max_retries | ||
| ORDER BY t.retries ASC, bt.position | ||
| LIMIT $limit; | ||
| " | ||
| else | ||
| db -separator $'\t' "$SUPERVISOR_DB" " | ||
| SELECT id, repo, description, model | ||
| FROM tasks | ||
| WHERE status = 'queued' | ||
| AND retries < max_retries | ||
| ORDER BY retries ASC, created_at ASC | ||
| LIMIT $limit; | ||
| " | ||
| fi | ||
|
|
||
| return 0 | ||
| } |
There was a problem hiding this comment.
Validate limit before interpolating into SQL.
LIMIT $limit is fed from user input without numeric validation, opening the door to SQL injection or query errors.
🔧 Proposed fix (sanitize limit)
cmd_next() {
local batch_id="${1:-}" limit="${2:-1}"
+ if [[ -n "$limit" && ! "$limit" =~ ^[0-9]+$ ]]; then
+ log_error "Invalid limit: $limit (must be a non-negative integer)"
+ return 1
+ fi
+
ensure_db📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| cmd_next() { | |
| local batch_id="${1:-}" limit="${2:-1}" | |
| ensure_db | |
| if [[ -n "$batch_id" ]]; then | |
| local escaped_batch | |
| escaped_batch=$(sql_escape "$batch_id") | |
| db -separator $'\t' "$SUPERVISOR_DB" " | |
| SELECT t.id, t.repo, t.description, t.model | |
| FROM batch_tasks bt | |
| JOIN tasks t ON bt.task_id = t.id | |
| WHERE bt.batch_id = '$escaped_batch' | |
| AND t.status = 'queued' | |
| AND t.retries < t.max_retries | |
| ORDER BY t.retries ASC, bt.position | |
| LIMIT $limit; | |
| " | |
| else | |
| db -separator $'\t' "$SUPERVISOR_DB" " | |
| SELECT id, repo, description, model | |
| FROM tasks | |
| WHERE status = 'queued' | |
| AND retries < max_retries | |
| ORDER BY retries ASC, created_at ASC | |
| LIMIT $limit; | |
| " | |
| fi | |
| return 0 | |
| } | |
| cmd_next() { | |
| local batch_id="${1:-}" limit="${2:-1}" | |
| if [[ -n "$limit" && ! "$limit" =~ ^[0-9]+$ ]]; then | |
| log_error "Invalid limit: $limit (must be a non-negative integer)" | |
| return 1 | |
| fi | |
| ensure_db | |
| if [[ -n "$batch_id" ]]; then | |
| local escaped_batch | |
| escaped_batch=$(sql_escape "$batch_id") | |
| db -separator $'\t' "$SUPERVISOR_DB" " | |
| SELECT t.id, t.repo, t.description, t.model | |
| FROM batch_tasks bt | |
| JOIN tasks t ON bt.task_id = t.id | |
| WHERE bt.batch_id = '$escaped_batch' | |
| AND t.status = 'queued' | |
| AND t.retries < t.max_retries | |
| ORDER BY t.retries ASC, bt.position | |
| LIMIT $limit; | |
| " | |
| else | |
| db -separator $'\t' "$SUPERVISOR_DB" " | |
| SELECT id, repo, description, model | |
| FROM tasks | |
| WHERE status = 'queued' | |
| AND retries < max_retries | |
| ORDER BY retries ASC, created_at ASC | |
| LIMIT $limit; | |
| " | |
| fi | |
| return 0 | |
| } |
🤖 Prompt for AI Agents
In @.agents/scripts/supervisor/state.sh around lines 817 - 848, The cmd_next
function interpolates the unvalidated variable limit directly into the SQL LIMIT
clause; validate and sanitize limit before use by checking it is a positive
integer, applying a sensible default and an upper cap (e.g., max_limit) if out
of range, and then use the sanitized numeric variable (not the raw input) when
building the db query; update the cmd_next function (references: cmd_next,
limit, ensure_db, db, SUPERVISOR_DB) to perform this numeric
validation/normalization and only then substitute the sanitized value into the
SQL.
| local threshold=$((cpu_cores * max_load_factor * 100)) | ||
| if [[ "$load_ratio" -gt "$threshold" ]]; then | ||
| echo "overloaded=true" | ||
| else | ||
| echo "overloaded=false" |
There was a problem hiding this comment.
Linux overload detection never triggers due to a bad threshold.
load_ratio is already a percent, but the threshold multiplies by cpu_cores again, so overload is effectively impossible. This prevents adaptive throttling under real load.
🔧 Proposed fix
- local threshold=$((cpu_cores * max_load_factor * 100))
+ local threshold=$((max_load_factor * 100))
if [[ "$load_ratio" -gt "$threshold" ]]; then
echo "overloaded=true"
else
echo "overloaded=false"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| local threshold=$((cpu_cores * max_load_factor * 100)) | |
| if [[ "$load_ratio" -gt "$threshold" ]]; then | |
| echo "overloaded=true" | |
| else | |
| echo "overloaded=false" | |
| local threshold=$((max_load_factor * 100)) | |
| if [[ "$load_ratio" -gt "$threshold" ]]; then | |
| echo "overloaded=true" | |
| else | |
| echo "overloaded=false" |
🤖 Prompt for AI Agents
In @.agents/scripts/supervisor/utility.sh around lines 459 - 463, The threshold
calculation is wrong: because load_ratio is already a percent, do not multiply
by cpu_cores again; compute threshold from max_load_factor as a percent (e.g.,
threshold = max_load_factor * 100, converting floats to an integer percent if
needed) and keep the existing comparison against load_ratio; update the
assignment to the threshold variable (refer to threshold, load_ratio, cpu_cores,
max_load_factor) and ensure numeric conversion/rounding is handled so the if [[
"$load_ratio" -gt "$threshold" ]] branch can trigger correctly.
There was a problem hiding this comment.
Code Review
This is a significant and valuable refactoring that modularizes the monolithic supervisor-helper.sh script, greatly improving maintainability. The extraction of functions into domain-specific modules is well-executed.
My review focuses on ensuring adherence to the repository's shell scripting style guide and identifying potential issues introduced during the refactoring. I've found a few critical issues that need to be addressed:
- All the new/modified scripts are missing the required
set -euo pipefaildirective. - The
deploy.shscript useseval, which is forbidden by the style guide. - The
evaluate.shscript calls trap helper functions that are not defined, and the approach to resolving this should consider the repository's guidelines onsourcedependencies versus function duplication.
I've also included a medium-severity suggestion to refactor the very long ensure_db function in database.sh for better readability.
Once these issues are resolved, this will be an excellent contribution to the project's health.
| # Functions for creating, managing, and querying batches | ||
| # Functions for adding tasks and creating batches | ||
|
|
||
| set -euo pipefail |
There was a problem hiding this comment.
The set -euo pipefail directive has been removed. According to the repository style guide (line 10), all scripts must use this setting for robustness and safety. Please add set -euo pipefail to the top of this script and all other new/modified shell scripts in this pull request.
References
- All scripts use
#!/usr/bin/env bashandset -euo pipefail(link)
| local merge_cmd="gh pr merge \"$pr_number\" --repo \"$repo_slug\" --squash" | ||
| if [[ "$use_admin_flag" == "true" ]]; then | ||
| merge_cmd="$merge_cmd --admin" | ||
| fi | ||
|
|
||
| if ! merge_output=$(eval "$merge_cmd" 2>&1); then |
There was a problem hiding this comment.
The use of eval is a security risk and is explicitly forbidden by the repository style guide (line 32). Please refactor this to use a bash array for dynamic command construction to avoid potential command injection vulnerabilities.
| local merge_cmd="gh pr merge \"$pr_number\" --repo \"$repo_slug\" --squash" | |
| if [[ "$use_admin_flag" == "true" ]]; then | |
| merge_cmd="$merge_cmd --admin" | |
| fi | |
| if ! merge_output=$(eval "$merge_cmd" 2>&1); then | |
| local -a merge_cmd_arr=("gh" "pr" "merge" "$pr_number" "--repo" "$repo_slug" "--squash") | |
| if [[ "$use_admin_flag" == "true" ]]; then | |
| merge_cmd_arr+=("--admin") | |
| fi | |
| if ! merge_output=$("${merge_cmd_arr[@]}" 2>&1); then |
References
- No
eval-- use bash arrays for dynamic command construction (link)
| _save_cleanup_scope | ||
| trap '_run_cleanups' RETURN | ||
| push_cleanup "rm -f '${log_tail_file}'" |
There was a problem hiding this comment.
This block uses _save_cleanup_scope, trap '_run_cleanups' RETURN, and push_cleanup. These functions are not defined within this script and appear to be from a shared utility file (like shared-constants.sh). This will cause a runtime error. Per repository guidelines (Rule 2), for standalone scripts, it is often preferable to duplicate simple, self-contained helper functions to maintain script independence and avoid path resolution issues, rather than introducing source dependencies. Please either duplicate these functions within this script if they are simple and self-contained, or ensure the necessary utility script is sourced if they are complex or shared across many non-standalone scripts.
References
- For standalone shell scripts, it is acceptable to duplicate simple, self-contained helper functions instead of introducing
sourcedependencies. This maintains script independence and avoids risks like path resolution issues.
| ensure_db() { | ||
| if [[ ! -d "$SUPERVISOR_DIR" ]]; then | ||
| mkdir -p "$SUPERVISOR_DIR" | ||
| fi | ||
|
|
||
| if [[ ! -f "$SUPERVISOR_DB" ]]; then | ||
| init_db | ||
| return 0 | ||
| fi | ||
|
|
||
| # Check if schema needs upgrade | ||
| local has_tasks | ||
| has_tasks=$(db "$SUPERVISOR_DB" "SELECT count(*) FROM sqlite_master WHERE type='table' AND name='tasks';") | ||
| if [[ "$has_tasks" -eq 0 ]]; then | ||
| init_db | ||
| fi | ||
|
|
||
| # Migrate: add post-PR lifecycle states if CHECK constraint is outdated (t128.8) | ||
| # SQLite doesn't support ALTER CHECK, so we recreate the constraint via a temp table | ||
| # Note: uses dynamic column lists so cannot use safe_migrate() directly (t188) | ||
| local check_sql | ||
| check_sql=$(db "$SUPERVISOR_DB" "SELECT sql FROM sqlite_master WHERE type='table' AND name='tasks';" 2>/dev/null || echo "") | ||
| if [[ -n "$check_sql" ]] && ! echo "$check_sql" | grep -q 'pr_review'; then | ||
| log_info "Migrating database schema for post-PR lifecycle states (t128.8)..." | ||
|
|
||
| # Backup before migration (t188: fail-safe — abort if backup fails) | ||
| local t128_backup | ||
| t128_backup=$(backup_sqlite_db "$SUPERVISOR_DB" "pre-migrate-t128.8") | ||
| if [[ $? -ne 0 || -z "$t128_backup" ]]; then | ||
| log_error "Backup failed for t128.8 migration — aborting" | ||
| return 1 | ||
| fi | ||
|
|
||
| # Detect which optional columns exist in the old table to preserve data (t162) | ||
| local has_issue_url_col has_diagnostic_of_col has_triage_result_col | ||
| has_issue_url_col=$(db "$SUPERVISOR_DB" "SELECT count(*) FROM pragma_table_info('tasks') WHERE name='issue_url';" 2>/dev/null || echo "0") | ||
| has_diagnostic_of_col=$(db "$SUPERVISOR_DB" "SELECT count(*) FROM pragma_table_info('tasks') WHERE name='diagnostic_of';" 2>/dev/null || echo "0") | ||
| has_triage_result_col=$(db "$SUPERVISOR_DB" "SELECT count(*) FROM pragma_table_info('tasks') WHERE name='triage_result';" 2>/dev/null || echo "0") | ||
|
|
||
| # Build column lists dynamically based on what exists | ||
| local insert_cols="id, repo, description, status, session_id, worktree, branch, log_file, retries, max_retries, model, error, pr_url, created_at, started_at, completed_at, updated_at" | ||
| local select_cols="$insert_cols" | ||
| [[ "$has_issue_url_col" -gt 0 ]] && { | ||
| insert_cols="$insert_cols, issue_url" | ||
| select_cols="$select_cols, issue_url" | ||
| } | ||
| [[ "$has_diagnostic_of_col" -gt 0 ]] && { | ||
| insert_cols="$insert_cols, diagnostic_of" | ||
| select_cols="$select_cols, diagnostic_of" | ||
| } | ||
| [[ "$has_triage_result_col" -gt 0 ]] && { | ||
| insert_cols="$insert_cols, triage_result" | ||
| select_cols="$select_cols, triage_result" | ||
| } | ||
|
|
||
| db "$SUPERVISOR_DB" <<MIGRATE | ||
| PRAGMA foreign_keys=OFF; | ||
| BEGIN TRANSACTION; | ||
| ALTER TABLE tasks RENAME TO tasks_old; | ||
| CREATE TABLE tasks ( | ||
| id TEXT PRIMARY KEY, | ||
| repo TEXT NOT NULL, | ||
| description TEXT, | ||
| status TEXT NOT NULL DEFAULT 'queued' | ||
| CHECK(status IN ('queued','dispatched','running','evaluating','retrying','complete','pr_review','review_triage','merging','merged','deploying','deployed','verifying','verified','verify_failed','blocked','failed','cancelled')), | ||
| session_id TEXT, | ||
| worktree TEXT, | ||
| branch TEXT, | ||
| log_file TEXT, | ||
| retries INTEGER NOT NULL DEFAULT 0, | ||
| max_retries INTEGER NOT NULL DEFAULT 3, | ||
| model TEXT DEFAULT 'anthropic/claude-opus-4-6', | ||
| error TEXT, | ||
| pr_url TEXT, | ||
| issue_url TEXT, | ||
| diagnostic_of TEXT, | ||
| triage_result TEXT, | ||
| created_at TEXT NOT NULL DEFAULT (strftime('%Y-%m-%dT%H:%M:%SZ','now')), | ||
| started_at TEXT, | ||
| completed_at TEXT, | ||
| updated_at TEXT NOT NULL DEFAULT (strftime('%Y-%m-%dT%H:%M:%SZ','now')) | ||
| ); | ||
| INSERT INTO tasks ($insert_cols) | ||
| SELECT $select_cols | ||
| FROM tasks_old; | ||
| DROP TABLE tasks_old; | ||
| CREATE INDEX IF NOT EXISTS idx_tasks_status ON tasks(status); | ||
| CREATE INDEX IF NOT EXISTS idx_tasks_repo ON tasks(repo); | ||
| CREATE INDEX IF NOT EXISTS idx_tasks_created ON tasks(created_at); | ||
| CREATE INDEX IF NOT EXISTS idx_tasks_diagnostic ON tasks(diagnostic_of); | ||
| COMMIT; | ||
| PRAGMA foreign_keys=ON; | ||
| MIGRATE | ||
|
|
||
| # Verify row counts after migration (t188) | ||
| if ! verify_migration_rowcounts "$SUPERVISOR_DB" "$t128_backup" "tasks"; then | ||
| log_error "t128.8 migration VERIFICATION FAILED — rolling back" | ||
| rollback_sqlite_db "$SUPERVISOR_DB" "$t128_backup" | ||
| return 1 | ||
| fi | ||
| log_success "Database schema migrated for post-PR lifecycle states (verified)" | ||
| fi | ||
|
|
||
| # Backup before ALTER TABLE migrations if any are needed (t162, t188) | ||
| local needs_alter_migration=false | ||
| local has_max_load has_release_on_complete has_diagnostic_of has_issue_url has_max_concurrency | ||
| has_max_load=$(db "$SUPERVISOR_DB" "SELECT count(*) FROM pragma_table_info('batches') WHERE name='max_load_factor';" 2>/dev/null || echo "0") | ||
| has_release_on_complete=$(db "$SUPERVISOR_DB" "SELECT count(*) FROM pragma_table_info('batches') WHERE name='release_on_complete';" 2>/dev/null || echo "0") | ||
| has_diagnostic_of=$(db "$SUPERVISOR_DB" "SELECT count(*) FROM pragma_table_info('tasks') WHERE name='diagnostic_of';" 2>/dev/null || echo "0") | ||
| has_issue_url=$(db "$SUPERVISOR_DB" "SELECT count(*) FROM pragma_table_info('tasks') WHERE name='issue_url';" 2>/dev/null || echo "0") | ||
| has_max_concurrency=$(db "$SUPERVISOR_DB" "SELECT count(*) FROM pragma_table_info('batches') WHERE name='max_concurrency';" 2>/dev/null || echo "0") | ||
| if [[ "$has_max_load" -eq 0 || "$has_release_on_complete" -eq 0 || "$has_diagnostic_of" -eq 0 || "$has_issue_url" -eq 0 || "$has_max_concurrency" -eq 0 ]]; then | ||
| needs_alter_migration=true | ||
| fi | ||
| if [[ "$needs_alter_migration" == "true" ]]; then | ||
| local alter_backup | ||
| alter_backup=$(backup_sqlite_db "$SUPERVISOR_DB" "pre-migrate-alter-columns") | ||
| if [[ $? -ne 0 || -z "$alter_backup" ]]; then | ||
| log_warn "Backup failed for ALTER TABLE migrations, proceeding cautiously" | ||
| fi | ||
| fi | ||
|
|
||
| # Migrate: add max_load_factor column to batches if missing (t135.15.4) | ||
| if [[ "$has_max_load" -eq 0 ]]; then | ||
| log_info "Migrating batches table: adding max_load_factor column (t135.15.4)..." | ||
| if ! log_cmd "db-migrate" db "$SUPERVISOR_DB" "ALTER TABLE batches ADD COLUMN max_load_factor INTEGER NOT NULL DEFAULT 2;"; then | ||
| log_warn "Failed to add max_load_factor column (may already exist)" | ||
| else | ||
| log_success "Added max_load_factor column to batches" | ||
| fi | ||
| fi | ||
|
|
||
| # Migrate: add max_concurrency column to batches if missing (adaptive scaling cap) | ||
| if [[ "$has_max_concurrency" -eq 0 ]]; then | ||
| log_info "Migrating batches table: adding max_concurrency column..." | ||
| db "$SUPERVISOR_DB" "ALTER TABLE batches ADD COLUMN max_concurrency INTEGER NOT NULL DEFAULT 0;" 2>/dev/null || true | ||
| log_success "Added max_concurrency column to batches (0 = auto-detect from cpu_cores)" | ||
| fi | ||
|
|
||
| # Migrate: add release_on_complete and release_type columns to batches if missing (t128.10) | ||
| if [[ "$has_release_on_complete" -eq 0 ]]; then | ||
| log_info "Migrating batches table: adding release columns (t128.10)..." | ||
| db "$SUPERVISOR_DB" "ALTER TABLE batches ADD COLUMN release_on_complete INTEGER NOT NULL DEFAULT 0;" 2>/dev/null || true | ||
| db "$SUPERVISOR_DB" "ALTER TABLE batches ADD COLUMN release_type TEXT NOT NULL DEFAULT 'patch';" 2>/dev/null || true | ||
| log_success "Added release_on_complete and release_type columns to batches" | ||
| fi | ||
|
|
||
| # Migrate: add diagnostic_of column to tasks if missing (t150) | ||
| if [[ "$has_diagnostic_of" -eq 0 ]]; then | ||
| log_info "Migrating tasks table: adding diagnostic_of column (t150)..." | ||
| db "$SUPERVISOR_DB" "ALTER TABLE tasks ADD COLUMN diagnostic_of TEXT;" 2>/dev/null || true | ||
| db "$SUPERVISOR_DB" "CREATE INDEX IF NOT EXISTS idx_tasks_diagnostic ON tasks(diagnostic_of);" 2>/dev/null || true | ||
| log_success "Added diagnostic_of column to tasks" | ||
| fi | ||
|
|
||
| # Migrate: add issue_url column (t149) | ||
| if [[ "$has_issue_url" -eq 0 ]]; then | ||
| log_info "Migrating tasks table: adding issue_url column (t149)..." | ||
| db "$SUPERVISOR_DB" "ALTER TABLE tasks ADD COLUMN issue_url TEXT;" 2>/dev/null || true | ||
| log_success "Added issue_url column to tasks" | ||
| fi | ||
|
|
||
| # Migrate: add triage_result column to tasks if missing (t148) | ||
| local has_triage_result | ||
| has_triage_result=$(db "$SUPERVISOR_DB" "SELECT count(*) FROM pragma_table_info('tasks') WHERE name='triage_result';" 2>/dev/null || echo "0") | ||
| if [[ "$has_triage_result" -eq 0 ]]; then | ||
| log_info "Migrating tasks table: adding triage_result column (t148)..." | ||
| db "$SUPERVISOR_DB" "ALTER TABLE tasks ADD COLUMN triage_result TEXT;" 2>/dev/null || true | ||
| log_success "Added triage_result column to tasks" | ||
| fi | ||
|
|
||
| # Migrate: add review_triage to CHECK constraint if missing (t148) | ||
| local check_sql_t148 | ||
| check_sql_t148=$(db "$SUPERVISOR_DB" "SELECT sql FROM sqlite_master WHERE type='table' AND name='tasks';" 2>/dev/null || echo "") | ||
| if [[ -n "$check_sql_t148" ]] && ! echo "$check_sql_t148" | grep -q 'review_triage'; then | ||
| log_info "Migrating database schema for review_triage state (t148)..." | ||
|
|
||
| # Backup before migration (t188: fail-safe — abort if backup fails) | ||
| local t148_backup | ||
| t148_backup=$(backup_sqlite_db "$SUPERVISOR_DB" "pre-migrate-t148") | ||
| if [[ $? -ne 0 || -z "$t148_backup" ]]; then | ||
| log_error "Backup failed for t148 migration — aborting" | ||
| return 1 | ||
| fi | ||
|
|
||
| db "$SUPERVISOR_DB" <<'MIGRATE_T148' | ||
| PRAGMA foreign_keys=OFF; | ||
| BEGIN TRANSACTION; | ||
| ALTER TABLE tasks RENAME TO tasks_old_t148; | ||
| CREATE TABLE tasks ( | ||
| id TEXT PRIMARY KEY, | ||
| repo TEXT NOT NULL, | ||
| description TEXT, | ||
| status TEXT NOT NULL DEFAULT 'queued' | ||
| CHECK(status IN ('queued','dispatched','running','evaluating','retrying','complete','pr_review','review_triage','merging','merged','deploying','deployed','verifying','verified','verify_failed','blocked','failed','cancelled')), | ||
| session_id TEXT, | ||
| worktree TEXT, | ||
| branch TEXT, | ||
| log_file TEXT, | ||
| retries INTEGER NOT NULL DEFAULT 0, | ||
| max_retries INTEGER NOT NULL DEFAULT 3, | ||
| model TEXT DEFAULT 'anthropic/claude-opus-4-6', | ||
| error TEXT, | ||
| pr_url TEXT, | ||
| issue_url TEXT, | ||
| diagnostic_of TEXT, | ||
| triage_result TEXT, | ||
| created_at TEXT NOT NULL DEFAULT (strftime('%Y-%m-%dT%H:%M:%SZ','now')), | ||
| started_at TEXT, | ||
| completed_at TEXT, | ||
| updated_at TEXT NOT NULL DEFAULT (strftime('%Y-%m-%dT%H:%M:%SZ','now')) | ||
| ); | ||
| INSERT INTO tasks (id, repo, description, status, session_id, worktree, branch, | ||
| log_file, retries, max_retries, model, error, pr_url, issue_url, diagnostic_of, | ||
| created_at, started_at, completed_at, updated_at) | ||
| SELECT id, repo, description, status, session_id, worktree, branch, | ||
| log_file, retries, max_retries, model, error, pr_url, issue_url, diagnostic_of, | ||
| created_at, started_at, completed_at, updated_at | ||
| FROM tasks_old_t148; | ||
| DROP TABLE tasks_old_t148; | ||
| CREATE INDEX IF NOT EXISTS idx_tasks_status ON tasks(status); | ||
| CREATE INDEX IF NOT EXISTS idx_tasks_repo ON tasks(repo); | ||
| CREATE INDEX IF NOT EXISTS idx_tasks_created ON tasks(created_at); | ||
| CREATE INDEX IF NOT EXISTS idx_tasks_diagnostic ON tasks(diagnostic_of); | ||
| COMMIT; | ||
| PRAGMA foreign_keys=ON; | ||
| MIGRATE_T148 | ||
|
|
||
| # Verify row counts after migration (t188) | ||
| if ! verify_migration_rowcounts "$SUPERVISOR_DB" "$t148_backup" "tasks"; then | ||
| log_error "t148 migration VERIFICATION FAILED — rolling back" | ||
| rollback_sqlite_db "$SUPERVISOR_DB" "$t148_backup" | ||
| return 1 | ||
| fi | ||
| log_success "Database schema migrated for review_triage state (verified)" | ||
| fi | ||
|
|
||
| # Migration: add verifying/verified/verify_failed states to CHECK constraint (t180) | ||
| # Check if the current schema already supports verify states | ||
| # NOTE: This migration originally used "INSERT INTO tasks SELECT * FROM tasks_old_t180" | ||
| # which silently fails if column counts don't match. Fixed in t188 to use explicit | ||
| # column lists and row-count verification with automatic rollback. | ||
| local has_verify_states | ||
| has_verify_states=$(db "$SUPERVISOR_DB" "SELECT sql FROM sqlite_master WHERE type='table' AND name='tasks';" 2>/dev/null || echo "") | ||
| if [[ -n "$has_verify_states" ]] && ! echo "$has_verify_states" | grep -q "verifying"; then | ||
| log_info "Migrating database schema for post-merge verification states (t180)..." | ||
|
|
||
| # Backup before migration (t188: fail-safe — abort if backup fails) | ||
| local t180_backup | ||
| t180_backup=$(backup_sqlite_db "$SUPERVISOR_DB" "pre-migrate-t180") | ||
| if [[ $? -ne 0 || -z "$t180_backup" ]]; then | ||
| log_error "Backup failed for t180 migration — aborting" | ||
| return 1 | ||
| fi | ||
|
|
||
| db "$SUPERVISOR_DB" <<'MIGRATE_T180' | ||
| PRAGMA foreign_keys=OFF; | ||
| BEGIN TRANSACTION; | ||
| ALTER TABLE tasks RENAME TO tasks_old_t180; | ||
| CREATE TABLE tasks ( | ||
| id TEXT PRIMARY KEY, | ||
| repo TEXT NOT NULL, | ||
| description TEXT, | ||
| status TEXT NOT NULL DEFAULT 'queued' | ||
| CHECK(status IN ('queued','dispatched','running','evaluating','retrying','complete','pr_review','review_triage','merging','merged','deploying','deployed','verifying','verified','verify_failed','blocked','failed','cancelled')), | ||
| session_id TEXT, | ||
| worktree TEXT, | ||
| branch TEXT, | ||
| log_file TEXT, | ||
| retries INTEGER NOT NULL DEFAULT 0, | ||
| max_retries INTEGER NOT NULL DEFAULT 3, | ||
| model TEXT DEFAULT 'anthropic/claude-opus-4-6', | ||
| error TEXT, | ||
| pr_url TEXT, | ||
| issue_url TEXT, | ||
| diagnostic_of TEXT, | ||
| triage_result TEXT, | ||
| created_at TEXT NOT NULL DEFAULT (strftime('%Y-%m-%dT%H:%M:%SZ','now')), | ||
| started_at TEXT, | ||
| completed_at TEXT, | ||
| updated_at TEXT NOT NULL DEFAULT (strftime('%Y-%m-%dT%H:%M:%SZ','now')) | ||
| ); | ||
| INSERT INTO tasks (id, repo, description, status, session_id, worktree, branch, | ||
| log_file, retries, max_retries, model, error, pr_url, issue_url, diagnostic_of, | ||
| triage_result, created_at, started_at, completed_at, updated_at) | ||
| SELECT id, repo, description, status, session_id, worktree, branch, | ||
| log_file, retries, max_retries, model, error, pr_url, issue_url, diagnostic_of, | ||
| triage_result, created_at, started_at, completed_at, updated_at | ||
| FROM tasks_old_t180; | ||
| DROP TABLE tasks_old_t180; | ||
| CREATE INDEX IF NOT EXISTS idx_tasks_status ON tasks(status); | ||
| CREATE INDEX IF NOT EXISTS idx_tasks_diagnostic ON tasks(diagnostic_of); | ||
| COMMIT; | ||
| PRAGMA foreign_keys=ON; | ||
| MIGRATE_T180 | ||
|
|
||
| # Verify row counts after migration (t188) | ||
| if ! verify_migration_rowcounts "$SUPERVISOR_DB" "$t180_backup" "tasks"; then | ||
| log_error "t180 migration VERIFICATION FAILED — rolling back" | ||
| rollback_sqlite_db "$SUPERVISOR_DB" "$t180_backup" | ||
| return 1 | ||
| fi | ||
| log_success "Database schema migrated for post-merge verification states" | ||
| fi | ||
|
|
||
| # Migrate: add escalation_depth and max_escalation columns to tasks (t132.6) | ||
| local has_escalation_depth | ||
| has_escalation_depth=$(db "$SUPERVISOR_DB" "SELECT count(*) FROM pragma_table_info('tasks') WHERE name='escalation_depth';" 2>/dev/null || echo "0") | ||
| if [[ "$has_escalation_depth" -eq 0 ]]; then | ||
| log_info "Migrating tasks table: adding escalation columns (t132.6)..." | ||
| db "$SUPERVISOR_DB" "ALTER TABLE tasks ADD COLUMN escalation_depth INTEGER NOT NULL DEFAULT 0;" 2>/dev/null || true | ||
| db "$SUPERVISOR_DB" "ALTER TABLE tasks ADD COLUMN max_escalation INTEGER NOT NULL DEFAULT 2;" 2>/dev/null || true | ||
| log_success "Added escalation_depth and max_escalation columns to tasks" | ||
| fi | ||
|
|
||
| # Migrate: add skip_quality_gate column to batches (t132.6) | ||
| local has_skip_quality_gate | ||
| has_skip_quality_gate=$(db "$SUPERVISOR_DB" "SELECT count(*) FROM pragma_table_info('batches') WHERE name='skip_quality_gate';" 2>/dev/null || echo "0") | ||
| if [[ "$has_skip_quality_gate" -eq 0 ]]; then | ||
| log_info "Migrating batches table: adding skip_quality_gate column (t132.6)..." | ||
| db "$SUPERVISOR_DB" "ALTER TABLE batches ADD COLUMN skip_quality_gate INTEGER NOT NULL DEFAULT 0;" 2>/dev/null || true | ||
| log_success "Added skip_quality_gate column to batches" | ||
| fi | ||
|
|
||
| # Migrate: add proof_logs table if missing (t218) | ||
| local has_proof_logs | ||
| has_proof_logs=$(db "$SUPERVISOR_DB" "SELECT count(*) FROM sqlite_master WHERE type='table' AND name='proof_logs';" 2>/dev/null || echo "0") | ||
| if [[ "$has_proof_logs" -eq 0 ]]; then | ||
| log_info "Migrating database: adding proof_logs table (t218)..." | ||
| db "$SUPERVISOR_DB" <<'MIGRATE_T218' | ||
| CREATE TABLE IF NOT EXISTS proof_logs ( | ||
| id INTEGER PRIMARY KEY AUTOINCREMENT, | ||
| task_id TEXT NOT NULL, | ||
| event TEXT NOT NULL, | ||
| stage TEXT, | ||
| decision TEXT, | ||
| evidence TEXT, | ||
| decision_maker TEXT, | ||
| pr_url TEXT, | ||
| duration_secs INTEGER, | ||
| metadata TEXT, | ||
| timestamp TEXT NOT NULL DEFAULT (strftime('%Y-%m-%dT%H:%M:%SZ','now')), | ||
| FOREIGN KEY (task_id) REFERENCES tasks(id) ON DELETE CASCADE | ||
| ); | ||
| CREATE INDEX IF NOT EXISTS idx_proof_logs_task ON proof_logs(task_id); | ||
| CREATE INDEX IF NOT EXISTS idx_proof_logs_event ON proof_logs(event); | ||
| CREATE INDEX IF NOT EXISTS idx_proof_logs_timestamp ON proof_logs(timestamp); | ||
| MIGRATE_T218 | ||
| log_success "Added proof_logs table (t218)" | ||
| fi | ||
|
|
||
| # Migrate: add deploying_recovery_attempts column to tasks (t263) | ||
| local has_deploying_recovery | ||
| has_deploying_recovery=$(db "$SUPERVISOR_DB" "SELECT count(*) FROM pragma_table_info('tasks') WHERE name='deploying_recovery_attempts';" 2>/dev/null || echo "0") | ||
| if [[ "$has_deploying_recovery" -eq 0 ]]; then | ||
| log_info "Migrating tasks table: adding deploying_recovery_attempts column (t263)..." | ||
| db "$SUPERVISOR_DB" "ALTER TABLE tasks ADD COLUMN deploying_recovery_attempts INTEGER NOT NULL DEFAULT 0;" 2>/dev/null || true | ||
| log_success "Added deploying_recovery_attempts column to tasks (t263)" | ||
| fi | ||
|
|
||
| # Migrate: add rebase_attempts column to tasks (t298) | ||
| local has_rebase_attempts | ||
| has_rebase_attempts=$(db "$SUPERVISOR_DB" "SELECT count(*) FROM pragma_table_info('tasks') WHERE name='rebase_attempts';" 2>/dev/null || echo "0") | ||
| if [[ "$has_rebase_attempts" -eq 0 ]]; then | ||
| log_info "Migrating tasks table: adding rebase_attempts column (t298)..." | ||
| db "$SUPERVISOR_DB" "ALTER TABLE tasks ADD COLUMN rebase_attempts INTEGER NOT NULL DEFAULT 0;" 2>/dev/null || true | ||
| log_success "Added rebase_attempts column to tasks (t298)" | ||
| fi | ||
|
|
||
| # Migrate: create contest tables if missing (t1011) | ||
| local has_contests | ||
| has_contests=$(db "$SUPERVISOR_DB" "SELECT count(*) FROM sqlite_master WHERE type='table' AND name='contests';" 2>/dev/null || echo "0") | ||
| if [[ "$has_contests" -eq 0 ]]; then | ||
| log_info "Creating contest tables (t1011)..." | ||
| local contest_helper="${SCRIPT_DIR}/contest-helper.sh" | ||
| if [[ -x "$contest_helper" ]]; then | ||
| "$contest_helper" help >/dev/null 2>&1 || true | ||
| # contest-helper.sh ensure_contest_tables creates them on first use | ||
| # but we can also create them here for immediate availability | ||
| db "$SUPERVISOR_DB" <<'CONTEST_SQL' | ||
| CREATE TABLE IF NOT EXISTS contests ( | ||
| id TEXT PRIMARY KEY, | ||
| task_id TEXT NOT NULL, | ||
| description TEXT, | ||
| status TEXT NOT NULL DEFAULT 'pending' | ||
| CHECK(status IN ('pending','dispatching','running','evaluating','scoring','complete','failed','cancelled')), | ||
| winner_model TEXT, | ||
| winner_entry_id TEXT, | ||
| winner_score REAL, | ||
| models TEXT NOT NULL, | ||
| batch_id TEXT, | ||
| repo TEXT, | ||
| created_at TEXT NOT NULL DEFAULT (strftime('%Y-%m-%dT%H:%M:%SZ','now')), | ||
| completed_at TEXT, | ||
| metadata TEXT | ||
| ); | ||
| CREATE TABLE IF NOT EXISTS contest_entries ( | ||
| id TEXT PRIMARY KEY, | ||
| contest_id TEXT NOT NULL, | ||
| model TEXT NOT NULL, | ||
| task_id TEXT, | ||
| worktree TEXT, | ||
| branch TEXT, | ||
| log_file TEXT, | ||
| pr_url TEXT, | ||
| status TEXT NOT NULL DEFAULT 'pending' | ||
| CHECK(status IN ('pending','dispatched','running','complete','failed','cancelled')), | ||
| output_summary TEXT, | ||
| score_correctness REAL DEFAULT 0, | ||
| score_completeness REAL DEFAULT 0, | ||
| score_code_quality REAL DEFAULT 0, | ||
| score_clarity REAL DEFAULT 0, | ||
| weighted_score REAL DEFAULT 0, | ||
| cross_rank_scores TEXT, | ||
| created_at TEXT NOT NULL DEFAULT (strftime('%Y-%m-%dT%H:%M:%SZ','now')), | ||
| completed_at TEXT, | ||
| FOREIGN KEY (contest_id) REFERENCES contests(id) ON DELETE CASCADE | ||
| ); | ||
| CREATE INDEX IF NOT EXISTS idx_contests_task ON contests(task_id); | ||
| CREATE INDEX IF NOT EXISTS idx_contests_status ON contests(status); | ||
| CREATE INDEX IF NOT EXISTS idx_contest_entries_contest ON contest_entries(contest_id); | ||
| CREATE INDEX IF NOT EXISTS idx_contest_entries_status ON contest_entries(status); | ||
| CONTEST_SQL | ||
| log_success "Created contest tables (t1011)" | ||
| fi | ||
| fi | ||
|
|
||
| # Migrate: add last_main_sha column to tasks (t1029) | ||
| local has_last_main_sha | ||
| has_last_main_sha=$(db "$SUPERVISOR_DB" "SELECT count(*) FROM pragma_table_info('tasks') WHERE name='last_main_sha';" 2>/dev/null || echo "0") | ||
| if [[ "$has_last_main_sha" -eq 0 ]]; then | ||
| log_info "Migrating tasks table: adding last_main_sha column (t1029)..." | ||
| db "$SUPERVISOR_DB" "ALTER TABLE tasks ADD COLUMN last_main_sha TEXT;" 2>/dev/null || true | ||
| log_success "Added last_main_sha column to tasks (t1029)" | ||
| fi | ||
|
|
||
| # Ensure WAL mode for existing databases created before t135.3 | ||
| local current_mode | ||
| current_mode=$(db "$SUPERVISOR_DB" "PRAGMA journal_mode;" 2>/dev/null || echo "") | ||
| if [[ "$current_mode" != "wal" ]]; then | ||
| log_cmd "db-wal" db "$SUPERVISOR_DB" "PRAGMA journal_mode=WAL;" || log_warn "Failed to enable WAL mode" | ||
| fi | ||
|
|
||
| return 0 | ||
| } |
There was a problem hiding this comment.
The ensure_db function is over 440 lines long and handles numerous database migrations. For improved readability and maintainability, consider refactoring this into smaller, single-purpose functions, one for each migration (e.g., _migrate_t128_8, _migrate_add_max_load_factor). The main ensure_db function would then become a more manageable sequence of calls to these migration helpers.
Summary
Modularize supervisor-helper.sh — move 145 functions from 16K-line monolith into 13 existing stub modules under
.agents/scripts/supervisor/.Changes
main(),show_usage(),cmd_contest(), case statement router, and constantslog_*,db(),sql_escape(),log_cmd()) remain in_common.shVALID_STATES,VALID_TRANSITIONS, etc.) remain in monolith header — modules access them at call time via bash late bindingtodo-sync.sh(1117 lines),release.sh(289 lines)cron.shModule breakdown
Verification
bash -nsyntax check passes on all 14 modified fileshelp,status,list,cron status,proof-log --recentRef #1357
Summary by CodeRabbit
New Features
Chores