Skip to content

Comments

t1031: Modularize supervisor-helper.sh into domain modules#1359

Merged
marcusquinn merged 2 commits intomainfrom
feature/t1031
Feb 13, 2026
Merged

t1031: Modularize supervisor-helper.sh into domain modules#1359
marcusquinn merged 2 commits intomainfrom
feature/t1031

Conversation

@marcusquinn
Copy link
Owner

@marcusquinn marcusquinn commented Feb 13, 2026

Summary

Modularize supervisor-helper.sh — move 145 functions from 16K-line monolith into 13 existing stub modules under .agents/scripts/supervisor/.

Changes

  • Extracted 145 functions from the monolith into their domain-specific modules
  • Monolith reduced from 16,290 lines to ~680 lines (95% reduction)
  • Monolith retains only: main(), show_usage(), cmd_contest(), case statement router, and constants
  • Common functions (log_*, db(), sql_escape(), log_cmd()) remain in _common.sh
  • Constants (VALID_STATES, VALID_TRANSITIONS, etc.) remain in monolith header — modules access them at call time via bash late binding
  • Preserved already-populated modules: todo-sync.sh (1117 lines), release.sh (289 lines)
  • Fixed SC2155 ShellCheck warning in cron.sh

Module breakdown

Module Functions Description
database.sh 9 DB init, migrate, backup, restore
state.sh 9 State transitions, status, list, reset, cancel
batch.sh 2 Task add, batch create
dispatch.sh 16 Dispatch, model resolution, quality gates
evaluate.sh 10 Worker evaluation, PR discovery, AI eval
pulse.sh 3 Pulse cycle, post-PR lifecycle
cleanup.sh 8 Worktree mgmt, worker cleanup
cron.sh 4 Cron scheduling, auto-pickup, watch
deploy.sh 21 PR lifecycle, merge, deploy, verify
self-heal.sh 6 Diagnostics, model escalation
utility.sh 22 Proof-logs, monitoring, dashboard, notifications
issue-sync.sh 20 GitHub issues, labels, claiming, staleness
memory-integration.sh 7 Memory recall, patterns, retrospectives

Verification

  • bash -n syntax check passes on all 14 modified files
  • ShellCheck passes (only pre-existing SC2034 unused-variable warnings from IFS read destructuring)
  • Functional tests pass: help, status, list, cron status, proof-log --recent
  • All commands route correctly through the case statement to module functions

Ref #1357

Summary by CodeRabbit

  • New Features

    • Added comprehensive task queuing and batch management system with concurrent execution controls
    • Implemented automated GitHub issue synchronization with model-aware labeling and queue health monitoring
    • Added intelligent task dispatch with dynamic model selection, quality gates, and health checks
    • Introduced automatic worker recovery and self-healing with diagnostic subtasks
    • Added scheduled task discovery and cron-based automation with real-time file watching
    • Implemented complete PR lifecycle management with review triage and automated merging
    • Added worker outcome evaluation with AI-assisted decision making and PR creation
    • Introduced proof-logging, system monitoring, and task notification capabilities
  • Chores

    • Replaced placeholder stubs with fully-functional supervisor infrastructure across multiple components

…to 13 modules (t1031)

Moved real implementations from 16K-line monolith into existing stub modules
under .agents/scripts/supervisor/. The monolith now retains only: main(),
show_usage(), cmd_contest(), and the case statement router (~680 lines).

Module mapping:
- database.sh: DB init, migrate, backup, restore (9 functions)
- state.sh: state transitions, status, list, reset, cancel (9 functions)
- batch.sh: task add, batch create (2 functions)
- dispatch.sh: dispatch, model resolution, quality gates (16 functions)
- evaluate.sh: worker evaluation, PR discovery, AI eval (10 functions)
- pulse.sh: pulse cycle, post-PR lifecycle (3 functions)
- cleanup.sh: worktree mgmt, worker cleanup (8 functions)
- cron.sh: cron scheduling, auto-pickup, watch (4 functions)
- deploy.sh: PR lifecycle, merge, deploy, verify (21 functions)
- self-heal.sh: diagnostics, model escalation (6 functions)
- utility.sh: proof-logs, monitoring, dashboard, notifications (22 functions)
- issue-sync.sh: GitHub issues, labels, claiming, staleness (20 functions)
- memory-integration.sh: memory recall, patterns, retrospectives (7 functions)

Preserved modules already containing real implementations:
- todo-sync.sh (1117 lines, 12 functions)
- release.sh (289 lines, 2 functions)

Log/db/sql_escape functions remain in _common.sh (sourced first).
Constants (VALID_STATES, VALID_TRANSITIONS, etc.) remain in monolith
header — modules access them at call time via bash late binding.
@marcusquinn marcusquinn marked this pull request as ready for review February 13, 2026 13:56
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 13, 2026

Walkthrough

This PR implements a comprehensive supervisor system across 13 shell script modules, transforming placeholder stubs into fully functional automation for task management, worker dispatch, GitHub integration, state transitions, and multi-phase orchestration. The implementation includes database management, proof logging, quality gates, model escalation, and batch processing workflows.

Changes

Cohort / File(s) Summary
Batch & Task Management
.agents/scripts/supervisor/batch.sh
Added cmd_add and cmd_batch functions replacing four placeholder stubs; implements task creation with backward-compatible GH issue handling, description/model derivation from TODO.md, and batch creation with auto-task validation and position linking.
State Machine & Lifecycle
.agents/scripts/supervisor/state.sh
Replaced minimal stubs with full state machine: added cmd_transition, cmd_status, cmd_list, cmd_reset, cmd_cancel, cmd_next, cmd_running_count with option parsing, transition validation, batch-aware logic, and side effects (proof logs, GitHub syncs, VERIFY.md generation).
Database & Persistence
.agents/scripts/supervisor/database.sh
Implemented SQLite-backed storage: added backup_db, safe_migrate, restore_db, ensure_db, init_db with schema initialization, multi-step migrations with per-step backups, WAL mode support, and command wrappers (cmd_init, cmd_backup, cmd_restore, cmd_db).
Worker Dispatch & Routing
.agents/scripts/supervisor/dispatch.sh
Comprehensive dispatch orchestration: added 16 functions including cmd_dispatch, model resolution (resolve_model, resolve_task_model, classify_task_complexity), health checks (check_model_health), quality gates (run_quality_gate, check_output_quality), and worker config generation; supports contest mode and multi-tier escalation.
Worker Evaluation & PR Creation
.agents/scripts/supervisor/evaluate.sh
Implemented worker outcome analysis: added cmd_evaluate, evaluate_worker, evaluate_with_ai with AI-driven verdict logic; PR handling via auto_create_pr_for_task, discover_pr_by_branch, link_pr_to_task, validate_pr_belongs_to_task; log extraction helpers and metadata parsing.
Cleanup & Worktree Management
.agents/scripts/supervisor/cleanup.sh
Full worktree lifecycle: added create_task_worktree, cleanup_task_worktree, cleanup_worker_processes with ownership checks, fallback removal strategies, and process tree termination; added cmd_cleanup, cmd_kill_workers with dry-run and stale-PID handling.
PR Lifecycle & Deployment
.agents/scripts/supervisor/deploy.sh
Extensive PR automation: added 21 functions for end-to-end workflows including cmd_pr_lifecycle, check_pr_status, merge_task_pr, cmd_pr_merge; review triage via check_review_threads, triage_review_feedback, dispatch_review_fix_worker; orphaned PR scanning, conflict resolution, and sibling rebasing; verification via verify_task_deliverables, cmd_verify with VERIFY.md checks.
GitHub Issue Synchronization
.agents/scripts/supervisor/issue-sync.sh
Model-centric GitHub integration: added ensure_status_labels, add_model_label, cmd_labels, sync_issue_status_label, update_queue_health_issue with label-driven state tracking; added staleness detection (check_task_staleness, handle_stale_task), claiming (sync_claim_to_github), blocked notifications, and assignee management.
Cron, Scheduling & Decomposition
.agents/scripts/supervisor/cron.sh
Added cron-based pulse system: cmd_cron (install/uninstall/status), cmd_watch for TODO.md file-watching with fswatch, cmd_auto_pickup with multiple discovery strategies (#auto-dispatch, Dispatch Queue, #plan tasks), and dispatch_decomposition_worker for AI-driven plan decomposition with PID throttling and async worker management.
Memory & Retrospectives
.agents/scripts/supervisor/memory-integration.sh
Modular memory system: added recall_task_memories, store_failure_pattern, store_success_pattern with model-tier awareness and rate-limiting; run_batch_retrospective, run_session_review for post-batch analysis; cmd_retrospective, cmd_recall CLI entry points for memory-driven decisions.
Self-Healing & Diagnostics
.agents/scripts/supervisor/self-heal.sh
Automatic recovery workflows: added is_self_heal_eligible, create_diagnostic_subtask, attempt_self_heal, escalate_model_on_failure, handle_diagnostic_completion, cmd_self_heal; enables diagnostic subtask creation, model escalation, and parent re-queueing on diagnostic completion.
Main Orchestration Loop
.agents/scripts/supervisor/pulse.sh
Unified pulse orchestration: centralized into cmd_pulse with options parsing; implements 12-phase workflow including auto-pickup, deduplication, per-task evaluation, batch dispatch with concurrency guards, post-PR lifecycle, worker health checks, cleanup, memory audits, CodeRabbit interactions, and MODELS.md regeneration; added process_post_pr_lifecycle, extract_parent_id.
Utilities & Monitoring
.agents/scripts/supervisor/utility.sh
System observability & operations: added 19+ functions for proof-logging (write_proof_log, _proof_log_stage_duration, cmd_proof_log), pulse locking (acquire_pulse_lock, release_pulse_lock), system monitoring (check_system_load, get_process_footprint_mb, check_supervisor_memory), dashboard (cmd_dashboard), queue health (cmd_queue_health), respawn management (attempt_respawn_after_batch, cmd_respawn_history), and notifications (send_task_notification, cmd_notify).

Sequence Diagram(s)

sequenceDiagram
    actor User
    participant Pulse as Pulse Orchestrator<br/>(pulse.sh)
    participant Batch as Batch/State<br/>(batch.sh, state.sh)
    participant DB as SQLite DB<br/>(database.sh)
    participant Dispatch as Dispatcher<br/>(dispatch.sh)
    participant Worker as Worker Process<br/>(subprocess)
    participant Evaluate as Evaluator<br/>(evaluate.sh)
    participant GitHub as GitHub API<br/>(issue-sync.sh)

    User->>Pulse: cmd_pulse [--batch]
    activate Pulse
    Pulse->>DB: ensure_db() init schema
    activate DB
    DB-->>Pulse: ready
    deactivate DB
    Pulse->>Batch: cmd_auto_pickup() discover tasks
    activate Batch
    Batch->>DB: insert tasks
    DB-->>Batch: task_ids
    Batch-->>Pulse: queued tasks
    deactivate Batch
    Pulse->>Batch: Phase 1: evaluate existing
    Pulse->>Dispatch: Phase 2: dispatch queued tasks
    activate Dispatch
    Dispatch->>Dispatch: resolve_task_model()
    Dispatch->>Dispatch: check_model_health()
    Dispatch->>Worker: spawn worker subprocess
    activate Worker
    Worker-->>Dispatch: worker_pid
    deactivate Worker
    deactivate Dispatch
    Pulse->>Pulse: Phase 3-4: post-PR lifecycle
    loop per completed task
        Pulse->>Evaluate: evaluate_worker(task_id)
        activate Evaluate
        Evaluate->>Worker: check worker logs
        alt worker success
            Evaluate->>Evaluate: discover/create PR
            Evaluate->>GitHub: link_pr_to_task()
            Evaluate->>DB: transition to evaluating→complete
        else worker needs retry
            Evaluate->>DB: transition to retry
        end
        Evaluate-->>Pulse: outcome
        deactivate Evaluate
    end
    Pulse->>GitHub: Phase 6: sync issues, labels
    Pulse->>DB: Phase 5: summary metrics
    Pulse-->>User: pulse complete
    deactivate Pulse
Loading
sequenceDiagram
    participant Task as Task (queued)
    participant Dispatch as cmd_dispatch<br/>(dispatch.sh)
    participant Worker as Worker Process
    participant Evaluate as cmd_evaluate<br/>(evaluate.sh)
    participant PR as PR Lifecycle<br/>(deploy.sh)
    participant DB as State Machine<br/>(state.sh)

    Task->>Dispatch: dispatch triggered
    activate Dispatch
    Dispatch->>Dispatch: resolve_task_model() select AI model
    Dispatch->>Dispatch: check_model_health() verify availability
    Dispatch->>Dispatch: create_task_worktree() setup workspace
    Dispatch->>Dispatch: build_dispatch_cmd() construct worker cmd
    Dispatch->>Worker: spawn worker (nohup/disown)
    activate Worker
    Dispatch->>DB: transition dispatched
    DB->>Task: state=dispatched
    deactivate Dispatch
    Worker->>Worker: run task (AI inference)
    Worker->>Worker: write logs, push branch
    Worker-->>Worker: exit with status
    deactivate Worker
    Evaluate->>DB: evaluate task
    activate Evaluate
    Evaluate->>Evaluate: extract_log_metadata() parse logs
    alt auto create PR needed
        Evaluate->>Evaluate: auto_create_pr_for_task()
        Evaluate->>DB: link_pr_to_task() record PR
    end
    Evaluate->>DB: transition evaluating→complete
    deactivate Evaluate
    PR->>DB: check_pr_status() fetch PR
    activate PR
    PR->>PR: check_review_threads() triage feedback
    alt needs fixes
        PR->>Worker: dispatch_review_fix_worker()
    else ready to merge
        PR->>PR: merge_task_pr()
        PR->>PR: run_postflight_for_task()
        PR->>PR: run_deploy_for_task()
    end
    PR->>DB: transition deployed
    deactivate PR
    DB-->>Task: state=deployed/complete
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~65 minutes

Possibly related PRs

Poem

🤖 From stubs to strength, the supervisor comes alive,
Thirteen scripts dance in orchestrated jive—
Dispatch, evaluate, merge, and deploy,
State machines and memory with GitHub's joy! ✨
Where workers thrive and batches survive,
Zero debt, A-grade quality—automation thrives. 🎯

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically summarizes the main change: modularizing a large monolithic supervisor-helper.sh script into 13 domain-specific modules.
Docstring Coverage ✅ Passed Docstring coverage is 98.62% which is sufficient. The required threshold is 80.00%.
Merge Conflict Detection ✅ Passed ✅ No merge conflicts detected when merging into main

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feature/t1031

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link

🔍 Code Quality Report

�[0;35m[MONITOR]�[0m Code Review Monitoring Report

�[0;34m[INFO]�[0m Latest Quality Status:
SonarCloud: 0 bugs, 0 vulnerabilities, 0 code smells

�[0;34m[INFO]�[0m Recent monitoring activity:
Fri Feb 13 13:57:13 UTC 2026: Code review monitoring started
Fri Feb 13 13:57:14 UTC 2026: SonarCloud - Bugs: 0, Vulnerabilities: 0, Code Smells: 0

📈 Current Quality Metrics

  • BUGS: 0
  • CODE SMELLS: 0
  • VULNERABILITIES: 0

Generated on: Fri Feb 13 13:57:16 UTC 2026


Generated by AI DevOps Framework Code Review Monitoring

@gemini-code-assist
Copy link

Summary of Changes

Hello @marcusquinn, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a significant architectural overhaul by modularizing the core supervisor script into dedicated functional units. This refactoring aims to enhance the system's scalability, maintainability, and extensibility by distributing responsibilities across specialized modules. The changes enable more intelligent task processing, robust error handling, and comprehensive lifecycle management for automated development workflows.

Highlights

  • Modularization of supervisor-helper.sh: The monolithic supervisor-helper.sh script has been refactored into 13 domain-specific modules under .agents/scripts/supervisor/, significantly improving code organization and maintainability.
  • Function Extraction and Code Reduction: 145 functions were extracted from the original script, reducing its size from 16,290 lines to approximately 680 lines, representing a 95% reduction in the core monolith.
  • Core Monolith Responsibilities: The main supervisor-helper.sh now only retains essential components: main(), show_usage(), cmd_contest(), a case statement router for dispatching to modules, and global constants.
  • Enhanced Task and Batch Management: New functionalities include cmd_add for comprehensive task creation with model auto-upgrading, and cmd_batch for creating and managing task batches with configurable concurrency and release options.
  • Robust Cleanup and Worktree Management: Improved worktree lifecycle management with functions for creating, cleaning up, and verifying task-specific Git worktrees, alongside enhanced worker process and PID file cleanup.
  • Advanced Cron and Auto-Pickup Capabilities: The cron module now supports installing/uninstalling pulse schedules, watching TODO.md for real-time task pickup, and an cmd_auto_pickup function to automatically add tasks based on tags or 'Dispatch Queue' sections, including auto-batching and decomposition worker dispatch for #plan tasks.
  • Comprehensive Database Management: The database module provides robust functions for schema initialization, migrations with backup/rollback, and direct SQLite access, ensuring data integrity and upgradeability.
  • Full PR Lifecycle Automation: The deploy module introduces a complete post-PR lifecycle, including CI status checks, AI-assisted review triage, automated merging with bot review dismissal, post-merge deployment, and verification checks.
  • Intelligent Task Dispatch and Model Resolution: The dispatch module now features dynamic model resolution based on task complexity, subagent frontmatter, and pattern-tracker recommendations, alongside pre-dispatch model health checks and worker-specific MCP configurations.
  • Sophisticated Worker Outcome Evaluation: The evaluation process is now multi-tiered, combining deterministic signal detection, heuristic error pattern matching, Git-based work detection, and AI-assisted evaluation for ambiguous outcomes, including auto-creation of PRs for orphaned branches.
  • GitHub Issue and Label Synchronization: The issue-sync module provides granular control over GitHub issues, including status label synchronization, task claiming/unclaiming, and a live queue health dashboard pinned to GitHub issues.
  • Memory Integration and Self-Healing: The memory module enables storing success/failure patterns and batch retrospectives, while the self-heal module introduces automatic diagnostic subtask creation and model escalation for persistent failures.
  • Enhanced Supervisor State and Utility Functions: The state module offers validated state transitions, detailed status views, and task/batch management. The utility module includes proof-logging, system monitoring, adaptive concurrency, and a live TUI dashboard.
Changelog
  • .agents/scripts/supervisor/batch.sh
    • Removed placeholder functions for batch management.
    • Added cmd_add for task creation with advanced options and model intelligence.
    • Added cmd_batch for batch creation and configuration.
  • .agents/scripts/supervisor/cleanup.sh
    • Removed placeholder cleanup functions.
    • Implemented create_task_worktree for dynamic worktree provisioning.
    • Implemented cleanup_task_worktree for robust worktree removal with ownership checks.
    • Added process management utilities _kill_descendants and _list_descendants.
    • Added cmd_kill_workers for orphaned process termination.
    • Added cmd_cleanup for comprehensive system hygiene.
    • Implemented cleanup_after_merge for post-merge housekeeping.
  • .agents/scripts/supervisor/cron.sh
    • Removed placeholder cron functions.
    • Implemented cmd_cron for managing scheduled pulse cycles.
    • Added cmd_watch for real-time TODO.md monitoring.
    • Implemented cmd_auto_pickup for automated task ingestion and batching.
    • Added dispatch_decomposition_worker for dynamic subtask generation.
  • .agents/scripts/supervisor/database.sh
    • Removed placeholder database functions.
    • Implemented backup_db for database backups.
    • Added safe_migrate for robust schema migrations with rollback capabilities.
    • Implemented restore_db for database restoration.
    • Added ensure_db for schema initialization and migration orchestration.
    • Implemented init_db to define the core database schema.
    • Added cmd_init, cmd_backup, cmd_restore, and cmd_db for direct database interaction.
  • .agents/scripts/supervisor/deploy.sh
    • Removed placeholder deployment functions.
    • Implemented run_postflight_for_task for post-merge verification.
    • Added run_deploy_for_task for intelligent deployment strategies.
    • Implemented record_lifecycle_timing for PR pipeline latency analysis.
    • Added cmd_pr_lifecycle for end-to-end PR lifecycle automation.
    • Implemented check_review_threads for GitHub PR review analysis.
    • Added triage_review_feedback for automated review severity classification.
    • Implemented dispatch_review_fix_worker for AI-driven review feedback resolution.
    • Added dismiss_bot_reviews for automated bot review management.
    • Implemented check_pr_status for comprehensive PR status checks.
    • Added scan_orphaned_prs and scan_orphaned_pr_for_task for PR discovery.
    • Implemented cmd_pr_check for manual PR status inspection.
    • Added get_sibling_tasks for subtask relationship management.
    • Implemented resolve_rebase_conflicts for AI-assisted rebase conflict resolution.
    • Added rebase_sibling_pr and rebase_sibling_prs_after_merge for automated branch rebasing.
    • Implemented merge_task_pr for automated PR merging.
    • Added cmd_pr_merge for manual PR merge initiation.
    • Implemented verify_task_deliverables for validating task outcomes.
    • Added run_verify_checks for executing verification routines.
    • Implemented cmd_verify for manual task verification.
  • .agents/scripts/supervisor/dispatch.sh
    • Removed placeholder dispatch functions.
    • Implemented detect_dispatch_mode for environment-aware dispatch.
    • Added resolve_ai_cli for dynamic AI CLI selection.
    • Implemented resolve_model for intelligent model selection based on tiers.
    • Added resolve_model_from_frontmatter for subagent model configuration.
    • Implemented classify_task_complexity for automated task routing.
    • Added resolve_task_model for comprehensive task model determination.
    • Implemented get_next_tier for model escalation paths.
    • Added check_output_quality for heuristic worker output analysis.
    • Implemented run_quality_gate for automated quality assurance and escalation.
    • Added check_model_health for pre-dispatch model availability checks.
    • Implemented generate_worker_mcp_config for optimized worker configurations.
    • Added build_dispatch_cmd for constructing worker commands.
    • Implemented build_verify_dispatch_cmd for lightweight verification prompts.
    • Added cmd_dispatch for orchestrating task dispatch.
    • Implemented cmd_worker_status for monitoring worker processes.
    • Added cmd_reprompt for re-initiating worker sessions.
  • .agents/scripts/supervisor/evaluate.sh
    • Removed placeholder evaluation functions.
    • Implemented extract_log_tail for concise log sampling.
    • Added extract_log_metadata for structured log data extraction.
    • Implemented validate_pr_belongs_to_task for PR attribution integrity.
    • Added parse_pr_url for standardized PR URL parsing.
    • Implemented discover_pr_by_branch for branch-based PR discovery.
    • Added auto_create_pr_for_task for automated PR creation.
    • Implemented link_pr_to_task for centralized PR linking.
    • Added evaluate_worker for multi-tiered worker outcome assessment.
    • Implemented evaluate_with_ai for AI-driven outcome evaluation.
    • Added cmd_evaluate for manual task evaluation.
  • .agents/scripts/supervisor/issue-sync.sh
    • Removed placeholder issue synchronization functions.
    • Implemented ensure_status_labels for GitHub label management.
    • Added model_to_tier for model tier mapping.
    • Implemented add_model_label for tracking model usage on issues.
    • Added cmd_labels for querying model usage labels.
    • Implemented state_to_status_label for state-to-label conversion.
    • Added sync_issue_status_label for GitHub issue status synchronization.
    • Implemented update_queue_health_issue for live queue health dashboards.
    • Added find_task_issue_number for GitHub issue lookup.
    • Implemented get_aidevops_identity for user identification.
    • Added get_task_assignee for TODO.md assignee extraction.
    • Implemented cmd_claim for task claiming.
    • Added cmd_unclaim for task unclaiming.
    • Implemented check_task_already_done for pre-dispatch completion verification.
    • Added was_previously_worked for detecting prior task attempts.
    • Implemented check_task_staleness for identifying outdated tasks.
    • Added handle_stale_task for automated staleness actions.
    • Implemented check_task_claimed for task claim status.
    • Added sync_claim_to_github for GitHub claim synchronization.
    • Implemented create_github_issue for automated GitHub issue creation.
    • Added post_blocked_comment_to_github for automated issue commenting.
  • .agents/scripts/supervisor/memory-integration.sh
    • Removed placeholder memory integration functions.
    • Implemented recall_task_memories for retrieving relevant task memories.
    • Added store_failure_pattern for recording failure patterns.
    • Implemented store_success_pattern for recording successful outcomes.
    • Added run_batch_retrospective for post-batch analysis.
    • Implemented run_session_review for capturing and distilling session learnings.
    • Added cmd_retrospective for manual batch retrospective initiation.
    • Implemented cmd_recall for manual memory retrieval.
  • .agents/scripts/supervisor/pulse.sh
    • Removed placeholder pulse cycle functions.
    • Implemented cmd_pulse as the core supervisor pulse orchestration logic.
    • Added process_post_pr_lifecycle for managing post-PR stages.
    • Implemented extract_parent_id for hierarchical task management.
  • .agents/scripts/supervisor/self-heal.sh
    • Removed placeholder self-healing functions.
    • Implemented is_self_heal_eligible for diagnostic task eligibility.
    • Added create_diagnostic_subtask for automated diagnostic task creation.
    • Implemented attempt_self_heal for triggering self-healing processes.
    • Added escalate_model_on_failure for dynamic model tier escalation.
    • Implemented handle_diagnostic_completion for parent task re-queuing.
    • Added cmd_self_heal for manual self-healing initiation.
  • .agents/scripts/supervisor/state.sh
    • Removed placeholder state management functions.
    • Implemented validate_transition for enforcing state machine rules.
    • Added cmd_transition for managing task state changes.
    • Implemented cmd_status for displaying task and batch status.
    • Added cmd_list for filtered task listings.
    • Implemented cmd_reset for resetting tasks to a queued state.
    • Added cmd_cancel for task and batch cancellation.
    • Implemented check_batch_completion for monitoring batch progress.
    • Added cmd_running_count for active task metrics.
    • Implemented cmd_next for retrieving dispatchable tasks.
  • .agents/scripts/supervisor/utility.sh
    • Removed placeholder utility functions.
    • Implemented write_proof_log for structured audit trail logging.
    • Added _proof_log_stage_duration for calculating pipeline stage durations.
    • Implemented check_gh_auth for GitHub authentication verification.
    • Added acquire_pulse_lock and release_pulse_lock for concurrency control.
    • Implemented get_cpu_cores for system CPU information.
    • Added check_system_load for comprehensive system resource monitoring.
    • Implemented get_process_footprint_mb for process memory measurement.
    • Added check_supervisor_memory for supervisor memory footprint analysis.
    • Implemented log_respawn_event for persistent respawn history.
    • Added attempt_respawn_after_batch for automated supervisor respawn.
    • Implemented cmd_respawn_history for viewing respawn logs.
    • Added calculate_adaptive_concurrency for dynamic concurrency management.
    • Implemented find_project_root for repository root detection.
    • Added detect_repo_slug for GitHub repository slug extraction.
    • Implemented cmd_proof_log for querying and exporting proof-logs.
    • Added cmd_dashboard for a live terminal UI dashboard.
    • Implemented cmd_queue_health for manual queue health updates.
    • Added cmd_mem_check for manual memory checks.
    • Implemented send_task_notification for task event notifications.
    • Added notify_batch_progress for macOS batch progress alerts.
    • Implemented cmd_notify for manual notification triggering.
Activity
  • The pull request author, marcusquinn, has modularized the supervisor-helper.sh script into 13 domain-specific modules, extracting 145 functions and reducing the core monolith by 95%.
  • Verification steps included bash -n syntax checks, ShellCheck analysis, and functional tests for key commands, all of which passed successfully.
  • No human review comments or approvals have been recorded yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@marcusquinn marcusquinn merged commit ee8fdcd into main Feb 13, 2026
7 of 10 checks passed
@marcusquinn marcusquinn deleted the feature/t1031 branch February 13, 2026 13:59
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 14

Note

Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.

🤖 Fix all issues with AI agents
In @.agents/scripts/supervisor/batch.sh:
- Around line 49-55: Validate the max_retries value before using it in SQL:
ensure the variable max_retries (set in the --max-retries parsing block and
later used in the SQL interpolation at the other occurrence) matches a
non-negative integer pattern (e.g. regex like ^[0-9]+$) and optionally enforce
an upper bound, reject or normalize invalid input (call log_error and return
non-zero) and convert to a sanitized integer form before embedding into SQL;
apply the same validation for the second occurrence (the block referenced at
lines 150-153) so both parsing sites enforce and sanitize max_retries prior to
SQL interpolation.
- Around line 196-227: The flag parsing currently assigns user-supplied values
directly to concurrency, max_concurrency, tasks, and max_load_factor; add
numeric validation for each before assignment and before any SQL insert: verify
the value matches a non-negative integer (or appropriate numeric/range for
max_load_factor) using a regex or POSIX arithmetic check, log an error with
log_error and return 1 on invalid input, coerce/normalize valid values (e.g.,
strip leading zeros) or set a safe default, and ensure the same validation is
applied again where these variables are used for SQL inserts (sanitize or
parameterize the query) — update the parsing branches for --concurrency,
--max-concurrency, --tasks, and --max-load and the corresponding usage site
around the SQL insert to enforce this.

In @.agents/scripts/supervisor/cron.sh:
- Around line 50-65: The current logic embeds GH_TOKEN directly into the crontab
entry via cron_cmd; instead, if gh_token is detected, write it to a protected
env file (e.g., create a file under SUPERVISOR_DIR like cron_env, chmod 600) and
update cron_cmd to source that env file before invoking ${script_path} (or
simply rely on calling ${script_path} which uses `gh auth token` internally), so
GH_TOKEN is not placed directly into the crontab or process list; update
references around cron.sh, cron_cmd, GH_TOKEN, SUPERVISOR_DIR, script_path,
pulse, and cron_marker accordingly.

In @.agents/scripts/supervisor/database.sh:
- Around line 325-358: The migration rebuilds the tasks table but the INSERT ...
SELECT omits the triage_result column, causing data loss; update the INSERT INTO
tasks (...) column list and the corresponding SELECT ... FROM tasks_old_t148 to
include triage_result in the exact position/order used in the CREATE TABLE so
values from tasks_old_t148.triage_result are preserved when copying rows back
into tasks.

In @.agents/scripts/supervisor/deploy.sh:
- Around line 84-96: The if/then loses the deployment script's exit code because
$? is read after the if compound; change the pattern to capture the exit code
immediately after running the deploy command (assign deploy_output by running
"$deploy_script" "${deploy_args[@]}" 2>&1, then right away set deploy_exit=$?),
then branch on deploy_exit: if deploy_exit -eq 0 handle success (log_success,
write deploy_output to deploy_log and return 0), if deploy_exit -eq 2 log_info
"No agent changes to deploy for $task_id" and return 0, else log_failure/fail
accordingly; update references to deploy_output, deploy_exit, deploy_script,
deploy_args, deploy_log, and task_id in the same block.
- Around line 2216-2225: The current code builds merge_cmd and uses eval, which
risks command injection via pr_number or repo_slug; change to building a command
array (e.g., cmd=(gh pr merge "$pr_number" "--repo" "$repo_slug" "--squash") and
append "--admin" to the array when use_admin_flag is true), then execute the
array without eval and capture output (e.g., merge_output="$( "${cmd[@]}" 2>&1
)" and check its exit status) so you avoid word-splitting and shell
interpretation of user-controlled values; update references to merge_cmd and
merge_output accordingly.
- Around line 725-726: The block that reassigns escaped_id using sed should be
replaced to use the project's sql_escape() helper so you don't shadow or diverge
from the shared escaping behavior; locate the reassignment of escaped_id
(currently using printf/sed on task_id) and remove it, then assign escaped_id by
calling sql_escape with task_id (or reuse the earlier escaped_id) and ensure you
don't redeclare it with a new local that would overwrite the value used later in
the function.
- Around line 490-513: The SQL statements in the no_pr case interpolate the raw
variable task_id (id='$task_id') which risks SQL injection and quoting errors;
change those to use the precomputed escaped_id from sql_escape (replace
id='$task_id' with id='$escaped_id' in both the SELECT that reads no_pr_retries
and the UPDATE that writes error/json_set) so the queries use the properly
escaped identifier (keep no_pr_count logic and json_set usage unchanged).

In @.agents/scripts/supervisor/dispatch.sh:
- Around line 1387-1909: The verify-mode branch incorrectly calls resolve_model
"coding" (which maps to Opus) causing higher-cost models to be used; change the
verify-mode model resolution in the cmd_dispatch function where resolved_model
is set for verify_mode to call resolve_model "sonnet" (or the intended cheaper
tier) and adjust any log text referencing "coding-tier model" to reflect the
sonnet/verify tier so verify dispatches use the cheaper model when building the
verify dispatch command (see resolved_model, verify_mode, resolve_model,
build_verify_dispatch_cmd).
- Around line 569-666: The check_output_quality function hardcodes "main" as the
base branch when calling git (git -C "$tworktree" diff --stat "main..HEAD" and
git -C "$tworktree" diff --name-only "main..HEAD"), which breaks repos that use
master or other defaults; compute the repository's default branch first (e.g.,
using git -C "$tworktree" symbolic-ref refs/remotes/origin/HEAD or git -C
"$tworktree" rev-parse --abbrev-ref origin/HEAD and strip the
refs/remotes/origin/ prefix), fall back to "main" then "master" if detection
fails, store it in a local variable like base_branch, and replace the two
hardcoded "main..HEAD" occurrences with "$base_branch..HEAD" (also use that
variable for both diff --stat and diff --name-only checks in
check_output_quality).

In @.agents/scripts/supervisor/evaluate.sh:
- Around line 675-1086: In evaluate_worker(), the billing/credits detection
block incorrectly references an undefined variable log_file causing grep to read
stdin or misclassify errors; update that grep call to use the log path variable
tlog (used throughout the function) so the billing detection uses the actual
task log (refer to evaluate_worker, the block checking meta_backend_error_count
and meta_content_lines before echoing "blocked:billing_credits_exhausted").

In @.agents/scripts/supervisor/issue-sync.sh:
- Around line 1267-1347: The check_task_already_done function is vulnerable to
regex mis-matches because task_id is used raw; replace uses of the raw task_id
with the already-built escaped_task_regex and adjust boundary_pattern to use it
(e.g., boundary_pattern="${escaped_task_regex}([^.0-9]|$)"), and pass
escaped_task_regex into git grep (git -C "$project_root" log --oneline ...
--grep="$escaped_task_regex") and into the exclusion grep (grep -ivE "add
${escaped_task_regex}|claim ${escaped_task_regex}|mark
${escaped_task_regex}|queue ${escaped_task_regex}|blocked") so dotted IDs like
t128.10 are treated literally across git log, grep -E, and grep -ivE checks;
leave the rest of check_task_already_done logic unchanged.

In @.agents/scripts/supervisor/state.sh:
- Around line 817-848: The cmd_next function interpolates the unvalidated
variable limit directly into the SQL LIMIT clause; validate and sanitize limit
before use by checking it is a positive integer, applying a sensible default and
an upper cap (e.g., max_limit) if out of range, and then use the sanitized
numeric variable (not the raw input) when building the db query; update the
cmd_next function (references: cmd_next, limit, ensure_db, db, SUPERVISOR_DB) to
perform this numeric validation/normalization and only then substitute the
sanitized value into the SQL.

In @.agents/scripts/supervisor/utility.sh:
- Around line 459-463: The threshold calculation is wrong: because load_ratio is
already a percent, do not multiply by cpu_cores again; compute threshold from
max_load_factor as a percent (e.g., threshold = max_load_factor * 100,
converting floats to an integer percent if needed) and keep the existing
comparison against load_ratio; update the assignment to the threshold variable
(refer to threshold, load_ratio, cpu_cores, max_load_factor) and ensure numeric
conversion/rounding is handled so the if [[ "$load_ratio" -gt "$threshold" ]]
branch can trigger correctly.
🟡 Minor comments (8)
.agents/scripts/supervisor/utility.sh-991-1015 (1)

991-1015: ⚠️ Potential issue | 🟡 Minor

--json output can be invalid when evidence/metadata contain backslashes or newlines.
Only quotes are escaped today; JSON consumers will break on \, \n, \r, or tabs.

🔧 Suggested hardening (apply similarly in task-mode JSON block)
-				local _esc_evidence="${pevidence:-}"
-				_esc_evidence="${_esc_evidence//\"/\\\"}"
-				local _esc_meta="${pmeta:-}"
-				_esc_meta="${_esc_meta//\"/\\\"}"
+				local _esc_evidence="${pevidence:-}"
+				_esc_evidence="${_esc_evidence//\\/\\\\}"
+				_esc_evidence="${_esc_evidence//$'\n'/\\n}"
+				_esc_evidence="${_esc_evidence//$'\r'/\\r}"
+				_esc_evidence="${_esc_evidence//$'\t'/\\t}"
+				_esc_evidence="${_esc_evidence//\"/\\\"}"
+				local _esc_meta="${pmeta:-}"
+				_esc_meta="${_esc_meta//\\/\\\\}"
+				_esc_meta="${_esc_meta//$'\n'/\\n}"
+				_esc_meta="${_esc_meta//$'\r'/\\r}"
+				_esc_meta="${_esc_meta//$'\t'/\\t}"
+				_esc_meta="${_esc_meta//\"/\\\"}"

Also applies to: 1081-1105

.agents/scripts/supervisor/cron.sh-693-696 (1)

693-696: ⚠️ Potential issue | 🟡 Minor

metadata column update will fail without schema support.
The current DB schema doesn’t define tasks.metadata, so this UPDATE is a no‑op. Either add a migration for metadata or drop the update to avoid silent failures.

.agents/scripts/supervisor/pulse.sh-682-689 (1)

682-689: ⚠️ Potential issue | 🟡 Minor

Cancel stale diagnostics when parent reaches verified states.
If a parent hits verified / verify_failed, the diagnostic can remain queued indefinitely. Treat those as terminal too.

🔧 Proposed fix
-          AND p.status IN ('deployed', 'cancelled', 'failed', 'complete', 'merged');
+          AND p.status IN ('deployed', 'verified', 'verify_failed', 'cancelled', 'failed', 'complete', 'merged');
.agents/scripts/supervisor/dispatch.sh-35-50 (1)

35-50: ⚠️ Potential issue | 🟡 Minor

Fix logger typo to avoid “command not found”.
log_warning isn’t defined in the shared helpers; use log_warn.

🔧 Proposed fix
-	if command -v claude &>/dev/null; then
-		log_warning "Using deprecated claude CLI fallback. Install opencode: npm i -g opencode"
+	if command -v claude &>/dev/null; then
+		log_warn "Using deprecated claude CLI fallback. Install opencode: npm i -g opencode"
 		echo "claude"
 		return 0
 	fi
.agents/scripts/supervisor/deploy.sh-2501-2509 (1)

2501-2509: ⚠️ Potential issue | 🟡 Minor

bash "$check_arg" executes an arbitrary path sourced from VERIFY.md — verify trust boundary.

The bash check type runs whatever script path is written in todo/VERIFY.md. If populate_verify_queue is the sole writer, this is internally consistent. But if the file is ever committed by a worker or modified via a PR, an attacker could inject a malicious script path. Consider allowlisting script paths or at minimum documenting the trust assumption.

.agents/scripts/supervisor/deploy.sh-1278-1278 (1)

1278-1278: ⚠️ Potential issue | 🟡 Minor

((dismissed_count++)) returns exit code 1 on first increment (from 0) — unsafe with set -e.

Postfix ++ evaluates the old value; ((0)) has exit status 1 in bash. If set -e (or ERR trap) is ever active in the call chain, the first successful dismissal will cause the function to abort.

🐛 Proposed fix
-			((dismissed_count++))
+			dismissed_count=$((dismissed_count + 1))
.agents/scripts/supervisor/deploy.sh-1161-1172 (1)

1161-1172: ⚠️ Potential issue | 🟡 Minor

Paths embedded in generated scripts via single quotes are vulnerable to quoting breakage.

Lines 1164, 1167 embed ${work_dir} and ${worker_xdg_config} inside single-quoted strings in the generated script. If any path contains a literal ', the resulting script will have a syntax error or, worse, unintended command execution. The printf '%q' pattern already used at line 1170 for cmd_parts is the safer approach.

🛡️ Proposed fix: use printf '%q' for path embedding
-	echo "cd '${work_dir}' || { echo 'WORKER_FAILED: cd to work_dir failed: ${work_dir}'; exit 1; }"
+	printf 'cd %q || { echo %q; exit 1; }\n' "$work_dir" "WORKER_FAILED: cd to work_dir failed: $work_dir"
 	# Redirect worker to use MCP config with heavy indexers disabled (t221)
 	if [[ -n "$worker_xdg_config" ]]; then
-		echo "export XDG_CONFIG_HOME='${worker_xdg_config}'"
+		printf 'export XDG_CONFIG_HOME=%q\n' "$worker_xdg_config"
 	fi
.agents/scripts/supervisor/deploy.sh-1640-1645 (1)

1640-1645: ⚠️ Potential issue | 🟡 Minor

Regex word-boundary match treats . in subtask IDs as wildcard.

jq test("\\b" + $tid + "\\b") uses $tid verbatim as a regex. For subtask IDs like t215.3, the . matches any character, so a PR titled "t21503" would falsely match. The downstream link_pr_to_task likely validates, but escaping the ID would be more precise.

🛡️ Suggestion: escape regex metacharacters in task ID
-		matched_pr_url=$(echo "$all_prs" | jq -r --arg tid "$tid" '
+		local escaped_tid
+		escaped_tid=$(printf '%s' "$tid" | sed 's/[.[\*^$()+?{|\\]/\\&/g')
+		matched_pr_url=$(echo "$all_prs" | jq -r --arg tid "$escaped_tid" '

Alternatively, in the jq filter, use gsub to escape the ID before passing to test.

🧹 Nitpick comments (2)
.agents/scripts/supervisor/deploy.sh (2)

172-187: Manual JSON string construction is fragile — consider using jq throughout.

stage_name and stage_time are concatenated into a JSON object via string interpolation. If any value ever contains a " or \, the resulting JSON will be malformed and the jq -n --argjson call on line 195 will silently fail. Since the data is currently all internally generated timing strings, this is low-risk — but a jq pipeline would be more robust.


2260-2267: cmd_pr_merge rejects unstable_sonarcloud status, but merge_task_pr supports it with --admin.

The manual pr-merge command only allows ready_to_merge, while the lifecycle flow handles unstable_sonarcloud via --admin. If this is intentional (forcing users through pr-lifecycle for SonarCloud bypasses), a brief comment would clarify intent. Otherwise, consider accepting unstable_sonarcloud here too.

Comment on lines +49 to +55
--max-retries)
[[ $# -lt 2 ]] && {
log_error "--max-retries requires a value"
return 1
}
max_retries="$2"
shift 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Validate --max-retries before embedding in SQL.
It’s currently unvalidated and interpolated directly, which can cause SQL errors or injection if malformed input slips in.

🔧 Proposed fix
 	--max-retries)
 		[[ $# -lt 2 ]] && {
 			log_error "--max-retries requires a value"
 			return 1
 		}
 		max_retries="$2"
 		shift 2
 		;;
@@
-	db "$SUPERVISOR_DB" "
+	if ! [[ "$max_retries" =~ ^[0-9]+$ ]]; then
+		log_error "--max-retries must be a non-negative integer"
+		return 1
+	fi
+	db "$SUPERVISOR_DB" "
         INSERT INTO tasks (id, repo, description, model, max_retries)
         VALUES ('$escaped_id', '$escaped_repo', '$escaped_desc', '$escaped_model', $max_retries);
     "

Also applies to: 150-153

🤖 Prompt for AI Agents
In @.agents/scripts/supervisor/batch.sh around lines 49 - 55, Validate the
max_retries value before using it in SQL: ensure the variable max_retries (set
in the --max-retries parsing block and later used in the SQL interpolation at
the other occurrence) matches a non-negative integer pattern (e.g. regex like
^[0-9]+$) and optionally enforce an upper bound, reject or normalize invalid
input (call log_error and return non-zero) and convert to a sanitized integer
form before embedding into SQL; apply the same validation for the second
occurrence (the block referenced at lines 150-153) so both parsing sites enforce
and sanitize max_retries prior to SQL interpolation.

Comment on lines +196 to +227
--concurrency)
[[ $# -lt 2 ]] && {
log_error "--concurrency requires a value"
return 1
}
concurrency="$2"
shift 2
;;
--max-concurrency)
[[ $# -lt 2 ]] && {
log_error "--max-concurrency requires a value"
return 1
}
max_concurrency="$2"
shift 2
;;
--tasks)
[[ $# -lt 2 ]] && {
log_error "--tasks requires a value"
return 1
}
tasks="$2"
shift 2
;;
--max-load)
[[ $# -lt 2 ]] && {
log_error "--max-load requires a value"
return 1
}
max_load_factor="$2"
shift 2
;;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Validate numeric batch flags before SQL insert.
concurrency, max_concurrency, and max_load_factor are injected raw; invalid values will break inserts or allow injection.

🔧 Proposed fix
 	while [[ $# -gt 0 ]]; do
@@
 	done
 
+	for v in concurrency max_concurrency max_load_factor; do
+		if ! [[ "${!v}" =~ ^[0-9]+$ ]]; then
+			log_error "--$v must be a non-negative integer"
+			return 1
+		fi
+	done
+
 	if [[ -z "$name" ]]; then
 		log_error "Usage: supervisor-helper.sh batch <name> [--concurrency N] [--max-concurrency N] [--tasks \"t001,t002\"] [--release-on-complete] [--release-type patch|minor|major] [--skip-quality-gate]"
 		return 1
 	fi

Also applies to: 276-279

🤖 Prompt for AI Agents
In @.agents/scripts/supervisor/batch.sh around lines 196 - 227, The flag parsing
currently assigns user-supplied values directly to concurrency, max_concurrency,
tasks, and max_load_factor; add numeric validation for each before assignment
and before any SQL insert: verify the value matches a non-negative integer (or
appropriate numeric/range for max_load_factor) using a regex or POSIX arithmetic
check, log an error with log_error and return 1 on invalid input,
coerce/normalize valid values (e.g., strip leading zeros) or set a safe default,
and ensure the same validation is applied again where these variables are used
for SQL inserts (sanitize or parameterize the query) — update the parsing
branches for --concurrency, --max-concurrency, --tasks, and --max-load and the
corresponding usage site around the SQL insert to enforce this.

Comment on lines +50 to +65
# Detect GH_TOKEN from gh CLI if available (t1006)
local gh_token=""
if command -v gh &>/dev/null; then
gh_token=$(gh auth token 2>/dev/null || true)
fi

# Build cron command with environment variables
local env_vars=""
if [[ -n "$user_path" ]]; then
env_vars="PATH=${user_path}"
fi
if [[ -n "$gh_token" ]]; then
env_vars="${env_vars:+${env_vars} }GH_TOKEN=${gh_token}"
fi

local cron_cmd="*/${interval} * * * * ${env_vars:+${env_vars} }${script_path} pulse ${batch_arg} >> ${SUPERVISOR_DIR}/cron.log 2>&1 ${cron_marker}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Avoid embedding GH_TOKEN directly in crontab.
This writes a long‑lived secret into a plaintext cron entry (and process list), which is a security foot‑gun. Prefer a protected env file (chmod 600) or rely on gh auth token inside the script.

🤖 Prompt for AI Agents
In @.agents/scripts/supervisor/cron.sh around lines 50 - 65, The current logic
embeds GH_TOKEN directly into the crontab entry via cron_cmd; instead, if
gh_token is detected, write it to a protected env file (e.g., create a file
under SUPERVISOR_DIR like cron_env, chmod 600) and update cron_cmd to source
that env file before invoking ${script_path} (or simply rely on calling
${script_path} which uses `gh auth token` internally), so GH_TOKEN is not placed
directly into the crontab or process list; update references around cron.sh,
cron_cmd, GH_TOKEN, SUPERVISOR_DIR, script_path, pulse, and cron_marker
accordingly.

Comment on lines +325 to +358
db "$SUPERVISOR_DB" <<'MIGRATE_T148'
PRAGMA foreign_keys=OFF;
BEGIN TRANSACTION;
ALTER TABLE tasks RENAME TO tasks_old_t148;
CREATE TABLE tasks (
id TEXT PRIMARY KEY,
repo TEXT NOT NULL,
description TEXT,
status TEXT NOT NULL DEFAULT 'queued'
CHECK(status IN ('queued','dispatched','running','evaluating','retrying','complete','pr_review','review_triage','merging','merged','deploying','deployed','verifying','verified','verify_failed','blocked','failed','cancelled')),
session_id TEXT,
worktree TEXT,
branch TEXT,
log_file TEXT,
retries INTEGER NOT NULL DEFAULT 0,
max_retries INTEGER NOT NULL DEFAULT 3,
model TEXT DEFAULT 'anthropic/claude-opus-4-6',
error TEXT,
pr_url TEXT,
issue_url TEXT,
diagnostic_of TEXT,
triage_result TEXT,
created_at TEXT NOT NULL DEFAULT (strftime('%Y-%m-%dT%H:%M:%SZ','now')),
started_at TEXT,
completed_at TEXT,
updated_at TEXT NOT NULL DEFAULT (strftime('%Y-%m-%dT%H:%M:%SZ','now'))
);
INSERT INTO tasks (id, repo, description, status, session_id, worktree, branch,
log_file, retries, max_retries, model, error, pr_url, issue_url, diagnostic_of,
created_at, started_at, completed_at, updated_at)
SELECT id, repo, description, status, session_id, worktree, branch,
log_file, retries, max_retries, model, error, pr_url, issue_url, diagnostic_of,
created_at, started_at, completed_at, updated_at
FROM tasks_old_t148;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Preserve triage_result during the t148 table rebuild.
The INSERT/SELECT lists omit triage_result, so any existing values are dropped when recreating tasks. That’s silent data loss for triage workflows.

🔧 Proposed fix
-INSERT INTO tasks (id, repo, description, status, session_id, worktree, branch,
-    log_file, retries, max_retries, model, error, pr_url, issue_url, diagnostic_of,
-    created_at, started_at, completed_at, updated_at)
-SELECT id, repo, description, status, session_id, worktree, branch,
-    log_file, retries, max_retries, model, error, pr_url, issue_url, diagnostic_of,
-    created_at, started_at, completed_at, updated_at
+INSERT INTO tasks (id, repo, description, status, session_id, worktree, branch,
+    log_file, retries, max_retries, model, error, pr_url, issue_url, diagnostic_of,
+    triage_result, created_at, started_at, completed_at, updated_at)
+SELECT id, repo, description, status, session_id, worktree, branch,
+    log_file, retries, max_retries, model, error, pr_url, issue_url, diagnostic_of,
+    triage_result, created_at, started_at, completed_at, updated_at
 FROM tasks_old_t148;
🤖 Prompt for AI Agents
In @.agents/scripts/supervisor/database.sh around lines 325 - 358, The migration
rebuilds the tasks table but the INSERT ... SELECT omits the triage_result
column, causing data loss; update the INSERT INTO tasks (...) column list and
the corresponding SELECT ... FROM tasks_old_t148 to include triage_result in the
exact position/order used in the CREATE TABLE so values from
tasks_old_t148.triage_result are preserved when copying rows back into tasks.

Comment on lines +84 to +96
local deploy_output
if deploy_output=$("$deploy_script" "${deploy_args[@]}" 2>&1); then
log_success "Targeted deploy complete for $task_id"
echo "$deploy_output" >"$deploy_log" 2>/dev/null || true
return 0
fi

local deploy_exit=$?
if [[ "$deploy_exit" -eq 2 ]]; then
# Exit 2 = nothing to deploy (no changes detected)
log_info "No agent changes to deploy for $task_id"
return 0
fi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Bug: $? after an if statement does not capture the command's exit code.

After if deploy_output=$(...); then ... fi, $? is 0 (the exit status of the if compound command when the body is not executed). The actual exit code from the deploy script is lost, so deploy_exit will always be 0 and the exit 2 ("nothing to deploy") branch on line 92 is dead code.

🐛 Proposed fix: capture exit code explicitly
 	local deploy_output
-	if deploy_output=$("$deploy_script" "${deploy_args[@]}" 2>&1); then
+	local deploy_exit=0
+	deploy_output=$("$deploy_script" "${deploy_args[@]}" 2>&1) || deploy_exit=$?
+
+	if [[ "$deploy_exit" -eq 0 ]]; then
 		log_success "Targeted deploy complete for $task_id"
 		echo "$deploy_output" >"$deploy_log" 2>/dev/null || true
 		return 0
 	fi

-	local deploy_exit=$?
 	if [[ "$deploy_exit" -eq 2 ]]; then
 		# Exit 2 = nothing to deploy (no changes detected)
 		log_info "No agent changes to deploy for $task_id"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
local deploy_output
if deploy_output=$("$deploy_script" "${deploy_args[@]}" 2>&1); then
log_success "Targeted deploy complete for $task_id"
echo "$deploy_output" >"$deploy_log" 2>/dev/null || true
return 0
fi
local deploy_exit=$?
if [[ "$deploy_exit" -eq 2 ]]; then
# Exit 2 = nothing to deploy (no changes detected)
log_info "No agent changes to deploy for $task_id"
return 0
fi
local deploy_output
local deploy_exit=0
deploy_output=$("$deploy_script" "${deploy_args[@]}" 2>&1) || deploy_exit=$?
if [[ "$deploy_exit" -eq 0 ]]; then
log_success "Targeted deploy complete for $task_id"
echo "$deploy_output" >"$deploy_log" 2>/dev/null || true
return 0
fi
if [[ "$deploy_exit" -eq 2 ]]; then
# Exit 2 = nothing to deploy (no changes detected)
log_info "No agent changes to deploy for $task_id"
return 0
fi
🤖 Prompt for AI Agents
In @.agents/scripts/supervisor/deploy.sh around lines 84 - 96, The if/then loses
the deployment script's exit code because $? is read after the if compound;
change the pattern to capture the exit code immediately after running the deploy
command (assign deploy_output by running "$deploy_script" "${deploy_args[@]}"
2>&1, then right away set deploy_exit=$?), then branch on deploy_exit: if
deploy_exit -eq 0 handle success (log_success, write deploy_output to deploy_log
and return 0), if deploy_exit -eq 2 log_info "No agent changes to deploy for
$task_id" and return 0, else log_failure/fail accordingly; update references to
deploy_output, deploy_exit, deploy_script, deploy_args, deploy_log, and task_id
in the same block.

Comment on lines +1387 to +1909
cmd_dispatch() {
local task_id="" batch_id=""

# First positional arg is task_id
if [[ $# -gt 0 && ! "$1" =~ ^-- ]]; then
task_id="$1"
shift
fi

while [[ $# -gt 0 ]]; do
case "$1" in
--batch)
[[ $# -lt 2 ]] && {
log_error "--batch requires a value"
return 1
}
batch_id="$2"
shift 2
;;
*)
log_error "Unknown option: $1"
return 1
;;
esac
done

if [[ -z "$task_id" ]]; then
log_error "Usage: supervisor-helper.sh dispatch <task_id>"
return 1
fi

ensure_db

# Get task details
local escaped_id
escaped_id=$(sql_escape "$task_id")
local task_row
task_row=$(db -separator $'\t' "$SUPERVISOR_DB" "
SELECT id, repo, description, status, model, retries, max_retries
FROM tasks WHERE id = '$escaped_id';
")

if [[ -z "$task_row" ]]; then
log_error "Task not found: $task_id"
return 1
fi

local tid trepo tdesc tstatus tmodel tretries tmax_retries
IFS=$'\t' read -r tid trepo tdesc tstatus tmodel tretries tmax_retries <<<"$task_row"

# Validate task is in dispatchable state
if [[ "$tstatus" != "queued" ]]; then
log_error "Task $task_id is in '$tstatus' state, must be 'queued' to dispatch"
return 1
fi

# Pre-dispatch verification: check if task was already completed in a prior batch.
# Searches git history for commits referencing this task ID. If a merged PR commit
# exists, the task is already done — cancel it instead of wasting an Opus session.
# This prevents the exact bug from backlog-10 where 6 t135 subtasks were dispatched
# despite being completed months earlier.
if check_task_already_done "$task_id" "${trepo:-.}"; then
log_warn "Task $task_id appears already completed in git history — cancelling"
db "$SUPERVISOR_DB" "UPDATE tasks SET status='cancelled', error='Pre-dispatch: already completed in git history' WHERE id='$(sql_escape "$task_id")';"
return 0
fi

# Pre-dispatch reverification: detect previously-worked tasks (t1008)
# If a task was dispatched before (dead worker, unclaimed, re-queued, quality
# escalation), dispatch a lightweight verify worker instead of full implementation.
# Cost: ~$0.10-0.20 (sonnet) vs ~$1.00 (full session). The verify worker checks
# if deliverables exist and work; if incomplete, it continues implementation.
local verify_mode="" verify_reason=""
if [[ "${SUPERVISOR_SKIP_VERIFY_MODE:-false}" != "true" ]]; then
# Skip verify mode if the last error was verify_not_started_needs_full —
# the verify worker already confirmed no prior work exists, so a full
# implementation dispatch is needed (avoids infinite verify loop).
local last_error=""
last_error=$(db "$SUPERVISOR_DB" "SELECT COALESCE(error, '') FROM tasks WHERE id = '$(sql_escape "$task_id")';" 2>/dev/null) || last_error=""
if [[ "$last_error" == "verify_not_started_needs_full" || "$last_error" == "verify_incomplete_no_pr" ]]; then
log_info "Task $task_id: skipping verify mode (last error: $last_error) — using full dispatch"
else
verify_reason=$(was_previously_worked "$task_id" 2>/dev/null) || true
if [[ -n "$verify_reason" ]]; then
verify_mode="true"
log_info "Task $task_id was previously worked ($verify_reason) — using verify dispatch mode (t1008)"
fi
fi
fi

# Check if task is claimed by someone else via TODO.md assignee: field (t165)
local claimed_by=""
claimed_by=$(check_task_claimed "$task_id" "${trepo:-.}" 2>/dev/null) || true
if [[ -n "$claimed_by" ]]; then
# t1024: Check if the claim is stale (no active worker, claimed >2h ago)
# This prevents tasks from being stuck forever when a worker dies
local stale_threshold_seconds=7200 # 2 hours
local is_stale="false"

# Check if there's an active worker process for this task
local active_session=""
local escaped_id
escaped_id=$(sql_escape "$task_id")
active_session=$(db "$SUPERVISOR_DB" "SELECT session_id FROM tasks WHERE id = '$escaped_id' AND session_id IS NOT NULL AND status IN ('dispatched','running');" 2>/dev/null) || active_session=""

if [[ -z "$active_session" ]]; then
# No active worker — check how long the claim has been held
local todo_file="${trepo:-.}/TODO.md"
local task_line=""
task_line=$(grep -m1 "^[[:space:]]*- \[ \] $task_id " "$todo_file" 2>/dev/null) || task_line=""
if [[ -n "$task_line" ]]; then
local started_ts=""
started_ts=$(echo "$task_line" | sed -n 's/.*started:\([0-9T:Z-]*\).*/\1/p' 2>/dev/null) || started_ts=""
if [[ -n "$started_ts" ]]; then
local started_epoch now_epoch
started_epoch=$(date -j -f "%Y-%m-%dT%H:%M:%SZ" "$started_ts" "+%s" 2>/dev/null) ||
started_epoch=$(date -d "$started_ts" "+%s" 2>/dev/null) || started_epoch=0
now_epoch=$(date "+%s")
if [[ "$started_epoch" -gt 0 ]] && ((now_epoch - started_epoch > stale_threshold_seconds)); then
is_stale="true"
fi
else
# No started: timestamp but claimed — treat as stale if task is queued in DB
local db_status=""
db_status=$(db "$SUPERVISOR_DB" "SELECT status FROM tasks WHERE id = '$escaped_id';" 2>/dev/null) || db_status=""
if [[ "$db_status" == "queued" ]]; then
is_stale="true"
fi
fi
fi
fi

if [[ "$is_stale" == "true" ]]; then
log_warn "Task $task_id: stale claim by assignee:$claimed_by (no active worker, >2h) — auto-unclaiming (t1024)"
cmd_unclaim "$task_id" "${trepo:-.}" --force 2>/dev/null || true
else
log_warn "Task $task_id is claimed by assignee:$claimed_by — skipping dispatch"
return 0
fi
fi

# Claim the task before dispatching (t165 — TODO.md primary, GH Issue sync optional)
# CRITICAL: abort dispatch if claim fails (race condition = another worker claimed first)
# Pass trepo so claim works from cron (where $PWD != repo dir)
if ! cmd_claim "$task_id" "${trepo:-.}"; then
log_error "Failed to claim $task_id — aborting dispatch"
return 1
fi

# Authoritative concurrency check with adaptive load awareness (t151, t172)
# This is the single source of truth for concurrency enforcement.
# cmd_next() intentionally does NOT check concurrency to avoid a TOCTOU race
# where the count becomes stale between cmd_next() and cmd_dispatch() calls
# within the same pulse loop.
if [[ -n "$batch_id" ]]; then
local escaped_batch
escaped_batch=$(sql_escape "$batch_id")
local base_concurrency max_load_factor batch_max_concurrency
base_concurrency=$(db "$SUPERVISOR_DB" "SELECT concurrency FROM batches WHERE id = '$escaped_batch';")
max_load_factor=$(db "$SUPERVISOR_DB" "SELECT max_load_factor FROM batches WHERE id = '$escaped_batch';")
batch_max_concurrency=$(db "$SUPERVISOR_DB" "SELECT COALESCE(max_concurrency, 0) FROM batches WHERE id = '$escaped_batch';" 2>/dev/null || echo "0")
local concurrency
concurrency=$(calculate_adaptive_concurrency "${base_concurrency:-4}" "${max_load_factor:-2}" "${batch_max_concurrency:-0}")
local active_count
active_count=$(cmd_running_count "$batch_id")

if [[ "$active_count" -ge "$concurrency" ]]; then
log_warn "Concurrency limit reached ($active_count/$concurrency, base:$base_concurrency, adaptive) for batch $batch_id"
return 2
fi
else
# Global concurrency check with adaptive load awareness (t151)
local base_global_concurrency="${SUPERVISOR_MAX_CONCURRENCY:-4}"
local global_concurrency
global_concurrency=$(calculate_adaptive_concurrency "$base_global_concurrency")
local global_active
global_active=$(cmd_running_count)
if [[ "$global_active" -ge "$global_concurrency" ]]; then
log_warn "Global concurrency limit reached ($global_active/$global_concurrency, base:$base_global_concurrency)"
return 2
fi
fi

# Check max retries
if [[ "$tretries" -ge "$tmax_retries" ]]; then
log_error "Task $task_id has exceeded max retries ($tretries/$tmax_retries)"
cmd_transition "$task_id" "failed" --error "Max retries exceeded"
return 1
fi

# Resolve AI CLI
local ai_cli
ai_cli=$(resolve_ai_cli) || return 1

# Pre-dispatch model availability check (t233 — replaces simple health check)
# Calls model-availability-helper.sh check before spawning workers.
# Distinct exit codes prevent wasted dispatch attempts:
# exit 0 = healthy, proceed
# exit 1 = provider unavailable, defer dispatch
# exit 2 = rate limited, defer dispatch (retry next pulse)
# exit 3 = API key invalid/credits exhausted, block dispatch
# Previously: 9 wasted failures from ambiguous_ai_unavailable + backend_quota_error
# because the health check collapsed all failures to a single exit code.
local health_model health_exit=0
health_model=$(resolve_model "health" "$ai_cli")
check_model_health "$ai_cli" "$health_model" || health_exit=$?
if [[ "$health_exit" -ne 0 ]]; then
case "$health_exit" in
2)
log_warn "Provider rate-limited for $task_id ($health_model via $ai_cli) — deferring dispatch to next pulse"
return 3 # Return 3 = provider unavailable (distinct from concurrency limit 2)
;;
3)
log_error "API key invalid/credits exhausted for $task_id ($health_model via $ai_cli) — blocking dispatch"
log_error "Human action required: check API key or billing. Task will not auto-retry."
return 3
;;
*)
log_error "Provider unavailable for $task_id ($health_model via $ai_cli) — deferring dispatch"
return 3
;;
esac
fi

# Pre-dispatch GitHub auth check — verify the worker can push before
# creating worktrees and burning compute. Workers spawned via nohup/cron
# may lack SSH keys; gh auth git-credential only works with HTTPS remotes.
if ! check_gh_auth; then
log_error "GitHub auth unavailable for $task_id — check_gh_auth failed"
log_error "Workers need 'gh auth login' or GH_TOKEN set. Skipping dispatch."
return 3
fi

# Verify repo remote uses HTTPS (not SSH) — workers in cron can't use SSH keys
local remote_url
remote_url=$(git -C "${trepo:-.}" remote get-url origin 2>/dev/null || echo "")
if [[ "$remote_url" == git@* || "$remote_url" == ssh://* ]]; then
log_warn "Remote URL is SSH ($remote_url) — switching to HTTPS for worker compatibility"
local https_url
https_url=$(echo "$remote_url" | sed -E 's|^git@github\.com:|https://github.com/|; s|^ssh://git@github\.com/|https://github.com/|; s|\.git$||').git
git -C "${trepo:-.}" remote set-url origin "$https_url" 2>/dev/null || true
log_info "Remote URL updated to $https_url"
fi

# Create worktree
log_info "Creating worktree for $task_id..."
local worktree_path
worktree_path=$(create_task_worktree "$task_id" "$trepo") || {
log_error "Failed to create worktree for $task_id"
cmd_transition "$task_id" "failed" --error "Worktree creation failed"
return 1
}

# Validate worktree path is an actual directory (guards against stdout
# pollution from git commands inside create_task_worktree)
if [[ ! -d "$worktree_path" ]]; then
log_error "Worktree path is not a directory: '$worktree_path'"
log_error "This usually means a git command leaked stdout into the path variable"
cmd_transition "$task_id" "failed" --error "Worktree path invalid: $worktree_path"
return 1
fi

local branch_name="feature/${task_id}"

# Set up log file
local log_dir="$SUPERVISOR_DIR/logs"
mkdir -p "$log_dir"
local log_file
log_file="$log_dir/${task_id}-$(date +%Y%m%d%H%M%S).log"

# Pre-create log file with dispatch metadata (t183)
# If the worker fails to start (opencode not found, permission error, etc.),
# the log file still exists with context for diagnosis instead of no_log_file.
{
echo "=== DISPATCH METADATA (t183) ==="
echo "task_id=$task_id"
echo "timestamp=$(date -u +%Y-%m-%dT%H:%M:%SZ)"
echo "worktree=$worktree_path"
echo "branch=$branch_name"
echo "model=${resolved_model:-${tmodel:-default}}"
echo "ai_cli=$(resolve_ai_cli 2>/dev/null || echo unknown)"
echo "dispatch_mode=$(detect_dispatch_mode 2>/dev/null || echo unknown)"
echo "dispatch_type=${verify_mode:+verify}"
echo "verify_reason=${verify_reason:-}"
echo "=== END DISPATCH METADATA ==="
echo ""
} >"$log_file" 2>/dev/null || true

# Transition to dispatched
cmd_transition "$task_id" "dispatched" \
--worktree "$worktree_path" \
--branch "$branch_name" \
--log-file "$log_file"

# Detect dispatch mode
local dispatch_mode
dispatch_mode=$(detect_dispatch_mode)

# Recall relevant memories before dispatch (t128.6)
local memory_context=""
memory_context=$(recall_task_memories "$task_id" "$tdesc" 2>/dev/null || echo "")
if [[ -n "$memory_context" ]]; then
log_info "Injecting ${#memory_context} bytes of memory context for $task_id"
fi

# Resolve model via frontmatter + fallback chain (t132.5)
# t1008: For verify-mode dispatches, prefer sonnet tier (cheaper, sufficient for
# verification checks). The verify worker can escalate to full implementation if
# it discovers the work is incomplete, but starts cheap.
local resolved_model
if [[ "$verify_mode" == "true" ]]; then
resolved_model=$(resolve_model "coding" "$ai_cli" 2>/dev/null) || resolved_model=""
log_info "Verify mode: using coding-tier model ($resolved_model) instead of task-specific model"
else
resolved_model=$(resolve_task_model "$task_id" "$tmodel" "${trepo:-.}" "$ai_cli")
fi

# Contest mode intercept (t1011): if model resolves to CONTEST, delegate to
# contest-helper.sh which dispatches the same task to top-3 models in parallel.
# The original task stays in 'running' state while contest entries execute.
if [[ "$resolved_model" == "CONTEST" ]]; then
log_info "Contest mode activated for $task_id — delegating to contest-helper.sh"
local contest_helper="${SCRIPT_DIR}/contest-helper.sh"
if [[ -x "$contest_helper" ]]; then
local contest_id
contest_id=$("$contest_helper" create "$task_id" ${batch_id:+--batch "$batch_id"} 2>/dev/null)
if [[ -n "$contest_id" ]]; then
"$contest_helper" dispatch "$contest_id" 2>/dev/null || {
log_error "Contest dispatch failed for $task_id"
cmd_transition "$task_id" "failed" --error "Contest dispatch failed"
return 1
}
# Keep original task in running state — pulse Phase 2.5 will check contest completion
db "$SUPERVISOR_DB" "UPDATE tasks SET error = 'contest:${contest_id}' WHERE id = '$(sql_escape "$task_id")';"
log_success "Contest $contest_id dispatched for $task_id"
echo "contest:${contest_id}"
return 0
else
log_error "Failed to create contest for $task_id — falling back to default model"
resolved_model=$(resolve_model "coding" "$ai_cli")
fi
else
log_warn "contest-helper.sh not found — falling back to default model"
resolved_model=$(resolve_model "coding" "$ai_cli")
fi
fi

# Secondary availability check: verify the resolved model's provider (t233)
# The initial health check uses the "health" tier (typically anthropic).
# If the resolved model uses a different provider (e.g., google/gemini for pro tier),
# we need to verify that provider too. Skip if same provider or if using OpenCode
# (which manages routing internally).
if [[ "$ai_cli" != "opencode" && -n "$resolved_model" && "$resolved_model" == *"/"* ]]; then
local resolved_provider="${resolved_model%%/*}"
local health_provider="${health_model%%/*}"
if [[ "$resolved_provider" != "$health_provider" ]]; then
local availability_helper="${SCRIPT_DIR}/model-availability-helper.sh"
if [[ -x "$availability_helper" ]]; then
local resolved_avail_exit=0
"$availability_helper" check "$resolved_provider" --quiet || resolved_avail_exit=$?
if [[ "$resolved_avail_exit" -ne 0 ]]; then
case "$resolved_avail_exit" in
2)
log_warn "Resolved model provider '$resolved_provider' is rate-limited (exit $resolved_avail_exit) for $task_id — deferring dispatch"
;;
3)
log_error "Resolved model provider '$resolved_provider' has invalid key/credits (exit $resolved_avail_exit) for $task_id — blocking dispatch"
;;
*)
log_warn "Resolved model provider '$resolved_provider' unavailable (exit $resolved_avail_exit) for $task_id — deferring dispatch"
;;
esac
return 3
fi
fi
fi
fi

local dispatch_type="full"
if [[ "$verify_mode" == "true" ]]; then
dispatch_type="verify"
fi
log_info "Dispatching $task_id via $ai_cli ($dispatch_mode mode, $dispatch_type dispatch)"
log_info "Worktree: $worktree_path"
log_info "Model: $resolved_model"
log_info "Log: $log_file"

# Build and execute dispatch command
# t1008: Use verify dispatch for previously-worked tasks (cheaper, focused)
# Use NUL-delimited read to preserve multi-line prompts as single arguments
local -a cmd_parts=()
if [[ "$verify_mode" == "true" ]]; then
while IFS= read -r -d '' part; do
cmd_parts+=("$part")
done < <(build_verify_dispatch_cmd "$task_id" "$worktree_path" "$log_file" "$ai_cli" "$memory_context" "$resolved_model" "$tdesc" "$verify_reason")
else
while IFS= read -r -d '' part; do
cmd_parts+=("$part")
done < <(build_dispatch_cmd "$task_id" "$worktree_path" "$log_file" "$ai_cli" "$memory_context" "$resolved_model" "$tdesc")
fi

# Ensure PID directory exists before dispatch
mkdir -p "$SUPERVISOR_DIR/pids"

# Set FULL_LOOP_HEADLESS for all supervisor-dispatched workers (t174)
# This ensures headless mode even if the AI doesn't parse --headless from the prompt
local headless_env="FULL_LOOP_HEADLESS=true"

# Generate worker-specific MCP config with heavy indexers disabled (t221)
# Saves ~4 CPU cores per worker by preventing osgrep from indexing
local worker_xdg_config=""
worker_xdg_config=$(generate_worker_mcp_config "$task_id" 2>/dev/null) || true

# Write dispatch script to a temp file to avoid bash -c quoting issues
# with multi-line prompts (newlines in printf '%q' break bash -c strings)
local dispatch_script="${SUPERVISOR_DIR}/pids/${task_id}-dispatch.sh"
{
echo '#!/usr/bin/env bash'
echo "# Startup sentinel (t183): if this line appears in the log, the script started"
echo "echo 'WORKER_STARTED task_id=${task_id} pid=\$\$ timestamp='\$(date -u +%Y-%m-%dT%H:%M:%SZ)"
echo "cd '${worktree_path}' || { echo 'WORKER_FAILED: cd to worktree failed: ${worktree_path}'; exit 1; }"
echo "export ${headless_env}"
# Redirect worker to use MCP config with heavy indexers disabled (t221)
if [[ -n "$worker_xdg_config" ]]; then
echo "export XDG_CONFIG_HOME='${worker_xdg_config}'"
fi
# Write each cmd_part as a properly quoted array element
printf 'exec '
printf '%q ' "${cmd_parts[@]}"
printf '\n'
} >"$dispatch_script"
chmod +x "$dispatch_script"

# Wrapper script (t183): captures errors from the dispatch script itself.
# Previous approach used nohup bash -c with &>/dev/null which swallowed
# errors when the dispatch script failed to start (e.g., opencode not found).
# Now errors are appended to the log file for diagnosis.
# t253: Add cleanup handlers to prevent orphaned children when wrapper exits
local wrapper_script="${SUPERVISOR_DIR}/pids/${task_id}-wrapper.sh"
{
echo '#!/usr/bin/env bash'
echo '# t253: Recursive cleanup to kill all descendant processes'
echo '_kill_descendants_recursive() {'
echo ' local parent_pid="$1"'
echo ' local children'
echo ' children=$(pgrep -P "$parent_pid" 2>/dev/null || true)'
echo ' if [[ -n "$children" ]]; then'
echo ' for child in $children; do'
echo ' _kill_descendants_recursive "$child"'
echo ' done'
echo ' fi'
echo ' kill -TERM "$parent_pid" 2>/dev/null || true'
echo '}'
echo ''
echo 'cleanup_children() {'
echo ' local wrapper_pid=$$'
echo ' local children'
echo ' children=$(pgrep -P "$wrapper_pid" 2>/dev/null || true)'
echo ' if [[ -n "$children" ]]; then'
echo ' # Recursively kill all descendants'
echo ' for child in $children; do'
echo ' _kill_descendants_recursive "$child"'
echo ' done'
echo ' sleep 0.5'
echo ' # Force kill any survivors'
echo ' for child in $children; do'
echo ' pkill -9 -P "$child" 2>/dev/null || true'
echo ' kill -9 "$child" 2>/dev/null || true'
echo ' done'
echo ' fi'
echo '}'
echo '# Register cleanup on EXIT, INT, TERM (KILL cannot be trapped)'
echo 'trap cleanup_children EXIT INT TERM'
echo ''
echo "'${dispatch_script}' >> '${log_file}' 2>&1"
echo "rc=\$?"
echo "echo \"EXIT:\${rc}\" >> '${log_file}'"
echo "if [ \$rc -ne 0 ]; then"
echo " echo \"WORKER_DISPATCH_ERROR: dispatch script exited with code \${rc}\" >> '${log_file}'"
echo "fi"
} >"$wrapper_script"
chmod +x "$wrapper_script"

if [[ "$dispatch_mode" == "tabby" ]]; then
# Tabby: attempt to open in a new tab via OSC 1337 escape sequence
log_info "Opening Tabby tab for $task_id..."
printf '\e]1337;NewTab=%s\a' "'${wrapper_script}'" 2>/dev/null || true
# Also start background process as fallback (Tabby may not support OSC 1337)
# t253: Use setsid if available (Linux) for process group isolation
# Use nohup + disown to survive parent (cron) exit
if command -v setsid &>/dev/null; then
nohup setsid bash "${wrapper_script}" &>/dev/null &
else
nohup bash "${wrapper_script}" &>/dev/null &
fi
else
# Headless: background process
# t253: Use setsid if available (Linux) for process group isolation
# Use nohup + disown to survive parent (cron) exit — without this,
# workers die after ~2 minutes when the cron pulse script exits
if command -v setsid &>/dev/null; then
nohup setsid bash "${wrapper_script}" &>/dev/null &
else
nohup bash "${wrapper_script}" &>/dev/null &
fi
fi

local worker_pid=$!
disown "$worker_pid" 2>/dev/null || true

# Store PID for monitoring
echo "$worker_pid" >"$SUPERVISOR_DIR/pids/${task_id}.pid"

# Transition to running
cmd_transition "$task_id" "running" --session "pid:$worker_pid"

# Add dispatched:model label to GitHub issue (t1010)
add_model_label "$task_id" "dispatched" "$resolved_model" "${trepo:-.}" 2>>"$SUPERVISOR_LOG" || true

log_success "Dispatched $task_id (PID: $worker_pid)"
echo "$worker_pid"
return 0
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Verify‑mode should resolve to a cheaper tier (sonnet), not coding.
The comment says “prefer sonnet tier,” but resolve_model "coding" maps to Opus; this inflates cost and defeats the optimization.

🔧 Proposed fix
 	if [[ "$verify_mode" == "true" ]]; then
-		resolved_model=$(resolve_model "coding" "$ai_cli" 2>/dev/null) || resolved_model=""
-		log_info "Verify mode: using coding-tier model ($resolved_model) instead of task-specific model"
+		resolved_model=$(resolve_model "sonnet" "$ai_cli" 2>/dev/null) || resolved_model=""
+		log_info "Verify mode: using sonnet-tier model ($resolved_model) instead of task-specific model"
 	else
 		resolved_model=$(resolve_task_model "$task_id" "$tmodel" "${trepo:-.}" "$ai_cli")
 	fi
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
cmd_dispatch() {
local task_id="" batch_id=""
# First positional arg is task_id
if [[ $# -gt 0 && ! "$1" =~ ^-- ]]; then
task_id="$1"
shift
fi
while [[ $# -gt 0 ]]; do
case "$1" in
--batch)
[[ $# -lt 2 ]] && {
log_error "--batch requires a value"
return 1
}
batch_id="$2"
shift 2
;;
*)
log_error "Unknown option: $1"
return 1
;;
esac
done
if [[ -z "$task_id" ]]; then
log_error "Usage: supervisor-helper.sh dispatch <task_id>"
return 1
fi
ensure_db
# Get task details
local escaped_id
escaped_id=$(sql_escape "$task_id")
local task_row
task_row=$(db -separator $'\t' "$SUPERVISOR_DB" "
SELECT id, repo, description, status, model, retries, max_retries
FROM tasks WHERE id = '$escaped_id';
")
if [[ -z "$task_row" ]]; then
log_error "Task not found: $task_id"
return 1
fi
local tid trepo tdesc tstatus tmodel tretries tmax_retries
IFS=$'\t' read -r tid trepo tdesc tstatus tmodel tretries tmax_retries <<<"$task_row"
# Validate task is in dispatchable state
if [[ "$tstatus" != "queued" ]]; then
log_error "Task $task_id is in '$tstatus' state, must be 'queued' to dispatch"
return 1
fi
# Pre-dispatch verification: check if task was already completed in a prior batch.
# Searches git history for commits referencing this task ID. If a merged PR commit
# exists, the task is already done — cancel it instead of wasting an Opus session.
# This prevents the exact bug from backlog-10 where 6 t135 subtasks were dispatched
# despite being completed months earlier.
if check_task_already_done "$task_id" "${trepo:-.}"; then
log_warn "Task $task_id appears already completed in git history — cancelling"
db "$SUPERVISOR_DB" "UPDATE tasks SET status='cancelled', error='Pre-dispatch: already completed in git history' WHERE id='$(sql_escape "$task_id")';"
return 0
fi
# Pre-dispatch reverification: detect previously-worked tasks (t1008)
# If a task was dispatched before (dead worker, unclaimed, re-queued, quality
# escalation), dispatch a lightweight verify worker instead of full implementation.
# Cost: ~$0.10-0.20 (sonnet) vs ~$1.00 (full session). The verify worker checks
# if deliverables exist and work; if incomplete, it continues implementation.
local verify_mode="" verify_reason=""
if [[ "${SUPERVISOR_SKIP_VERIFY_MODE:-false}" != "true" ]]; then
# Skip verify mode if the last error was verify_not_started_needs_full —
# the verify worker already confirmed no prior work exists, so a full
# implementation dispatch is needed (avoids infinite verify loop).
local last_error=""
last_error=$(db "$SUPERVISOR_DB" "SELECT COALESCE(error, '') FROM tasks WHERE id = '$(sql_escape "$task_id")';" 2>/dev/null) || last_error=""
if [[ "$last_error" == "verify_not_started_needs_full" || "$last_error" == "verify_incomplete_no_pr" ]]; then
log_info "Task $task_id: skipping verify mode (last error: $last_error) — using full dispatch"
else
verify_reason=$(was_previously_worked "$task_id" 2>/dev/null) || true
if [[ -n "$verify_reason" ]]; then
verify_mode="true"
log_info "Task $task_id was previously worked ($verify_reason) — using verify dispatch mode (t1008)"
fi
fi
fi
# Check if task is claimed by someone else via TODO.md assignee: field (t165)
local claimed_by=""
claimed_by=$(check_task_claimed "$task_id" "${trepo:-.}" 2>/dev/null) || true
if [[ -n "$claimed_by" ]]; then
# t1024: Check if the claim is stale (no active worker, claimed >2h ago)
# This prevents tasks from being stuck forever when a worker dies
local stale_threshold_seconds=7200 # 2 hours
local is_stale="false"
# Check if there's an active worker process for this task
local active_session=""
local escaped_id
escaped_id=$(sql_escape "$task_id")
active_session=$(db "$SUPERVISOR_DB" "SELECT session_id FROM tasks WHERE id = '$escaped_id' AND session_id IS NOT NULL AND status IN ('dispatched','running');" 2>/dev/null) || active_session=""
if [[ -z "$active_session" ]]; then
# No active worker — check how long the claim has been held
local todo_file="${trepo:-.}/TODO.md"
local task_line=""
task_line=$(grep -m1 "^[[:space:]]*- \[ \] $task_id " "$todo_file" 2>/dev/null) || task_line=""
if [[ -n "$task_line" ]]; then
local started_ts=""
started_ts=$(echo "$task_line" | sed -n 's/.*started:\([0-9T:Z-]*\).*/\1/p' 2>/dev/null) || started_ts=""
if [[ -n "$started_ts" ]]; then
local started_epoch now_epoch
started_epoch=$(date -j -f "%Y-%m-%dT%H:%M:%SZ" "$started_ts" "+%s" 2>/dev/null) ||
started_epoch=$(date -d "$started_ts" "+%s" 2>/dev/null) || started_epoch=0
now_epoch=$(date "+%s")
if [[ "$started_epoch" -gt 0 ]] && ((now_epoch - started_epoch > stale_threshold_seconds)); then
is_stale="true"
fi
else
# No started: timestamp but claimed — treat as stale if task is queued in DB
local db_status=""
db_status=$(db "$SUPERVISOR_DB" "SELECT status FROM tasks WHERE id = '$escaped_id';" 2>/dev/null) || db_status=""
if [[ "$db_status" == "queued" ]]; then
is_stale="true"
fi
fi
fi
fi
if [[ "$is_stale" == "true" ]]; then
log_warn "Task $task_id: stale claim by assignee:$claimed_by (no active worker, >2h) — auto-unclaiming (t1024)"
cmd_unclaim "$task_id" "${trepo:-.}" --force 2>/dev/null || true
else
log_warn "Task $task_id is claimed by assignee:$claimed_by — skipping dispatch"
return 0
fi
fi
# Claim the task before dispatching (t165 — TODO.md primary, GH Issue sync optional)
# CRITICAL: abort dispatch if claim fails (race condition = another worker claimed first)
# Pass trepo so claim works from cron (where $PWD != repo dir)
if ! cmd_claim "$task_id" "${trepo:-.}"; then
log_error "Failed to claim $task_id — aborting dispatch"
return 1
fi
# Authoritative concurrency check with adaptive load awareness (t151, t172)
# This is the single source of truth for concurrency enforcement.
# cmd_next() intentionally does NOT check concurrency to avoid a TOCTOU race
# where the count becomes stale between cmd_next() and cmd_dispatch() calls
# within the same pulse loop.
if [[ -n "$batch_id" ]]; then
local escaped_batch
escaped_batch=$(sql_escape "$batch_id")
local base_concurrency max_load_factor batch_max_concurrency
base_concurrency=$(db "$SUPERVISOR_DB" "SELECT concurrency FROM batches WHERE id = '$escaped_batch';")
max_load_factor=$(db "$SUPERVISOR_DB" "SELECT max_load_factor FROM batches WHERE id = '$escaped_batch';")
batch_max_concurrency=$(db "$SUPERVISOR_DB" "SELECT COALESCE(max_concurrency, 0) FROM batches WHERE id = '$escaped_batch';" 2>/dev/null || echo "0")
local concurrency
concurrency=$(calculate_adaptive_concurrency "${base_concurrency:-4}" "${max_load_factor:-2}" "${batch_max_concurrency:-0}")
local active_count
active_count=$(cmd_running_count "$batch_id")
if [[ "$active_count" -ge "$concurrency" ]]; then
log_warn "Concurrency limit reached ($active_count/$concurrency, base:$base_concurrency, adaptive) for batch $batch_id"
return 2
fi
else
# Global concurrency check with adaptive load awareness (t151)
local base_global_concurrency="${SUPERVISOR_MAX_CONCURRENCY:-4}"
local global_concurrency
global_concurrency=$(calculate_adaptive_concurrency "$base_global_concurrency")
local global_active
global_active=$(cmd_running_count)
if [[ "$global_active" -ge "$global_concurrency" ]]; then
log_warn "Global concurrency limit reached ($global_active/$global_concurrency, base:$base_global_concurrency)"
return 2
fi
fi
# Check max retries
if [[ "$tretries" -ge "$tmax_retries" ]]; then
log_error "Task $task_id has exceeded max retries ($tretries/$tmax_retries)"
cmd_transition "$task_id" "failed" --error "Max retries exceeded"
return 1
fi
# Resolve AI CLI
local ai_cli
ai_cli=$(resolve_ai_cli) || return 1
# Pre-dispatch model availability check (t233 — replaces simple health check)
# Calls model-availability-helper.sh check before spawning workers.
# Distinct exit codes prevent wasted dispatch attempts:
# exit 0 = healthy, proceed
# exit 1 = provider unavailable, defer dispatch
# exit 2 = rate limited, defer dispatch (retry next pulse)
# exit 3 = API key invalid/credits exhausted, block dispatch
# Previously: 9 wasted failures from ambiguous_ai_unavailable + backend_quota_error
# because the health check collapsed all failures to a single exit code.
local health_model health_exit=0
health_model=$(resolve_model "health" "$ai_cli")
check_model_health "$ai_cli" "$health_model" || health_exit=$?
if [[ "$health_exit" -ne 0 ]]; then
case "$health_exit" in
2)
log_warn "Provider rate-limited for $task_id ($health_model via $ai_cli) — deferring dispatch to next pulse"
return 3 # Return 3 = provider unavailable (distinct from concurrency limit 2)
;;
3)
log_error "API key invalid/credits exhausted for $task_id ($health_model via $ai_cli) — blocking dispatch"
log_error "Human action required: check API key or billing. Task will not auto-retry."
return 3
;;
*)
log_error "Provider unavailable for $task_id ($health_model via $ai_cli) — deferring dispatch"
return 3
;;
esac
fi
# Pre-dispatch GitHub auth check — verify the worker can push before
# creating worktrees and burning compute. Workers spawned via nohup/cron
# may lack SSH keys; gh auth git-credential only works with HTTPS remotes.
if ! check_gh_auth; then
log_error "GitHub auth unavailable for $task_id — check_gh_auth failed"
log_error "Workers need 'gh auth login' or GH_TOKEN set. Skipping dispatch."
return 3
fi
# Verify repo remote uses HTTPS (not SSH) — workers in cron can't use SSH keys
local remote_url
remote_url=$(git -C "${trepo:-.}" remote get-url origin 2>/dev/null || echo "")
if [[ "$remote_url" == git@* || "$remote_url" == ssh://* ]]; then
log_warn "Remote URL is SSH ($remote_url) — switching to HTTPS for worker compatibility"
local https_url
https_url=$(echo "$remote_url" | sed -E 's|^git@github\.com:|https://github.com/|; s|^ssh://git@github\.com/|https://github.com/|; s|\.git$||').git
git -C "${trepo:-.}" remote set-url origin "$https_url" 2>/dev/null || true
log_info "Remote URL updated to $https_url"
fi
# Create worktree
log_info "Creating worktree for $task_id..."
local worktree_path
worktree_path=$(create_task_worktree "$task_id" "$trepo") || {
log_error "Failed to create worktree for $task_id"
cmd_transition "$task_id" "failed" --error "Worktree creation failed"
return 1
}
# Validate worktree path is an actual directory (guards against stdout
# pollution from git commands inside create_task_worktree)
if [[ ! -d "$worktree_path" ]]; then
log_error "Worktree path is not a directory: '$worktree_path'"
log_error "This usually means a git command leaked stdout into the path variable"
cmd_transition "$task_id" "failed" --error "Worktree path invalid: $worktree_path"
return 1
fi
local branch_name="feature/${task_id}"
# Set up log file
local log_dir="$SUPERVISOR_DIR/logs"
mkdir -p "$log_dir"
local log_file
log_file="$log_dir/${task_id}-$(date +%Y%m%d%H%M%S).log"
# Pre-create log file with dispatch metadata (t183)
# If the worker fails to start (opencode not found, permission error, etc.),
# the log file still exists with context for diagnosis instead of no_log_file.
{
echo "=== DISPATCH METADATA (t183) ==="
echo "task_id=$task_id"
echo "timestamp=$(date -u +%Y-%m-%dT%H:%M:%SZ)"
echo "worktree=$worktree_path"
echo "branch=$branch_name"
echo "model=${resolved_model:-${tmodel:-default}}"
echo "ai_cli=$(resolve_ai_cli 2>/dev/null || echo unknown)"
echo "dispatch_mode=$(detect_dispatch_mode 2>/dev/null || echo unknown)"
echo "dispatch_type=${verify_mode:+verify}"
echo "verify_reason=${verify_reason:-}"
echo "=== END DISPATCH METADATA ==="
echo ""
} >"$log_file" 2>/dev/null || true
# Transition to dispatched
cmd_transition "$task_id" "dispatched" \
--worktree "$worktree_path" \
--branch "$branch_name" \
--log-file "$log_file"
# Detect dispatch mode
local dispatch_mode
dispatch_mode=$(detect_dispatch_mode)
# Recall relevant memories before dispatch (t128.6)
local memory_context=""
memory_context=$(recall_task_memories "$task_id" "$tdesc" 2>/dev/null || echo "")
if [[ -n "$memory_context" ]]; then
log_info "Injecting ${#memory_context} bytes of memory context for $task_id"
fi
# Resolve model via frontmatter + fallback chain (t132.5)
# t1008: For verify-mode dispatches, prefer sonnet tier (cheaper, sufficient for
# verification checks). The verify worker can escalate to full implementation if
# it discovers the work is incomplete, but starts cheap.
local resolved_model
if [[ "$verify_mode" == "true" ]]; then
resolved_model=$(resolve_model "coding" "$ai_cli" 2>/dev/null) || resolved_model=""
log_info "Verify mode: using coding-tier model ($resolved_model) instead of task-specific model"
else
resolved_model=$(resolve_task_model "$task_id" "$tmodel" "${trepo:-.}" "$ai_cli")
fi
# Contest mode intercept (t1011): if model resolves to CONTEST, delegate to
# contest-helper.sh which dispatches the same task to top-3 models in parallel.
# The original task stays in 'running' state while contest entries execute.
if [[ "$resolved_model" == "CONTEST" ]]; then
log_info "Contest mode activated for $task_id — delegating to contest-helper.sh"
local contest_helper="${SCRIPT_DIR}/contest-helper.sh"
if [[ -x "$contest_helper" ]]; then
local contest_id
contest_id=$("$contest_helper" create "$task_id" ${batch_id:+--batch "$batch_id"} 2>/dev/null)
if [[ -n "$contest_id" ]]; then
"$contest_helper" dispatch "$contest_id" 2>/dev/null || {
log_error "Contest dispatch failed for $task_id"
cmd_transition "$task_id" "failed" --error "Contest dispatch failed"
return 1
}
# Keep original task in running state — pulse Phase 2.5 will check contest completion
db "$SUPERVISOR_DB" "UPDATE tasks SET error = 'contest:${contest_id}' WHERE id = '$(sql_escape "$task_id")';"
log_success "Contest $contest_id dispatched for $task_id"
echo "contest:${contest_id}"
return 0
else
log_error "Failed to create contest for $task_id — falling back to default model"
resolved_model=$(resolve_model "coding" "$ai_cli")
fi
else
log_warn "contest-helper.sh not found — falling back to default model"
resolved_model=$(resolve_model "coding" "$ai_cli")
fi
fi
# Secondary availability check: verify the resolved model's provider (t233)
# The initial health check uses the "health" tier (typically anthropic).
# If the resolved model uses a different provider (e.g., google/gemini for pro tier),
# we need to verify that provider too. Skip if same provider or if using OpenCode
# (which manages routing internally).
if [[ "$ai_cli" != "opencode" && -n "$resolved_model" && "$resolved_model" == *"/"* ]]; then
local resolved_provider="${resolved_model%%/*}"
local health_provider="${health_model%%/*}"
if [[ "$resolved_provider" != "$health_provider" ]]; then
local availability_helper="${SCRIPT_DIR}/model-availability-helper.sh"
if [[ -x "$availability_helper" ]]; then
local resolved_avail_exit=0
"$availability_helper" check "$resolved_provider" --quiet || resolved_avail_exit=$?
if [[ "$resolved_avail_exit" -ne 0 ]]; then
case "$resolved_avail_exit" in
2)
log_warn "Resolved model provider '$resolved_provider' is rate-limited (exit $resolved_avail_exit) for $task_id — deferring dispatch"
;;
3)
log_error "Resolved model provider '$resolved_provider' has invalid key/credits (exit $resolved_avail_exit) for $task_id — blocking dispatch"
;;
*)
log_warn "Resolved model provider '$resolved_provider' unavailable (exit $resolved_avail_exit) for $task_id — deferring dispatch"
;;
esac
return 3
fi
fi
fi
fi
local dispatch_type="full"
if [[ "$verify_mode" == "true" ]]; then
dispatch_type="verify"
fi
log_info "Dispatching $task_id via $ai_cli ($dispatch_mode mode, $dispatch_type dispatch)"
log_info "Worktree: $worktree_path"
log_info "Model: $resolved_model"
log_info "Log: $log_file"
# Build and execute dispatch command
# t1008: Use verify dispatch for previously-worked tasks (cheaper, focused)
# Use NUL-delimited read to preserve multi-line prompts as single arguments
local -a cmd_parts=()
if [[ "$verify_mode" == "true" ]]; then
while IFS= read -r -d '' part; do
cmd_parts+=("$part")
done < <(build_verify_dispatch_cmd "$task_id" "$worktree_path" "$log_file" "$ai_cli" "$memory_context" "$resolved_model" "$tdesc" "$verify_reason")
else
while IFS= read -r -d '' part; do
cmd_parts+=("$part")
done < <(build_dispatch_cmd "$task_id" "$worktree_path" "$log_file" "$ai_cli" "$memory_context" "$resolved_model" "$tdesc")
fi
# Ensure PID directory exists before dispatch
mkdir -p "$SUPERVISOR_DIR/pids"
# Set FULL_LOOP_HEADLESS for all supervisor-dispatched workers (t174)
# This ensures headless mode even if the AI doesn't parse --headless from the prompt
local headless_env="FULL_LOOP_HEADLESS=true"
# Generate worker-specific MCP config with heavy indexers disabled (t221)
# Saves ~4 CPU cores per worker by preventing osgrep from indexing
local worker_xdg_config=""
worker_xdg_config=$(generate_worker_mcp_config "$task_id" 2>/dev/null) || true
# Write dispatch script to a temp file to avoid bash -c quoting issues
# with multi-line prompts (newlines in printf '%q' break bash -c strings)
local dispatch_script="${SUPERVISOR_DIR}/pids/${task_id}-dispatch.sh"
{
echo '#!/usr/bin/env bash'
echo "# Startup sentinel (t183): if this line appears in the log, the script started"
echo "echo 'WORKER_STARTED task_id=${task_id} pid=\$\$ timestamp='\$(date -u +%Y-%m-%dT%H:%M:%SZ)"
echo "cd '${worktree_path}' || { echo 'WORKER_FAILED: cd to worktree failed: ${worktree_path}'; exit 1; }"
echo "export ${headless_env}"
# Redirect worker to use MCP config with heavy indexers disabled (t221)
if [[ -n "$worker_xdg_config" ]]; then
echo "export XDG_CONFIG_HOME='${worker_xdg_config}'"
fi
# Write each cmd_part as a properly quoted array element
printf 'exec '
printf '%q ' "${cmd_parts[@]}"
printf '\n'
} >"$dispatch_script"
chmod +x "$dispatch_script"
# Wrapper script (t183): captures errors from the dispatch script itself.
# Previous approach used nohup bash -c with &>/dev/null which swallowed
# errors when the dispatch script failed to start (e.g., opencode not found).
# Now errors are appended to the log file for diagnosis.
# t253: Add cleanup handlers to prevent orphaned children when wrapper exits
local wrapper_script="${SUPERVISOR_DIR}/pids/${task_id}-wrapper.sh"
{
echo '#!/usr/bin/env bash'
echo '# t253: Recursive cleanup to kill all descendant processes'
echo '_kill_descendants_recursive() {'
echo ' local parent_pid="$1"'
echo ' local children'
echo ' children=$(pgrep -P "$parent_pid" 2>/dev/null || true)'
echo ' if [[ -n "$children" ]]; then'
echo ' for child in $children; do'
echo ' _kill_descendants_recursive "$child"'
echo ' done'
echo ' fi'
echo ' kill -TERM "$parent_pid" 2>/dev/null || true'
echo '}'
echo ''
echo 'cleanup_children() {'
echo ' local wrapper_pid=$$'
echo ' local children'
echo ' children=$(pgrep -P "$wrapper_pid" 2>/dev/null || true)'
echo ' if [[ -n "$children" ]]; then'
echo ' # Recursively kill all descendants'
echo ' for child in $children; do'
echo ' _kill_descendants_recursive "$child"'
echo ' done'
echo ' sleep 0.5'
echo ' # Force kill any survivors'
echo ' for child in $children; do'
echo ' pkill -9 -P "$child" 2>/dev/null || true'
echo ' kill -9 "$child" 2>/dev/null || true'
echo ' done'
echo ' fi'
echo '}'
echo '# Register cleanup on EXIT, INT, TERM (KILL cannot be trapped)'
echo 'trap cleanup_children EXIT INT TERM'
echo ''
echo "'${dispatch_script}' >> '${log_file}' 2>&1"
echo "rc=\$?"
echo "echo \"EXIT:\${rc}\" >> '${log_file}'"
echo "if [ \$rc -ne 0 ]; then"
echo " echo \"WORKER_DISPATCH_ERROR: dispatch script exited with code \${rc}\" >> '${log_file}'"
echo "fi"
} >"$wrapper_script"
chmod +x "$wrapper_script"
if [[ "$dispatch_mode" == "tabby" ]]; then
# Tabby: attempt to open in a new tab via OSC 1337 escape sequence
log_info "Opening Tabby tab for $task_id..."
printf '\e]1337;NewTab=%s\a' "'${wrapper_script}'" 2>/dev/null || true
# Also start background process as fallback (Tabby may not support OSC 1337)
# t253: Use setsid if available (Linux) for process group isolation
# Use nohup + disown to survive parent (cron) exit
if command -v setsid &>/dev/null; then
nohup setsid bash "${wrapper_script}" &>/dev/null &
else
nohup bash "${wrapper_script}" &>/dev/null &
fi
else
# Headless: background process
# t253: Use setsid if available (Linux) for process group isolation
# Use nohup + disown to survive parent (cron) exit — without this,
# workers die after ~2 minutes when the cron pulse script exits
if command -v setsid &>/dev/null; then
nohup setsid bash "${wrapper_script}" &>/dev/null &
else
nohup bash "${wrapper_script}" &>/dev/null &
fi
fi
local worker_pid=$!
disown "$worker_pid" 2>/dev/null || true
# Store PID for monitoring
echo "$worker_pid" >"$SUPERVISOR_DIR/pids/${task_id}.pid"
# Transition to running
cmd_transition "$task_id" "running" --session "pid:$worker_pid"
# Add dispatched:model label to GitHub issue (t1010)
add_model_label "$task_id" "dispatched" "$resolved_model" "${trepo:-.}" 2>>"$SUPERVISOR_LOG" || true
log_success "Dispatched $task_id (PID: $worker_pid)"
echo "$worker_pid"
return 0
}
local resolved_model
if [[ "$verify_mode" == "true" ]]; then
resolved_model=$(resolve_model "sonnet" "$ai_cli" 2>/dev/null) || resolved_model=""
log_info "Verify mode: using sonnet-tier model ($resolved_model) instead of task-specific model"
else
resolved_model=$(resolve_task_model "$task_id" "$tmodel" "${trepo:-.}" "$ai_cli")
fi
🤖 Prompt for AI Agents
In @.agents/scripts/supervisor/dispatch.sh around lines 1387 - 1909, The
verify-mode branch incorrectly calls resolve_model "coding" (which maps to Opus)
causing higher-cost models to be used; change the verify-mode model resolution
in the cmd_dispatch function where resolved_model is set for verify_mode to call
resolve_model "sonnet" (or the intended cheaper tier) and adjust any log text
referencing "coding-tier model" to reflect the sonnet/verify tier so verify
dispatches use the cheaper model when building the verify dispatch command (see
resolved_model, verify_mode, resolve_model, build_verify_dispatch_cmd).

Comment on lines +675 to +1086
evaluate_worker() {
local task_id="$1"
local skip_ai_eval="${2:-false}"

ensure_db

local escaped_id
escaped_id=$(sql_escape "$task_id")
local task_row
task_row=$(db -separator '|' "$SUPERVISOR_DB" "
SELECT status, log_file, retries, max_retries, session_id, pr_url
FROM tasks WHERE id = '$escaped_id';
")

if [[ -z "$task_row" ]]; then
log_error "Task not found: $task_id"
return 1
fi

local tstatus tlog tretries tmax_retries tsession tpr_url
IFS='|' read -r tstatus tlog tretries tmax_retries tsession tpr_url <<<"$task_row"

# Enhanced no_log_file diagnostics (t183)
# Instead of a bare "failed:no_log_file", gather context about why the log
# is missing so the supervisor can make better retry/block decisions and
# self-healing diagnostics have actionable information.
if [[ -z "$tlog" ]]; then
# No log path in DB at all — dispatch likely failed before setting log_file
local diag_detail="no_log_path_in_db"
local pid_file="$SUPERVISOR_DIR/pids/${task_id}.pid"
if [[ -f "$pid_file" ]]; then
local stale_pid
stale_pid=$(cat "$pid_file" 2>/dev/null || echo "")
if [[ -n "$stale_pid" ]] && ! kill -0 "$stale_pid" 2>/dev/null; then
diag_detail="no_log_path_in_db:worker_pid_${stale_pid}_dead"
elif [[ -n "$stale_pid" ]]; then
diag_detail="no_log_path_in_db:worker_pid_${stale_pid}_alive"
fi
fi
echo "failed:${diag_detail}"
return 0
fi

if [[ ! -f "$tlog" ]]; then
# Log path set in DB but file doesn't exist — worker wrapper never ran
local diag_detail="log_file_missing"
local dispatch_script="${SUPERVISOR_DIR}/pids/${task_id}-dispatch.sh"
local wrapper_script="${SUPERVISOR_DIR}/pids/${task_id}-wrapper.sh"
if [[ ! -f "$dispatch_script" && ! -f "$wrapper_script" ]]; then
diag_detail="log_file_missing:no_dispatch_scripts"
elif [[ -f "$dispatch_script" && ! -x "$dispatch_script" ]]; then
diag_detail="log_file_missing:dispatch_script_not_executable"
fi
local pid_file="$SUPERVISOR_DIR/pids/${task_id}.pid"
if [[ -f "$pid_file" ]]; then
local stale_pid
stale_pid=$(cat "$pid_file" 2>/dev/null || echo "")
if [[ -n "$stale_pid" ]] && ! kill -0 "$stale_pid" 2>/dev/null; then
diag_detail="${diag_detail}:worker_pid_${stale_pid}_dead"
fi
else
diag_detail="${diag_detail}:no_pid_file"
fi
echo "failed:${diag_detail}"
return 0
fi

# Log file exists but may be empty or contain only metadata header (t183)
local log_size
log_size=$(wc -c <"$tlog" 2>/dev/null | tr -d ' ')
if [[ "$log_size" -eq 0 ]]; then
echo "failed:log_file_empty"
return 0
fi

# Check if worker never started (only dispatch metadata, no WORKER_STARTED sentinel)
if [[ "$log_size" -lt 500 ]] && ! grep -q 'WORKER_STARTED' "$tlog" 2>/dev/null; then
# Log has metadata but worker never started — extract any error from log
local startup_error=""
startup_error=$(grep -i 'WORKER_FAILED\|WORKER_DISPATCH_ERROR\|command not found\|No such file\|Permission denied' "$tlog" 2>/dev/null | head -1 | head -c 200 || echo "")
if [[ -n "$startup_error" ]]; then
echo "failed:worker_never_started:$(echo "$startup_error" | tr ' ' '_' | tr -cd '[:alnum:]_:-')"
else
echo "failed:worker_never_started:no_sentinel"
fi
return 0
fi

# --- Tier 1: Deterministic signal detection ---

# Parse structured metadata from log (bash 3.2 compatible - no associative arrays)
local meta_output
meta_output=$(extract_log_metadata "$tlog")

# Helper: extract a value from key=value metadata output
_meta_get() {
local key="$1" default="${2:-}"
local val
val=$(echo "$meta_output" | grep "^${key}=" | head -1 | cut -d= -f2-)
echo "${val:-$default}"
}

local meta_signal meta_pr_url meta_exit_code
meta_signal=$(_meta_get "signal" "none")
meta_pr_url=$(_meta_get "pr_url" "")
meta_exit_code=$(_meta_get "exit_code" "")

# Seed PR URL from DB (t171): check_pr_status() or a previous pulse may have
# already found and persisted the PR URL. Use it before expensive gh API calls.
if [[ -z "$meta_pr_url" && -n "${tpr_url:-}" && "$tpr_url" != "no_pr" && "$tpr_url" != "task_only" ]]; then
meta_pr_url="$tpr_url"
fi

# Resolve repo slug early — needed for PR validation (t195) and fallback detection
local task_repo task_branch repo_slug_detect
task_repo=$(sqlite3 "$SUPERVISOR_DB" "SELECT repo FROM tasks WHERE id = '$escaped_id';" 2>/dev/null || echo "")
task_branch=$(sqlite3 "$SUPERVISOR_DB" "SELECT branch FROM tasks WHERE id = '$escaped_id';" 2>/dev/null || echo "")
repo_slug_detect=""
if [[ -n "$task_repo" ]]; then
repo_slug_detect=$(detect_repo_slug "$task_repo" 2>/dev/null || echo "")
fi

# Validate PR URL belongs to this task (t195, t223): a previous pulse
# may have stored a PR URL that doesn't actually reference this task ID
# (e.g., branch reuse, stale data, or log containing another task's PR URL).
# Validate before using for attribution. If repo slug detection failed,
# clear the PR URL entirely — unvalidated URLs cause cross-contamination
# where the wrong PR gets linked to the wrong task (t223).
if [[ -n "$meta_pr_url" ]]; then
if [[ -n "$repo_slug_detect" ]]; then
local validated_url
validated_url=$(validate_pr_belongs_to_task "$task_id" "$repo_slug_detect" "$meta_pr_url") || validated_url=""
if [[ -z "$validated_url" ]]; then
log_warn "evaluate_worker: PR URL for $task_id failed task ID validation — clearing"
meta_pr_url=""
fi
else
log_warn "evaluate_worker: cannot validate PR URL for $task_id (repo slug detection failed) — clearing to prevent cross-contamination"
meta_pr_url=""
fi
fi

# Fallback PR URL detection via centralized discover_pr_by_branch() (t232, t161, t195)
if [[ -z "$meta_pr_url" && -n "$repo_slug_detect" ]]; then
meta_pr_url=$(discover_pr_by_branch "$task_id" "$repo_slug_detect" "$task_branch") || meta_pr_url=""
fi

local meta_rate_limit_count meta_auth_error_count meta_conflict_count
local meta_timeout_count meta_oom_count meta_backend_error_count
meta_rate_limit_count=$(_meta_get "rate_limit_count" "0")
meta_auth_error_count=$(_meta_get "auth_error_count" "0")
meta_conflict_count=$(_meta_get "conflict_count" "0")
meta_timeout_count=$(_meta_get "timeout_count" "0")
meta_oom_count=$(_meta_get "oom_count" "0")
meta_backend_error_count=$(_meta_get "backend_error_count" "0")

# FULL_LOOP_COMPLETE = definitive success
if [[ "$meta_signal" == "FULL_LOOP_COMPLETE" ]]; then
echo "complete:${meta_pr_url:-no_pr}"
return 0
fi

# t1008: Verify-mode worker signals
# VERIFY_COMPLETE = verification confirmed prior work is done
if [[ "$meta_signal" == "VERIFY_COMPLETE" ]]; then
log_info "Verify worker confirmed $task_id is complete"
echo "complete:${meta_pr_url:-verified_complete}"
return 0
fi

# VERIFY_INCOMPLETE = prior work exists but needs more; worker continued implementation
if [[ "$meta_signal" == "VERIFY_INCOMPLETE" ]]; then
if [[ -n "$meta_pr_url" ]]; then
log_info "Verify worker found incomplete work for $task_id, continued and created PR"
echo "complete:${meta_pr_url}"
return 0
fi
# No PR = worker found incomplete work but couldn't finish
log_info "Verify worker found incomplete work for $task_id but no PR created"
echo "retry:verify_incomplete_no_pr"
return 0
fi

# VERIFY_NOT_STARTED = no prior work found; worker should have done full implementation
if [[ "$meta_signal" == "VERIFY_NOT_STARTED" ]]; then
if [[ -n "$meta_pr_url" ]]; then
log_info "Verify worker found no prior work for $task_id, did full implementation"
echo "complete:${meta_pr_url}"
return 0
fi
# No PR = verify worker couldn't complete full implementation (expected — it's lightweight)
log_info "Verify worker found no prior work for $task_id and couldn't complete — re-queue for full dispatch"
echo "retry:verify_not_started_needs_full"
return 0
fi

# TASK_COMPLETE with clean exit = partial success (PR phase may have failed)
# If a PR URL is available (from DB or gh fallback), include it.
if [[ "$meta_signal" == "TASK_COMPLETE" && "$meta_exit_code" == "0" ]]; then
echo "complete:${meta_pr_url:-task_only}"
return 0
fi

# PR URL with clean exit = task completed (PR was created successfully)
# This takes priority over heuristic error patterns because log content
# may discuss auth/errors as part of the task itself (e.g., creating an
# API integration subagent that documents authentication flows)
if [[ -n "$meta_pr_url" && "$meta_exit_code" == "0" ]]; then
echo "complete:${meta_pr_url}"
return 0
fi

# Backend infrastructure error with EXIT:0 (t095-diag-1): CLI wrappers like
# OpenCode exit 0 even when the backend rejects the request (quota exceeded,
# backend down). A short log with backend errors means the worker never
# started - this is NOT content discussion, it's a real failure.
# Must be checked BEFORE clean_exit_no_signal to avoid wasting retries.
# (t198): Use content_lines instead of log_lines to exclude REPROMPT METADATA
# headers that inflate the line count in retry logs (8-line header caused
# 12-line logs to miss the < 10 threshold).
if [[ "$meta_exit_code" == "0" && "$meta_signal" == "none" ]]; then
local meta_content_lines
meta_content_lines=$(_meta_get "content_lines" "0")
# Billing/credits errors: block immediately, retrying won't help.
# OpenCode Zen proxy returns CreditsError when credits exhausted;
# this is a billing issue, not a transient backend error.
if [[ "$meta_backend_error_count" -gt 0 && "$meta_content_lines" -lt 10 ]]; then
if grep -qi 'CreditsError\|Insufficient balance' "$log_file" 2>/dev/null; then
echo "blocked:billing_credits_exhausted"
return 0
fi
echo "retry:backend_quota_error"
return 0
fi
fi

# Task obsolete detection (t198): workers that determine a task is already
# done or obsolete exit cleanly with EXIT:0, no signal, and no PR. Without
# this check, the supervisor retries them as clean_exit_no_signal, wasting
# retry attempts on work that will never produce a PR.
# Uses the final "type":"text" entry (authoritative) to detect explicit
# "already done" / "no changes needed" language from the worker.
if [[ "$meta_exit_code" == "0" && "$meta_signal" == "none" ]]; then
local meta_task_obsolete
meta_task_obsolete=$(_meta_get "task_obsolete" "false")
if [[ "$meta_task_obsolete" == "true" ]]; then
echo "complete:task_obsolete"
return 0
fi
fi

# Clean exit with no completion signal and no PR (checked DB + gh API above)
# = likely incomplete. The agent finished cleanly but didn't emit a signal
# and no PR was found. Retry (agent may have run out of context or hit a
# soft limit). If a PR exists, it was caught at line ~3179 via DB seed (t171)
# or gh fallback (t161).
if [[ "$meta_exit_code" == "0" && "$meta_signal" == "none" ]]; then
echo "retry:clean_exit_no_signal"
return 0
fi

# --- Tier 2: Heuristic error pattern matching ---
# ONLY applied when exit code is non-zero or missing.
# When exit=0, the agent finished cleanly - any "error" strings in the log
# are content (e.g., subagents documenting auth flows), not real failures.

if [[ "$meta_exit_code" != "0" ]]; then
# Backend infrastructure error (quota, API gateway) = transient retry
# Only checked on non-zero exit: a clean exit with backend error strings in
# the log is content discussion, not a real infrastructure failure.
if [[ "$meta_backend_error_count" -gt 0 ]]; then
echo "retry:backend_infrastructure_error"
return 0
fi

# Auth errors are always blocking (human must fix credentials)
if [[ "$meta_auth_error_count" -gt 0 ]]; then
echo "blocked:auth_error"
return 0
fi

# Merge conflicts require human resolution
if [[ "$meta_conflict_count" -gt 0 ]]; then
echo "blocked:merge_conflict"
return 0
fi

# OOM is infrastructure - blocking
if [[ "$meta_oom_count" -gt 0 ]]; then
echo "blocked:out_of_memory"
return 0
fi

# Rate limiting is transient - retry with backoff
if [[ "$meta_rate_limit_count" -gt 0 ]]; then
echo "retry:rate_limited"
return 0
fi

# Timeout is transient - retry
if [[ "$meta_timeout_count" -gt 0 ]]; then
echo "retry:timeout"
return 0
fi
fi

# Non-zero exit with known code
if [[ -n "$meta_exit_code" && "$meta_exit_code" != "0" ]]; then
# Exit code 130 = SIGINT (Ctrl+C), 137 = SIGKILL, 143 = SIGTERM
case "$meta_exit_code" in
130)
echo "retry:interrupted_sigint"
return 0
;;
137)
echo "retry:killed_sigkill"
return 0
;;
143)
echo "retry:terminated_sigterm"
return 0
;;
esac
fi

# Check if retries exhausted before attempting AI eval
if [[ "$tretries" -ge "$tmax_retries" ]]; then
echo "failed:max_retries"
return 0
fi

# --- Tier 2.5: Git heuristic signals (t175) ---
# Before expensive AI eval, check for concrete evidence of work in the
# task's worktree/branch. This resolves most ambiguous outcomes cheaply
# and prevents false retries when the worker completed but didn't emit
# a signal (e.g., context exhaustion after creating a PR).

# Reuse task_repo/task_branch from PR detection above; fetch worktree
local task_worktree
task_worktree=$(db "$SUPERVISOR_DB" "SELECT worktree FROM tasks WHERE id = '$escaped_id';" 2>/dev/null || echo "")

if [[ -n "$task_repo" && -d "$task_repo" ]]; then
# Use worktree path if available, otherwise fall back to repo
local git_dir="${task_worktree:-$task_repo}"
if [[ ! -d "$git_dir" ]]; then
git_dir="$task_repo"
fi

# Check for commits on branch ahead of main/master
local branch_commits=0
if [[ -n "$task_branch" ]]; then
local base_branch
base_branch=$(git -C "$git_dir" symbolic-ref refs/remotes/origin/HEAD 2>/dev/null | sed 's|refs/remotes/origin/||' || echo "main")
branch_commits=$(git -C "$git_dir" rev-list --count "${base_branch}..${task_branch}" 2>/dev/null || echo 0)
fi

# Check for uncommitted changes in worktree
local uncommitted_changes=0
if [[ -n "$task_worktree" && -d "$task_worktree" ]]; then
uncommitted_changes=$(git -C "$task_worktree" status --porcelain 2>/dev/null | wc -l | tr -d ' ')
fi

# Decision matrix:
# - Commits + PR URL → complete (worker finished, signal was lost)
# - Commits + no PR → auto-create PR (t247.2), fallback to task_only
# - No commits + uncommitted changes → retry:work_in_progress
# - No commits + no changes → genuine ambiguity (fall through to AI/retry)

if [[ "$branch_commits" -gt 0 ]]; then
if [[ -n "$meta_pr_url" ]]; then
echo "complete:${meta_pr_url}"
else
# t247.2: Auto-create PR instead of returning task_only.
# Saves ~300s per retry by preserving work for review.
local auto_pr_url=""
if [[ -n "$repo_slug_detect" && -n "$task_branch" ]]; then
auto_pr_url=$(auto_create_pr_for_task "$task_id" "$git_dir" "$task_branch" "$repo_slug_detect" 2>>"${SUPERVISOR_LOG:-/dev/null}") || auto_pr_url=""
fi
if [[ -n "$auto_pr_url" ]]; then
echo "complete:${auto_pr_url}"
else
echo "complete:task_only"
fi
fi
return 0
fi

if [[ "$uncommitted_changes" -gt 0 ]]; then
echo "retry:work_in_progress"
return 0
fi
fi

# --- Tier 3: AI evaluation for ambiguous outcomes ---

if [[ "$skip_ai_eval" == "true" ]]; then
echo "retry:ambiguous_skipped_ai"
return 0
fi

local ai_verdict
ai_verdict=$(evaluate_with_ai "$task_id" "$tlog" 2>/dev/null || echo "")

if [[ -n "$ai_verdict" ]]; then
echo "$ai_verdict"
return 0
fi

# AI eval failed or unavailable - default to retry
echo "retry:ambiguous_ai_unavailable"
return 0
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fix undefined log_file reference in billing error check.
This path uses $log_file (undefined) instead of $tlog, which can block on stdin or misclassify errors.

🔧 Proposed fix
-		if [[ "$meta_backend_error_count" -gt 0 && "$meta_content_lines" -lt 10 ]]; then
-			if grep -qi 'CreditsError\|Insufficient balance' "$log_file" 2>/dev/null; then
+		if [[ "$meta_backend_error_count" -gt 0 && "$meta_content_lines" -lt 10 ]]; then
+			if grep -qi 'CreditsError\|Insufficient balance' "$tlog" 2>/dev/null; then
 				echo "blocked:billing_credits_exhausted"
 				return 0
 			fi
🤖 Prompt for AI Agents
In @.agents/scripts/supervisor/evaluate.sh around lines 675 - 1086, In
evaluate_worker(), the billing/credits detection block incorrectly references an
undefined variable log_file causing grep to read stdin or misclassify errors;
update that grep call to use the log path variable tlog (used throughout the
function) so the billing detection uses the actual task log (refer to
evaluate_worker, the block checking meta_backend_error_count and
meta_content_lines before echoing "blocked:billing_credits_exhausted").

Comment on lines +1267 to +1347
check_task_already_done() {
local task_id="${1:-}"
local project_root="${2:-.}"

if [[ -z "$task_id" ]]; then
return 1
fi

# Check 1: Is the task already marked [x] in TODO.md?
# IMPORTANT: TODO.md may contain the same task ID in multiple sections:
# - Active task list (authoritative — near the top)
# - Completed plan archive (historical — further down, from earlier iterations)
# We must check the FIRST occurrence only. If the first match is [x], it's done.
# If the first match is [ ] or [-], it's NOT done (even if a later [x] exists).
local todo_file="$project_root/TODO.md"
if [[ -f "$todo_file" ]]; then
local first_match=""
first_match=$(grep -E "^\s*- \[(x| |-)\] ${task_id}[[:space:]]" "$todo_file" 2>/dev/null | head -1) || true
if [[ -n "$first_match" ]]; then
# Extract ONLY the checkbox at the start of the line, not [x] anywhere in description
local checkbox=""
checkbox=$(printf '%s' "$first_match" | sed -n 's/^[[:space:]]*- \[\(.\)\].*/\1/p')
if [[ "$checkbox" == "x" ]]; then
log_info "Pre-dispatch check: $task_id is marked [x] in TODO.md (first occurrence)" >&2
return 0
else
# First occurrence is [ ] or [-] — task is NOT done, skip further checks
log_info "Pre-dispatch check: $task_id is [ ] in TODO.md (first occurrence — ignoring any later [x] entries)" >&2
return 1
fi
fi
fi

# Check 2: Are there merged commits referencing this task ID?
# IMPORTANT: Use word-boundary matching to prevent t020 matching t020.6.
# Escaped task_id for regex: dots become literal dots.
local escaped_task_regex
escaped_task_regex=$(printf '%s' "$task_id" | sed 's/\./\\./g')
# grep -w uses word boundaries but dots aren't word chars, so for subtask IDs
# like t020.1 we need a custom boundary: task_id followed by non-digit or EOL.
# This prevents t020 from matching t020.1, t020.2, etc.
local boundary_pattern="${task_id}([^.0-9]|$)"

local commit_count=0
commit_count=$(git -C "$project_root" log --oneline -500 --all --grep="$task_id" 2>/dev/null |
grep -cE "$boundary_pattern" 2>/dev/null) || true
if [[ "$commit_count" -gt 0 ]]; then
# Verify at least one commit looks like a REAL completion:
# Must have a PR merge reference "(#NNN)" AND the exact task ID.
# Exclude: "add tNNN", "claim tNNN", "mark tNNN blocked", "queue tNNN"
local completion_evidence=""
completion_evidence=$(git -C "$project_root" log --oneline -500 --all --grep="$task_id" 2>/dev/null |
grep -E "$boundary_pattern" |
grep -iE "\(#[0-9]+\)|PR #[0-9]+ merged" |
grep -ivE "add ${task_id}|claim ${task_id}|mark ${task_id}|queue ${task_id}|blocked" |
head -1) || true
if [[ -n "$completion_evidence" ]]; then
log_info "Pre-dispatch check: $task_id has completion evidence: $completion_evidence" >&2
return 0
fi
fi

# Check 3: Does a merged PR exist for this task?
# Only check if gh CLI is available and authenticated (cached check).
# Use exact task ID in title search to prevent substring matches.
# IMPORTANT: gh pr list --repo requires OWNER/REPO slug, not a local path (t224).
if command -v gh &>/dev/null && check_gh_auth 2>/dev/null; then
local repo_slug=""
repo_slug=$(detect_repo_slug "$project_root" 2>/dev/null) || true
if [[ -n "$repo_slug" ]]; then
local pr_count=0
pr_count=$(gh pr list --repo "$repo_slug" --state merged --search "\"$task_id\" in:title" --limit 1 --json number --jq 'length' 2>/dev/null) || true
if [[ "$pr_count" -gt 0 ]]; then
log_info "Pre-dispatch check: $task_id has a merged PR on GitHub" >&2
return 0
fi
fi
fi

return 1
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# Find and examine the issue-sync.sh file
find . -name "issue-sync.sh" -type f

Repository: marcusquinn/aidevops

Length of output: 106


🏁 Script executed:

# Check the file at the specified lines
if [[ -f ".agents/scripts/supervisor/issue-sync.sh" ]]; then
  wc -l ".agents/scripts/supervisor/issue-sync.sh"
fi

Repository: marcusquinn/aidevops

Length of output: 109


🏁 Script executed:

# Read the relevant section of the code
sed -n '1267,1347p' ".agents/scripts/supervisor/issue-sync.sh"

Repository: marcusquinn/aidevops

Length of output: 3650


🏁 Script executed:

# Create a test git repository to verify git log --grep behavior
cd /tmp && rm -rf test_repo && mkdir test_repo && cd test_repo && git init

# Create some test commits with task IDs
git config user.email "test@test.com"
git config user.name "Test"
echo "test" > file.txt
git add . && git commit -m "Fix t128.10 task"
echo "test2" > file.txt
git add . && git commit -m "Fix t128x10 related issue"
echo "test3" > file.txt
git add . && git commit -m "Fix t020 task"

# Now test if unescaped dot in grep matches both
echo "=== Testing unescaped dot (false positive) ==="
git log --oneline -500 --all --grep="t128.10" 2>/dev/null || true

echo "=== Testing escaped dot (should be more specific) ==="
git log --oneline -500 --all --grep="t128\.10" 2>/dev/null || true

Repository: marcusquinn/aidevops

Length of output: 1146


🏁 Script executed:

# Test the boundary pattern with grep -E as used in the code
echo "=== Testing boundary pattern with unescaped task_id ==="
printf "Fix t128.10 task\nFix t128x10 related\n" | grep -cE "t128.10([^.0-9]|$)" && echo "Unescaped matches: 2"

echo "=== Testing boundary pattern with escaped task_id ==="
printf "Fix t128.10 task\nFix t128x10 related\n" | grep -cE "t128\.10([^.0-9]|$)" && echo "Escaped matches: 1"

Repository: marcusquinn/aidevops

Length of output: 217


🏁 Script executed:

# Also test the last grep pattern in the code that uses ${task_id}
# These are in the exclusion patterns for completion_evidence
task_id="t128.10"
echo "=== Testing exclusion pattern with unescaped task_id ==="
printf "add t128.10\nadd t128x10\n" | grep -ivE "add ${task_id}" || echo "No matches (correct)"

echo "=== Testing with escaped task_id ==="
escaped_task_id=$(printf '%s' "$task_id" | sed 's/\./\\./g')
printf "add t128.10\nadd t128x10\n" | grep -ivE "add ${escaped_task_id}" || echo "Matches line with x"

Repository: marcusquinn/aidevops

Length of output: 190


Escape task IDs comprehensively in all regex patterns to prevent false positives.

Dotted task IDs (e.g., t128.10) are treated as regex wildcards in git log --grep and grep -E, causing t128.10 to match t128x10, t128a10, etc. The code creates escaped_task_regex but never uses it; the unescaped task_id is passed to three pattern contexts:

  • boundary_pattern="${task_id}([^.0-9]|$)" — matches wrong tasks in grep -cE
  • git log --grep="$task_id" — returns commits for unrelated tasks
  • grep -ivE "add ${task_id}|claim ${task_id}|..." — incorrectly excludes substring matches

This causes the function to incorrectly mark unrelated tasks as "already done", disrupting task dispatch logic.

🔧 Fix: Use escaped_task_regex in all regex contexts
  local escaped_task_regex
- escaped_task_regex=$(printf '%s' "$task_id" | sed 's/\./\\./g')
+ escaped_task_regex=$(printf '%s' "$task_id" | sed 's/[][\\/.*^$+?{}()|]/\\&/g')
  
- local boundary_pattern="${task_id}([^.0-9]|$)"
+ local boundary_pattern="${escaped_task_regex}([^.0-9]|$)"
  
- commit_count=$(git -C "$project_root" log --oneline -500 --all --grep="$task_id" 2>/dev/null |
+ commit_count=$(git -C "$project_root" log --oneline -500 --all --grep="$escaped_task_regex" 2>/dev/null |
  
- completion_evidence=$(git -C "$project_root" log --oneline -500 --all --grep="$task_id" 2>/dev/null |
+ completion_evidence=$(git -C "$project_root" log --oneline -500 --all --grep="$escaped_task_regex" 2>/dev/null |
  
- grep -ivE "add ${task_id}|claim ${task_id}|mark ${task_id}|queue ${task_id}|blocked" |
+ grep -ivE "add ${escaped_task_regex}|claim ${escaped_task_regex}|mark ${escaped_task_regex}|queue ${escaped_task_regex}|blocked" |
🤖 Prompt for AI Agents
In @.agents/scripts/supervisor/issue-sync.sh around lines 1267 - 1347, The
check_task_already_done function is vulnerable to regex mis-matches because
task_id is used raw; replace uses of the raw task_id with the already-built
escaped_task_regex and adjust boundary_pattern to use it (e.g.,
boundary_pattern="${escaped_task_regex}([^.0-9]|$)"), and pass
escaped_task_regex into git grep (git -C "$project_root" log --oneline ...
--grep="$escaped_task_regex") and into the exclusion grep (grep -ivE "add
${escaped_task_regex}|claim ${escaped_task_regex}|mark
${escaped_task_regex}|queue ${escaped_task_regex}|blocked") so dotted IDs like
t128.10 are treated literally across git log, grep -E, and grep -ivE checks;
leave the rest of check_task_already_done logic unchanged.

Comment on lines +817 to 848
cmd_next() {
local batch_id="${1:-}" limit="${2:-1}"

ensure_db

if [[ -n "$batch_id" ]]; then
local escaped_batch
escaped_batch=$(sql_escape "$batch_id")

db -separator $'\t' "$SUPERVISOR_DB" "
SELECT t.id, t.repo, t.description, t.model
FROM batch_tasks bt
JOIN tasks t ON bt.task_id = t.id
WHERE bt.batch_id = '$escaped_batch'
AND t.status = 'queued'
AND t.retries < t.max_retries
ORDER BY t.retries ASC, bt.position
LIMIT $limit;
"
else
db -separator $'\t' "$SUPERVISOR_DB" "
SELECT id, repo, description, model
FROM tasks
WHERE status = 'queued'
AND retries < max_retries
ORDER BY retries ASC, created_at ASC
LIMIT $limit;
"
fi

return 0
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Validate limit before interpolating into SQL.
LIMIT $limit is fed from user input without numeric validation, opening the door to SQL injection or query errors.

🔧 Proposed fix (sanitize limit)
 cmd_next() {
 	local batch_id="${1:-}" limit="${2:-1}"
 
+	if [[ -n "$limit" && ! "$limit" =~ ^[0-9]+$ ]]; then
+		log_error "Invalid limit: $limit (must be a non-negative integer)"
+		return 1
+	fi
+
 	ensure_db
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
cmd_next() {
local batch_id="${1:-}" limit="${2:-1}"
ensure_db
if [[ -n "$batch_id" ]]; then
local escaped_batch
escaped_batch=$(sql_escape "$batch_id")
db -separator $'\t' "$SUPERVISOR_DB" "
SELECT t.id, t.repo, t.description, t.model
FROM batch_tasks bt
JOIN tasks t ON bt.task_id = t.id
WHERE bt.batch_id = '$escaped_batch'
AND t.status = 'queued'
AND t.retries < t.max_retries
ORDER BY t.retries ASC, bt.position
LIMIT $limit;
"
else
db -separator $'\t' "$SUPERVISOR_DB" "
SELECT id, repo, description, model
FROM tasks
WHERE status = 'queued'
AND retries < max_retries
ORDER BY retries ASC, created_at ASC
LIMIT $limit;
"
fi
return 0
}
cmd_next() {
local batch_id="${1:-}" limit="${2:-1}"
if [[ -n "$limit" && ! "$limit" =~ ^[0-9]+$ ]]; then
log_error "Invalid limit: $limit (must be a non-negative integer)"
return 1
fi
ensure_db
if [[ -n "$batch_id" ]]; then
local escaped_batch
escaped_batch=$(sql_escape "$batch_id")
db -separator $'\t' "$SUPERVISOR_DB" "
SELECT t.id, t.repo, t.description, t.model
FROM batch_tasks bt
JOIN tasks t ON bt.task_id = t.id
WHERE bt.batch_id = '$escaped_batch'
AND t.status = 'queued'
AND t.retries < t.max_retries
ORDER BY t.retries ASC, bt.position
LIMIT $limit;
"
else
db -separator $'\t' "$SUPERVISOR_DB" "
SELECT id, repo, description, model
FROM tasks
WHERE status = 'queued'
AND retries < max_retries
ORDER BY retries ASC, created_at ASC
LIMIT $limit;
"
fi
return 0
}
🤖 Prompt for AI Agents
In @.agents/scripts/supervisor/state.sh around lines 817 - 848, The cmd_next
function interpolates the unvalidated variable limit directly into the SQL LIMIT
clause; validate and sanitize limit before use by checking it is a positive
integer, applying a sensible default and an upper cap (e.g., max_limit) if out
of range, and then use the sanitized numeric variable (not the raw input) when
building the db query; update the cmd_next function (references: cmd_next,
limit, ensure_db, db, SUPERVISOR_DB) to perform this numeric
validation/normalization and only then substitute the sanitized value into the
SQL.

Comment on lines +459 to +463
local threshold=$((cpu_cores * max_load_factor * 100))
if [[ "$load_ratio" -gt "$threshold" ]]; then
echo "overloaded=true"
else
echo "overloaded=false"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Linux overload detection never triggers due to a bad threshold.
load_ratio is already a percent, but the threshold multiplies by cpu_cores again, so overload is effectively impossible. This prevents adaptive throttling under real load.

🔧 Proposed fix
-		local threshold=$((cpu_cores * max_load_factor * 100))
+		local threshold=$((max_load_factor * 100))
 		if [[ "$load_ratio" -gt "$threshold" ]]; then
 			echo "overloaded=true"
 		else
 			echo "overloaded=false"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
local threshold=$((cpu_cores * max_load_factor * 100))
if [[ "$load_ratio" -gt "$threshold" ]]; then
echo "overloaded=true"
else
echo "overloaded=false"
local threshold=$((max_load_factor * 100))
if [[ "$load_ratio" -gt "$threshold" ]]; then
echo "overloaded=true"
else
echo "overloaded=false"
🤖 Prompt for AI Agents
In @.agents/scripts/supervisor/utility.sh around lines 459 - 463, The threshold
calculation is wrong: because load_ratio is already a percent, do not multiply
by cpu_cores again; compute threshold from max_load_factor as a percent (e.g.,
threshold = max_load_factor * 100, converting floats to an integer percent if
needed) and keep the existing comparison against load_ratio; update the
assignment to the threshold variable (refer to threshold, load_ratio, cpu_cores,
max_load_factor) and ensure numeric conversion/rounding is handled so the if [[
"$load_ratio" -gt "$threshold" ]] branch can trigger correctly.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This is a significant and valuable refactoring that modularizes the monolithic supervisor-helper.sh script, greatly improving maintainability. The extraction of functions into domain-specific modules is well-executed.

My review focuses on ensuring adherence to the repository's shell scripting style guide and identifying potential issues introduced during the refactoring. I've found a few critical issues that need to be addressed:

  • All the new/modified scripts are missing the required set -euo pipefail directive.
  • The deploy.sh script uses eval, which is forbidden by the style guide.
  • The evaluate.sh script calls trap helper functions that are not defined, and the approach to resolving this should consider the repository's guidelines on source dependencies versus function duplication.

I've also included a medium-severity suggestion to refactor the very long ensure_db function in database.sh for better readability.

Once these issues are resolved, this will be an excellent contribution to the project's health.

# Functions for creating, managing, and querying batches
# Functions for adding tasks and creating batches

set -euo pipefail

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The set -euo pipefail directive has been removed. According to the repository style guide (line 10), all scripts must use this setting for robustness and safety. Please add set -euo pipefail to the top of this script and all other new/modified shell scripts in this pull request.

References
  1. All scripts use #!/usr/bin/env bash and set -euo pipefail (link)

Comment on lines +2216 to +2221
local merge_cmd="gh pr merge \"$pr_number\" --repo \"$repo_slug\" --squash"
if [[ "$use_admin_flag" == "true" ]]; then
merge_cmd="$merge_cmd --admin"
fi

if ! merge_output=$(eval "$merge_cmd" 2>&1); then

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The use of eval is a security risk and is explicitly forbidden by the repository style guide (line 32). Please refactor this to use a bash array for dynamic command construction to avoid potential command injection vulnerabilities.

Suggested change
local merge_cmd="gh pr merge \"$pr_number\" --repo \"$repo_slug\" --squash"
if [[ "$use_admin_flag" == "true" ]]; then
merge_cmd="$merge_cmd --admin"
fi
if ! merge_output=$(eval "$merge_cmd" 2>&1); then
local -a merge_cmd_arr=("gh" "pr" "merge" "$pr_number" "--repo" "$repo_slug" "--squash")
if [[ "$use_admin_flag" == "true" ]]; then
merge_cmd_arr+=("--admin")
fi
if ! merge_output=$("${merge_cmd_arr[@]}" 2>&1); then
References
  1. No eval -- use bash arrays for dynamic command construction (link)

Comment on lines +128 to +130
_save_cleanup_scope
trap '_run_cleanups' RETURN
push_cleanup "rm -f '${log_tail_file}'"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This block uses _save_cleanup_scope, trap '_run_cleanups' RETURN, and push_cleanup. These functions are not defined within this script and appear to be from a shared utility file (like shared-constants.sh). This will cause a runtime error. Per repository guidelines (Rule 2), for standalone scripts, it is often preferable to duplicate simple, self-contained helper functions to maintain script independence and avoid path resolution issues, rather than introducing source dependencies. Please either duplicate these functions within this script if they are simple and self-contained, or ensure the necessary utility script is sourced if they are complex or shared across many non-standalone scripts.

References
  1. For standalone shell scripts, it is acceptable to duplicate simple, self-contained helper functions instead of introducing source dependencies. This maintains script independence and avoids risks like path resolution issues.

Comment on lines +140 to +584
ensure_db() {
if [[ ! -d "$SUPERVISOR_DIR" ]]; then
mkdir -p "$SUPERVISOR_DIR"
fi

if [[ ! -f "$SUPERVISOR_DB" ]]; then
init_db
return 0
fi

# Check if schema needs upgrade
local has_tasks
has_tasks=$(db "$SUPERVISOR_DB" "SELECT count(*) FROM sqlite_master WHERE type='table' AND name='tasks';")
if [[ "$has_tasks" -eq 0 ]]; then
init_db
fi

# Migrate: add post-PR lifecycle states if CHECK constraint is outdated (t128.8)
# SQLite doesn't support ALTER CHECK, so we recreate the constraint via a temp table
# Note: uses dynamic column lists so cannot use safe_migrate() directly (t188)
local check_sql
check_sql=$(db "$SUPERVISOR_DB" "SELECT sql FROM sqlite_master WHERE type='table' AND name='tasks';" 2>/dev/null || echo "")
if [[ -n "$check_sql" ]] && ! echo "$check_sql" | grep -q 'pr_review'; then
log_info "Migrating database schema for post-PR lifecycle states (t128.8)..."

# Backup before migration (t188: fail-safe — abort if backup fails)
local t128_backup
t128_backup=$(backup_sqlite_db "$SUPERVISOR_DB" "pre-migrate-t128.8")
if [[ $? -ne 0 || -z "$t128_backup" ]]; then
log_error "Backup failed for t128.8 migration — aborting"
return 1
fi

# Detect which optional columns exist in the old table to preserve data (t162)
local has_issue_url_col has_diagnostic_of_col has_triage_result_col
has_issue_url_col=$(db "$SUPERVISOR_DB" "SELECT count(*) FROM pragma_table_info('tasks') WHERE name='issue_url';" 2>/dev/null || echo "0")
has_diagnostic_of_col=$(db "$SUPERVISOR_DB" "SELECT count(*) FROM pragma_table_info('tasks') WHERE name='diagnostic_of';" 2>/dev/null || echo "0")
has_triage_result_col=$(db "$SUPERVISOR_DB" "SELECT count(*) FROM pragma_table_info('tasks') WHERE name='triage_result';" 2>/dev/null || echo "0")

# Build column lists dynamically based on what exists
local insert_cols="id, repo, description, status, session_id, worktree, branch, log_file, retries, max_retries, model, error, pr_url, created_at, started_at, completed_at, updated_at"
local select_cols="$insert_cols"
[[ "$has_issue_url_col" -gt 0 ]] && {
insert_cols="$insert_cols, issue_url"
select_cols="$select_cols, issue_url"
}
[[ "$has_diagnostic_of_col" -gt 0 ]] && {
insert_cols="$insert_cols, diagnostic_of"
select_cols="$select_cols, diagnostic_of"
}
[[ "$has_triage_result_col" -gt 0 ]] && {
insert_cols="$insert_cols, triage_result"
select_cols="$select_cols, triage_result"
}

db "$SUPERVISOR_DB" <<MIGRATE
PRAGMA foreign_keys=OFF;
BEGIN TRANSACTION;
ALTER TABLE tasks RENAME TO tasks_old;
CREATE TABLE tasks (
id TEXT PRIMARY KEY,
repo TEXT NOT NULL,
description TEXT,
status TEXT NOT NULL DEFAULT 'queued'
CHECK(status IN ('queued','dispatched','running','evaluating','retrying','complete','pr_review','review_triage','merging','merged','deploying','deployed','verifying','verified','verify_failed','blocked','failed','cancelled')),
session_id TEXT,
worktree TEXT,
branch TEXT,
log_file TEXT,
retries INTEGER NOT NULL DEFAULT 0,
max_retries INTEGER NOT NULL DEFAULT 3,
model TEXT DEFAULT 'anthropic/claude-opus-4-6',
error TEXT,
pr_url TEXT,
issue_url TEXT,
diagnostic_of TEXT,
triage_result TEXT,
created_at TEXT NOT NULL DEFAULT (strftime('%Y-%m-%dT%H:%M:%SZ','now')),
started_at TEXT,
completed_at TEXT,
updated_at TEXT NOT NULL DEFAULT (strftime('%Y-%m-%dT%H:%M:%SZ','now'))
);
INSERT INTO tasks ($insert_cols)
SELECT $select_cols
FROM tasks_old;
DROP TABLE tasks_old;
CREATE INDEX IF NOT EXISTS idx_tasks_status ON tasks(status);
CREATE INDEX IF NOT EXISTS idx_tasks_repo ON tasks(repo);
CREATE INDEX IF NOT EXISTS idx_tasks_created ON tasks(created_at);
CREATE INDEX IF NOT EXISTS idx_tasks_diagnostic ON tasks(diagnostic_of);
COMMIT;
PRAGMA foreign_keys=ON;
MIGRATE

# Verify row counts after migration (t188)
if ! verify_migration_rowcounts "$SUPERVISOR_DB" "$t128_backup" "tasks"; then
log_error "t128.8 migration VERIFICATION FAILED — rolling back"
rollback_sqlite_db "$SUPERVISOR_DB" "$t128_backup"
return 1
fi
log_success "Database schema migrated for post-PR lifecycle states (verified)"
fi

# Backup before ALTER TABLE migrations if any are needed (t162, t188)
local needs_alter_migration=false
local has_max_load has_release_on_complete has_diagnostic_of has_issue_url has_max_concurrency
has_max_load=$(db "$SUPERVISOR_DB" "SELECT count(*) FROM pragma_table_info('batches') WHERE name='max_load_factor';" 2>/dev/null || echo "0")
has_release_on_complete=$(db "$SUPERVISOR_DB" "SELECT count(*) FROM pragma_table_info('batches') WHERE name='release_on_complete';" 2>/dev/null || echo "0")
has_diagnostic_of=$(db "$SUPERVISOR_DB" "SELECT count(*) FROM pragma_table_info('tasks') WHERE name='diagnostic_of';" 2>/dev/null || echo "0")
has_issue_url=$(db "$SUPERVISOR_DB" "SELECT count(*) FROM pragma_table_info('tasks') WHERE name='issue_url';" 2>/dev/null || echo "0")
has_max_concurrency=$(db "$SUPERVISOR_DB" "SELECT count(*) FROM pragma_table_info('batches') WHERE name='max_concurrency';" 2>/dev/null || echo "0")
if [[ "$has_max_load" -eq 0 || "$has_release_on_complete" -eq 0 || "$has_diagnostic_of" -eq 0 || "$has_issue_url" -eq 0 || "$has_max_concurrency" -eq 0 ]]; then
needs_alter_migration=true
fi
if [[ "$needs_alter_migration" == "true" ]]; then
local alter_backup
alter_backup=$(backup_sqlite_db "$SUPERVISOR_DB" "pre-migrate-alter-columns")
if [[ $? -ne 0 || -z "$alter_backup" ]]; then
log_warn "Backup failed for ALTER TABLE migrations, proceeding cautiously"
fi
fi

# Migrate: add max_load_factor column to batches if missing (t135.15.4)
if [[ "$has_max_load" -eq 0 ]]; then
log_info "Migrating batches table: adding max_load_factor column (t135.15.4)..."
if ! log_cmd "db-migrate" db "$SUPERVISOR_DB" "ALTER TABLE batches ADD COLUMN max_load_factor INTEGER NOT NULL DEFAULT 2;"; then
log_warn "Failed to add max_load_factor column (may already exist)"
else
log_success "Added max_load_factor column to batches"
fi
fi

# Migrate: add max_concurrency column to batches if missing (adaptive scaling cap)
if [[ "$has_max_concurrency" -eq 0 ]]; then
log_info "Migrating batches table: adding max_concurrency column..."
db "$SUPERVISOR_DB" "ALTER TABLE batches ADD COLUMN max_concurrency INTEGER NOT NULL DEFAULT 0;" 2>/dev/null || true
log_success "Added max_concurrency column to batches (0 = auto-detect from cpu_cores)"
fi

# Migrate: add release_on_complete and release_type columns to batches if missing (t128.10)
if [[ "$has_release_on_complete" -eq 0 ]]; then
log_info "Migrating batches table: adding release columns (t128.10)..."
db "$SUPERVISOR_DB" "ALTER TABLE batches ADD COLUMN release_on_complete INTEGER NOT NULL DEFAULT 0;" 2>/dev/null || true
db "$SUPERVISOR_DB" "ALTER TABLE batches ADD COLUMN release_type TEXT NOT NULL DEFAULT 'patch';" 2>/dev/null || true
log_success "Added release_on_complete and release_type columns to batches"
fi

# Migrate: add diagnostic_of column to tasks if missing (t150)
if [[ "$has_diagnostic_of" -eq 0 ]]; then
log_info "Migrating tasks table: adding diagnostic_of column (t150)..."
db "$SUPERVISOR_DB" "ALTER TABLE tasks ADD COLUMN diagnostic_of TEXT;" 2>/dev/null || true
db "$SUPERVISOR_DB" "CREATE INDEX IF NOT EXISTS idx_tasks_diagnostic ON tasks(diagnostic_of);" 2>/dev/null || true
log_success "Added diagnostic_of column to tasks"
fi

# Migrate: add issue_url column (t149)
if [[ "$has_issue_url" -eq 0 ]]; then
log_info "Migrating tasks table: adding issue_url column (t149)..."
db "$SUPERVISOR_DB" "ALTER TABLE tasks ADD COLUMN issue_url TEXT;" 2>/dev/null || true
log_success "Added issue_url column to tasks"
fi

# Migrate: add triage_result column to tasks if missing (t148)
local has_triage_result
has_triage_result=$(db "$SUPERVISOR_DB" "SELECT count(*) FROM pragma_table_info('tasks') WHERE name='triage_result';" 2>/dev/null || echo "0")
if [[ "$has_triage_result" -eq 0 ]]; then
log_info "Migrating tasks table: adding triage_result column (t148)..."
db "$SUPERVISOR_DB" "ALTER TABLE tasks ADD COLUMN triage_result TEXT;" 2>/dev/null || true
log_success "Added triage_result column to tasks"
fi

# Migrate: add review_triage to CHECK constraint if missing (t148)
local check_sql_t148
check_sql_t148=$(db "$SUPERVISOR_DB" "SELECT sql FROM sqlite_master WHERE type='table' AND name='tasks';" 2>/dev/null || echo "")
if [[ -n "$check_sql_t148" ]] && ! echo "$check_sql_t148" | grep -q 'review_triage'; then
log_info "Migrating database schema for review_triage state (t148)..."

# Backup before migration (t188: fail-safe — abort if backup fails)
local t148_backup
t148_backup=$(backup_sqlite_db "$SUPERVISOR_DB" "pre-migrate-t148")
if [[ $? -ne 0 || -z "$t148_backup" ]]; then
log_error "Backup failed for t148 migration — aborting"
return 1
fi

db "$SUPERVISOR_DB" <<'MIGRATE_T148'
PRAGMA foreign_keys=OFF;
BEGIN TRANSACTION;
ALTER TABLE tasks RENAME TO tasks_old_t148;
CREATE TABLE tasks (
id TEXT PRIMARY KEY,
repo TEXT NOT NULL,
description TEXT,
status TEXT NOT NULL DEFAULT 'queued'
CHECK(status IN ('queued','dispatched','running','evaluating','retrying','complete','pr_review','review_triage','merging','merged','deploying','deployed','verifying','verified','verify_failed','blocked','failed','cancelled')),
session_id TEXT,
worktree TEXT,
branch TEXT,
log_file TEXT,
retries INTEGER NOT NULL DEFAULT 0,
max_retries INTEGER NOT NULL DEFAULT 3,
model TEXT DEFAULT 'anthropic/claude-opus-4-6',
error TEXT,
pr_url TEXT,
issue_url TEXT,
diagnostic_of TEXT,
triage_result TEXT,
created_at TEXT NOT NULL DEFAULT (strftime('%Y-%m-%dT%H:%M:%SZ','now')),
started_at TEXT,
completed_at TEXT,
updated_at TEXT NOT NULL DEFAULT (strftime('%Y-%m-%dT%H:%M:%SZ','now'))
);
INSERT INTO tasks (id, repo, description, status, session_id, worktree, branch,
log_file, retries, max_retries, model, error, pr_url, issue_url, diagnostic_of,
created_at, started_at, completed_at, updated_at)
SELECT id, repo, description, status, session_id, worktree, branch,
log_file, retries, max_retries, model, error, pr_url, issue_url, diagnostic_of,
created_at, started_at, completed_at, updated_at
FROM tasks_old_t148;
DROP TABLE tasks_old_t148;
CREATE INDEX IF NOT EXISTS idx_tasks_status ON tasks(status);
CREATE INDEX IF NOT EXISTS idx_tasks_repo ON tasks(repo);
CREATE INDEX IF NOT EXISTS idx_tasks_created ON tasks(created_at);
CREATE INDEX IF NOT EXISTS idx_tasks_diagnostic ON tasks(diagnostic_of);
COMMIT;
PRAGMA foreign_keys=ON;
MIGRATE_T148

# Verify row counts after migration (t188)
if ! verify_migration_rowcounts "$SUPERVISOR_DB" "$t148_backup" "tasks"; then
log_error "t148 migration VERIFICATION FAILED — rolling back"
rollback_sqlite_db "$SUPERVISOR_DB" "$t148_backup"
return 1
fi
log_success "Database schema migrated for review_triage state (verified)"
fi

# Migration: add verifying/verified/verify_failed states to CHECK constraint (t180)
# Check if the current schema already supports verify states
# NOTE: This migration originally used "INSERT INTO tasks SELECT * FROM tasks_old_t180"
# which silently fails if column counts don't match. Fixed in t188 to use explicit
# column lists and row-count verification with automatic rollback.
local has_verify_states
has_verify_states=$(db "$SUPERVISOR_DB" "SELECT sql FROM sqlite_master WHERE type='table' AND name='tasks';" 2>/dev/null || echo "")
if [[ -n "$has_verify_states" ]] && ! echo "$has_verify_states" | grep -q "verifying"; then
log_info "Migrating database schema for post-merge verification states (t180)..."

# Backup before migration (t188: fail-safe — abort if backup fails)
local t180_backup
t180_backup=$(backup_sqlite_db "$SUPERVISOR_DB" "pre-migrate-t180")
if [[ $? -ne 0 || -z "$t180_backup" ]]; then
log_error "Backup failed for t180 migration — aborting"
return 1
fi

db "$SUPERVISOR_DB" <<'MIGRATE_T180'
PRAGMA foreign_keys=OFF;
BEGIN TRANSACTION;
ALTER TABLE tasks RENAME TO tasks_old_t180;
CREATE TABLE tasks (
id TEXT PRIMARY KEY,
repo TEXT NOT NULL,
description TEXT,
status TEXT NOT NULL DEFAULT 'queued'
CHECK(status IN ('queued','dispatched','running','evaluating','retrying','complete','pr_review','review_triage','merging','merged','deploying','deployed','verifying','verified','verify_failed','blocked','failed','cancelled')),
session_id TEXT,
worktree TEXT,
branch TEXT,
log_file TEXT,
retries INTEGER NOT NULL DEFAULT 0,
max_retries INTEGER NOT NULL DEFAULT 3,
model TEXT DEFAULT 'anthropic/claude-opus-4-6',
error TEXT,
pr_url TEXT,
issue_url TEXT,
diagnostic_of TEXT,
triage_result TEXT,
created_at TEXT NOT NULL DEFAULT (strftime('%Y-%m-%dT%H:%M:%SZ','now')),
started_at TEXT,
completed_at TEXT,
updated_at TEXT NOT NULL DEFAULT (strftime('%Y-%m-%dT%H:%M:%SZ','now'))
);
INSERT INTO tasks (id, repo, description, status, session_id, worktree, branch,
log_file, retries, max_retries, model, error, pr_url, issue_url, diagnostic_of,
triage_result, created_at, started_at, completed_at, updated_at)
SELECT id, repo, description, status, session_id, worktree, branch,
log_file, retries, max_retries, model, error, pr_url, issue_url, diagnostic_of,
triage_result, created_at, started_at, completed_at, updated_at
FROM tasks_old_t180;
DROP TABLE tasks_old_t180;
CREATE INDEX IF NOT EXISTS idx_tasks_status ON tasks(status);
CREATE INDEX IF NOT EXISTS idx_tasks_diagnostic ON tasks(diagnostic_of);
COMMIT;
PRAGMA foreign_keys=ON;
MIGRATE_T180

# Verify row counts after migration (t188)
if ! verify_migration_rowcounts "$SUPERVISOR_DB" "$t180_backup" "tasks"; then
log_error "t180 migration VERIFICATION FAILED — rolling back"
rollback_sqlite_db "$SUPERVISOR_DB" "$t180_backup"
return 1
fi
log_success "Database schema migrated for post-merge verification states"
fi

# Migrate: add escalation_depth and max_escalation columns to tasks (t132.6)
local has_escalation_depth
has_escalation_depth=$(db "$SUPERVISOR_DB" "SELECT count(*) FROM pragma_table_info('tasks') WHERE name='escalation_depth';" 2>/dev/null || echo "0")
if [[ "$has_escalation_depth" -eq 0 ]]; then
log_info "Migrating tasks table: adding escalation columns (t132.6)..."
db "$SUPERVISOR_DB" "ALTER TABLE tasks ADD COLUMN escalation_depth INTEGER NOT NULL DEFAULT 0;" 2>/dev/null || true
db "$SUPERVISOR_DB" "ALTER TABLE tasks ADD COLUMN max_escalation INTEGER NOT NULL DEFAULT 2;" 2>/dev/null || true
log_success "Added escalation_depth and max_escalation columns to tasks"
fi

# Migrate: add skip_quality_gate column to batches (t132.6)
local has_skip_quality_gate
has_skip_quality_gate=$(db "$SUPERVISOR_DB" "SELECT count(*) FROM pragma_table_info('batches') WHERE name='skip_quality_gate';" 2>/dev/null || echo "0")
if [[ "$has_skip_quality_gate" -eq 0 ]]; then
log_info "Migrating batches table: adding skip_quality_gate column (t132.6)..."
db "$SUPERVISOR_DB" "ALTER TABLE batches ADD COLUMN skip_quality_gate INTEGER NOT NULL DEFAULT 0;" 2>/dev/null || true
log_success "Added skip_quality_gate column to batches"
fi

# Migrate: add proof_logs table if missing (t218)
local has_proof_logs
has_proof_logs=$(db "$SUPERVISOR_DB" "SELECT count(*) FROM sqlite_master WHERE type='table' AND name='proof_logs';" 2>/dev/null || echo "0")
if [[ "$has_proof_logs" -eq 0 ]]; then
log_info "Migrating database: adding proof_logs table (t218)..."
db "$SUPERVISOR_DB" <<'MIGRATE_T218'
CREATE TABLE IF NOT EXISTS proof_logs (
id INTEGER PRIMARY KEY AUTOINCREMENT,
task_id TEXT NOT NULL,
event TEXT NOT NULL,
stage TEXT,
decision TEXT,
evidence TEXT,
decision_maker TEXT,
pr_url TEXT,
duration_secs INTEGER,
metadata TEXT,
timestamp TEXT NOT NULL DEFAULT (strftime('%Y-%m-%dT%H:%M:%SZ','now')),
FOREIGN KEY (task_id) REFERENCES tasks(id) ON DELETE CASCADE
);
CREATE INDEX IF NOT EXISTS idx_proof_logs_task ON proof_logs(task_id);
CREATE INDEX IF NOT EXISTS idx_proof_logs_event ON proof_logs(event);
CREATE INDEX IF NOT EXISTS idx_proof_logs_timestamp ON proof_logs(timestamp);
MIGRATE_T218
log_success "Added proof_logs table (t218)"
fi

# Migrate: add deploying_recovery_attempts column to tasks (t263)
local has_deploying_recovery
has_deploying_recovery=$(db "$SUPERVISOR_DB" "SELECT count(*) FROM pragma_table_info('tasks') WHERE name='deploying_recovery_attempts';" 2>/dev/null || echo "0")
if [[ "$has_deploying_recovery" -eq 0 ]]; then
log_info "Migrating tasks table: adding deploying_recovery_attempts column (t263)..."
db "$SUPERVISOR_DB" "ALTER TABLE tasks ADD COLUMN deploying_recovery_attempts INTEGER NOT NULL DEFAULT 0;" 2>/dev/null || true
log_success "Added deploying_recovery_attempts column to tasks (t263)"
fi

# Migrate: add rebase_attempts column to tasks (t298)
local has_rebase_attempts
has_rebase_attempts=$(db "$SUPERVISOR_DB" "SELECT count(*) FROM pragma_table_info('tasks') WHERE name='rebase_attempts';" 2>/dev/null || echo "0")
if [[ "$has_rebase_attempts" -eq 0 ]]; then
log_info "Migrating tasks table: adding rebase_attempts column (t298)..."
db "$SUPERVISOR_DB" "ALTER TABLE tasks ADD COLUMN rebase_attempts INTEGER NOT NULL DEFAULT 0;" 2>/dev/null || true
log_success "Added rebase_attempts column to tasks (t298)"
fi

# Migrate: create contest tables if missing (t1011)
local has_contests
has_contests=$(db "$SUPERVISOR_DB" "SELECT count(*) FROM sqlite_master WHERE type='table' AND name='contests';" 2>/dev/null || echo "0")
if [[ "$has_contests" -eq 0 ]]; then
log_info "Creating contest tables (t1011)..."
local contest_helper="${SCRIPT_DIR}/contest-helper.sh"
if [[ -x "$contest_helper" ]]; then
"$contest_helper" help >/dev/null 2>&1 || true
# contest-helper.sh ensure_contest_tables creates them on first use
# but we can also create them here for immediate availability
db "$SUPERVISOR_DB" <<'CONTEST_SQL'
CREATE TABLE IF NOT EXISTS contests (
id TEXT PRIMARY KEY,
task_id TEXT NOT NULL,
description TEXT,
status TEXT NOT NULL DEFAULT 'pending'
CHECK(status IN ('pending','dispatching','running','evaluating','scoring','complete','failed','cancelled')),
winner_model TEXT,
winner_entry_id TEXT,
winner_score REAL,
models TEXT NOT NULL,
batch_id TEXT,
repo TEXT,
created_at TEXT NOT NULL DEFAULT (strftime('%Y-%m-%dT%H:%M:%SZ','now')),
completed_at TEXT,
metadata TEXT
);
CREATE TABLE IF NOT EXISTS contest_entries (
id TEXT PRIMARY KEY,
contest_id TEXT NOT NULL,
model TEXT NOT NULL,
task_id TEXT,
worktree TEXT,
branch TEXT,
log_file TEXT,
pr_url TEXT,
status TEXT NOT NULL DEFAULT 'pending'
CHECK(status IN ('pending','dispatched','running','complete','failed','cancelled')),
output_summary TEXT,
score_correctness REAL DEFAULT 0,
score_completeness REAL DEFAULT 0,
score_code_quality REAL DEFAULT 0,
score_clarity REAL DEFAULT 0,
weighted_score REAL DEFAULT 0,
cross_rank_scores TEXT,
created_at TEXT NOT NULL DEFAULT (strftime('%Y-%m-%dT%H:%M:%SZ','now')),
completed_at TEXT,
FOREIGN KEY (contest_id) REFERENCES contests(id) ON DELETE CASCADE
);
CREATE INDEX IF NOT EXISTS idx_contests_task ON contests(task_id);
CREATE INDEX IF NOT EXISTS idx_contests_status ON contests(status);
CREATE INDEX IF NOT EXISTS idx_contest_entries_contest ON contest_entries(contest_id);
CREATE INDEX IF NOT EXISTS idx_contest_entries_status ON contest_entries(status);
CONTEST_SQL
log_success "Created contest tables (t1011)"
fi
fi

# Migrate: add last_main_sha column to tasks (t1029)
local has_last_main_sha
has_last_main_sha=$(db "$SUPERVISOR_DB" "SELECT count(*) FROM pragma_table_info('tasks') WHERE name='last_main_sha';" 2>/dev/null || echo "0")
if [[ "$has_last_main_sha" -eq 0 ]]; then
log_info "Migrating tasks table: adding last_main_sha column (t1029)..."
db "$SUPERVISOR_DB" "ALTER TABLE tasks ADD COLUMN last_main_sha TEXT;" 2>/dev/null || true
log_success "Added last_main_sha column to tasks (t1029)"
fi

# Ensure WAL mode for existing databases created before t135.3
local current_mode
current_mode=$(db "$SUPERVISOR_DB" "PRAGMA journal_mode;" 2>/dev/null || echo "")
if [[ "$current_mode" != "wal" ]]; then
log_cmd "db-wal" db "$SUPERVISOR_DB" "PRAGMA journal_mode=WAL;" || log_warn "Failed to enable WAL mode"
fi

return 0
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The ensure_db function is over 440 lines long and handles numerous database migrations. For improved readability and maintainability, consider refactoring this into smaller, single-purpose functions, one for each migration (e.g., _migrate_t128_8, _migrate_add_max_load_factor). The main ensure_db function would then become a more manageable sequence of calls to these migration helpers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant