feat: add 3-tier outcome evaluation and re-prompt cycle to supervisor (t128.3) #378

marcusquinn · 2026-02-06T04:52:26Z

Summary

Adds 3-tier outcome evaluation to supervisor-helper.sh: deterministic signal detection, heuristic error pattern matching, and AI eval (Sonnet ~30s) for ambiguous outcomes
Implements reprompt command that re-dispatches workers in their existing worktree with failure context, enabling intelligent retry cycles
Adds evaluate command for manual debugging of stuck tasks
Changes max-retries behavior: tasks are now marked blocked (not failed) after exhausting retries, preserving worktree for human investigation

Changes

New functions

extract_log_metadata() - Structured log parsing (signals, PR URLs, error counts for rate limit/auth/conflict/OOM/timeout)
extract_log_tail() - Safe tail extraction for AI eval context
evaluate_with_ai() - Dispatches cheap Sonnet call to classify ambiguous outcomes
cmd_reprompt - Re-prompts workers in existing worktree with failure context
cmd_evaluate - Manual evaluation command with --no-ai flag

Enhanced evaluation tiers

Deterministic: FULL_LOOP_COMPLETE/TASK_COMPLETE signals, EXIT codes (0, 130/SIGINT, 137/SIGKILL, 143/SIGTERM)
Heuristic: Error pattern counting (rate limit, auth, merge conflict, OOM, timeout)
AI eval: Sonnet dispatch with structured VERDICT response format, audit trail in state_log

Pulse cycle improvements

Retry now uses cmd_reprompt instead of naive state re-queue
Max retries marks tasks as blocked with mail escalation
Re-prompt failures properly cascade to failed state

Testing

Zero ShellCheck violations
help, init, evaluate commands tested manually
State machine transitions validated (retrying -> dispatched via reprompt)

Task

Closes t128.3 (Outcome evaluation and re-prompt cycle)

Summary by CodeRabbit

Release Notes

New Features
- Added manual task re-prompting capability to retry tasks with custom prompts
- Introduced manual evaluation command to trigger task assessment and view log metadata
- Integrated AI-assisted task evaluation for improved outcome determination
- Enhanced task state machine with better retry, escalation, and completion handling
- Expanded supervisor command interface with new control and management options

… (t128.3) Enhance supervisor-helper.sh with outcome evaluation and retry intelligence: - 3-tier evaluation: deterministic signals, heuristic error patterns, AI eval (Sonnet) - extract_log_metadata(): structured log parsing (signals, PR URLs, error counts) - evaluate_with_ai(): dispatch cheap Sonnet call for ambiguous outcomes (~30s) - cmd_reprompt: re-prompt workers in existing worktree with failure context - cmd_evaluate: manual evaluation command for debugging stuck tasks - Max retries marks tasks as blocked (not failed) for human investigation - Pulse cycle now uses re-prompt instead of naive re-dispatch - Updated help text and subagent-index.toon with new commands Zero ShellCheck violations.

gemini-code-assist · 2026-02-06T04:52:29Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

coderabbitai · 2026-02-06T04:52:41Z

Walkthrough

The PR extends the supervisor-helper script with log analysis utilities and AI-assisted task evaluation capabilities. New functions extract log metadata and perform multi-tier evaluation (deterministic, heuristic, AI-based). Additional CLI commands enable manual task re-prompting and evaluation, with state machine adjustments to accommodate new fields and outcomes.

Changes

Cohort / File(s)	Summary
Evaluation & Re-Prompt Utilities `.agent/scripts/supervisor-helper.sh`	Added `extract_log_tail()` and `extract_log_metadata()` for log analysis; implemented `evaluate_worker()` with three-tier evaluation (deterministic, heuristic, AI-based) and `evaluate_with_ai()` for Sonnet-based verdict dispatch; introduced `cmd_reprompt()` for task re-prompting with optional custom prompts and `cmd_evaluate()` for manual evaluation triggering; extended `main()` dispatch to support `reprompt` and `evaluate` subcommands; updated state machine to read `session_id` and handle retry/escalation paths.
Public Interface Documentation `.agent/subagent-index.toon`	Updated supervisor-helper.sh description to reflect expanded command surface including new `reprompt`, `evaluate`, `dispatch`, `worker-status`, and `running-count` capabilities.

Sequence Diagram(s)

sequenceDiagram
    actor User
    participant Main as main()
    participant Eval as evaluate_worker()
    participant Heur as Heuristic Check
    participant AI as evaluate_with_ai()
    participant Claude as Claude API
    participant DB as Database
    participant Log as Log File

    User->>Main: cmd_evaluate task_id [--no-ai]
    Main->>Eval: evaluate_worker(task_id, skip_ai)
    Eval->>Log: extract_log_metadata()
    Log-->>Eval: log size, completion, errors
    Eval->>Heur: deterministic patterns
    alt Deterministic Decision
        Heur-->>Eval: complete/failed
    else Ambiguous
        alt skip_ai == false
            Eval->>AI: evaluate_with_ai()
            AI->>Log: read full log
            AI->>Claude: dispatch evaluation
            Claude-->>AI: VERDICT line
            AI->>DB: store audit entry
            AI-->>Eval: verdict result
        else no-ai flag
            Eval->>Eval: defer to heuristic
        end
    end
    Eval-->>Main: result (complete/retry/blocked/failed)
    Main-->>User: display evaluation report

sequenceDiagram
    actor User
    participant Main as main()
    participant Reprompt as cmd_reprompt()
    participant DB as Database
    participant Worker as worker process
    participant Dispatch as Task Dispatch

    User->>Main: cmd_reprompt task_id [--prompt "text"]
    Main->>Reprompt: cmd_reprompt task_id [--prompt]
    Reprompt->>DB: read task + session_id
    DB-->>Reprompt: task metadata
    Reprompt->>Reprompt: construct AI-assisted prompt
    Reprompt->>DB: increment retry count, check max_retries
    alt retry_count < max_retries
        Reprompt->>DB: update task state (retrying)
        Reprompt->>Reprompt: create new retry log
        Reprompt->>Dispatch: dispatch task in worktree
        Dispatch->>Worker: spawn worker process
        Worker-->>Dispatch: task in progress
    else max retries exceeded
        Reprompt->>DB: transition to blocked + escalation
    end
    Reprompt-->>User: confirm re-prompt + dispatch

Estimated Code Review Effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🤖✨ The Supervisor Ascends

With metadata gleaming and verdicts divine,
The supervisor now thinks—a three-tier design.
Log tails and retries, AI at the helm,
Orchestration flows, nothing to overwhelm. 🎯

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately captures the core additions to the PR: implementing a 3-tier outcome evaluation system and introducing a re-prompt cycle for intelligent task retries in the supervisor.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feature/t128-3-outcome-eval-reprompt

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-02-06T04:52:59Z

🔍 Code Quality Report

�[0;35m[MONITOR]�[0m Code Review Monitoring Report

�[0;34m[INFO]�[0m Latest Quality Status:
SonarCloud: 0 bugs, 0 vulnerabilities, 14 code smells

�[0;34m[INFO]�[0m Recent monitoring activity:
Fri Feb 6 04:52:53 UTC 2026: Code review monitoring started
Fri Feb 6 04:52:54 UTC 2026: SonarCloud - Bugs: 0, Vulnerabilities: 0, Code Smells: 14
Fri Feb 6 04:52:54 UTC 2026: Qlty - 0 issues found, auto-formatting applied
Fri Feb 6 04:52:56 UTC 2026: Codacy analysis completed with auto-fixes

📈 Current Quality Metrics

BUGS: 0
CODE SMELLS: 14
VULNERABILITIES: 0

Generated on: Fri Feb 6 04:52:58 UTC 2026

Generated by AI DevOps Framework Code Review Monitoring

sonarqubecloud · 2026-02-06T04:53:20Z

Quality Gate passed

Issues
5 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

coderabbitai

Actionable comments posted: 5

🤖 Fix all issues with AI agents

In @.agent/scripts/supervisor-helper.sh:
- Around line 1565-1569: The TASK_COMPLETE branch currently only treats
exit_code=="0" as task-only complete; update the logic so that when meta[signal]
== "TASK_COMPLETE" (regardless of meta[exit_code]) the script echoes
"complete:task_only" and returns 0, thereby making TASK_COMPLETE a tier-1 signal
that bypasses the heuristic fallback (modify the conditional around the meta
array check for "TASK_COMPLETE" in the supervisor-helper.sh snippet to ignore
exit_code or add a separate branch that handles meta[signal]=="TASK_COMPLETE"
first).
- Around line 1706-1723: The VERDICT parsing is too restrictive and the verdict
is inserted into SQL unescaped; update the grep/regex used to extract
verdict_line (the variable verdict_line and the pattern matching around
VERDICT:) to allow digits, hyphens and underscores (e.g., include 0-9, - and _
in the token characters or use a more permissive non-whitespace capture for the
detail portion) so strings like VERDICT:retry:rate-limited and
VERDICT:complete:pr_123 match, then after stripping the VERDICT: prefix when
assigning local verdict use the sql_escape helper on that value in the sqlite3
INSERT (replace 'AI eval verdict: $verdict' with a version that uses sql_escape
"$verdict") to prevent SQL injection.
- Around line 1977-1980: The variable dispatched_count is being incremented
during Phase 1 (in the cmd_reprompt block) but is later re-declared (shadowed)
in Phase 2, so Phase 1 increments are lost; to fix, declare and initialize
dispatched_count (e.g., local dispatched_count=0) together with the other Phase
1 counters before any Phase 1 logic (so increments in the cmd_reprompt branch
are applied to the same variable), then remove the duplicate local
dispatched_count declaration in Phase 2 so the final pulse summary uses the
single shared counter.
- Around line 1782-1787: The retry-exceeded path currently transitions the task
to "failed" causing a state mismatch downstream; update the VALID_TRANSITIONS
array to include the "retrying:blocked" transition and change the cmd_reprompt
logic (the cmd_reprompt function where it checks tretries >= tmax_retries) to
call cmd_transition "$task_id" "blocked" (preserving the existing --error
payload) instead of transitioning to "failed" so that cmd_pulse and escalation
logic see the intended "blocked" state and cmd_cleanup won't remove the
worktree.

coderabbitai · 2026-02-06T05:00:52Z