-
Notifications
You must be signed in to change notification settings - Fork 5
feat: add 3-tier outcome evaluation and re-prompt cycle to supervisor (t128.3) #378
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
… (t128.3) Enhance supervisor-helper.sh with outcome evaluation and retry intelligence: - 3-tier evaluation: deterministic signals, heuristic error patterns, AI eval (Sonnet) - extract_log_metadata(): structured log parsing (signals, PR URLs, error counts) - evaluate_with_ai(): dispatch cheap Sonnet call for ambiguous outcomes (~30s) - cmd_reprompt: re-prompt workers in existing worktree with failure context - cmd_evaluate: manual evaluation command for debugging stuck tasks - Max retries marks tasks as blocked (not failed) for human investigation - Pulse cycle now uses re-prompt instead of naive re-dispatch - Updated help text and subagent-index.toon with new commands Zero ShellCheck violations.
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
WalkthroughThe PR extends the supervisor-helper script with log analysis utilities and AI-assisted task evaluation capabilities. New functions extract log metadata and perform multi-tier evaluation (deterministic, heuristic, AI-based). Additional CLI commands enable manual task re-prompting and evaluation, with state machine adjustments to accommodate new fields and outcomes. Changes
Sequence Diagram(s)sequenceDiagram
actor User
participant Main as main()
participant Eval as evaluate_worker()
participant Heur as Heuristic Check
participant AI as evaluate_with_ai()
participant Claude as Claude API
participant DB as Database
participant Log as Log File
User->>Main: cmd_evaluate task_id [--no-ai]
Main->>Eval: evaluate_worker(task_id, skip_ai)
Eval->>Log: extract_log_metadata()
Log-->>Eval: log size, completion, errors
Eval->>Heur: deterministic patterns
alt Deterministic Decision
Heur-->>Eval: complete/failed
else Ambiguous
alt skip_ai == false
Eval->>AI: evaluate_with_ai()
AI->>Log: read full log
AI->>Claude: dispatch evaluation
Claude-->>AI: VERDICT line
AI->>DB: store audit entry
AI-->>Eval: verdict result
else no-ai flag
Eval->>Eval: defer to heuristic
end
end
Eval-->>Main: result (complete/retry/blocked/failed)
Main-->>User: display evaluation report
sequenceDiagram
actor User
participant Main as main()
participant Reprompt as cmd_reprompt()
participant DB as Database
participant Worker as worker process
participant Dispatch as Task Dispatch
User->>Main: cmd_reprompt task_id [--prompt "text"]
Main->>Reprompt: cmd_reprompt task_id [--prompt]
Reprompt->>DB: read task + session_id
DB-->>Reprompt: task metadata
Reprompt->>Reprompt: construct AI-assisted prompt
Reprompt->>DB: increment retry count, check max_retries
alt retry_count < max_retries
Reprompt->>DB: update task state (retrying)
Reprompt->>Reprompt: create new retry log
Reprompt->>Dispatch: dispatch task in worktree
Dispatch->>Worker: spawn worker process
Worker-->>Dispatch: task in progress
else max retries exceeded
Reprompt->>DB: transition to blocked + escalation
end
Reprompt-->>User: confirm re-prompt + dispatch
Estimated Code Review Effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
🔍 Code Quality Report�[0;35m[MONITOR]�[0m Code Review Monitoring Report �[0;34m[INFO]�[0m Latest Quality Status: �[0;34m[INFO]�[0m Recent monitoring activity: 📈 Current Quality Metrics
Generated on: Fri Feb 6 04:52:58 UTC 2026 Generated by AI DevOps Framework Code Review Monitoring |
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
🤖 Fix all issues with AI agents
In @.agent/scripts/supervisor-helper.sh:
- Around line 1565-1569: The TASK_COMPLETE branch currently only treats
exit_code=="0" as task-only complete; update the logic so that when meta[signal]
== "TASK_COMPLETE" (regardless of meta[exit_code]) the script echoes
"complete:task_only" and returns 0, thereby making TASK_COMPLETE a tier-1 signal
that bypasses the heuristic fallback (modify the conditional around the meta
array check for "TASK_COMPLETE" in the supervisor-helper.sh snippet to ignore
exit_code or add a separate branch that handles meta[signal]=="TASK_COMPLETE"
first).
- Around line 1706-1723: The VERDICT parsing is too restrictive and the verdict
is inserted into SQL unescaped; update the grep/regex used to extract
verdict_line (the variable verdict_line and the pattern matching around
VERDICT:) to allow digits, hyphens and underscores (e.g., include 0-9, - and _
in the token characters or use a more permissive non-whitespace capture for the
detail portion) so strings like VERDICT:retry:rate-limited and
VERDICT:complete:pr_123 match, then after stripping the VERDICT: prefix when
assigning local verdict use the sql_escape helper on that value in the sqlite3
INSERT (replace 'AI eval verdict: $verdict' with a version that uses sql_escape
"$verdict") to prevent SQL injection.
- Around line 1977-1980: The variable dispatched_count is being incremented
during Phase 1 (in the cmd_reprompt block) but is later re-declared (shadowed)
in Phase 2, so Phase 1 increments are lost; to fix, declare and initialize
dispatched_count (e.g., local dispatched_count=0) together with the other Phase
1 counters before any Phase 1 logic (so increments in the cmd_reprompt branch
are applied to the same variable), then remove the duplicate local
dispatched_count declaration in Phase 2 so the final pulse summary uses the
single shared counter.
- Around line 1782-1787: The retry-exceeded path currently transitions the task
to "failed" causing a state mismatch downstream; update the VALID_TRANSITIONS
array to include the "retrying:blocked" transition and change the cmd_reprompt
logic (the cmd_reprompt function where it checks tretries >= tmax_retries) to
call cmd_transition "$task_id" "blocked" (preserving the existing --error
payload) instead of transitioning to "failed" so that cmd_pulse and escalation
logic see the intended "blocked" state and cmd_cleanup won't remove the
worktree.
| local rate_limit_count=0 auth_error_count=0 conflict_count=0 timeout_count=0 oom_count=0 | ||
| rate_limit_count=$(grep -ci 'rate.limit\|429\|too many requests' "$log_file" 2>/dev/null || echo 0) | ||
| auth_error_count=$(grep -ci 'permission denied\|unauthorized\|403\|401' "$log_file" 2>/dev/null || echo 0) | ||
| conflict_count=$(grep -ci 'merge conflict\|CONFLICT\|conflict marker' "$log_file" 2>/dev/null || echo 0) | ||
| timeout_count=$(grep -ci 'timeout\|timed out\|ETIMEDOUT' "$log_file" 2>/dev/null || echo 0) | ||
| oom_count=$(grep -ci 'out of memory\|OOM\|heap.*exceeded\|ENOMEM' "$log_file" 2>/dev/null || echo 0) | ||
|
|
||
| echo "rate_limit_count=$rate_limit_count" | ||
| echo "auth_error_count=$auth_error_count" | ||
| echo "conflict_count=$conflict_count" | ||
| echo "timeout_count=$timeout_count" | ||
| echo "oom_count=$oom_count" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Heuristic grep patterns are too broad — false positives will cause incorrect evaluations.
The short numeric patterns 429, 401, 403 match in non-HTTP contexts (timestamps like 14:29:xx, port numbers, line counts). The case-insensitive OOM matches substrings in "room", "zoom", "bloom", etc. Since evaluate_worker treats any count > 0 as a definitive signal (blocked:auth_error, blocked:out_of_memory), a single false positive will misclassify the outcome.
Proposed fix: use word boundaries and more specific patterns
- rate_limit_count=$(grep -ci 'rate.limit\|429\|too many requests' "$log_file" 2>/dev/null || echo 0)
- auth_error_count=$(grep -ci 'permission denied\|unauthorized\|403\|401' "$log_file" 2>/dev/null || echo 0)
- conflict_count=$(grep -ci 'merge conflict\|CONFLICT\|conflict marker' "$log_file" 2>/dev/null || echo 0)
- timeout_count=$(grep -ci 'timeout\|timed out\|ETIMEDOUT' "$log_file" 2>/dev/null || echo 0)
- oom_count=$(grep -ci 'out of memory\|OOM\|heap.*exceeded\|ENOMEM' "$log_file" 2>/dev/null || echo 0)
+ rate_limit_count=$(grep -cP -i 'rate.limit|\b429\b|too many requests' "$log_file" 2>/dev/null || echo 0)
+ auth_error_count=$(grep -cP -i 'permission denied|unauthorized|\b403\b|\b401\b' "$log_file" 2>/dev/null || echo 0)
+ conflict_count=$(grep -ci 'merge conflict\|CONFLICT\|conflict marker' "$log_file" 2>/dev/null || echo 0)
+ timeout_count=$(grep -ci 'timeout\|timed out\|ETIMEDOUT' "$log_file" 2>/dev/null || echo 0)
+ oom_count=$(grep -cP -i 'out of memory|\bOOM\b|heap.*exceeded|ENOMEM' "$log_file" 2>/dev/null || echo 0)As per coding guidelines, .agent/scripts/*.sh should focus on reliability and robustness — false positive error detection undermines both.
| # TASK_COMPLETE with clean exit = partial success (PR phase may have failed) | ||
| if [[ "${meta[signal]:-}" == "TASK_COMPLETE" && "${meta[exit_code]:-}" == "0" ]]; then | ||
| echo "complete:task_only" | ||
| return 0 | ||
| fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TASK_COMPLETE with non-zero exit falls through to heuristics that may misclassify.
When the log contains TASK_COMPLETE but exit ≠ 0, the flow falls to the heuristic tier. If the worker encountered transient errors earlier in the run (rate limits, timeouts) that were internally recovered, those counts will still be > 0, causing the heuristic to incorrectly return retry:rate_limited or blocked:auth_error instead of recognizing the task actually completed.
Consider treating TASK_COMPLETE (regardless of exit code) as at least complete:task_only — or as a distinct tier-1 signal that bypasses heuristics.
🤖 Prompt for AI Agents
In @.agent/scripts/supervisor-helper.sh around lines 1565 - 1569, The
TASK_COMPLETE branch currently only treats exit_code=="0" as task-only complete;
update the logic so that when meta[signal] == "TASK_COMPLETE" (regardless of
meta[exit_code]) the script echoes "complete:task_only" and returns 0, thereby
making TASK_COMPLETE a tier-1 signal that bypasses the heuristic fallback
(modify the conditional around the meta array check for "TASK_COMPLETE" in the
supervisor-helper.sh snippet to ignore exit_code or add a separate branch that
handles meta[signal]=="TASK_COMPLETE" first).
| # Parse the VERDICT line from AI response | ||
| local verdict_line | ||
| verdict_line=$(echo "$ai_result" | grep -o 'VERDICT:[a-z]*:[a-z_]*' | head -1 || true) | ||
|
|
||
| if [[ -n "$verdict_line" ]]; then | ||
| # Strip VERDICT: prefix and return | ||
| local verdict="${verdict_line#VERDICT:}" | ||
| log_info "AI eval for $task_id: $verdict" | ||
|
|
||
| # Store AI evaluation in state log for audit trail | ||
| sqlite3 "$SUPERVISOR_DB" " | ||
| INSERT INTO state_log (task_id, from_state, to_state, reason) | ||
| VALUES ('$(sql_escape "$task_id")', 'evaluating', 'evaluating', | ||
| 'AI eval verdict: $verdict'); | ||
| " 2>/dev/null || true | ||
|
|
||
| echo "$verdict" | ||
| return 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion | 🟠 Major
Verdict regex is too restrictive and the value isn't SQL-escaped.
Two issues:
-
Regex
VERDICT:[a-z]*:[a-z_]*rejects hyphens and digits in the detail field. The AI is likely to return verdicts likeVERDICT:retry:rate-limitedorVERDICT:complete:pr_123— both would fail to match, causing the function to return "unparseable" and falling back toretry:ambiguous_ai_unavailableon every ambiguous case. -
SQL safety —
$verdictis interpolated directly into the SQL INSERT at Line 1719 withoutsql_escape. Currently safe due to the regex constraint, but fragile — any future relaxation of the regex creates an injection path.
Proposed fix
- verdict_line=$(echo "$ai_result" | grep -o 'VERDICT:[a-z]*:[a-z_]*' | head -1 || true)
+ verdict_line=$(echo "$ai_result" | grep -oE 'VERDICT:[a-z]+:[a-z0-9_-]+' | head -1 || true) sqlite3 "$SUPERVISOR_DB" "
INSERT INTO state_log (task_id, from_state, to_state, reason)
VALUES ('$(sql_escape "$task_id")', 'evaluating', 'evaluating',
- 'AI eval verdict: $verdict');
+ 'AI eval verdict: $(sql_escape "$verdict")');
" 2>/dev/null || true📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Parse the VERDICT line from AI response | |
| local verdict_line | |
| verdict_line=$(echo "$ai_result" | grep -o 'VERDICT:[a-z]*:[a-z_]*' | head -1 || true) | |
| if [[ -n "$verdict_line" ]]; then | |
| # Strip VERDICT: prefix and return | |
| local verdict="${verdict_line#VERDICT:}" | |
| log_info "AI eval for $task_id: $verdict" | |
| # Store AI evaluation in state log for audit trail | |
| sqlite3 "$SUPERVISOR_DB" " | |
| INSERT INTO state_log (task_id, from_state, to_state, reason) | |
| VALUES ('$(sql_escape "$task_id")', 'evaluating', 'evaluating', | |
| 'AI eval verdict: $verdict'); | |
| " 2>/dev/null || true | |
| echo "$verdict" | |
| return 0 | |
| # Parse the VERDICT line from AI response | |
| local verdict_line | |
| verdict_line=$(echo "$ai_result" | grep -oE 'VERDICT:[a-z]+:[a-z0-9_-]+' | head -1 || true) | |
| if [[ -n "$verdict_line" ]]; then | |
| # Strip VERDICT: prefix and return | |
| local verdict="${verdict_line#VERDICT:}" | |
| log_info "AI eval for $task_id: $verdict" | |
| # Store AI evaluation in state log for audit trail | |
| sqlite3 "$SUPERVISOR_DB" " | |
| INSERT INTO state_log (task_id, from_state, to_state, reason) | |
| VALUES ('$(sql_escape "$task_id")', 'evaluating', 'evaluating', | |
| 'AI eval verdict: $(sql_escape "$verdict")'); | |
| " 2>/dev/null || true | |
| echo "$verdict" | |
| return 0 |
🤖 Prompt for AI Agents
In @.agent/scripts/supervisor-helper.sh around lines 1706 - 1723, The VERDICT
parsing is too restrictive and the verdict is inserted into SQL unescaped;
update the grep/regex used to extract verdict_line (the variable verdict_line
and the pattern matching around VERDICT:) to allow digits, hyphens and
underscores (e.g., include 0-9, - and _ in the token characters or use a more
permissive non-whitespace capture for the detail portion) so strings like
VERDICT:retry:rate-limited and VERDICT:complete:pr_123 match, then after
stripping the VERDICT: prefix when assigning local verdict use the sql_escape
helper on that value in the sqlite3 INSERT (replace 'AI eval verdict: $verdict'
with a version that uses sql_escape "$verdict") to prevent SQL injection.
| # Check max retries | ||
| if [[ "$tretries" -ge "$tmax_retries" ]]; then | ||
| log_error "Task $task_id has exceeded max retries ($tretries/$tmax_retries)" | ||
| cmd_transition "$task_id" "failed" --error "Max retries exceeded during re-prompt" | ||
| return 1 | ||
| fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
State machine conflict: max-retries path marks "failed" instead of "blocked", defeating the PR's stated intent.
When retries >= max_retries, cmd_reprompt transitions the task from retrying → failed (Line 1785). Back in cmd_pulse (Line 1989), the fallback tries to transition from failed → blocked, but this is not a valid transition (only failed:queued is permitted). The || true suppresses the error silently, so:
- The task ends up
failedinstead ofblocked. cmd_cleanuptargetsfailedtasks, so the worktree gets removed — contradicting the PR objective of preserving the worktree for blocked tasks.- The escalation email (Lines 1991-1996) still fires, but its message says "blocked" while the actual state is "failed".
The fix requires two changes: add retrying:blocked to VALID_TRANSITIONS, and transition to blocked here instead of failed.
Proposed fix
Add the missing transition to the valid transitions array:
"retrying:dispatched"
"retrying:failed"
+ "retrying:blocked"
"retrying:cancelled"Then in cmd_reprompt, transition to blocked instead of failed:
if [[ "$tretries" -ge "$tmax_retries" ]]; then
log_error "Task $task_id has exceeded max retries ($tretries/$tmax_retries)"
- cmd_transition "$task_id" "failed" --error "Max retries exceeded during re-prompt"
+ cmd_transition "$task_id" "blocked" --error "Max retries exceeded during re-prompt"
return 1
fiThis way cmd_pulse's fallback (Line 1989) won't need to re-transition — the task is already blocked, and the escalation logic can key off the state directly.
As per coding guidelines, .agent/scripts/*.sh must focus on error recovery mechanisms — the current path silently loses the intended blocked state.
🤖 Prompt for AI Agents
In @.agent/scripts/supervisor-helper.sh around lines 1782 - 1787, The
retry-exceeded path currently transitions the task to "failed" causing a state
mismatch downstream; update the VALID_TRANSITIONS array to include the
"retrying:blocked" transition and change the cmd_reprompt logic (the
cmd_reprompt function where it checks tretries >= tmax_retries) to call
cmd_transition "$task_id" "blocked" (preserving the existing --error payload)
instead of transitioning to "failed" so that cmd_pulse and escalation logic see
the intended "blocked" state and cmd_cleanup won't remove the worktree.
| # Re-prompt in existing worktree (continues context) | ||
| if cmd_reprompt "$tid" 2>/dev/null; then | ||
| dispatched_count=$((dispatched_count + 1)) | ||
| log_info " $tid: re-prompted successfully" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dispatched_count used in Phase 1 before declaration in Phase 2 — pulse summary undercounts.
dispatched_count is incremented here (Line 1979) during re-prompts in Phase 1, but it's declared with local dispatched_count=0 at Line 2027 in Phase 2. The Phase 2 declaration shadows the Phase 1 writes, so the pulse summary (Line 2080) will never include re-prompt dispatches — it only counts Phase 2 fresh dispatches.
Proposed fix: move the declaration before Phase 1
Move local dispatched_count=0 up alongside the other Phase 1 counters:
local completed_count=0
local failed_count=0
+ local dispatched_count=0
if [[ -n "$running_tasks" ]]; thenAnd remove the duplicate declaration at Line 2027:
# Phase 2: Dispatch queued tasks up to concurrency limit
- local dispatched_count=0🤖 Prompt for AI Agents
In @.agent/scripts/supervisor-helper.sh around lines 1977 - 1980, The variable
dispatched_count is being incremented during Phase 1 (in the cmd_reprompt block)
but is later re-declared (shadowed) in Phase 2, so Phase 1 increments are lost;
to fix, declare and initialize dispatched_count (e.g., local dispatched_count=0)
together with the other Phase 1 counters before any Phase 1 logic (so increments
in the cmd_reprompt branch are applied to the same variable), then remove the
duplicate local dispatched_count declaration in Phase 2 so the final pulse
summary uses the single shared counter.



Summary
repromptcommand that re-dispatches workers in their existing worktree with failure context, enabling intelligent retry cyclesevaluatecommand for manual debugging of stuck tasksblocked(notfailed) after exhausting retries, preserving worktree for human investigationChanges
New functions
extract_log_metadata()- Structured log parsing (signals, PR URLs, error counts for rate limit/auth/conflict/OOM/timeout)extract_log_tail()- Safe tail extraction for AI eval contextevaluate_with_ai()- Dispatches cheap Sonnet call to classify ambiguous outcomescmd_reprompt- Re-prompts workers in existing worktree with failure contextcmd_evaluate- Manual evaluation command with--no-aiflagEnhanced evaluation tiers
Pulse cycle improvements
cmd_repromptinstead of naive state re-queueblockedwith mail escalationfailedstateTesting
help,init,evaluatecommands tested manuallyTask
Closes t128.3 (Outcome evaluation and re-prompt cycle)
Summary by CodeRabbit
Release Notes