-
Notifications
You must be signed in to change notification settings - Fork 5
fix: supervisor integration testing fixes (t128.7) #384
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
Caution Review failedThe pull request is closed. WalkthroughAdds emergency worker-process cleanup, model selection and health-checks, and threads model hints through dispatch/pulse/evaluation flows; introduces new helper functions and a Changes
Sequence DiagramsequenceDiagram
participant Client as Client/CLI
participant Supervisor as supervisor-helper.sh
participant Dispatcher as Dispatch Flow
participant Pulse as Pulse Phase
participant Evaluator as evaluate_worker()
participant LogParser as extract_log_metadata()
participant HealthSvc as check_model_health()
participant Worker as Worker Process Tree
Client->>Supervisor: cmd_dispatch / cmd_pulse / kill-workers
Supervisor->>HealthSvc: resolve_model() & check_model_health(model)
HealthSvc-->>Supervisor: model health (ok|unhealthy)
Supervisor->>Dispatcher: build_dispatch_cmd(model) -> start Worker
Dispatcher->>Worker: spawn PID
Pulse->>Evaluator: evaluate_worker(PID, model)
Evaluator->>LogParser: tail logs (last 20 lines)
LogParser-->>Evaluator: metadata + backend_error_count + exit_code
alt backend infra or retryable
Evaluator->>Supervisor: decide RETRY
Supervisor->>Worker: cleanup_worker_processes(PID)
Supervisor->>Worker: _list_descendants / _kill_descendants
else deterministic complete
Evaluator-->>Pulse: TASK_COMPLETE
else unrecoverable
Evaluator-->>Pulse: TASK_FAILED
end
Estimated Code Review Effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
🔍 Code Quality Report�[0;35m[MONITOR]�[0m Code Review Monitoring Report �[0;34m[INFO]�[0m Latest Quality Status: �[0;34m[INFO]�[0m Recent monitoring activity: 📈 Current Quality Metrics
Generated on: Fri Feb 6 05:59:31 UTC 2026 Generated by AI DevOps Framework Code Review Monitoring |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
.agent/scripts/supervisor-helper.sh (1)
3446-3464:⚠️ Potential issue | 🟡 Minor
show_usagemissingkill-workerscommand and updatedcleanupdescription.The file header (lines 16-17) documents both
cleanup ... + processesandkill-workers, butshow_usage()still has the old cleanup description and omitskill-workersentirely.📝 Proposed fix
- supervisor-helper.sh cleanup [--dry-run] Clean up completed worktrees + supervisor-helper.sh cleanup [--dry-run] Clean up completed worktrees + processes + supervisor-helper.sh kill-workers [--dry-run] Kill orphaned worker processes (emergency)Also add an
Options for 'kill-workers'section lower in the help text:Options for 'kill-workers': --dry-run Show orphaned processes without killing themAs per coding guidelines,
.agent/scripts/*.sh: "Clear logging and feedback" — the help output is the primary discoverability mechanism for the new emergency command.
🤖 Fix all issues with AI agents
In @.agent/scripts/supervisor-helper.sh:
- Around line 1699-1710: The script deletes "$log_tail_file" before computing
backend_error_count, so the grep against "$log_tail_file" always fails and
backend_error_count is 0; fix by moving the rm -f "$log_tail_file" to after the
backend_error_count assignment (or add an existence check such as [ -f
"$log_tail_file" ] before grepping) so the grep command that sets
backend_error_count (the line using grep -ci 'endpoints
failed\|Antigravity\|gateway.*error\|service unavailable\|503' "$log_tail_file"
...) runs against the actual file and does not get silenced by 2>/dev/null ||
echo 0. Ensure the cleanup removal of "$log_tail_file" happens only after all
counts (rate_limit_count, auth_error_count, conflict_count, timeout_count,
oom_count, backend_error_count) are computed.
🧹 Nitpick comments (2)
.agent/scripts/supervisor-helper.sh (2)
1258-1271: Descendants only receive SIGTERM — no SIGKILL fallback.
cleanup_worker_processesescalates tokill -9for the parent PID, but_kill_descendantsonly sends SIGTERM. A stubborn node/opencode child process that traps or ignores SIGTERM will survive cleanup — exactly the orphan scenario this PR aims to eliminate.♻️ Add SIGKILL fallback for descendants
_kill_descendants() { local parent_pid="$1" local children children=$(pgrep -P "$parent_pid" 2>/dev/null) || true if [[ -n "$children" ]]; then for child in $children; do _kill_descendants "$child" kill "$child" 2>/dev/null || true done + # Brief wait then force-kill any survivors + sleep 0.5 + for child in $children; do + kill -0 "$child" 2>/dev/null && kill -9 "$child" 2>/dev/null || true + done fi return 0 }
1688-1690: Consider a trap for temp file cleanup inextract_log_metadata.If the function exits abnormally between
mktempandrm -f, the temp file leaks. Underset -e, an unguarded failure could trigger this. A local trap would make this bulletproof.♻️ Trap-based temp file cleanup
local log_tail_file log_tail_file=$(mktemp) + trap "rm -f '$log_tail_file'" RETURN tail -20 "$log_file" > "$log_tail_file" 2>/dev/null || trueWith this, the
rm -f "$log_tail_file"on line 1699 can be removed (the trap handles it), which also naturally fixes the ordering bug withbackend_error_count.
🔍 Code Quality Report�[0;35m[MONITOR]�[0m Code Review Monitoring Report �[0;34m[INFO]�[0m Latest Quality Status: �[0;34m[INFO]�[0m Recent monitoring activity: 📈 Current Quality Metrics
Generated on: Fri Feb 6 06:08:31 UTC 2026 Generated by AI DevOps Framework Code Review Monitoring |
🔍 Code Quality Report�[0;35m[MONITOR]�[0m Code Review Monitoring Report �[0;34m[INFO]�[0m Latest Quality Status: �[0;34m[INFO]�[0m Recent monitoring activity: 📈 Current Quality Metrics
Generated on: Fri Feb 6 06:12:21 UTC 2026 Generated by AI DevOps Framework Code Review Monitoring |
🔍 Code Quality Report�[0;35m[MONITOR]�[0m Code Review Monitoring Report �[0;34m[INFO]�[0m Latest Quality Status: �[0;34m[INFO]�[0m Recent monitoring activity: 📈 Current Quality Metrics
Generated on: Fri Feb 6 06:17:28 UTC 2026 Generated by AI DevOps Framework Code Review Monitoring |
🔍 Code Quality Report�[0;35m[MONITOR]�[0m Code Review Monitoring Report �[0;34m[INFO]�[0m Latest Quality Status: �[0;34m[INFO]�[0m Recent monitoring activity: 📈 Current Quality Metrics
Generated on: Fri Feb 6 06:18:33 UTC 2026 Generated by AI DevOps Framework Code Review Monitoring |
🔍 Code Quality Report�[0;35m[MONITOR]�[0m Code Review Monitoring Report �[0;34m[INFO]�[0m Latest Quality Status: �[0;34m[INFO]�[0m Recent monitoring activity: 📈 Current Quality Metrics
Generated on: Fri Feb 6 06:31:18 UTC 2026 Generated by AI DevOps Framework Code Review Monitoring |
🔍 Code Quality Report�[0;35m[MONITOR]�[0m Code Review Monitoring Report �[0;34m[INFO]�[0m Latest Quality Status: �[0;34m[INFO]�[0m Recent monitoring activity: 📈 Current Quality Metrics
Generated on: Fri Feb 6 06:38:55 UTC 2026 Generated by AI DevOps Framework Code Review Monitoring |
…racy - Replace associative arrays (bash 4+) with grep-based metadata parsing - Move dispatched_count declaration before Phase 1 (unbound variable fix) - Add cleanup_worker_processes() to kill process trees on task completion - Add kill-workers command for emergency orphan cleanup - Add _list_descendants() for safe process tree protection - Integrate process cleanup into pulse evaluation and cleanup command - Prioritize PR URL over heuristic error patterns (prevents false positives) - Gate heuristic error matching on non-zero exit code only - Reduce error pattern search to last 20 log lines Found during t128.7 integration testing: 80 orphaned opencode processes accumulated from workers whose parent shells exited but child processes (opencode binary, node MCP servers) kept running as PPID=1 orphans. Heuristic evaluation produced false positives when task content discussed authentication/errors (e.g., Bing Webmaster Tools API documentation).
Tasks could get stuck in 'evaluating' state when the AI eval (Tier 3) timed out. The pulse now includes 'evaluating' in its Phase 1 query and re-evaluates stuck tasks with --no-ai to avoid repeated timeouts.
Add backend_error_count to extract_log_metadata() for detecting Antigravity/API gateway failures. These are always transient (provider overloaded) and should retry, not block. Found during t128.7 when concurrency 4 overwhelmed the opencode backend.
- Task notifications: sound + banner on complete (Glass), blocked (Basso), failed (Sosumi) - Batch progress: notification on each pulse with progress (Hero sound on all-complete) - Integrated into send_task_notification() and pulse summary - Only fires on Darwin (macOS), no-op on Linux
- afplay needs direct backgrounding (not subshell) to produce sound - Backend error grep was reading deleted temp file; now searches full log - say for spoken status, afplay for attention sounds
Add Quota protection, over.*usage, quota reset patterns to backend error detection. Opencode rotates providers (Antigravity, Gemini) and all can hit quota limits during high-concurrency batch dispatch.
- Add check_model_health() that probes the model with a trivial prompt before creating worktrees or burning retries (15s timeout, 5m cache) - Pass task model to build_dispatch_cmd() so opencode uses -m flag (prevents Antigravity router from picking exhausted providers) - Return code 3 from dispatch = provider unavailable, stops pulse dispatch loop until next cycle (avoids cascading failures) - Detects: quota exhaustion, endpoint failures, timeouts, empty responses
- Add resolve_model() with tiers: coding (opus), eval (sonnet), health (sonnet)
- Health check uses cheap sonnet probe instead of opus
- AI eval uses resolve_model('eval') instead of hardcoded model
- Pulse Phase 4: periodic cleanup of stale worker processes every cycle
- Model priority: Anthropic SOTA via opencode > claude CLI > zen free
A task can hit a backend error early, recover, and complete with a PR. Backend error check was running first, causing false retries on completed tasks. Now ordered: signals > PR URL > backend errors.
Background say/afplay were getting killed when the calling function returned. nohup detaches them from the process group.
…rms) say/TTS requires the speech synthesis daemon which macOS Sequoia won't start for processes without Accessibility permissions. afplay works reliably for all process contexts. TTS can be re-enabled once Tabby is added to System Settings > Privacy & Security > Accessibility.
027ec21 to
da1193c
Compare
🔍 Code Quality Report�[0;35m[MONITOR]�[0m Code Review Monitoring Report �[0;34m[INFO]�[0m Latest Quality Status: �[0;34m[INFO]�[0m Recent monitoring activity: 📈 Current Quality Metrics
Generated on: Fri Feb 6 16:49:39 UTC 2026 Generated by AI DevOps Framework Code Review Monitoring |
|



Summary
kill-workerscommand for emergency orphan cleanupIntegration Test Results: 12/12 COMPLETE
Bugs Fixed (10)
local -Aassociative arrays with grep-based parsingdispatched_countreferenced before declaration in retry handlerevaluatingstate, re-evaluates without AI on timeoutsayneeds Accessibility perms; simplified toafplaysystem soundsNew Features
cleanup_worker_processes()- kills process trees on task completion_kill_descendants()/_list_descendants()- recursive process tree managementkill-workerscommand - emergency orphan cleanup with safety guardscheck_model_health()- pre-dispatch probe with 5-minute cacheresolve_model()- tiered model selection (coding=opus, eval=sonnet, health=sonnet)notify_batch_progress()- macOS audio alerts on progress milestones-mflag to opencode (prevents router picking exhausted providers)Follow-up Tasks Created
Closes t128.7
Summary by CodeRabbit
Release Notes
New Features
Improvements