t1269: Fix stuck evaluating tasks with crash-resilient SIGTERM rollback#2002
t1269: Fix stuck evaluating tasks with crash-resilient SIGTERM rollback#2002marcusquinn merged 2 commits intomainfrom
Conversation
Phase 1 evaluation runs inline in the pulse loop. When the pulse process is killed mid-evaluation (SIGTERM from cron, OOM, system sleep), the task is stranded in 'evaluating' with no live process. Phase 0.7 catches these but only after 120-600s grace periods, wasting concurrency slots. Analysis of 205 stale recovery events showed: - eval_process_died: 36 events (avg 639s stuck) - worker_failed_before_eval: 34 events (avg 614s stuck) - pulse_killed_after_pr_persist: 33 events (avg 781s stuck) - worker_oom_killed: 24 events (avg 1092s stuck) Fix: Install a SIGTERM/SIGINT trap inside the Phase 1 evaluation loop that tracks which task is currently being evaluated. On pulse death, the trap rolls back the in-progress task from 'evaluating' to its pre-eval state (running/dispatched), so the next pulse can re-evaluate immediately without waiting for the grace period. The rollback SQL uses AND status='evaluating' as a safety guard — if the task was already transitioned to its final state before the signal, the rollback is a no-op.
Summary of ChangesHello @marcusquinn, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a critical reliability improvement to the supervisor's task evaluation mechanism. By adding signal handling for unexpected process terminations, it ensures that tasks are gracefully recovered rather than becoming orphaned in an 'evaluating' state. This change directly addresses a common cause of stalled tasks, leading to faster recovery times and more efficient resource utilization. Highlights
Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
WalkthroughAdds per-pulse tracking of the currently-evaluating task and its pre-evaluation state, and installs a SIGTERM/SIGINT trap to rollback an in-progress evaluation, release the pulse lock, and clean up temporary artifacts if the pulse is killed. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
🔍 Code Quality Report�[0;35m[MONITOR]�[0m Code Review Monitoring Report �[0;34m[INFO]�[0m Latest Quality Status: �[0;34m[INFO]�[0m Recent monitoring activity: 📈 Current Quality Metrics
Generated on: Fri Feb 20 04:13:55 UTC 2026 Generated by AI DevOps Framework Code Review Monitoring |
There was a problem hiding this comment.
Code Review
The pull request introduces a robust crash-resilient mechanism for Phase 1 evaluation in pulse.sh. By implementing SIGTERM/SIGINT traps, the script now gracefully rolls back tasks that are in the 'evaluating' state if the pulse process is killed. This significantly reduces the time tasks remain stuck, improving overall throughput and reliability. The changes are well-documented with comments and adhere to the specified shell scripting standards, including the use of sql_escape for database interactions and || true for commands that might fail. The logic for tracking and clearing evaluation variables (_phase1_evaluating_tid, _phase1_pre_eval_state) is consistently applied throughout the evaluation loop, ensuring proper state management. The restoration of the original signal trap after the loop is also correctly handled.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.agents/scripts/supervisor/pulse.sh:
- Around line 1369-1377: The race occurs because _phase1_evaluating_tid is
assigned after cmd_transition("$tid","evaluating") — if SIGTERM hits between
those lines the cleanup logic sees an empty _phase1_evaluating_tid and skips
rollback; move the assignment of _phase1_evaluating_tid="$tid" to just before
calling cmd_transition so the cleanup handler always knows which task is being
evaluated (the SQL guard "AND status = 'evaluating'" prevents premature
rollbacks if the DB transition hasn't committed), apply the same reorder in the
"already-evaluating" path that uses the same variables.
Auto-dismissed: bot review does not block autonomous pipeline
Close SIGTERM race window by setting _phase1_evaluating_tid before cmd_transition in both code paths (already-evaluating and new-evaluation). The SQL guard (AND status = 'evaluating') in the cleanup handler makes a premature rollback a no-op if the DB transition hasn't committed yet.
🔍 Code Quality Report�[0;35m[MONITOR]�[0m Code Review Monitoring Report �[0;34m[INFO]�[0m Latest Quality Status: �[0;34m[INFO]�[0m Recent monitoring activity: 📈 Current Quality Metrics
Generated on: Fri Feb 20 04:40:20 UTC 2026 Generated by AI DevOps Framework Code Review Monitoring |
|
There was a problem hiding this comment.
🧹 Nitpick comments (1)
.agents/scripts/supervisor/pulse.sh (1)
1325-1332: Optional: add eval-checkpoint cleanup to the rollback handler.
_phase1_cleanup_on_signalcorrectly rolls back the DB status, but it doesn't remove theeval-checkpoints/${_phase1_evaluating_tid}.evalfile thatevaluate_with_ai()may have written. After a successful rollback the task leavesevaluatingstate, so Phase 0.7's stale-recovery path (which normally removes checkpoint files) never visits it. The orphaned file persists until the task is evaluated again. If a subsequent evaluation gets stuck for an unrelated reason,_diagnose_stale_root_causewill still returnpulse_killed_mid_eval(false root cause) because the stale checkpoint file matches first.♻️ Proposed addition to the rollback handler
_phase1_cleanup_on_signal() { if [[ -n "$_phase1_evaluating_tid" && -n "$_phase1_pre_eval_state" ]]; then log_warn " Phase 1 (t1269): pulse killed during evaluation of $_phase1_evaluating_tid — rolling back to $_phase1_pre_eval_state" db "$SUPERVISOR_DB" "UPDATE tasks SET status = '$(sql_escape "$_phase1_pre_eval_state")', updated_at = strftime('%Y-%m-%dT%H:%M:%SZ','now'), error = 'pulse_killed_mid_eval_rollback_t1269' WHERE id = '$(sql_escape "$_phase1_evaluating_tid")' AND status = 'evaluating';" 2>/dev/null || true + # t1269: Remove eval checkpoint so the next stale-recovery pass + # doesn't misdiagnose the root cause as pulse_killed_mid_eval. + rm -f "${SUPERVISOR_DIR}/eval-checkpoints/${_phase1_evaluating_tid}.eval" 2>/dev/null || true _phase1_evaluating_tid="" _phase1_pre_eval_state="" fi }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.agents/scripts/supervisor/pulse.sh around lines 1325 - 1332, The rollback handler _phase1_cleanup_on_signal rolls back DB status but fails to remove the eval-checkpoints/${_phase1_evaluating_tid}.eval file that evaluate_with_ai() may have created, leaving an orphaned checkpoint that confuses stale-recovery; update _phase1_cleanup_on_signal to, after the DB update and before clearing _phase1_evaluating_tid, delete the checkpoint file (e.g., remove eval-checkpoints/"${_phase1_evaluating_tid}".eval safely, ignoring errors), ensuring you only attempt removal when _phase1_evaluating_tid is non-empty and keep existing silence-on-error behavior similar to the db call.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In @.agents/scripts/supervisor/pulse.sh:
- Around line 1363-1384: Race-condition fix verified: ensure both branches set
_phase1_evaluating_tid before any cmd_transition to close the SIGTERM race
window; keep the current order in the already-evaluating branch (setting
_phase1_pre_eval_state and _phase1_evaluating_tid) and in the new-evaluation
branch (set _phase1_pre_eval_state, then _phase1_evaluating_tid, then call
cmd_transition), and do not move or remove the SQL guard (AND status =
'evaluating') used by the cleanup handler.
---
Nitpick comments:
In @.agents/scripts/supervisor/pulse.sh:
- Around line 1325-1332: The rollback handler _phase1_cleanup_on_signal rolls
back DB status but fails to remove the
eval-checkpoints/${_phase1_evaluating_tid}.eval file that evaluate_with_ai() may
have created, leaving an orphaned checkpoint that confuses stale-recovery;
update _phase1_cleanup_on_signal to, after the DB update and before clearing
_phase1_evaluating_tid, delete the checkpoint file (e.g., remove
eval-checkpoints/"${_phase1_evaluating_tid}".eval safely, ignoring errors),
ensuring you only attempt removal when _phase1_evaluating_tid is non-empty and
keep existing silence-on-error behavior similar to the db call.



Summary
How it works
evaluatingto its previous stateThe rollback SQL uses
AND status = 'evaluating'as a safety guard — if the task was already transitioned to its final state before the signal arrived, the rollback is a no-op.Impact
Estimated throughput recovery: ~30 hours of wasted concurrency slots across 205 events.
Testing
bash -nsyntax check: passSummary by CodeRabbit