Skip to content

t1269: Fix stuck evaluating tasks with crash-resilient SIGTERM rollback#2002

Merged
marcusquinn merged 2 commits intomainfrom
bugfix/t1269-crash-resilient-evaluation
Feb 20, 2026
Merged

t1269: Fix stuck evaluating tasks with crash-resilient SIGTERM rollback#2002
marcusquinn merged 2 commits intomainfrom
bugfix/t1269-crash-resilient-evaluation

Conversation

@marcusquinn
Copy link
Owner

@marcusquinn marcusquinn commented Feb 20, 2026

Summary

  • Adds SIGTERM/SIGINT trap to Phase 1 evaluation loop that rolls back in-progress evaluations to their pre-eval state on pulse death
  • Eliminates 120-600s stuck evaluating windows caused by pulse process being killed mid-evaluation (SIGTERM, OOM, cron overlap)
  • Analysis of 205 stale recovery events: eval_process_died (36), worker_failed_before_eval (34), pulse_killed_after_pr_persist (33), worker_oom_killed (24)

How it works

  1. Before evaluating each task, records the task ID and its pre-evaluation state
  2. Installs a SIGTERM/SIGINT trap that rolls back the task from evaluating to its previous state
  3. After each task completes evaluation (any outcome), clears the tracking variables
  4. Restores the original signal trap after the evaluation loop

The rollback SQL uses AND status = 'evaluating' as a safety guard — if the task was already transitioned to its final state before the signal arrived, the rollback is a no-op.

Impact

  • Before: Killed pulse → task stuck in evaluating → Phase 0.7 detects after 120-600s → retry/recovery
  • After: Killed pulse → trap fires → task rolled back to running → next pulse (2 min) re-evaluates immediately

Estimated throughput recovery: ~30 hours of wasted concurrency slots across 205 events.

Testing

  • bash -n syntax check: pass
  • ShellCheck: zero violations
  • Self-improvement memory stored for supervisor learning

Summary by CodeRabbit

  • Bug Fixes
    • Improved resilience of the evaluation loop: evaluations now track their pre-evaluation state and automatically roll back if the process is terminated unexpectedly, release locks, and clean up temporary artifacts. Shutdown handlers are preserved so normal cleanup behavior remains intact.

Phase 1 evaluation runs inline in the pulse loop. When the pulse process
is killed mid-evaluation (SIGTERM from cron, OOM, system sleep), the task
is stranded in 'evaluating' with no live process. Phase 0.7 catches these
but only after 120-600s grace periods, wasting concurrency slots.

Analysis of 205 stale recovery events showed:
- eval_process_died: 36 events (avg 639s stuck)
- worker_failed_before_eval: 34 events (avg 614s stuck)
- pulse_killed_after_pr_persist: 33 events (avg 781s stuck)
- worker_oom_killed: 24 events (avg 1092s stuck)

Fix: Install a SIGTERM/SIGINT trap inside the Phase 1 evaluation loop
that tracks which task is currently being evaluated. On pulse death, the
trap rolls back the in-progress task from 'evaluating' to its pre-eval
state (running/dispatched), so the next pulse can re-evaluate immediately
without waiting for the grace period.

The rollback SQL uses AND status='evaluating' as a safety guard — if the
task was already transitioned to its final state before the signal, the
rollback is a no-op.
@gemini-code-assist
Copy link

Summary of Changes

Hello @marcusquinn, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a critical reliability improvement to the supervisor's task evaluation mechanism. By adding signal handling for unexpected process terminations, it ensures that tasks are gracefully recovered rather than becoming orphaned in an 'evaluating' state. This change directly addresses a common cause of stalled tasks, leading to faster recovery times and more efficient resource utilization.

Highlights

  • Crash-Resilient Task Evaluation: Implemented a SIGTERM/SIGINT trap within the Phase 1 evaluation loop to prevent tasks from getting stuck in an 'evaluating' state if the supervisor pulse process is unexpectedly terminated.
  • Automated Task Rollback: Tasks that are in-progress during a pulse death are now automatically rolled back to their pre-evaluation state, ensuring they can be re-evaluated by the next pulse without significant delay.
  • Reduced Stale Task Windows: Eliminates the 120-600 second windows where tasks would remain stuck in 'evaluating' after a pulse crash, significantly improving system throughput and recovery.
Changelog
  • .agents/scripts/supervisor/pulse.sh
    • Added a new shell function _phase1_cleanup_on_signal to handle task state rollback upon receiving SIGTERM or SIGINT.
    • Introduced local variables _phase1_evaluating_tid and _phase1_pre_eval_state to track the currently evaluated task ID and its state before evaluation.
    • Installed a trap for TERM and INT signals to execute _phase1_cleanup_on_signal at the beginning of the evaluation loop.
    • Modified the task evaluation logic to record the current task's ID and its pre-evaluation state before transitioning to 'evaluating'.
    • Cleared the task tracking variables (_phase1_evaluating_tid, _phase1_pre_eval_state) after a task successfully completes evaluation or is handled by specific retry/escalation paths.
    • Restored the original signal trap configuration after the main evaluation loop finishes to ensure proper cleanup.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 20, 2026

Walkthrough

Adds per-pulse tracking of the currently-evaluating task and its pre-evaluation state, and installs a SIGTERM/SIGINT trap to rollback an in-progress evaluation, release the pulse lock, and clean up temporary artifacts if the pulse is killed.

Changes

Cohort / File(s) Summary
Evaluation Rollback Mechanism
.agents/scripts/supervisor/pulse.sh
Introduce _phase1_evaluating_tid and _phase1_pre_eval_state tracking; install a temporary trap around the evaluation loop to rollback an evaluating task on SIGTERM/SIGINT, release the pulse lock, clean temporary artifacts, and restore the previous trap. Populate/clear trackers at evaluation transition points.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Poem

⚙️ In the pulse where tasks ascend,
A safety net now wraps the end,
On signal's strike the state rewinds,
Locks released and temp files find,
A calmer shell where chaos mends. ✨

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main change: adding crash-resilient SIGTERM rollback to fix stuck evaluating tasks, which is the primary objective of the changeset.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch bugfix/t1269-crash-resilient-evaluation

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link

🔍 Code Quality Report

�[0;35m[MONITOR]�[0m Code Review Monitoring Report

�[0;34m[INFO]�[0m Latest Quality Status:
SonarCloud: 0 bugs, 0 vulnerabilities, 30 code smells

�[0;34m[INFO]�[0m Recent monitoring activity:
Fri Feb 20 04:13:52 UTC 2026: Code review monitoring started
Fri Feb 20 04:13:52 UTC 2026: SonarCloud - Bugs: 0, Vulnerabilities: 0, Code Smells: 30

📈 Current Quality Metrics

  • BUGS: 0
  • CODE SMELLS: 30
  • VULNERABILITIES: 0

Generated on: Fri Feb 20 04:13:55 UTC 2026


Generated by AI DevOps Framework Code Review Monitoring

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request introduces a robust crash-resilient mechanism for Phase 1 evaluation in pulse.sh. By implementing SIGTERM/SIGINT traps, the script now gracefully rolls back tasks that are in the 'evaluating' state if the pulse process is killed. This significantly reduces the time tasks remain stuck, improving overall throughput and reliability. The changes are well-documented with comments and adhere to the specified shell scripting standards, including the use of sql_escape for database interactions and || true for commands that might fail. The logic for tracking and clearing evaluation variables (_phase1_evaluating_tid, _phase1_pre_eval_state) is consistently applied throughout the evaluation loop, ensuring proper state management. The restoration of the original signal trap after the loop is also correctly handled.

coderabbitai[bot]
coderabbitai bot previously requested changes Feb 20, 2026
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.agents/scripts/supervisor/pulse.sh:
- Around line 1369-1377: The race occurs because _phase1_evaluating_tid is
assigned after cmd_transition("$tid","evaluating") — if SIGTERM hits between
those lines the cleanup logic sees an empty _phase1_evaluating_tid and skips
rollback; move the assignment of _phase1_evaluating_tid="$tid" to just before
calling cmd_transition so the cleanup handler always knows which task is being
evaluated (the SQL guard "AND status = 'evaluating'" prevents premature
rollbacks if the DB transition hasn't committed), apply the same reorder in the
"already-evaluating" path that uses the same variables.

@marcusquinn marcusquinn dismissed coderabbitai[bot]’s stale review February 20, 2026 04:21

Auto-dismissed: bot review does not block autonomous pipeline

Close SIGTERM race window by setting _phase1_evaluating_tid before cmd_transition in both code paths (already-evaluating and new-evaluation). The SQL guard (AND status = 'evaluating') in the cleanup handler makes a premature rollback a no-op if the DB transition hasn't committed yet.
@github-actions
Copy link

🔍 Code Quality Report

�[0;35m[MONITOR]�[0m Code Review Monitoring Report

�[0;34m[INFO]�[0m Latest Quality Status:
SonarCloud: 0 bugs, 0 vulnerabilities, 30 code smells

�[0;34m[INFO]�[0m Recent monitoring activity:
Fri Feb 20 04:40:16 UTC 2026: Code review monitoring started
Fri Feb 20 04:40:17 UTC 2026: SonarCloud - Bugs: 0, Vulnerabilities: 0, Code Smells: 30

📈 Current Quality Metrics

  • BUGS: 0
  • CODE SMELLS: 30
  • VULNERABILITIES: 0

Generated on: Fri Feb 20 04:40:20 UTC 2026


Generated by AI DevOps Framework Code Review Monitoring

@sonarqubecloud
Copy link

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
.agents/scripts/supervisor/pulse.sh (1)

1325-1332: Optional: add eval-checkpoint cleanup to the rollback handler.

_phase1_cleanup_on_signal correctly rolls back the DB status, but it doesn't remove the eval-checkpoints/${_phase1_evaluating_tid}.eval file that evaluate_with_ai() may have written. After a successful rollback the task leaves evaluating state, so Phase 0.7's stale-recovery path (which normally removes checkpoint files) never visits it. The orphaned file persists until the task is evaluated again. If a subsequent evaluation gets stuck for an unrelated reason, _diagnose_stale_root_cause will still return pulse_killed_mid_eval (false root cause) because the stale checkpoint file matches first.

♻️ Proposed addition to the rollback handler
 _phase1_cleanup_on_signal() {
 	if [[ -n "$_phase1_evaluating_tid" && -n "$_phase1_pre_eval_state" ]]; then
 		log_warn "  Phase 1 (t1269): pulse killed during evaluation of $_phase1_evaluating_tid — rolling back to $_phase1_pre_eval_state"
 		db "$SUPERVISOR_DB" "UPDATE tasks SET status = '$(sql_escape "$_phase1_pre_eval_state")', updated_at = strftime('%Y-%m-%dT%H:%M:%SZ','now'), error = 'pulse_killed_mid_eval_rollback_t1269' WHERE id = '$(sql_escape "$_phase1_evaluating_tid")' AND status = 'evaluating';" 2>/dev/null || true
+		# t1269: Remove eval checkpoint so the next stale-recovery pass
+		# doesn't misdiagnose the root cause as pulse_killed_mid_eval.
+		rm -f "${SUPERVISOR_DIR}/eval-checkpoints/${_phase1_evaluating_tid}.eval" 2>/dev/null || true
 		_phase1_evaluating_tid=""
 		_phase1_pre_eval_state=""
 	fi
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.agents/scripts/supervisor/pulse.sh around lines 1325 - 1332, The rollback
handler _phase1_cleanup_on_signal rolls back DB status but fails to remove the
eval-checkpoints/${_phase1_evaluating_tid}.eval file that evaluate_with_ai() may
have created, leaving an orphaned checkpoint that confuses stale-recovery;
update _phase1_cleanup_on_signal to, after the DB update and before clearing
_phase1_evaluating_tid, delete the checkpoint file (e.g., remove
eval-checkpoints/"${_phase1_evaluating_tid}".eval safely, ignoring errors),
ensuring you only attempt removal when _phase1_evaluating_tid is non-empty and
keep existing silence-on-error behavior similar to the db call.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In @.agents/scripts/supervisor/pulse.sh:
- Around line 1363-1384: Race-condition fix verified: ensure both branches set
_phase1_evaluating_tid before any cmd_transition to close the SIGTERM race
window; keep the current order in the already-evaluating branch (setting
_phase1_pre_eval_state and _phase1_evaluating_tid) and in the new-evaluation
branch (set _phase1_pre_eval_state, then _phase1_evaluating_tid, then call
cmd_transition), and do not move or remove the SQL guard (AND status =
'evaluating') used by the cleanup handler.

---

Nitpick comments:
In @.agents/scripts/supervisor/pulse.sh:
- Around line 1325-1332: The rollback handler _phase1_cleanup_on_signal rolls
back DB status but fails to remove the
eval-checkpoints/${_phase1_evaluating_tid}.eval file that evaluate_with_ai() may
have created, leaving an orphaned checkpoint that confuses stale-recovery;
update _phase1_cleanup_on_signal to, after the DB update and before clearing
_phase1_evaluating_tid, delete the checkpoint file (e.g., remove
eval-checkpoints/"${_phase1_evaluating_tid}".eval safely, ignoring errors),
ensuring you only attempt removal when _phase1_evaluating_tid is non-empty and
keep existing silence-on-error behavior similar to the db call.

@marcusquinn marcusquinn merged commit 54e05e2 into main Feb 20, 2026
19 checks passed
@marcusquinn marcusquinn deleted the bugfix/t1269-crash-resilient-evaluation branch February 21, 2026 01:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant