t1269: Fix stuck evaluating tasks with crash-resilient SIGTERM rollback by marcusquinn · Pull Request #2002 · marcusquinn/aidevops

marcusquinn · 2026-02-20T04:13:23Z

Summary

Adds SIGTERM/SIGINT trap to Phase 1 evaluation loop that rolls back in-progress evaluations to their pre-eval state on pulse death
Eliminates 120-600s stuck evaluating windows caused by pulse process being killed mid-evaluation (SIGTERM, OOM, cron overlap)
Analysis of 205 stale recovery events: eval_process_died (36), worker_failed_before_eval (34), pulse_killed_after_pr_persist (33), worker_oom_killed (24)

How it works

Before evaluating each task, records the task ID and its pre-evaluation state
Installs a SIGTERM/SIGINT trap that rolls back the task from evaluating to its previous state
After each task completes evaluation (any outcome), clears the tracking variables
Restores the original signal trap after the evaluation loop

The rollback SQL uses AND status = 'evaluating' as a safety guard — if the task was already transitioned to its final state before the signal arrived, the rollback is a no-op.

Impact

Before: Killed pulse → task stuck in evaluating → Phase 0.7 detects after 120-600s → retry/recovery
After: Killed pulse → trap fires → task rolled back to running → next pulse (2 min) re-evaluates immediately

Estimated throughput recovery: ~30 hours of wasted concurrency slots across 205 events.

Testing

bash -n syntax check: pass
ShellCheck: zero violations
Self-improvement memory stored for supervisor learning

Summary by CodeRabbit

Bug Fixes
- Improved resilience of the evaluation loop: evaluations now track their pre-evaluation state and automatically roll back if the process is terminated unexpectedly, release locks, and clean up temporary artifacts. Shutdown handlers are preserved so normal cleanup behavior remains intact.

Phase 1 evaluation runs inline in the pulse loop. When the pulse process is killed mid-evaluation (SIGTERM from cron, OOM, system sleep), the task is stranded in 'evaluating' with no live process. Phase 0.7 catches these but only after 120-600s grace periods, wasting concurrency slots. Analysis of 205 stale recovery events showed: - eval_process_died: 36 events (avg 639s stuck) - worker_failed_before_eval: 34 events (avg 614s stuck) - pulse_killed_after_pr_persist: 33 events (avg 781s stuck) - worker_oom_killed: 24 events (avg 1092s stuck) Fix: Install a SIGTERM/SIGINT trap inside the Phase 1 evaluation loop that tracks which task is currently being evaluated. On pulse death, the trap rolls back the in-progress task from 'evaluating' to its pre-eval state (running/dispatched), so the next pulse can re-evaluate immediately without waiting for the grace period. The rollback SQL uses AND status='evaluating' as a safety guard — if the task was already transitioned to its final state before the signal, the rollback is a no-op.

gemini-code-assist · 2026-02-20T04:13:35Z

Summary of Changes

Hello @marcusquinn, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a critical reliability improvement to the supervisor's task evaluation mechanism. By adding signal handling for unexpected process terminations, it ensures that tasks are gracefully recovered rather than becoming orphaned in an 'evaluating' state. This change directly addresses a common cause of stalled tasks, leading to faster recovery times and more efficient resource utilization.

Highlights

Crash-Resilient Task Evaluation: Implemented a SIGTERM/SIGINT trap within the Phase 1 evaluation loop to prevent tasks from getting stuck in an 'evaluating' state if the supervisor pulse process is unexpectedly terminated.
Automated Task Rollback: Tasks that are in-progress during a pulse death are now automatically rolled back to their pre-evaluation state, ensuring they can be re-evaluated by the next pulse without significant delay.
Reduced Stale Task Windows: Eliminates the 120-600 second windows where tasks would remain stuck in 'evaluating' after a pulse crash, significantly improving system throughput and recovery.

Changelog

.agents/scripts/supervisor/pulse.sh
- Added a new shell function _phase1_cleanup_on_signal to handle task state rollback upon receiving SIGTERM or SIGINT.
- Introduced local variables _phase1_evaluating_tid and _phase1_pre_eval_state to track the currently evaluated task ID and its state before evaluation.
- Installed a trap for TERM and INT signals to execute _phase1_cleanup_on_signal at the beginning of the evaluation loop.
- Modified the task evaluation logic to record the current task's ID and its pre-evaluation state before transitioning to 'evaluating'.
- Cleared the task tracking variables (_phase1_evaluating_tid, _phase1_pre_eval_state) after a task successfully completes evaluation or is handled by specific retry/escalation paths.
- Restored the original signal trap configuration after the main evaluation loop finishes to ensure proper cleanup.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

coderabbitai · 2026-02-20T04:13:37Z

Walkthrough

Adds per-pulse tracking of the currently-evaluating task and its pre-evaluation state, and installs a SIGTERM/SIGINT trap to rollback an in-progress evaluation, release the pulse lock, and clean up temporary artifacts if the pulse is killed.

Changes

Cohort / File(s)	Summary
Evaluation Rollback Mechanism `.agents/scripts/supervisor/pulse.sh`	Introduce `_phase1_evaluating_tid` and `_phase1_pre_eval_state` tracking; install a temporary trap around the evaluation loop to rollback an evaluating task on SIGTERM/SIGINT, release the pulse lock, clean temporary artifacts, and restore the previous trap. Populate/clear trackers at evaluation transition points.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

t1183: Fix 3 supervisor pipeline bugs — PR-aware reaping, atomic eval, batch merge #1790: Related change to the supervisor pulse evaluation lifecycle (eval→complete atomicity) that complements the new rollback/trap handling.

Poem

⚙️ In the pulse where tasks ascend,
A safety net now wraps the end,
On signal's strike the state rewinds,
Locks released and temp files find,
A calmer shell where chaos mends. ✨

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and specifically describes the main change: adding crash-resilient SIGTERM rollback to fix stuck evaluating tasks, which is the primary objective of the changeset.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch bugfix/t1269-crash-resilient-evaluation

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-02-20T04:13:55Z

🔍 Code Quality Report

�[0;35m[MONITOR]�[0m Code Review Monitoring Report

�[0;34m[INFO]�[0m Latest Quality Status:
SonarCloud: 0 bugs, 0 vulnerabilities, 30 code smells

�[0;34m[INFO]�[0m Recent monitoring activity:
Fri Feb 20 04:13:52 UTC 2026: Code review monitoring started
Fri Feb 20 04:13:52 UTC 2026: SonarCloud - Bugs: 0, Vulnerabilities: 0, Code Smells: 30

📈 Current Quality Metrics

BUGS: 0
CODE SMELLS: 30
VULNERABILITIES: 0

Generated on: Fri Feb 20 04:13:55 UTC 2026

Generated by AI DevOps Framework Code Review Monitoring

gemini-code-assist

Code Review

The pull request introduces a robust crash-resilient mechanism for Phase 1 evaluation in pulse.sh. By implementing SIGTERM/SIGINT traps, the script now gracefully rolls back tasks that are in the 'evaluating' state if the pulse process is killed. This significantly reduces the time tasks remain stuck, improving overall throughput and reliability. The changes are well-documented with comments and adhere to the specified shell scripting standards, including the use of sql_escape for database interactions and || true for commands that might fail. The logic for tracking and clearing evaluation variables (_phase1_evaluating_tid, _phase1_pre_eval_state) is consistently applied throughout the evaluation loop, ensuring proper state management. The restoration of the original signal trap after the loop is also correctly handled.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.agents/scripts/supervisor/pulse.sh:
- Around line 1369-1377: The race occurs because _phase1_evaluating_tid is
assigned after cmd_transition("$tid","evaluating") — if SIGTERM hits between
those lines the cleanup logic sees an empty _phase1_evaluating_tid and skips
rollback; move the assignment of _phase1_evaluating_tid="$tid" to just before
calling cmd_transition so the cleanup handler always knows which task is being
evaluated (the SQL guard "AND status = 'evaluating'" prevents premature
rollbacks if the DB transition hasn't committed), apply the same reorder in the
"already-evaluating" path that uses the same variables.

.agents/scripts/supervisor/pulse.sh

Auto-dismissed: bot review does not block autonomous pipeline

Close SIGTERM race window by setting _phase1_evaluating_tid before cmd_transition in both code paths (already-evaluating and new-evaluation). The SQL guard (AND status = 'evaluating') in the cleanup handler makes a premature rollback a no-op if the DB transition hasn't committed yet.

github-actions · 2026-02-20T04:40:21Z

🔍 Code Quality Report

�[0;35m[MONITOR]�[0m Code Review Monitoring Report

�[0;34m[INFO]�[0m Latest Quality Status:
SonarCloud: 0 bugs, 0 vulnerabilities, 30 code smells

�[0;34m[INFO]�[0m Recent monitoring activity:
Fri Feb 20 04:40:16 UTC 2026: Code review monitoring started
Fri Feb 20 04:40:17 UTC 2026: SonarCloud - Bugs: 0, Vulnerabilities: 0, Code Smells: 30

📈 Current Quality Metrics

BUGS: 0
CODE SMELLS: 30
VULNERABILITIES: 0

Generated on: Fri Feb 20 04:40:20 UTC 2026

Generated by AI DevOps Framework Code Review Monitoring

sonarqubecloud · 2026-02-20T04:41:05Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

coderabbitai

🧹 Nitpick comments (1)

.agents/scripts/supervisor/pulse.sh (1)
1325-1332: Optional: add eval-checkpoint cleanup to the rollback handler.

_phase1_cleanup_on_signal correctly rolls back the DB status, but it doesn't remove the eval-checkpoints/${_phase1_evaluating_tid}.eval file that evaluate_with_ai() may have written. After a successful rollback the task leaves evaluating state, so Phase 0.7's stale-recovery path (which normally removes checkpoint files) never visits it. The orphaned file persists until the task is evaluated again. If a subsequent evaluation gets stuck for an unrelated reason, _diagnose_stale_root_cause will still return pulse_killed_mid_eval (false root cause) because the stale checkpoint file matches first.
♻️ Proposed addition to the rollback handler
 _phase1_cleanup_on_signal() {
 	if [[ -n "$_phase1_evaluating_tid" && -n "$_phase1_pre_eval_state" ]]; then
 		log_warn "  Phase 1 (t1269): pulse killed during evaluation of $_phase1_evaluating_tid — rolling back to $_phase1_pre_eval_state"
 		db "$SUPERVISOR_DB" "UPDATE tasks SET status = '$(sql_escape "$_phase1_pre_eval_state")', updated_at = strftime('%Y-%m-%dT%H:%M:%SZ','now'), error = 'pulse_killed_mid_eval_rollback_t1269' WHERE id = '$(sql_escape "$_phase1_evaluating_tid")' AND status = 'evaluating';" 2>/dev/null || true
+		# t1269: Remove eval checkpoint so the next stale-recovery pass
+		# doesn't misdiagnose the root cause as pulse_killed_mid_eval.
+		rm -f "${SUPERVISOR_DIR}/eval-checkpoints/${_phase1_evaluating_tid}.eval" 2>/dev/null || true
 		_phase1_evaluating_tid=""
 		_phase1_pre_eval_state=""
 	fi
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.agents/scripts/supervisor/pulse.sh around lines 1325 - 1332, The rollback
handler _phase1_cleanup_on_signal rolls back DB status but fails to remove the
eval-checkpoints/${_phase1_evaluating_tid}.eval file that evaluate_with_ai() may
have created, leaving an orphaned checkpoint that confuses stale-recovery;
update _phase1_cleanup_on_signal to, after the DB update and before clearing
_phase1_evaluating_tid, delete the checkpoint file (e.g., remove
eval-checkpoints/"${_phase1_evaluating_tid}".eval safely, ignoring errors),
ensuring you only attempt removal when _phase1_evaluating_tid is non-empty and
keep existing silence-on-error behavior similar to the db call.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In @.agents/scripts/supervisor/pulse.sh:
- Around line 1363-1384: Race-condition fix verified: ensure both branches set
_phase1_evaluating_tid before any cmd_transition to close the SIGTERM race
window; keep the current order in the already-evaluating branch (setting
_phase1_pre_eval_state and _phase1_evaluating_tid) and in the new-evaluation
branch (set _phase1_pre_eval_state, then _phase1_evaluating_tid, then call
cmd_transition), and do not move or remove the SQL guard (AND status =
'evaluating') used by the cleanup handler.

---

Nitpick comments:
In @.agents/scripts/supervisor/pulse.sh:
- Around line 1325-1332: The rollback handler _phase1_cleanup_on_signal rolls
back DB status but fails to remove the
eval-checkpoints/${_phase1_evaluating_tid}.eval file that evaluate_with_ai() may
have created, leaving an orphaned checkpoint that confuses stale-recovery;
update _phase1_cleanup_on_signal to, after the DB update and before clearing
_phase1_evaluating_tid, delete the checkpoint file (e.g., remove
eval-checkpoints/"${_phase1_evaluating_tid}".eval safely, ignoring errors),
ensuring you only attempt removal when _phase1_evaluating_tid is non-empty and
keep existing silence-on-error behavior similar to the db call.

gemini-code-assist bot reviewed Feb 20, 2026

View reviewed changes

coderabbitai bot previously requested changes Feb 20, 2026

View reviewed changes

.agents/scripts/supervisor/pulse.sh Outdated Show resolved Hide resolved

marcusquinn mentioned this pull request Feb 20, 2026

[Supervisor:marcusquinn] 0 queued, 0 working, 1 in review at 14:46 UTC #1502

Closed

coderabbitai bot reviewed Feb 20, 2026

View reviewed changes

coderabbitai bot approved these changes Feb 20, 2026

View reviewed changes

marcusquinn merged commit 54e05e2 into main Feb 20, 2026
19 checks passed

marcusquinn added a commit that referenced this pull request Feb 20, 2026

chore: mark t1269 complete in TODO.md (#2002)

750457d

marcusquinn mentioned this pull request Feb 20, 2026

t1269: Fix stuck evaluating tasks: crash-resilient evaluation with immediate rollback on pulse death #2001

Closed

marcusquinn deleted the bugfix/t1269-crash-resilient-evaluation branch February 21, 2026 01:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

t1269: Fix stuck evaluating tasks with crash-resilient SIGTERM rollback#2002

t1269: Fix stuck evaluating tasks with crash-resilient SIGTERM rollback#2002
marcusquinn merged 2 commits intomainfrom
bugfix/t1269-crash-resilient-evaluation

marcusquinn commented Feb 20, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

gemini-code-assist bot commented Feb 20, 2026

Uh oh!

coderabbitai bot commented Feb 20, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Feb 20, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

github-actions bot commented Feb 20, 2026

Uh oh!

sonarqubecloud bot commented Feb 20, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

marcusquinn commented Feb 20, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

How it works

Impact

Testing

Summary by CodeRabbit

Uh oh!

gemini-code-assist bot commented Feb 20, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

coderabbitai bot commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

github-actions bot commented Feb 20, 2026

🔍 Code Quality Report

📈 Current Quality Metrics

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions bot commented Feb 20, 2026

🔍 Code Quality Report

📈 Current Quality Metrics

Uh oh!

sonarqubecloud bot commented Feb 20, 2026

Quality Gate passed

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

marcusquinn commented Feb 20, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 20, 2026 •

edited

Loading