t1254: Fix stale-evaluating periodic heartbeat — eliminate false recoveries for long evals by marcusquinn · Pull Request #1958 · marcusquinn/aidevops

marcusquinn · 2026-02-19T11:03:32Z

Fixes the remaining stale-evaluating false recovery gap left after t1251 (PR #1952).

Root Cause

t1251 added a one-shot updated_at heartbeat at eval start, but if the AI eval takes longer than heartbeat_window (eval_timeout+30s = 120s default), Phase 0.7 still triggers false recovery — interrupting a live eval and causing unnecessary re-queuing.

Failure scenario before this fix:

eval_timeout=90s, heartbeat_window=120s
Eval starts, updated_at set at T=0
AI call takes 150s (slow model, large log tail)
At T=121s, Phase 0.7: secs_since_update=121 > 120 → triggers recovery
Original eval still running → duplicate execution / race condition

Fixes

Periodic heartbeat loop (20s interval) runs alongside the AI call in a background subshell, keeping updated_at fresh for the full eval duration regardless of how long it takes
Wider heartbeat_window in _diagnose_stale_root_cause: eval_timeout*2+60 (240s default) instead of eval_timeout+30 (120s) — covers slow models that exceed eval_timeout
Final heartbeat touch after AI call completes, before teardown — covers the post-watchdog race window where Phase 0.7 could fire during teardown

Expected Impact

Eliminates remaining ~35min/task stale recovery overhead for evals that exceed 120s. Combined with t1251's fixes, this should reduce the stale-evaluating rate from 73% toward near-zero.

Files Changed

.agents/scripts/supervisor/evaluate.sh: periodic heartbeat loop + final touch
.agents/scripts/supervisor/pulse.sh: wider heartbeat_window formula

Summary by CodeRabbit

Bug Fixes
- Added periodic heartbeat during AI evaluations to prevent premature recovery actions for long-running tasks.
- Extended the detection window and adjusted timeout behavior to better accommodate slower models.
- Improved cleanup and teardown timing to ensure graceful completion and reduce false timeouts.

…eries (t1254) Root cause: t1251 added a one-shot updated_at touch at eval start, but if the AI eval takes longer than heartbeat_window (eval_timeout+30s = 120s), Phase 0.7 still triggers false recovery — interrupting a live eval and causing re-queuing. Fixes: - Periodic heartbeat loop (20s interval) runs alongside the AI call in a background subshell, keeping updated_at fresh for the full eval duration - Wider heartbeat_window in _diagnose_stale_root_cause: eval_timeout*2+60 (240s default) instead of eval_timeout+30 (120s) — covers slow models - Final updated_at touch after AI call completes, before teardown, to cover the post-watchdog race window (Phase 0.7 could fire during teardown) - Heartbeat sentinel/PID cleanup via push_cleanup for safety on early exit Expected impact: eliminates remaining ~35min/task stale recovery overhead for evals that exceed 120s (slow models, large log tails)

gemini-code-assist · 2026-02-19T11:03:43Z

Summary of Changes

Hello @marcusquinn, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical issue where long-running AI evaluations were prematurely marked as 'stale-evaluating' and recovered, leading to duplicate executions and wasted resources. The changes introduce a more robust heartbeat mechanism to accurately reflect the active status of evaluations, significantly reducing false recovery incidents and improving the efficiency of the evaluation pipeline.

Highlights

Periodic Heartbeat Loop: Implemented a background subshell that runs alongside the AI call, updating updated_at every 20 seconds to prevent false recoveries for long-running evaluations.
Wider Heartbeat Window: Adjusted the heartbeat_window calculation in _diagnose_stale_root_cause from eval_timeout + 30 to eval_timeout * 2 + 60. This new window (defaulting to 240s) better accommodates slow models that might exceed the initial eval_timeout.
Final Heartbeat Touch: Added a final updated_at touch after the AI call completes but before teardown. This extends the grace window, preventing Phase 0.7 from triggering recovery during the post-evaluation cleanup phase.

Changelog

.agents/scripts/supervisor/evaluate.sh
- Added a periodic heartbeat loop using a sentinel file and background process to continuously update the task's updated_at timestamp during AI evaluation.
- Included cleanup logic to stop the periodic heartbeat process and remove its sentinel file upon evaluation completion or timeout.
- Inserted a final updated_at update after the AI evaluation concludes to cover the post-evaluation teardown phase.
.agents/scripts/supervisor/pulse.sh
- Modified the heartbeat_window calculation within _diagnose_stale_root_cause to eval_timeout * 2 + 60, providing a larger grace period for active evaluations.

Activity

No human activity has been recorded on this pull request yet.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

coderabbitai · 2026-02-19T11:03:50Z

Walkthrough

Adds a periodic heartbeat (every 20s) during AI evaluation that updates updated_at and a sentinel to control the heartbeat lifecycle; adjusts watchdog detection to use a wider dynamic heartbeat window (eval_timeout_cfg * 2 + 60) so active evaluations are not prematurely considered stale.

Changes

Cohort / File(s)	Summary
Evaluation Heartbeat `.agents/scripts/supervisor/evaluate.sh`	Adds a background heartbeat process that touches `updated_at` every 20s, creates a sentinel file to control loop lifecycle, records the background PID, registers cleanup to stop the heartbeat, and performs a final touch during teardown.
Watchdog Window Timing `.agents/scripts/supervisor/pulse.sh`	Expands the evaluation-active detection window by switching from `eval_timeout_cfg + 30` to `eval_timeout_cfg * 2 + 60`, widening the grace period used when comparing `updated_at` against the current time.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Eval as "evaluate.sh"
  participant HB as "heartbeat (background)"
  participant FS as "Filesystem (sentinel / updated_at)"
  participant Pulse as "pulse.sh (watchdog)"
  Eval->>FS: create sentinel & start HB (bg)
  Eval->>HB: spawn background loop (touch updated_at every 20s)
  loop every 20s
    HB->>FS: touch updated_at
  end
  Pulse->>FS: read updated_at
  Pulse->>Pulse: compare with window = eval_timeout_cfg * 2 + 60
  alt within window
    Pulse-->>Pulse: consider evaluation active (no recovery)
  else outside window
    Pulse-->>Pulse: flag stale / trigger recovery
  end
  Eval->>FS: remove sentinel & stop HB (on finish/timeout)
  Eval->>FS: final touch updated_at (post-eval teardown)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

t1183: Fix 3 supervisor pipeline bugs — PR-aware reaping, atomic eval, batch merge #1790: Modifies pulse.sh stale-evaluation detection/heartbeat logic — closely related to the window calculation change.
t1252: add eval watchdog, duration metrics, and hang detection #1955: Adjusts evaluate.sh sentinel/background lifecycle and coordinates with pulse.sh timing — overlaps in heartbeat/watchdog lifecycle handling.

Poem

💓 A tiny loop beats strong and steady,
Every twenty seconds, firm and ready,
A sentinel file keeps time in sight,
The watchdog waits, patient through the night,
Teardown kisses the timestamp goodnight. 🚀

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main change: fixing a periodic heartbeat mechanism (t1254) to prevent false recovery triggers for long-running AI evaluations.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feature/t1254

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-02-19T11:04:01Z

🔍 Code Quality Report

�[0;35m[MONITOR]�[0m Code Review Monitoring Report

�[0;34m[INFO]�[0m Latest Quality Status:
SonarCloud: 0 bugs, 0 vulnerabilities, 29 code smells

�[0;34m[INFO]�[0m Recent monitoring activity:
Thu Feb 19 11:03:57 UTC 2026: Code review monitoring started
Thu Feb 19 11:03:57 UTC 2026: SonarCloud - Bugs: 0, Vulnerabilities: 0, Code Smells: 29

📈 Current Quality Metrics

BUGS: 0
CODE SMELLS: 29
VULNERABILITIES: 0

Generated on: Thu Feb 19 11:04:00 UTC 2026

Generated by AI DevOps Framework Code Review Monitoring

gemini-code-assist

Code Review

This pull request introduces a periodic heartbeat mechanism to prevent false recoveries for long-running AI evaluations, which is a solid improvement. The implementation is mostly robust, following existing patterns in the codebase. I've identified a potential bug in the cleanup trap that could cause issues in error scenarios, and suggested a couple of minor refactorings to improve code clarity and maintainability. These suggestions align with established repository patterns for robust cleanup. Overall, the changes are well-aligned with the PR's goal.

gemini-code-assist · 2026-02-19T11:07:17Z

.agents/scripts/supervisor/evaluate.sh

+			done
+		) &
+		_hb_pid=$!
+		push_cleanup "rm -f '${_hb_sentinel}' 2>/dev/null || true; kill '${_hb_pid}' 2>/dev/null; wait '${_hb_pid}' 2>/dev/null || true"


The kill command in the push_cleanup string is missing an || true guard. If the script is running with set -e (which it is), and the process with PID _hb_pid has already exited when the trap runs, the kill command will fail with a non-zero exit code. This would cause the trap to terminate prematurely, and the subsequent wait command would not be executed, potentially leaving a zombie process.

The manual cleanup block on line 1520 correctly uses || true, and this trap should be just as robust.

Suggested change

push_cleanup "rm -f '${_hb_sentinel}' 2>/dev/null || true; kill '${_hb_pid}' 2>/dev/null; wait '${_hb_pid}' 2>/dev/null || true"

push_cleanup "rm -f '${_hb_sentinel}' 2>/dev/null || true; kill '${_hb_pid}' 2>/dev/null || true; wait '${_hb_pid}' 2>/dev/null || true"

References

For resource cleanup in shell scripts, use the established project pattern: use _save_cleanup_scope, trap '_run_cleanups' RETURN, and push_cleanup for robust cleanup on any exit path, and also include explicit manual cleanup at the end of the normal execution path as a 'fast-path'.

Fixed in bfb87fb. Added || true to the kill call in the push_cleanup string so a process that has already exited does not cause the trap to terminate prematurely before the wait runs.

gemini-code-assist · 2026-02-19T11:07:18Z

.agents/scripts/supervisor/evaluate.sh

+	if [[ -n "$_hb_sentinel" ]]; then
+		rm -f "$_hb_sentinel" 2>/dev/null || true
+	fi
+	if [[ -n "$_hb_pid" ]]; then
+		kill "$_hb_pid" 2>/dev/null || true
+		wait "$_hb_pid" 2>/dev/null || true
+	fi


The manual cleanup logic can be made more concise. Since _hb_pid is only set when _hb_sentinel is successfully created and the background process is launched, checking for _hb_pid is sufficient to guard all cleanup operations for the heartbeat process.

Suggested change

if [[ -n "$_hb_sentinel" ]]; then

rm -f "$_hb_sentinel" 2>/dev/null || true

fi

if [[ -n "$_hb_pid" ]]; then

kill "$_hb_pid" 2>/dev/null || true

wait "$_hb_pid" 2>/dev/null || true

fi

if [[ -n "$_hb_pid" ]]; then

rm -f "$_hb_sentinel" 2>/dev/null || true

kill "$_hb_pid" 2>/dev/null || true

wait "$_hb_pid" 2>/dev/null || true

fi

References

For resource cleanup in shell scripts, use the established project pattern: use _save_cleanup_scope, trap '_run_cleanups' RETURN, and push_cleanup for robust cleanup on any exit path, and also include explicit manual cleanup at the end of the normal execution path as a 'fast-path'.

Fixed in bfb87fb. Consolidated the manual cleanup block to check only $_hb_pid (which is only set when $_hb_sentinel was successfully created), removing the redundant $_hb_sentinel guard as suggested.

gemini-code-assist · 2026-02-19T11:07:18Z

.agents/scripts/supervisor/pulse.sh

+		# heartbeat_window = eval_timeout * 2 + 60: covers slow models that exceed
+		# eval_timeout (e.g., 90s timeout but model responds at 150s), plus 60s
+		# buffer for teardown. The periodic heartbeat keeps updated_at fresh, so
+		# this window only needs to cover the 20s heartbeat interval + teardown.


The comment here presents two different justifications for the heartbeat_window size, which can be confusing. One part suggests a large window is needed to cover long-running evaluations, while the other says only a small window is needed because of the new periodic heartbeat.

To improve clarity, I suggest rephrasing the comment to explain that the large window is a defense-in-depth measure in case the heartbeat process fails, while the periodic heartbeat is the primary mechanism that makes the window less critical under normal operation.

Fixed in bfb87fb. Rewrote the comment to give a single clear justification: the wide window is defense-in-depth for the case where the heartbeat subshell itself is unexpectedly killed or stalled, rather than presenting two conflicting rationales.

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

.agents/scripts/supervisor/evaluate.sh (1)

1442-1478: ⚠️ Potential issue | 🔴 Critical

Heartbeat cleanup is dropped on early exit due to cleanup scope ordering.

The heartbeat cleanup is registered via push_cleanup at line 1466, then _save_cleanup_scope is called at line 1476, which moves the heartbeat cleanup to the save stack and clears the current scope. The watchdog cleanup is then registered into the now-empty scope at lines 1483 and 1496.

If the eval exits early (error handling, signal, early return), the RETURN trap fires and executes only the current _CLEANUP_CMDS (watchdog cleanup). The heartbeat cleanup remains in _CLEANUP_SAVE_STACK orphaned—_restore_cleanup_scope is never called to restore it. The heartbeat background loop continues indefinitely, refreshing updated_at and masking stale evals.

Move _save_cleanup_scope and trap above the heartbeat registration block, or re-register heartbeat cleanup after the scope is saved.

✅ Corrected ordering

	# t1251: Initial heartbeat — touch updated_at so Phase 0.7's grace period
	# doesn't fire while evaluation is actively running.
	db "$SUPERVISOR_DB" "UPDATE tasks SET updated_at = strftime('%Y-%m-%dT%H:%M:%SZ','now') WHERE id = '$(sql_escape "$task_id")';" 2>/dev/null || true

+	# t1252: Watchdog timer — detect evaluation hangs within 60s instead of
+	# waiting for the full stale timeout (120s grace + eval_timeout).
+	# The watchdog fires a warning log at 60s so trends are visible in logs
+	# before the timeout kills the eval at eval_timeout seconds.
+	# Uses a background subshell that self-terminates when the eval completes.
+	# Cleanup scope ensures sentinel file and watchdog process are released on
+	# any exit path (normal return, early return, or unexpected signal).
+	_save_cleanup_scope
+	trap '_run_cleanups' RETURN
+
	# t1254: Periodic heartbeat — refresh updated_at every 20s during the AI call.
	# The one-shot heartbeat from t1251 only covers eval_timeout+30s (120s window).
	# If the AI eval takes >120s (slow model, large log), Phase 0.7 triggers false
	# recovery even though eval is actively running. A periodic heartbeat keeps
	# updated_at fresh for the full duration of the eval, regardless of how long
	# it takes. The loop self-terminates when the sentinel file is removed.
	local _hb_sentinel
	_hb_sentinel=$(mktemp 2>/dev/null || echo "")
	local _hb_pid=""
	if [[ -n "$_hb_sentinel" ]]; then
		local _hb_task_id="$task_id"
		local _hb_db="$SUPERVISOR_DB"
		(
			while [[ -f "$_hb_sentinel" ]]; do
				sleep 20
				[[ -f "$_hb_sentinel" ]] || break
				db "$_hb_db" "UPDATE tasks SET updated_at = strftime('%Y-%m-%dT%H:%M:%SZ','now') WHERE id = '$(sql_escape "$_hb_task_id")';" 2>/dev/null || true
			done
		) &
		_hb_pid=$!
		push_cleanup "rm -f '${_hb_sentinel}' 2>/dev/null || true; kill '${_hb_pid}' 2>/dev/null; wait '${_hb_pid}' 2>/dev/null || true"
	fi
-
-	# t1252: Watchdog timer — detect evaluation hangs within 60s instead of
-	# waiting for the full stale timeout (120s grace + eval_timeout).
-	# The watchdog fires a warning log at 60s so trends are visible in logs
-	# before the timeout kills the eval at eval_timeout seconds.
-	# Uses a background subshell that self-terminates when the eval completes.
-	# Cleanup scope ensures sentinel file and watchdog process are released on
-	# any exit path (normal return, early return, or unexpected signal).
-	_save_cleanup_scope
-	trap '_run_cleanups' RETURN
	local watchdog_pid=""

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In @.agents/scripts/supervisor/evaluate.sh around lines 1442 - 1478, The
heartbeat cleanup pushed with push_cleanup (the rm/kill/wait for
_hb_sentinel/_hb_pid) is saved into the saved cleanup stack because
_save_cleanup_scope is called after the heartbeat is registered, so an early
RETURN only runs the current _CLEANUP_CMDS (watchdog) and leaves the heartbeat
running; fix by moving the _save_cleanup_scope and trap '_run_cleanups' RETURN
calls to before the heartbeat creation block so both heartbeat and watchdog
cleanups live in the same active scope, or alternatively call push_cleanup for
the heartbeat after _save_cleanup_scope so the heartbeat cleanup is in the
current scope that _run_cleanups executes (reference symbols: push_cleanup,
_save_cleanup_scope, trap '_run_cleanups', _CLEANUP_CMDS, _CLEANUP_SAVE_STACK).

🧹 Nitpick comments (1)

.agents/scripts/supervisor/pulse.sh (1)

195-203: Optional guard for non-numeric SUPERVISOR_EVAL_TIMEOUT.

Line 202 relies on arithmetic expansion; if the env var is accidentally set to a non-integer (e.g., 90s), bash emits an arithmetic error and the heartbeat window can degrade. A small numeric sanitize keeps the stale-eval guardrail robust.

🛠️ Suggested hardening

- local eval_timeout_cfg="${SUPERVISOR_EVAL_TIMEOUT:-90}"
- local heartbeat_window=$((eval_timeout_cfg * 2 + 60))
+ local eval_timeout_cfg="${SUPERVISOR_EVAL_TIMEOUT:-90}"
+ if ! [[ "$eval_timeout_cfg" =~ ^[0-9]+$ ]]; then
+   log_warn "SUPERVISOR_EVAL_TIMEOUT is non-numeric ('$eval_timeout_cfg'); defaulting to 90"
+   eval_timeout_cfg=90
+ fi
+ local heartbeat_window=$((eval_timeout_cfg * 2 + 60))

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In @.agents/scripts/supervisor/pulse.sh around lines 195 - 203, The arithmetic
for heartbeat_window can fail if SUPERVISOR_EVAL_TIMEOUT contains non-digits;
sanitize and default the value before computing heartbeat_window: read
SUPERVISOR_EVAL_TIMEOUT into eval_timeout_cfg, strip non-numeric characters (or
extract digits), and if the result is empty or zero fall back to the default 90,
then compute heartbeat_window using that validated integer; update the logic
around the existing eval_timeout_cfg and heartbeat_window variables in pulse.sh
to perform this validation.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In @.agents/scripts/supervisor/evaluate.sh:
- Around line 1442-1478: The heartbeat cleanup pushed with push_cleanup (the
rm/kill/wait for _hb_sentinel/_hb_pid) is saved into the saved cleanup stack
because _save_cleanup_scope is called after the heartbeat is registered, so an
early RETURN only runs the current _CLEANUP_CMDS (watchdog) and leaves the
heartbeat running; fix by moving the _save_cleanup_scope and trap
'_run_cleanups' RETURN calls to before the heartbeat creation block so both
heartbeat and watchdog cleanups live in the same active scope, or alternatively
call push_cleanup for the heartbeat after _save_cleanup_scope so the heartbeat
cleanup is in the current scope that _run_cleanups executes (reference symbols:
push_cleanup, _save_cleanup_scope, trap '_run_cleanups', _CLEANUP_CMDS,
_CLEANUP_SAVE_STACK).

---

Nitpick comments:
In @.agents/scripts/supervisor/pulse.sh:
- Around line 195-203: The arithmetic for heartbeat_window can fail if
SUPERVISOR_EVAL_TIMEOUT contains non-digits; sanitize and default the value
before computing heartbeat_window: read SUPERVISOR_EVAL_TIMEOUT into
eval_timeout_cfg, strip non-numeric characters (or extract digits), and if the
result is empty or zero fall back to the default 90, then compute
heartbeat_window using that validated integer; update the logic around the
existing eval_timeout_cfg and heartbeat_window variables in pulse.sh to perform
this validation.

github-actions · 2026-02-19T11:15:20Z

🔍 Code Quality Report

�[0;35m[MONITOR]�[0m Code Review Monitoring Report

�[0;34m[INFO]�[0m Latest Quality Status:
SonarCloud: 0 bugs, 0 vulnerabilities, 29 code smells

�[0;34m[INFO]�[0m Recent monitoring activity:
Thu Feb 19 11:15:16 UTC 2026: Code review monitoring started
Thu Feb 19 11:15:16 UTC 2026: SonarCloud - Bugs: 0, Vulnerabilities: 0, Code Smells: 29

📈 Current Quality Metrics

BUGS: 0
CODE SMELLS: 29
VULNERABILITIES: 0

Generated on: Thu Feb 19 11:15:18 UTC 2026

Generated by AI DevOps Framework Code Review Monitoring

sonarqubecloud · 2026-02-19T11:16:14Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

.agents/scripts/supervisor/evaluate.sh (1)

1442-1477: ⚠️ Potential issue | 🟠 Major

Move _save_cleanup_scope before heartbeat spawn to follow standard cleanup pattern.

The heartbeat push_cleanup at line 1466 is registered before _save_cleanup_scope is called at line 1476. While the code includes explicit manual cleanup of both heartbeat and watchdog (lines 1514–1537) before early return paths, this ordering violates the established pattern throughout the codebase and creates a fragile structure. Any future code path that adds an early return between watchdog setup and manual cleanup would orphan the heartbeat cleanup, leaving the background update loop running indefinitely and masking stale recovery.

Move _save_cleanup_scope and trap to the beginning of the heartbeat section:

🔧 Recommended cleanup scope ordering

 	# t1251: Initial heartbeat — touch updated_at so Phase 0.7's grace period
 	# doesn't fire while evaluation is actively running.
 	db "$SUPERVISOR_DB" "UPDATE tasks SET updated_at = strftime('%Y-%m-%dT%H:%M:%SZ','now') WHERE id = '$(sql_escape "$task_id")';" 2>/dev/null || true

+	# Establish cleanup scope before spawning background helpers (heartbeat + watchdog).
+	_save_cleanup_scope
+	trap '_run_cleanups' RETURN
+
 	# t1254: Periodic heartbeat — refresh updated_at every 20s during the AI call.
 	# The one-shot heartbeat from t1251 only covers eval_timeout+30s (120s window).
 	# If the AI eval takes >120s (slow model, large log), Phase 0.7 triggers false
 	# recovery even though eval is actively running. A periodic heartbeat keeps
 	# updated_at fresh for the full duration of the eval, regardless of how long
 	# it takes. The loop self-terminates when the sentinel file is removed.
 	local _hb_sentinel
 	_hb_sentinel=$(mktemp 2>/dev/null || echo "")
 	local _hb_pid=""
 	if [[ -n "$_hb_sentinel" ]]; then
 		local _hb_task_id="$task_id"
 		local _hb_db="$SUPERVISOR_DB"
 		(
 			while [[ -f "$_hb_sentinel" ]]; do
 				sleep 20
 				[[ -f "$_hb_sentinel" ]] || break
 				db "$_hb_db" "UPDATE tasks SET updated_at = strftime('%Y-%m-%dT%H:%M:%SZ','now') WHERE id = '$(sql_escape "$_hb_task_id")';" 2>/dev/null || true
 			done
 		) &
 		_hb_pid=$!
 		push_cleanup "rm -f '${_hb_sentinel}' 2>/dev/null || true; kill '${_hb_pid}' 2>/dev/null || true; wait '${_hb_pid}' 2>/dev/null || true"
 	fi

 	# t1252: Watchdog timer — detect evaluation hangs within 60s instead of
 	# waiting for the full stale timeout (120s grace + eval_timeout).
 	# The watchdog fires a warning log at 60s so trends are visible in logs
 	# before the timeout kills the eval at eval_timeout seconds.
 	# Uses a background subshell that self-terminates when the eval completes.
 	# Cleanup scope ensures sentinel file and watchdog process are released on
 	# any exit path (normal return, early return, or unexpected signal).
-	_save_cleanup_scope
-	trap '_run_cleanups' RETURN
 	local watchdog_pid=""
 	local watchdog_sentinel

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In @.agents/scripts/supervisor/evaluate.sh around lines 1442 - 1477, The
heartbeat's cleanup registration (push_cleanup inside the background heartbeat
block that uses _hb_sentinel and _hb_pid) is being added before the cleanup
scope is saved, which can orphan the heartbeat if an early return is added
later; move the call to _save_cleanup_scope and the trap '_run_cleanups' RETURN
so they occur before spawning the heartbeat subprocess and before any
push_cleanup calls (i.e., call _save_cleanup_scope and set trap prior to
creating _hb_sentinel / starting the background loop that registers
push_cleanup), ensuring all push_cleanup registrations for the heartbeat and
watchdog are captured in the saved scope.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In @.agents/scripts/supervisor/evaluate.sh:
- Around line 1442-1477: The heartbeat's cleanup registration (push_cleanup
inside the background heartbeat block that uses _hb_sentinel and _hb_pid) is
being added before the cleanup scope is saved, which can orphan the heartbeat if
an early return is added later; move the call to _save_cleanup_scope and the
trap '_run_cleanups' RETURN so they occur before spawning the heartbeat
subprocess and before any push_cleanup calls (i.e., call _save_cleanup_scope and
set trap prior to creating _hb_sentinel / starting the background loop that
registers push_cleanup), ensuring all push_cleanup registrations for the
heartbeat and watchdog are captured in the saved scope.

marcusquinn mentioned this pull request Feb 19, 2026

[Supervisor:marcusquinn] 0 queued, 0 working at 11:56 UTC #1502

Open

gemini-code-assist bot reviewed Feb 19, 2026

View reviewed changes

coderabbitai bot reviewed Feb 19, 2026

View reviewed changes

coderabbitai bot approved these changes Feb 19, 2026

View reviewed changes

fix: address review feedback for t1254

bfb87fb

coderabbitai bot reviewed Feb 19, 2026

View reviewed changes

marcusquinn merged commit 69c396a into main Feb 19, 2026
19 checks passed

marcusquinn deleted the feature/t1254 branch February 19, 2026 11:43

marcusquinn added a commit that referenced this pull request Feb 19, 2026

chore: mark t1254 complete in TODO.md (#1958)

0d5c8f2

marcusquinn mentioned this pull request Feb 19, 2026

t1254: Add stale-evaluating root cause fix based on t1251 investigation findings #1957

Closed

	push_cleanup "rm -f '${_hb_sentinel}' 2>/dev/null \|\| true; kill '${_hb_pid}' 2>/dev/null; wait '${_hb_pid}' 2>/dev/null \|\| true"
	push_cleanup "rm -f '${_hb_sentinel}' 2>/dev/null \|\| true; kill '${_hb_pid}' 2>/dev/null \|\| true; wait '${_hb_pid}' 2>/dev/null \|\| true"

Comments

Conversation

marcusquinn commented Feb 19, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Root Cause

Fixes

Expected Impact

Files Changed

Summary by CodeRabbit

Uh oh!

gemini-code-assist bot commented Feb 19, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

coderabbitai bot commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

github-actions bot commented Feb 19, 2026

🔍 Code Quality Report

📈 Current Quality Metrics

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

marcusquinn Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

marcusquinn Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

marcusquinn Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Feb 19, 2026

🔍 Code Quality Report

📈 Current Quality Metrics

Uh oh!

sonarqubecloud bot commented Feb 19, 2026

Quality Gate passed

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

marcusquinn commented Feb 19, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 19, 2026 •

edited

Loading