Skip to content

Comments

t1254: Fix stale-evaluating periodic heartbeat — eliminate false recoveries for long evals#1958

Merged
marcusquinn merged 2 commits intomainfrom
feature/t1254
Feb 19, 2026
Merged

t1254: Fix stale-evaluating periodic heartbeat — eliminate false recoveries for long evals#1958
marcusquinn merged 2 commits intomainfrom
feature/t1254

Conversation

@marcusquinn
Copy link
Owner

@marcusquinn marcusquinn commented Feb 19, 2026

Fixes the remaining stale-evaluating false recovery gap left after t1251 (PR #1952).

Root Cause

t1251 added a one-shot updated_at heartbeat at eval start, but if the AI eval takes longer than heartbeat_window (eval_timeout+30s = 120s default), Phase 0.7 still triggers false recovery — interrupting a live eval and causing unnecessary re-queuing.

Failure scenario before this fix:

  • eval_timeout=90s, heartbeat_window=120s
  • Eval starts, updated_at set at T=0
  • AI call takes 150s (slow model, large log tail)
  • At T=121s, Phase 0.7: secs_since_update=121 > 120 → triggers recovery
  • Original eval still running → duplicate execution / race condition

Fixes

  1. Periodic heartbeat loop (20s interval) runs alongside the AI call in a background subshell, keeping updated_at fresh for the full eval duration regardless of how long it takes
  2. Wider heartbeat_window in _diagnose_stale_root_cause: eval_timeout*2+60 (240s default) instead of eval_timeout+30 (120s) — covers slow models that exceed eval_timeout
  3. Final heartbeat touch after AI call completes, before teardown — covers the post-watchdog race window where Phase 0.7 could fire during teardown

Expected Impact

Eliminates remaining ~35min/task stale recovery overhead for evals that exceed 120s. Combined with t1251's fixes, this should reduce the stale-evaluating rate from 73% toward near-zero.

Files Changed

  • .agents/scripts/supervisor/evaluate.sh: periodic heartbeat loop + final touch
  • .agents/scripts/supervisor/pulse.sh: wider heartbeat_window formula

Summary by CodeRabbit

  • Bug Fixes
    • Added periodic heartbeat during AI evaluations to prevent premature recovery actions for long-running tasks.
    • Extended the detection window and adjusted timeout behavior to better accommodate slower models.
    • Improved cleanup and teardown timing to ensure graceful completion and reduce false timeouts.

…eries (t1254)

Root cause: t1251 added a one-shot updated_at touch at eval start, but if the
AI eval takes longer than heartbeat_window (eval_timeout+30s = 120s), Phase 0.7
still triggers false recovery — interrupting a live eval and causing re-queuing.

Fixes:
- Periodic heartbeat loop (20s interval) runs alongside the AI call in a
  background subshell, keeping updated_at fresh for the full eval duration
- Wider heartbeat_window in _diagnose_stale_root_cause: eval_timeout*2+60
  (240s default) instead of eval_timeout+30 (120s) — covers slow models
- Final updated_at touch after AI call completes, before teardown, to cover
  the post-watchdog race window (Phase 0.7 could fire during teardown)
- Heartbeat sentinel/PID cleanup via push_cleanup for safety on early exit

Expected impact: eliminates remaining ~35min/task stale recovery overhead
for evals that exceed 120s (slow models, large log tails)
@gemini-code-assist
Copy link

Summary of Changes

Hello @marcusquinn, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical issue where long-running AI evaluations were prematurely marked as 'stale-evaluating' and recovered, leading to duplicate executions and wasted resources. The changes introduce a more robust heartbeat mechanism to accurately reflect the active status of evaluations, significantly reducing false recovery incidents and improving the efficiency of the evaluation pipeline.

Highlights

  • Periodic Heartbeat Loop: Implemented a background subshell that runs alongside the AI call, updating updated_at every 20 seconds to prevent false recoveries for long-running evaluations.
  • Wider Heartbeat Window: Adjusted the heartbeat_window calculation in _diagnose_stale_root_cause from eval_timeout + 30 to eval_timeout * 2 + 60. This new window (defaulting to 240s) better accommodates slow models that might exceed the initial eval_timeout.
  • Final Heartbeat Touch: Added a final updated_at touch after the AI call completes but before teardown. This extends the grace window, preventing Phase 0.7 from triggering recovery during the post-evaluation cleanup phase.
Changelog
  • .agents/scripts/supervisor/evaluate.sh
    • Added a periodic heartbeat loop using a sentinel file and background process to continuously update the task's updated_at timestamp during AI evaluation.
    • Included cleanup logic to stop the periodic heartbeat process and remove its sentinel file upon evaluation completion or timeout.
    • Inserted a final updated_at update after the AI evaluation concludes to cover the post-evaluation teardown phase.
  • .agents/scripts/supervisor/pulse.sh
    • Modified the heartbeat_window calculation within _diagnose_stale_root_cause to eval_timeout * 2 + 60, providing a larger grace period for active evaluations.
Activity
  • No human activity has been recorded on this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 19, 2026

Walkthrough

Adds a periodic heartbeat (every 20s) during AI evaluation that updates updated_at and a sentinel to control the heartbeat lifecycle; adjusts watchdog detection to use a wider dynamic heartbeat window (eval_timeout_cfg * 2 + 60) so active evaluations are not prematurely considered stale.

Changes

Cohort / File(s) Summary
Evaluation Heartbeat
.agents/scripts/supervisor/evaluate.sh
Adds a background heartbeat process that touches updated_at every 20s, creates a sentinel file to control loop lifecycle, records the background PID, registers cleanup to stop the heartbeat, and performs a final touch during teardown.
Watchdog Window Timing
.agents/scripts/supervisor/pulse.sh
Expands the evaluation-active detection window by switching from eval_timeout_cfg + 30 to eval_timeout_cfg * 2 + 60, widening the grace period used when comparing updated_at against the current time.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Eval as "evaluate.sh"
  participant HB as "heartbeat (background)"
  participant FS as "Filesystem (sentinel / updated_at)"
  participant Pulse as "pulse.sh (watchdog)"
  Eval->>FS: create sentinel & start HB (bg)
  Eval->>HB: spawn background loop (touch updated_at every 20s)
  loop every 20s
    HB->>FS: touch updated_at
  end
  Pulse->>FS: read updated_at
  Pulse->>Pulse: compare with window = eval_timeout_cfg * 2 + 60
  alt within window
    Pulse-->>Pulse: consider evaluation active (no recovery)
  else outside window
    Pulse-->>Pulse: flag stale / trigger recovery
  end
  Eval->>FS: remove sentinel & stop HB (on finish/timeout)
  Eval->>FS: final touch updated_at (post-eval teardown)
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Poem

💓 A tiny loop beats strong and steady,
Every twenty seconds, firm and ready,
A sentinel file keeps time in sight,
The watchdog waits, patient through the night,
Teardown kisses the timestamp goodnight. 🚀

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: fixing a periodic heartbeat mechanism (t1254) to prevent false recovery triggers for long-running AI evaluations.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feature/t1254

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link

🔍 Code Quality Report

�[0;35m[MONITOR]�[0m Code Review Monitoring Report

�[0;34m[INFO]�[0m Latest Quality Status:
SonarCloud: 0 bugs, 0 vulnerabilities, 29 code smells

�[0;34m[INFO]�[0m Recent monitoring activity:
Thu Feb 19 11:03:57 UTC 2026: Code review monitoring started
Thu Feb 19 11:03:57 UTC 2026: SonarCloud - Bugs: 0, Vulnerabilities: 0, Code Smells: 29

📈 Current Quality Metrics

  • BUGS: 0
  • CODE SMELLS: 29
  • VULNERABILITIES: 0

Generated on: Thu Feb 19 11:04:00 UTC 2026


Generated by AI DevOps Framework Code Review Monitoring

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a periodic heartbeat mechanism to prevent false recoveries for long-running AI evaluations, which is a solid improvement. The implementation is mostly robust, following existing patterns in the codebase. I've identified a potential bug in the cleanup trap that could cause issues in error scenarios, and suggested a couple of minor refactorings to improve code clarity and maintainability. These suggestions align with established repository patterns for robust cleanup. Overall, the changes are well-aligned with the PR's goal.

done
) &
_hb_pid=$!
push_cleanup "rm -f '${_hb_sentinel}' 2>/dev/null || true; kill '${_hb_pid}' 2>/dev/null; wait '${_hb_pid}' 2>/dev/null || true"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The kill command in the push_cleanup string is missing an || true guard. If the script is running with set -e (which it is), and the process with PID _hb_pid has already exited when the trap runs, the kill command will fail with a non-zero exit code. This would cause the trap to terminate prematurely, and the subsequent wait command would not be executed, potentially leaving a zombie process.

The manual cleanup block on line 1520 correctly uses || true, and this trap should be just as robust.

Suggested change
push_cleanup "rm -f '${_hb_sentinel}' 2>/dev/null || true; kill '${_hb_pid}' 2>/dev/null; wait '${_hb_pid}' 2>/dev/null || true"
push_cleanup "rm -f '${_hb_sentinel}' 2>/dev/null || true; kill '${_hb_pid}' 2>/dev/null || true; wait '${_hb_pid}' 2>/dev/null || true"
References
  1. For resource cleanup in shell scripts, use the established project pattern: use _save_cleanup_scope, trap '_run_cleanups' RETURN, and push_cleanup for robust cleanup on any exit path, and also include explicit manual cleanup at the end of the normal execution path as a 'fast-path'.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in bfb87fb. Added || true to the kill call in the push_cleanup string so a process that has already exited does not cause the trap to terminate prematurely before the wait runs.

Comment on lines 1516 to 1522
if [[ -n "$_hb_sentinel" ]]; then
rm -f "$_hb_sentinel" 2>/dev/null || true
fi
if [[ -n "$_hb_pid" ]]; then
kill "$_hb_pid" 2>/dev/null || true
wait "$_hb_pid" 2>/dev/null || true
fi

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The manual cleanup logic can be made more concise. Since _hb_pid is only set when _hb_sentinel is successfully created and the background process is launched, checking for _hb_pid is sufficient to guard all cleanup operations for the heartbeat process.

Suggested change
if [[ -n "$_hb_sentinel" ]]; then
rm -f "$_hb_sentinel" 2>/dev/null || true
fi
if [[ -n "$_hb_pid" ]]; then
kill "$_hb_pid" 2>/dev/null || true
wait "$_hb_pid" 2>/dev/null || true
fi
if [[ -n "$_hb_pid" ]]; then
rm -f "$_hb_sentinel" 2>/dev/null || true
kill "$_hb_pid" 2>/dev/null || true
wait "$_hb_pid" 2>/dev/null || true
fi
References
  1. For resource cleanup in shell scripts, use the established project pattern: use _save_cleanup_scope, trap '_run_cleanups' RETURN, and push_cleanup for robust cleanup on any exit path, and also include explicit manual cleanup at the end of the normal execution path as a 'fast-path'.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in bfb87fb. Consolidated the manual cleanup block to check only $_hb_pid (which is only set when $_hb_sentinel was successfully created), removing the redundant $_hb_sentinel guard as suggested.

Comment on lines 198 to 201
# heartbeat_window = eval_timeout * 2 + 60: covers slow models that exceed
# eval_timeout (e.g., 90s timeout but model responds at 150s), plus 60s
# buffer for teardown. The periodic heartbeat keeps updated_at fresh, so
# this window only needs to cover the 20s heartbeat interval + teardown.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The comment here presents two different justifications for the heartbeat_window size, which can be confusing. One part suggests a large window is needed to cover long-running evaluations, while the other says only a small window is needed because of the new periodic heartbeat.

To improve clarity, I suggest rephrasing the comment to explain that the large window is a defense-in-depth measure in case the heartbeat process fails, while the periodic heartbeat is the primary mechanism that makes the window less critical under normal operation.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in bfb87fb. Rewrote the comment to give a single clear justification: the wide window is defense-in-depth for the case where the heartbeat subshell itself is unexpectedly killed or stalled, rather than presenting two conflicting rationales.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
.agents/scripts/supervisor/evaluate.sh (1)

1442-1478: ⚠️ Potential issue | 🔴 Critical

Heartbeat cleanup is dropped on early exit due to cleanup scope ordering.

The heartbeat cleanup is registered via push_cleanup at line 1466, then _save_cleanup_scope is called at line 1476, which moves the heartbeat cleanup to the save stack and clears the current scope. The watchdog cleanup is then registered into the now-empty scope at lines 1483 and 1496.

If the eval exits early (error handling, signal, early return), the RETURN trap fires and executes only the current _CLEANUP_CMDS (watchdog cleanup). The heartbeat cleanup remains in _CLEANUP_SAVE_STACK orphaned—_restore_cleanup_scope is never called to restore it. The heartbeat background loop continues indefinitely, refreshing updated_at and masking stale evals.

Move _save_cleanup_scope and trap above the heartbeat registration block, or re-register heartbeat cleanup after the scope is saved.

✅ Corrected ordering
	# t1251: Initial heartbeat — touch updated_at so Phase 0.7's grace period
	# doesn't fire while evaluation is actively running.
	db "$SUPERVISOR_DB" "UPDATE tasks SET updated_at = strftime('%Y-%m-%dT%H:%M:%SZ','now') WHERE id = '$(sql_escape "$task_id")';" 2>/dev/null || true

+	# t1252: Watchdog timer — detect evaluation hangs within 60s instead of
+	# waiting for the full stale timeout (120s grace + eval_timeout).
+	# The watchdog fires a warning log at 60s so trends are visible in logs
+	# before the timeout kills the eval at eval_timeout seconds.
+	# Uses a background subshell that self-terminates when the eval completes.
+	# Cleanup scope ensures sentinel file and watchdog process are released on
+	# any exit path (normal return, early return, or unexpected signal).
+	_save_cleanup_scope
+	trap '_run_cleanups' RETURN
+
	# t1254: Periodic heartbeat — refresh updated_at every 20s during the AI call.
	# The one-shot heartbeat from t1251 only covers eval_timeout+30s (120s window).
	# If the AI eval takes >120s (slow model, large log), Phase 0.7 triggers false
	# recovery even though eval is actively running. A periodic heartbeat keeps
	# updated_at fresh for the full duration of the eval, regardless of how long
	# it takes. The loop self-terminates when the sentinel file is removed.
	local _hb_sentinel
	_hb_sentinel=$(mktemp 2>/dev/null || echo "")
	local _hb_pid=""
	if [[ -n "$_hb_sentinel" ]]; then
		local _hb_task_id="$task_id"
		local _hb_db="$SUPERVISOR_DB"
		(
			while [[ -f "$_hb_sentinel" ]]; do
				sleep 20
				[[ -f "$_hb_sentinel" ]] || break
				db "$_hb_db" "UPDATE tasks SET updated_at = strftime('%Y-%m-%dT%H:%M:%SZ','now') WHERE id = '$(sql_escape "$_hb_task_id")';" 2>/dev/null || true
			done
		) &
		_hb_pid=$!
		push_cleanup "rm -f '${_hb_sentinel}' 2>/dev/null || true; kill '${_hb_pid}' 2>/dev/null; wait '${_hb_pid}' 2>/dev/null || true"
	fi
-
-	# t1252: Watchdog timer — detect evaluation hangs within 60s instead of
-	# waiting for the full stale timeout (120s grace + eval_timeout).
-	# The watchdog fires a warning log at 60s so trends are visible in logs
-	# before the timeout kills the eval at eval_timeout seconds.
-	# Uses a background subshell that self-terminates when the eval completes.
-	# Cleanup scope ensures sentinel file and watchdog process are released on
-	# any exit path (normal return, early return, or unexpected signal).
-	_save_cleanup_scope
-	trap '_run_cleanups' RETURN
	local watchdog_pid=""
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.agents/scripts/supervisor/evaluate.sh around lines 1442 - 1478, The
heartbeat cleanup pushed with push_cleanup (the rm/kill/wait for
_hb_sentinel/_hb_pid) is saved into the saved cleanup stack because
_save_cleanup_scope is called after the heartbeat is registered, so an early
RETURN only runs the current _CLEANUP_CMDS (watchdog) and leaves the heartbeat
running; fix by moving the _save_cleanup_scope and trap '_run_cleanups' RETURN
calls to before the heartbeat creation block so both heartbeat and watchdog
cleanups live in the same active scope, or alternatively call push_cleanup for
the heartbeat after _save_cleanup_scope so the heartbeat cleanup is in the
current scope that _run_cleanups executes (reference symbols: push_cleanup,
_save_cleanup_scope, trap '_run_cleanups', _CLEANUP_CMDS, _CLEANUP_SAVE_STACK).
🧹 Nitpick comments (1)
.agents/scripts/supervisor/pulse.sh (1)

195-203: Optional guard for non-numeric SUPERVISOR_EVAL_TIMEOUT.

Line 202 relies on arithmetic expansion; if the env var is accidentally set to a non-integer (e.g., 90s), bash emits an arithmetic error and the heartbeat window can degrade. A small numeric sanitize keeps the stale-eval guardrail robust.

🛠️ Suggested hardening
- local eval_timeout_cfg="${SUPERVISOR_EVAL_TIMEOUT:-90}"
- local heartbeat_window=$((eval_timeout_cfg * 2 + 60))
+ local eval_timeout_cfg="${SUPERVISOR_EVAL_TIMEOUT:-90}"
+ if ! [[ "$eval_timeout_cfg" =~ ^[0-9]+$ ]]; then
+   log_warn "SUPERVISOR_EVAL_TIMEOUT is non-numeric ('$eval_timeout_cfg'); defaulting to 90"
+   eval_timeout_cfg=90
+ fi
+ local heartbeat_window=$((eval_timeout_cfg * 2 + 60))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.agents/scripts/supervisor/pulse.sh around lines 195 - 203, The arithmetic
for heartbeat_window can fail if SUPERVISOR_EVAL_TIMEOUT contains non-digits;
sanitize and default the value before computing heartbeat_window: read
SUPERVISOR_EVAL_TIMEOUT into eval_timeout_cfg, strip non-numeric characters (or
extract digits), and if the result is empty or zero fall back to the default 90,
then compute heartbeat_window using that validated integer; update the logic
around the existing eval_timeout_cfg and heartbeat_window variables in pulse.sh
to perform this validation.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In @.agents/scripts/supervisor/evaluate.sh:
- Around line 1442-1478: The heartbeat cleanup pushed with push_cleanup (the
rm/kill/wait for _hb_sentinel/_hb_pid) is saved into the saved cleanup stack
because _save_cleanup_scope is called after the heartbeat is registered, so an
early RETURN only runs the current _CLEANUP_CMDS (watchdog) and leaves the
heartbeat running; fix by moving the _save_cleanup_scope and trap
'_run_cleanups' RETURN calls to before the heartbeat creation block so both
heartbeat and watchdog cleanups live in the same active scope, or alternatively
call push_cleanup for the heartbeat after _save_cleanup_scope so the heartbeat
cleanup is in the current scope that _run_cleanups executes (reference symbols:
push_cleanup, _save_cleanup_scope, trap '_run_cleanups', _CLEANUP_CMDS,
_CLEANUP_SAVE_STACK).

---

Nitpick comments:
In @.agents/scripts/supervisor/pulse.sh:
- Around line 195-203: The arithmetic for heartbeat_window can fail if
SUPERVISOR_EVAL_TIMEOUT contains non-digits; sanitize and default the value
before computing heartbeat_window: read SUPERVISOR_EVAL_TIMEOUT into
eval_timeout_cfg, strip non-numeric characters (or extract digits), and if the
result is empty or zero fall back to the default 90, then compute
heartbeat_window using that validated integer; update the logic around the
existing eval_timeout_cfg and heartbeat_window variables in pulse.sh to perform
this validation.

@github-actions
Copy link

🔍 Code Quality Report

�[0;35m[MONITOR]�[0m Code Review Monitoring Report

�[0;34m[INFO]�[0m Latest Quality Status:
SonarCloud: 0 bugs, 0 vulnerabilities, 29 code smells

�[0;34m[INFO]�[0m Recent monitoring activity:
Thu Feb 19 11:15:16 UTC 2026: Code review monitoring started
Thu Feb 19 11:15:16 UTC 2026: SonarCloud - Bugs: 0, Vulnerabilities: 0, Code Smells: 29

📈 Current Quality Metrics

  • BUGS: 0
  • CODE SMELLS: 29
  • VULNERABILITIES: 0

Generated on: Thu Feb 19 11:15:18 UTC 2026


Generated by AI DevOps Framework Code Review Monitoring

@sonarqubecloud
Copy link

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
.agents/scripts/supervisor/evaluate.sh (1)

1442-1477: ⚠️ Potential issue | 🟠 Major

Move _save_cleanup_scope before heartbeat spawn to follow standard cleanup pattern.

The heartbeat push_cleanup at line 1466 is registered before _save_cleanup_scope is called at line 1476. While the code includes explicit manual cleanup of both heartbeat and watchdog (lines 1514–1537) before early return paths, this ordering violates the established pattern throughout the codebase and creates a fragile structure. Any future code path that adds an early return between watchdog setup and manual cleanup would orphan the heartbeat cleanup, leaving the background update loop running indefinitely and masking stale recovery.

Move _save_cleanup_scope and trap to the beginning of the heartbeat section:

🔧 Recommended cleanup scope ordering
 	# t1251: Initial heartbeat — touch updated_at so Phase 0.7's grace period
 	# doesn't fire while evaluation is actively running.
 	db "$SUPERVISOR_DB" "UPDATE tasks SET updated_at = strftime('%Y-%m-%dT%H:%M:%SZ','now') WHERE id = '$(sql_escape "$task_id")';" 2>/dev/null || true

+	# Establish cleanup scope before spawning background helpers (heartbeat + watchdog).
+	_save_cleanup_scope
+	trap '_run_cleanups' RETURN
+
 	# t1254: Periodic heartbeat — refresh updated_at every 20s during the AI call.
 	# The one-shot heartbeat from t1251 only covers eval_timeout+30s (120s window).
 	# If the AI eval takes >120s (slow model, large log), Phase 0.7 triggers false
 	# recovery even though eval is actively running. A periodic heartbeat keeps
 	# updated_at fresh for the full duration of the eval, regardless of how long
 	# it takes. The loop self-terminates when the sentinel file is removed.
 	local _hb_sentinel
 	_hb_sentinel=$(mktemp 2>/dev/null || echo "")
 	local _hb_pid=""
 	if [[ -n "$_hb_sentinel" ]]; then
 		local _hb_task_id="$task_id"
 		local _hb_db="$SUPERVISOR_DB"
 		(
 			while [[ -f "$_hb_sentinel" ]]; do
 				sleep 20
 				[[ -f "$_hb_sentinel" ]] || break
 				db "$_hb_db" "UPDATE tasks SET updated_at = strftime('%Y-%m-%dT%H:%M:%SZ','now') WHERE id = '$(sql_escape "$_hb_task_id")';" 2>/dev/null || true
 			done
 		) &
 		_hb_pid=$!
 		push_cleanup "rm -f '${_hb_sentinel}' 2>/dev/null || true; kill '${_hb_pid}' 2>/dev/null || true; wait '${_hb_pid}' 2>/dev/null || true"
 	fi

 	# t1252: Watchdog timer — detect evaluation hangs within 60s instead of
 	# waiting for the full stale timeout (120s grace + eval_timeout).
 	# The watchdog fires a warning log at 60s so trends are visible in logs
 	# before the timeout kills the eval at eval_timeout seconds.
 	# Uses a background subshell that self-terminates when the eval completes.
 	# Cleanup scope ensures sentinel file and watchdog process are released on
 	# any exit path (normal return, early return, or unexpected signal).
-	_save_cleanup_scope
-	trap '_run_cleanups' RETURN
 	local watchdog_pid=""
 	local watchdog_sentinel
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.agents/scripts/supervisor/evaluate.sh around lines 1442 - 1477, The
heartbeat's cleanup registration (push_cleanup inside the background heartbeat
block that uses _hb_sentinel and _hb_pid) is being added before the cleanup
scope is saved, which can orphan the heartbeat if an early return is added
later; move the call to _save_cleanup_scope and the trap '_run_cleanups' RETURN
so they occur before spawning the heartbeat subprocess and before any
push_cleanup calls (i.e., call _save_cleanup_scope and set trap prior to
creating _hb_sentinel / starting the background loop that registers
push_cleanup), ensuring all push_cleanup registrations for the heartbeat and
watchdog are captured in the saved scope.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In @.agents/scripts/supervisor/evaluate.sh:
- Around line 1442-1477: The heartbeat's cleanup registration (push_cleanup
inside the background heartbeat block that uses _hb_sentinel and _hb_pid) is
being added before the cleanup scope is saved, which can orphan the heartbeat if
an early return is added later; move the call to _save_cleanup_scope and the
trap '_run_cleanups' RETURN so they occur before spawning the heartbeat
subprocess and before any push_cleanup calls (i.e., call _save_cleanup_scope and
set trap prior to creating _hb_sentinel / starting the background loop that
registers push_cleanup), ensuring all push_cleanup registrations for the
heartbeat and watchdog are captured in the saved scope.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant