t1254: Fix stale-evaluating periodic heartbeat — eliminate false recoveries for long evals#1958
t1254: Fix stale-evaluating periodic heartbeat — eliminate false recoveries for long evals#1958marcusquinn merged 2 commits intomainfrom
Conversation
…eries (t1254) Root cause: t1251 added a one-shot updated_at touch at eval start, but if the AI eval takes longer than heartbeat_window (eval_timeout+30s = 120s), Phase 0.7 still triggers false recovery — interrupting a live eval and causing re-queuing. Fixes: - Periodic heartbeat loop (20s interval) runs alongside the AI call in a background subshell, keeping updated_at fresh for the full eval duration - Wider heartbeat_window in _diagnose_stale_root_cause: eval_timeout*2+60 (240s default) instead of eval_timeout+30 (120s) — covers slow models - Final updated_at touch after AI call completes, before teardown, to cover the post-watchdog race window (Phase 0.7 could fire during teardown) - Heartbeat sentinel/PID cleanup via push_cleanup for safety on early exit Expected impact: eliminates remaining ~35min/task stale recovery overhead for evals that exceed 120s (slow models, large log tails)
Summary of ChangesHello @marcusquinn, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses a critical issue where long-running AI evaluations were prematurely marked as 'stale-evaluating' and recovered, leading to duplicate executions and wasted resources. The changes introduce a more robust heartbeat mechanism to accurately reflect the active status of evaluations, significantly reducing false recovery incidents and improving the efficiency of the evaluation pipeline. Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
WalkthroughAdds a periodic heartbeat (every 20s) during AI evaluation that updates Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Eval as "evaluate.sh"
participant HB as "heartbeat (background)"
participant FS as "Filesystem (sentinel / updated_at)"
participant Pulse as "pulse.sh (watchdog)"
Eval->>FS: create sentinel & start HB (bg)
Eval->>HB: spawn background loop (touch updated_at every 20s)
loop every 20s
HB->>FS: touch updated_at
end
Pulse->>FS: read updated_at
Pulse->>Pulse: compare with window = eval_timeout_cfg * 2 + 60
alt within window
Pulse-->>Pulse: consider evaluation active (no recovery)
else outside window
Pulse-->>Pulse: flag stale / trigger recovery
end
Eval->>FS: remove sentinel & stop HB (on finish/timeout)
Eval->>FS: final touch updated_at (post-eval teardown)
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
🔍 Code Quality Report�[0;35m[MONITOR]�[0m Code Review Monitoring Report �[0;34m[INFO]�[0m Latest Quality Status: �[0;34m[INFO]�[0m Recent monitoring activity: 📈 Current Quality Metrics
Generated on: Thu Feb 19 11:04:00 UTC 2026 Generated by AI DevOps Framework Code Review Monitoring |
There was a problem hiding this comment.
Code Review
This pull request introduces a periodic heartbeat mechanism to prevent false recoveries for long-running AI evaluations, which is a solid improvement. The implementation is mostly robust, following existing patterns in the codebase. I've identified a potential bug in the cleanup trap that could cause issues in error scenarios, and suggested a couple of minor refactorings to improve code clarity and maintainability. These suggestions align with established repository patterns for robust cleanup. Overall, the changes are well-aligned with the PR's goal.
| done | ||
| ) & | ||
| _hb_pid=$! | ||
| push_cleanup "rm -f '${_hb_sentinel}' 2>/dev/null || true; kill '${_hb_pid}' 2>/dev/null; wait '${_hb_pid}' 2>/dev/null || true" |
There was a problem hiding this comment.
The kill command in the push_cleanup string is missing an || true guard. If the script is running with set -e (which it is), and the process with PID _hb_pid has already exited when the trap runs, the kill command will fail with a non-zero exit code. This would cause the trap to terminate prematurely, and the subsequent wait command would not be executed, potentially leaving a zombie process.
The manual cleanup block on line 1520 correctly uses || true, and this trap should be just as robust.
| push_cleanup "rm -f '${_hb_sentinel}' 2>/dev/null || true; kill '${_hb_pid}' 2>/dev/null; wait '${_hb_pid}' 2>/dev/null || true" | |
| push_cleanup "rm -f '${_hb_sentinel}' 2>/dev/null || true; kill '${_hb_pid}' 2>/dev/null || true; wait '${_hb_pid}' 2>/dev/null || true" |
References
- For resource cleanup in shell scripts, use the established project pattern: use
_save_cleanup_scope,trap '_run_cleanups' RETURN, andpush_cleanupfor robust cleanup on any exit path, and also include explicit manual cleanup at the end of the normal execution path as a 'fast-path'.
There was a problem hiding this comment.
Fixed in bfb87fb. Added || true to the kill call in the push_cleanup string so a process that has already exited does not cause the trap to terminate prematurely before the wait runs.
| if [[ -n "$_hb_sentinel" ]]; then | ||
| rm -f "$_hb_sentinel" 2>/dev/null || true | ||
| fi | ||
| if [[ -n "$_hb_pid" ]]; then | ||
| kill "$_hb_pid" 2>/dev/null || true | ||
| wait "$_hb_pid" 2>/dev/null || true | ||
| fi |
There was a problem hiding this comment.
The manual cleanup logic can be made more concise. Since _hb_pid is only set when _hb_sentinel is successfully created and the background process is launched, checking for _hb_pid is sufficient to guard all cleanup operations for the heartbeat process.
| if [[ -n "$_hb_sentinel" ]]; then | |
| rm -f "$_hb_sentinel" 2>/dev/null || true | |
| fi | |
| if [[ -n "$_hb_pid" ]]; then | |
| kill "$_hb_pid" 2>/dev/null || true | |
| wait "$_hb_pid" 2>/dev/null || true | |
| fi | |
| if [[ -n "$_hb_pid" ]]; then | |
| rm -f "$_hb_sentinel" 2>/dev/null || true | |
| kill "$_hb_pid" 2>/dev/null || true | |
| wait "$_hb_pid" 2>/dev/null || true | |
| fi |
References
- For resource cleanup in shell scripts, use the established project pattern: use
_save_cleanup_scope,trap '_run_cleanups' RETURN, andpush_cleanupfor robust cleanup on any exit path, and also include explicit manual cleanup at the end of the normal execution path as a 'fast-path'.
There was a problem hiding this comment.
Fixed in bfb87fb. Consolidated the manual cleanup block to check only $_hb_pid (which is only set when $_hb_sentinel was successfully created), removing the redundant $_hb_sentinel guard as suggested.
.agents/scripts/supervisor/pulse.sh
Outdated
| # heartbeat_window = eval_timeout * 2 + 60: covers slow models that exceed | ||
| # eval_timeout (e.g., 90s timeout but model responds at 150s), plus 60s | ||
| # buffer for teardown. The periodic heartbeat keeps updated_at fresh, so | ||
| # this window only needs to cover the 20s heartbeat interval + teardown. |
There was a problem hiding this comment.
The comment here presents two different justifications for the heartbeat_window size, which can be confusing. One part suggests a large window is needed to cover long-running evaluations, while the other says only a small window is needed because of the new periodic heartbeat.
To improve clarity, I suggest rephrasing the comment to explain that the large window is a defense-in-depth measure in case the heartbeat process fails, while the periodic heartbeat is the primary mechanism that makes the window less critical under normal operation.
There was a problem hiding this comment.
Fixed in bfb87fb. Rewrote the comment to give a single clear justification: the wide window is defense-in-depth for the case where the heartbeat subshell itself is unexpectedly killed or stalled, rather than presenting two conflicting rationales.
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
.agents/scripts/supervisor/evaluate.sh (1)
1442-1478:⚠️ Potential issue | 🔴 CriticalHeartbeat cleanup is dropped on early exit due to cleanup scope ordering.
The heartbeat cleanup is registered via
push_cleanupat line 1466, then_save_cleanup_scopeis called at line 1476, which moves the heartbeat cleanup to the save stack and clears the current scope. The watchdog cleanup is then registered into the now-empty scope at lines 1483 and 1496.If the eval exits early (error handling, signal, early return), the RETURN trap fires and executes only the current
_CLEANUP_CMDS(watchdog cleanup). The heartbeat cleanup remains in_CLEANUP_SAVE_STACKorphaned—_restore_cleanup_scopeis never called to restore it. The heartbeat background loop continues indefinitely, refreshingupdated_atand masking stale evals.Move
_save_cleanup_scopeandtrapabove the heartbeat registration block, or re-register heartbeat cleanup after the scope is saved.✅ Corrected ordering
# t1251: Initial heartbeat — touch updated_at so Phase 0.7's grace period # doesn't fire while evaluation is actively running. db "$SUPERVISOR_DB" "UPDATE tasks SET updated_at = strftime('%Y-%m-%dT%H:%M:%SZ','now') WHERE id = '$(sql_escape "$task_id")';" 2>/dev/null || true + # t1252: Watchdog timer — detect evaluation hangs within 60s instead of + # waiting for the full stale timeout (120s grace + eval_timeout). + # The watchdog fires a warning log at 60s so trends are visible in logs + # before the timeout kills the eval at eval_timeout seconds. + # Uses a background subshell that self-terminates when the eval completes. + # Cleanup scope ensures sentinel file and watchdog process are released on + # any exit path (normal return, early return, or unexpected signal). + _save_cleanup_scope + trap '_run_cleanups' RETURN + # t1254: Periodic heartbeat — refresh updated_at every 20s during the AI call. # The one-shot heartbeat from t1251 only covers eval_timeout+30s (120s window). # If the AI eval takes >120s (slow model, large log), Phase 0.7 triggers false # recovery even though eval is actively running. A periodic heartbeat keeps # updated_at fresh for the full duration of the eval, regardless of how long # it takes. The loop self-terminates when the sentinel file is removed. local _hb_sentinel _hb_sentinel=$(mktemp 2>/dev/null || echo "") local _hb_pid="" if [[ -n "$_hb_sentinel" ]]; then local _hb_task_id="$task_id" local _hb_db="$SUPERVISOR_DB" ( while [[ -f "$_hb_sentinel" ]]; do sleep 20 [[ -f "$_hb_sentinel" ]] || break db "$_hb_db" "UPDATE tasks SET updated_at = strftime('%Y-%m-%dT%H:%M:%SZ','now') WHERE id = '$(sql_escape "$_hb_task_id")';" 2>/dev/null || true done ) & _hb_pid=$! push_cleanup "rm -f '${_hb_sentinel}' 2>/dev/null || true; kill '${_hb_pid}' 2>/dev/null; wait '${_hb_pid}' 2>/dev/null || true" fi - - # t1252: Watchdog timer — detect evaluation hangs within 60s instead of - # waiting for the full stale timeout (120s grace + eval_timeout). - # The watchdog fires a warning log at 60s so trends are visible in logs - # before the timeout kills the eval at eval_timeout seconds. - # Uses a background subshell that self-terminates when the eval completes. - # Cleanup scope ensures sentinel file and watchdog process are released on - # any exit path (normal return, early return, or unexpected signal). - _save_cleanup_scope - trap '_run_cleanups' RETURN local watchdog_pid=""🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.agents/scripts/supervisor/evaluate.sh around lines 1442 - 1478, The heartbeat cleanup pushed with push_cleanup (the rm/kill/wait for _hb_sentinel/_hb_pid) is saved into the saved cleanup stack because _save_cleanup_scope is called after the heartbeat is registered, so an early RETURN only runs the current _CLEANUP_CMDS (watchdog) and leaves the heartbeat running; fix by moving the _save_cleanup_scope and trap '_run_cleanups' RETURN calls to before the heartbeat creation block so both heartbeat and watchdog cleanups live in the same active scope, or alternatively call push_cleanup for the heartbeat after _save_cleanup_scope so the heartbeat cleanup is in the current scope that _run_cleanups executes (reference symbols: push_cleanup, _save_cleanup_scope, trap '_run_cleanups', _CLEANUP_CMDS, _CLEANUP_SAVE_STACK).
🧹 Nitpick comments (1)
.agents/scripts/supervisor/pulse.sh (1)
195-203: Optional guard for non-numericSUPERVISOR_EVAL_TIMEOUT.Line 202 relies on arithmetic expansion; if the env var is accidentally set to a non-integer (e.g.,
90s), bash emits an arithmetic error and the heartbeat window can degrade. A small numeric sanitize keeps the stale-eval guardrail robust.🛠️ Suggested hardening
- local eval_timeout_cfg="${SUPERVISOR_EVAL_TIMEOUT:-90}" - local heartbeat_window=$((eval_timeout_cfg * 2 + 60)) + local eval_timeout_cfg="${SUPERVISOR_EVAL_TIMEOUT:-90}" + if ! [[ "$eval_timeout_cfg" =~ ^[0-9]+$ ]]; then + log_warn "SUPERVISOR_EVAL_TIMEOUT is non-numeric ('$eval_timeout_cfg'); defaulting to 90" + eval_timeout_cfg=90 + fi + local heartbeat_window=$((eval_timeout_cfg * 2 + 60))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.agents/scripts/supervisor/pulse.sh around lines 195 - 203, The arithmetic for heartbeat_window can fail if SUPERVISOR_EVAL_TIMEOUT contains non-digits; sanitize and default the value before computing heartbeat_window: read SUPERVISOR_EVAL_TIMEOUT into eval_timeout_cfg, strip non-numeric characters (or extract digits), and if the result is empty or zero fall back to the default 90, then compute heartbeat_window using that validated integer; update the logic around the existing eval_timeout_cfg and heartbeat_window variables in pulse.sh to perform this validation.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In @.agents/scripts/supervisor/evaluate.sh:
- Around line 1442-1478: The heartbeat cleanup pushed with push_cleanup (the
rm/kill/wait for _hb_sentinel/_hb_pid) is saved into the saved cleanup stack
because _save_cleanup_scope is called after the heartbeat is registered, so an
early RETURN only runs the current _CLEANUP_CMDS (watchdog) and leaves the
heartbeat running; fix by moving the _save_cleanup_scope and trap
'_run_cleanups' RETURN calls to before the heartbeat creation block so both
heartbeat and watchdog cleanups live in the same active scope, or alternatively
call push_cleanup for the heartbeat after _save_cleanup_scope so the heartbeat
cleanup is in the current scope that _run_cleanups executes (reference symbols:
push_cleanup, _save_cleanup_scope, trap '_run_cleanups', _CLEANUP_CMDS,
_CLEANUP_SAVE_STACK).
---
Nitpick comments:
In @.agents/scripts/supervisor/pulse.sh:
- Around line 195-203: The arithmetic for heartbeat_window can fail if
SUPERVISOR_EVAL_TIMEOUT contains non-digits; sanitize and default the value
before computing heartbeat_window: read SUPERVISOR_EVAL_TIMEOUT into
eval_timeout_cfg, strip non-numeric characters (or extract digits), and if the
result is empty or zero fall back to the default 90, then compute
heartbeat_window using that validated integer; update the logic around the
existing eval_timeout_cfg and heartbeat_window variables in pulse.sh to perform
this validation.
🔍 Code Quality Report�[0;35m[MONITOR]�[0m Code Review Monitoring Report �[0;34m[INFO]�[0m Latest Quality Status: �[0;34m[INFO]�[0m Recent monitoring activity: 📈 Current Quality Metrics
Generated on: Thu Feb 19 11:15:18 UTC 2026 Generated by AI DevOps Framework Code Review Monitoring |
|
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
.agents/scripts/supervisor/evaluate.sh (1)
1442-1477:⚠️ Potential issue | 🟠 MajorMove
_save_cleanup_scopebefore heartbeat spawn to follow standard cleanup pattern.The heartbeat
push_cleanupat line 1466 is registered before_save_cleanup_scopeis called at line 1476. While the code includes explicit manual cleanup of both heartbeat and watchdog (lines 1514–1537) before early return paths, this ordering violates the established pattern throughout the codebase and creates a fragile structure. Any future code path that adds an early return between watchdog setup and manual cleanup would orphan the heartbeat cleanup, leaving the background update loop running indefinitely and masking stale recovery.Move
_save_cleanup_scopeandtrapto the beginning of the heartbeat section:🔧 Recommended cleanup scope ordering
# t1251: Initial heartbeat — touch updated_at so Phase 0.7's grace period # doesn't fire while evaluation is actively running. db "$SUPERVISOR_DB" "UPDATE tasks SET updated_at = strftime('%Y-%m-%dT%H:%M:%SZ','now') WHERE id = '$(sql_escape "$task_id")';" 2>/dev/null || true + # Establish cleanup scope before spawning background helpers (heartbeat + watchdog). + _save_cleanup_scope + trap '_run_cleanups' RETURN + # t1254: Periodic heartbeat — refresh updated_at every 20s during the AI call. # The one-shot heartbeat from t1251 only covers eval_timeout+30s (120s window). # If the AI eval takes >120s (slow model, large log), Phase 0.7 triggers false # recovery even though eval is actively running. A periodic heartbeat keeps # updated_at fresh for the full duration of the eval, regardless of how long # it takes. The loop self-terminates when the sentinel file is removed. local _hb_sentinel _hb_sentinel=$(mktemp 2>/dev/null || echo "") local _hb_pid="" if [[ -n "$_hb_sentinel" ]]; then local _hb_task_id="$task_id" local _hb_db="$SUPERVISOR_DB" ( while [[ -f "$_hb_sentinel" ]]; do sleep 20 [[ -f "$_hb_sentinel" ]] || break db "$_hb_db" "UPDATE tasks SET updated_at = strftime('%Y-%m-%dT%H:%M:%SZ','now') WHERE id = '$(sql_escape "$_hb_task_id")';" 2>/dev/null || true done ) & _hb_pid=$! push_cleanup "rm -f '${_hb_sentinel}' 2>/dev/null || true; kill '${_hb_pid}' 2>/dev/null || true; wait '${_hb_pid}' 2>/dev/null || true" fi # t1252: Watchdog timer — detect evaluation hangs within 60s instead of # waiting for the full stale timeout (120s grace + eval_timeout). # The watchdog fires a warning log at 60s so trends are visible in logs # before the timeout kills the eval at eval_timeout seconds. # Uses a background subshell that self-terminates when the eval completes. # Cleanup scope ensures sentinel file and watchdog process are released on # any exit path (normal return, early return, or unexpected signal). - _save_cleanup_scope - trap '_run_cleanups' RETURN local watchdog_pid="" local watchdog_sentinel🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.agents/scripts/supervisor/evaluate.sh around lines 1442 - 1477, The heartbeat's cleanup registration (push_cleanup inside the background heartbeat block that uses _hb_sentinel and _hb_pid) is being added before the cleanup scope is saved, which can orphan the heartbeat if an early return is added later; move the call to _save_cleanup_scope and the trap '_run_cleanups' RETURN so they occur before spawning the heartbeat subprocess and before any push_cleanup calls (i.e., call _save_cleanup_scope and set trap prior to creating _hb_sentinel / starting the background loop that registers push_cleanup), ensuring all push_cleanup registrations for the heartbeat and watchdog are captured in the saved scope.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In @.agents/scripts/supervisor/evaluate.sh:
- Around line 1442-1477: The heartbeat's cleanup registration (push_cleanup
inside the background heartbeat block that uses _hb_sentinel and _hb_pid) is
being added before the cleanup scope is saved, which can orphan the heartbeat if
an early return is added later; move the call to _save_cleanup_scope and the
trap '_run_cleanups' RETURN so they occur before spawning the heartbeat
subprocess and before any push_cleanup calls (i.e., call _save_cleanup_scope and
set trap prior to creating _hb_sentinel / starting the background loop that
registers push_cleanup), ensuring all push_cleanup registrations for the
heartbeat and watchdog are captured in the saved scope.



Fixes the remaining stale-evaluating false recovery gap left after t1251 (PR #1952).
Root Cause
t1251 added a one-shot
updated_atheartbeat at eval start, but if the AI eval takes longer thanheartbeat_window(eval_timeout+30s = 120s default), Phase 0.7 still triggers false recovery — interrupting a live eval and causing unnecessary re-queuing.Failure scenario before this fix:
Fixes
updated_atfresh for the full eval duration regardless of how long it takes_diagnose_stale_root_cause:eval_timeout*2+60(240s default) instead ofeval_timeout+30(120s) — covers slow models that exceed eval_timeoutExpected Impact
Eliminates remaining ~35min/task stale recovery overhead for evals that exceed 120s. Combined with t1251's fixes, this should reduce the stale-evaluating rate from 73% toward near-zero.
Files Changed
.agents/scripts/supervisor/evaluate.sh: periodic heartbeat loop + final touch.agents/scripts/supervisor/pulse.sh: wider heartbeat_window formulaSummary by CodeRabbit