t1252: add eval watchdog, duration metrics, and hang detection#1955
t1252: add eval watchdog, duration metrics, and hang detection#1955marcusquinn merged 2 commits intomainfrom
Conversation
Root cause verified (t1251 fixes confirmed): heartbeat + configurable timeout already reduce stale-evaluating rate. This PR adds the remaining instrumentation: 1. eval_duration_secs column in tasks table — persists how long each AI eval took, enabling trend analysis and regression detection across pulse cycles. 2. Watchdog timer (SUPERVISOR_EVAL_WATCHDOG, default 60s) — background subshell fires a EVAL_WATCHDOG_FIRED log line if eval hasn't completed within 60s, giving early visibility into hangs before the full stale timeout fires. Watchdog is cancelled immediately when eval completes (sentinel file pattern). 3. Eval start/end timestamp logging — evaluate_with_ai() now logs start time (ISO 8601) and duration at completion: EVAL_DURATION task=$id duration_secs=N. Duration is also included in state_log audit entries and pattern tracker tags. 4. Eval duration in stale GC report — cmd_stale_gc_report shows a new 'Eval Duration Analysis' section when eval_duration_secs data is available, including avg/max/min and count of evals exceeding the 60s watchdog threshold. 5. Eval duration in pattern tracker — record_evaluation_metadata() includes eval_duration_secs in description and tags for cross-task trend analysis. Expected benefit: eliminates ~35min avg wasted per stale recovery by making evaluation hangs visible within 60s instead of waiting for the full grace period.
Summary of ChangesHello @marcusquinn, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces significant instrumentation to monitor and detect issues with AI evaluation processes. By adding a watchdog timer, recording evaluation durations, and integrating these metrics into logging and reporting, the system can now identify and respond to evaluation hangs much faster than before. This enhancement aims to improve the reliability and visibility of AI task processing, building upon previous fixes to reduce stale-evaluating rates by providing an early detection mechanism for prolonged evaluations. Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
🔍 Code Quality Report�[0;35m[MONITOR]�[0m Code Review Monitoring Report �[0;34m[INFO]�[0m Latest Quality Status: �[0;34m[INFO]�[0m Recent monitoring activity: 📈 Current Quality Metrics
Generated on: Thu Feb 19 09:57:18 UTC 2026 Generated by AI DevOps Framework Code Review Monitoring |
There was a problem hiding this comment.
Code Review
This pull request introduces valuable instrumentation for monitoring the AI evaluation process, including a watchdog for hangs and duration metrics. The changes are well-implemented across the database, evaluation script, and reporting. My review includes one high-severity suggestion in evaluate.sh to add a trap for resource cleanup, ensuring the new watchdog mechanism is robust against unexpected script termination, in accordance with the repository's style guide.
WalkthroughAdds eval_duration_secs to the tasks schema, instruments AI evaluations to measure and persist evaluation duration with a watchdog for long runs, and surfaces eval duration metrics in the stale GC report. Changes
Sequence DiagramsequenceDiagram
participant Eval as Evaluate Script
participant WD as Watchdog
participant DB as Database
participant Report as Pulse Report
Eval->>Eval: record eval_start_epoch
Eval->>WD: spawn watchdog (configurable delay)
Eval->>Eval: run AI evaluation
WD-->>WD: monitor elapsed time
Eval->>Eval: compute eval_end_epoch & eval_duration_secs
Eval->>WD: cancel watchdog
Eval->>DB: persist eval_duration_secs to tasks
Eval->>Eval: emit EVAL_DURATION log/metric
Report->>DB: query recent eval_duration_secs
DB-->>Report: return duration rows
Report->>Report: compute stats (count, avg, max, min, threshold exceed)
Report->>Report: display Eval Duration Analysis
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~22 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.agents/scripts/supervisor/evaluate.sh:
- Around line 867-886: Reset the eval_duration_secs variable at the start of
evaluation to avoid carrying stale values into description/tags: explicitly set
eval_duration_secs="" (or unset) before the DB read and before building
description/record_args so only a freshly-populated value from the current AI
evaluation is included; update the code around the eval_duration_secs assignment
and the description/record_args construction (referencing the eval_duration_secs
variable and the block that builds description and record_args) to ensure it's
cleared on every evaluation start, including retry and skip-AI paths.
In @.agents/scripts/supervisor/pulse.sh:
- Around line 425-447: The report currently hardcodes the watchdog threshold
(60s) in the SQL CASE expression inside the eval duration block; read the
SUPERVISOR_EVAL_WATCHDOG env var (with a sensible default like 60) into a shell
variable (e.g., eval_watchdog) after has_eval_duration_data is computed,
validate/ensure it is numeric, and substitute that variable into the SQL
(replace the literal 60 in the CASE WHEN eval_duration_secs >= 60 THEN 1 ELSE 0
END with the shell variable) so the count uses the configured
SUPERVISOR_EVAL_WATCHDOG value; keep variable names like has_eval_duration_data
and eval_watchdog to locate the relevant block.
Add trap-based cleanup for watchdog sentinel file and background process in evaluate_with_ai(). Uses the established _save_cleanup_scope + push_cleanup pattern (per styleguide: temp files must have trap cleanup). The trap fires on any exit path — normal return, early return, or signal — ensuring no sentinel file or watchdog process leaks.
🔍 Code Quality Report�[0;35m[MONITOR]�[0m Code Review Monitoring Report �[0;34m[INFO]�[0m Latest Quality Status: �[0;34m[INFO]�[0m Recent monitoring activity: 📈 Current Quality Metrics
Generated on: Thu Feb 19 10:08:06 UTC 2026 Generated by AI DevOps Framework Code Review Monitoring |
|
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.agents/scripts/supervisor/evaluate.sh:
- Around line 1459-1475: The watchdog sleep may be longer-or-equal to
eval_timeout, causing the subshell to never fire EVAL_WATCHDOG_FIRED; before
spawning the subshell that uses SUPERVISOR_EVAL_WATCHDOG, compare the resolved
watchdog_delay (SUPERVISOR_EVAL_WATCHDOG or default 60) with eval_timeout and
log a warning (e.g. via log_warn) if watchdog_delay >= eval_timeout so
misconfiguration is visible, then only spawn the subshell when watchdog_delay <
eval_timeout; inside the subshell drop the unnecessary local and reference
SUPERVISOR_EVAL_WATCHDOG directly in sleep, keeping the existing push_cleanup
usage for watchdog_sentinel and watchdog_pid to preserve cleanup ordering.
- Around line 1502-1512: The subtraction storing eval_duration_secs can produce
negative or zero values when clocks step backwards; before writing to the DB or
emitting the structured metric, validate eval_duration_secs (computed from
eval_end_epoch - eval_start_epoch) and only perform db "UPDATE ..." and log_info
"EVAL_DURATION ..." when eval_duration_secs is greater than zero; otherwise emit
a warning via log_info/log_warn with context (task_id, eval_start_epoch,
eval_end_epoch, eval_timeout, eval_model) and skip the DB write/metric emission
to avoid corrupting trend data. Ensure you reference the existing variables and
helpers (eval_duration_secs, eval_end_epoch, eval_start_epoch, db,
SUPERVISOR_DB, log_info, eval_timeout, eval_model, task_id) so the guard is
applied in the same block that currently performs the DB update and metric log.
---
Duplicate comments:
In @.agents/scripts/supervisor/evaluate.sh:
- Around line 867-871: Clear the stale eval_duration_secs DB column at the start
of each evaluation to avoid leaking a previous AI run's duration into
description/tags/pattern records: before the existing read of eval_duration_secs
(the block referencing SUPERVISOR_DB, eval_duration_secs and task_id and using
db/sql_escape), execute an update that sets eval_duration_secs = NULL for this
task_id via the db helper so subsequent logic reads a fresh value; keep the same
db/sql_escape usage and only perform this NULL-reset when SUPERVISOR_DB is set.
Auto-dismissed: bot review does not block autonomous pipeline



Summary
Root Cause Context
t1251 already fixed the primary causes of 73% stale-evaluating rate (heartbeat, configurable timeout, PR fast-path). This PR adds the instrumentation layer requested in t1252 so evaluation hangs are visible within 60s instead of waiting for the full stale grace period.
Changes
evaluate.sh: watchdog timer (background subshell + sentinel file), eval start/end timestamps, duration logging and DB persistencedatabase.sh: eval_duration_secs column in tasks schema + migrationpulse.sh: eval duration analysis section in cmd_stale_gc_reportRef #1954
Summary by CodeRabbit