Skip to content

Comments

t1252: add eval watchdog, duration metrics, and hang detection#1955

Merged
marcusquinn merged 2 commits intomainfrom
feature/t1252
Feb 19, 2026
Merged

t1252: add eval watchdog, duration metrics, and hang detection#1955
marcusquinn merged 2 commits intomainfrom
feature/t1252

Conversation

@marcusquinn
Copy link
Owner

@marcusquinn marcusquinn commented Feb 19, 2026

Summary

  • Add 60s watchdog timer to detect evaluation hangs early (SUPERVISOR_EVAL_WATCHDOG env var)
  • Add eval_duration_secs column to tasks table for trend analysis
  • Log EVAL_DURATION and EVAL_WATCHDOG_FIRED structured metrics to supervisor log
  • Include eval duration in stale GC report, pattern tracker tags, and state_log audit entries

Root Cause Context

t1251 already fixed the primary causes of 73% stale-evaluating rate (heartbeat, configurable timeout, PR fast-path). This PR adds the instrumentation layer requested in t1252 so evaluation hangs are visible within 60s instead of waiting for the full stale grace period.

Changes

  • evaluate.sh: watchdog timer (background subshell + sentinel file), eval start/end timestamps, duration logging and DB persistence
  • database.sh: eval_duration_secs column in tasks schema + migration
  • pulse.sh: eval duration analysis section in cmd_stale_gc_report

Ref #1954

Summary by CodeRabbit

  • New Features
    • AI evaluation duration is now recorded and persisted for each evaluation.
    • Watchdog detects and logs evaluations that exceed a configurable threshold (default 60s).
    • Evaluation duration is included in logs, metrics, verdicts, and tracking records for better traceability.
    • New Eval Duration Analysis report shows count, average, min/max durations and threshold violations within the selected window.

Root cause verified (t1251 fixes confirmed): heartbeat + configurable timeout
already reduce stale-evaluating rate. This PR adds the remaining instrumentation:

1. eval_duration_secs column in tasks table — persists how long each AI eval
   took, enabling trend analysis and regression detection across pulse cycles.

2. Watchdog timer (SUPERVISOR_EVAL_WATCHDOG, default 60s) — background subshell
   fires a EVAL_WATCHDOG_FIRED log line if eval hasn't completed within 60s,
   giving early visibility into hangs before the full stale timeout fires.
   Watchdog is cancelled immediately when eval completes (sentinel file pattern).

3. Eval start/end timestamp logging — evaluate_with_ai() now logs start time
   (ISO 8601) and duration at completion: EVAL_DURATION task=$id duration_secs=N.
   Duration is also included in state_log audit entries and pattern tracker tags.

4. Eval duration in stale GC report — cmd_stale_gc_report shows a new
   'Eval Duration Analysis' section when eval_duration_secs data is available,
   including avg/max/min and count of evals exceeding the 60s watchdog threshold.

5. Eval duration in pattern tracker — record_evaluation_metadata() includes
   eval_duration_secs in description and tags for cross-task trend analysis.

Expected benefit: eliminates ~35min avg wasted per stale recovery by making
evaluation hangs visible within 60s instead of waiting for the full grace period.
@gemini-code-assist
Copy link

Summary of Changes

Hello @marcusquinn, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces significant instrumentation to monitor and detect issues with AI evaluation processes. By adding a watchdog timer, recording evaluation durations, and integrating these metrics into logging and reporting, the system can now identify and respond to evaluation hangs much faster than before. This enhancement aims to improve the reliability and visibility of AI task processing, building upon previous fixes to reduce stale-evaluating rates by providing an early detection mechanism for prolonged evaluations.

Highlights

  • Evaluation Watchdog: A 60-second watchdog timer has been added to detect AI evaluation hangs early, using the SUPERVISOR_EVAL_WATCHDOG environment variable for configuration.
  • Evaluation Duration Metrics: A new eval_duration_secs column was added to the tasks table to record the duration of AI evaluation steps, enabling trend analysis.
  • Structured Logging: Structured metrics EVAL_DURATION and EVAL_WATCHDOG_FIRED are now logged to the supervisor log for better observability and trend analysis.
  • Reporting and Auditing Integration: Evaluation duration is now included in the stale GC report, pattern tracker tags, and state_log audit entries for comprehensive tracking.
Changelog
  • .agents/scripts/supervisor/database.sh
    • Added a migration to include the eval_duration_secs column in the tasks table.
    • Updated the tasks table schema to define the new eval_duration_secs column.
  • .agents/scripts/supervisor/evaluate.sh
    • Modified record_evaluation_metadata to fetch and include eval_duration_secs in pattern tracker tags and descriptions.
    • Implemented a watchdog timer using a background subshell and sentinel file to detect long-running evaluations.
    • Recorded evaluation start and end timestamps to calculate eval_duration_secs.
    • Persisted eval_duration_secs to the tasks table in the database.
    • Logged EVAL_DURATION and EVAL_WATCHDOG_FIRED structured metrics.
    • Updated log_info and db calls to include evaluation duration in state log audit entries and general logs.
  • .agents/scripts/supervisor/pulse.sh
    • Added a new 'Eval Duration Analysis' section to the cmd_stale_gc_report command, displaying statistics on evaluation times.
Activity
  • No human activity has occurred on this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@github-actions
Copy link

🔍 Code Quality Report

�[0;35m[MONITOR]�[0m Code Review Monitoring Report

�[0;34m[INFO]�[0m Latest Quality Status:
SonarCloud: 0 bugs, 0 vulnerabilities, 30 code smells

�[0;34m[INFO]�[0m Recent monitoring activity:
Thu Feb 19 09:57:15 UTC 2026: Code review monitoring started
Thu Feb 19 09:57:16 UTC 2026: SonarCloud - Bugs: 0, Vulnerabilities: 0, Code Smells: 30

📈 Current Quality Metrics

  • BUGS: 0
  • CODE SMELLS: 30
  • VULNERABILITIES: 0

Generated on: Thu Feb 19 09:57:18 UTC 2026


Generated by AI DevOps Framework Code Review Monitoring

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces valuable instrumentation for monitoring the AI evaluation process, including a watchdog for hangs and duration metrics. The changes are well-implemented across the database, evaluation script, and reporting. My review includes one high-severity suggestion in evaluate.sh to add a trap for resource cleanup, ensuring the new watchdog mechanism is robust against unexpected script termination, in accordance with the repository's style guide.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 19, 2026

Walkthrough

Adds eval_duration_secs to the tasks schema, instruments AI evaluations to measure and persist evaluation duration with a watchdog for long runs, and surfaces eval duration metrics in the stale GC report.

Changes

Cohort / File(s) Summary
Database Schema & Migration
.agents/scripts/supervisor/database.sh
Adds eval_duration_secs column to tasks creation SQL and startup migration that conditionally adds the column if missing.
Evaluation Timing & Monitoring
.agents/scripts/supervisor/evaluate.sh
Records eval_start_epoch/eval_end_epoch, computes eval_duration_secs, persists it to tasks, emits EVAL_DURATION logs/metrics, and adds a configurable watchdog background process that logs and records when runs exceed the threshold.
Analytics / Reporting
.agents/scripts/supervisor/pulse.sh
Adds an "Eval Duration Analysis" block to stale GC report that queries recent eval_duration_secs and prints count, avg, max, min, and watchdog-threshold-exceeded counts when data exists.

Sequence Diagram

sequenceDiagram
    participant Eval as Evaluate Script
    participant WD as Watchdog
    participant DB as Database
    participant Report as Pulse Report

    Eval->>Eval: record eval_start_epoch
    Eval->>WD: spawn watchdog (configurable delay)
    Eval->>Eval: run AI evaluation
    WD-->>WD: monitor elapsed time
    Eval->>Eval: compute eval_end_epoch & eval_duration_secs
    Eval->>WD: cancel watchdog
    Eval->>DB: persist eval_duration_secs to tasks
    Eval->>Eval: emit EVAL_DURATION log/metric
    Report->>DB: query recent eval_duration_secs
    DB-->>Report: return duration rows
    Report->>Report: compute stats (count, avg, max, min, threshold exceed)
    Report->>Report: display Eval Duration Analysis
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Possibly related PRs

Poem

⏱️ A tiny timestamp set at start,
A watchdog listens, plays its part,
Durations logged and stored with care,
Metrics bloom inside the GC's glare,
Small ticks, big insight — DevOps declares! 🚀

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the three main changes: adding an eval watchdog, duration metrics, and hang detection—all of which are present throughout the changeset.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feature/t1252

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

coderabbitai[bot]
coderabbitai bot previously requested changes Feb 19, 2026
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.agents/scripts/supervisor/evaluate.sh:
- Around line 867-886: Reset the eval_duration_secs variable at the start of
evaluation to avoid carrying stale values into description/tags: explicitly set
eval_duration_secs="" (or unset) before the DB read and before building
description/record_args so only a freshly-populated value from the current AI
evaluation is included; update the code around the eval_duration_secs assignment
and the description/record_args construction (referencing the eval_duration_secs
variable and the block that builds description and record_args) to ensure it's
cleared on every evaluation start, including retry and skip-AI paths.

In @.agents/scripts/supervisor/pulse.sh:
- Around line 425-447: The report currently hardcodes the watchdog threshold
(60s) in the SQL CASE expression inside the eval duration block; read the
SUPERVISOR_EVAL_WATCHDOG env var (with a sensible default like 60) into a shell
variable (e.g., eval_watchdog) after has_eval_duration_data is computed,
validate/ensure it is numeric, and substitute that variable into the SQL
(replace the literal 60 in the CASE WHEN eval_duration_secs >= 60 THEN 1 ELSE 0
END with the shell variable) so the count uses the configured
SUPERVISOR_EVAL_WATCHDOG value; keep variable names like has_eval_duration_data
and eval_watchdog to locate the relevant block.

Add trap-based cleanup for watchdog sentinel file and background process
in evaluate_with_ai(). Uses the established _save_cleanup_scope +
push_cleanup pattern (per styleguide: temp files must have trap cleanup).
The trap fires on any exit path — normal return, early return, or signal —
ensuring no sentinel file or watchdog process leaks.
@github-actions
Copy link

🔍 Code Quality Report

�[0;35m[MONITOR]�[0m Code Review Monitoring Report

�[0;34m[INFO]�[0m Latest Quality Status:
SonarCloud: 0 bugs, 0 vulnerabilities, 30 code smells

�[0;34m[INFO]�[0m Recent monitoring activity:
Thu Feb 19 10:08:03 UTC 2026: Code review monitoring started
Thu Feb 19 10:08:04 UTC 2026: SonarCloud - Bugs: 0, Vulnerabilities: 0, Code Smells: 30

📈 Current Quality Metrics

  • BUGS: 0
  • CODE SMELLS: 30
  • VULNERABILITIES: 0

Generated on: Thu Feb 19 10:08:06 UTC 2026


Generated by AI DevOps Framework Code Review Monitoring

@sonarqubecloud
Copy link

coderabbitai[bot]
coderabbitai bot previously requested changes Feb 19, 2026
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.agents/scripts/supervisor/evaluate.sh:
- Around line 1459-1475: The watchdog sleep may be longer-or-equal to
eval_timeout, causing the subshell to never fire EVAL_WATCHDOG_FIRED; before
spawning the subshell that uses SUPERVISOR_EVAL_WATCHDOG, compare the resolved
watchdog_delay (SUPERVISOR_EVAL_WATCHDOG or default 60) with eval_timeout and
log a warning (e.g. via log_warn) if watchdog_delay >= eval_timeout so
misconfiguration is visible, then only spawn the subshell when watchdog_delay <
eval_timeout; inside the subshell drop the unnecessary local and reference
SUPERVISOR_EVAL_WATCHDOG directly in sleep, keeping the existing push_cleanup
usage for watchdog_sentinel and watchdog_pid to preserve cleanup ordering.
- Around line 1502-1512: The subtraction storing eval_duration_secs can produce
negative or zero values when clocks step backwards; before writing to the DB or
emitting the structured metric, validate eval_duration_secs (computed from
eval_end_epoch - eval_start_epoch) and only perform db "UPDATE ..." and log_info
"EVAL_DURATION ..." when eval_duration_secs is greater than zero; otherwise emit
a warning via log_info/log_warn with context (task_id, eval_start_epoch,
eval_end_epoch, eval_timeout, eval_model) and skip the DB write/metric emission
to avoid corrupting trend data. Ensure you reference the existing variables and
helpers (eval_duration_secs, eval_end_epoch, eval_start_epoch, db,
SUPERVISOR_DB, log_info, eval_timeout, eval_model, task_id) so the guard is
applied in the same block that currently performs the DB update and metric log.

---

Duplicate comments:
In @.agents/scripts/supervisor/evaluate.sh:
- Around line 867-871: Clear the stale eval_duration_secs DB column at the start
of each evaluation to avoid leaking a previous AI run's duration into
description/tags/pattern records: before the existing read of eval_duration_secs
(the block referencing SUPERVISOR_DB, eval_duration_secs and task_id and using
db/sql_escape), execute an update that sets eval_duration_secs = NULL for this
task_id via the db helper so subsequent logic reads a fresh value; keep the same
db/sql_escape usage and only perform this NULL-reset when SUPERVISOR_DB is set.

@marcusquinn marcusquinn dismissed stale reviews from coderabbitai[bot] and coderabbitai[bot] February 19, 2026 10:26

Auto-dismissed: bot review does not block autonomous pipeline

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant