t1252: add eval watchdog, duration metrics, and hang detection by marcusquinn · Pull Request #1955 · marcusquinn/aidevops

marcusquinn · 2026-02-19T09:56:52Z

Summary

Add 60s watchdog timer to detect evaluation hangs early (SUPERVISOR_EVAL_WATCHDOG env var)
Add eval_duration_secs column to tasks table for trend analysis
Log EVAL_DURATION and EVAL_WATCHDOG_FIRED structured metrics to supervisor log
Include eval duration in stale GC report, pattern tracker tags, and state_log audit entries

Root Cause Context

t1251 already fixed the primary causes of 73% stale-evaluating rate (heartbeat, configurable timeout, PR fast-path). This PR adds the instrumentation layer requested in t1252 so evaluation hangs are visible within 60s instead of waiting for the full stale grace period.

Changes

evaluate.sh: watchdog timer (background subshell + sentinel file), eval start/end timestamps, duration logging and DB persistence
database.sh: eval_duration_secs column in tasks schema + migration
pulse.sh: eval duration analysis section in cmd_stale_gc_report

Ref #1954

Summary by CodeRabbit

New Features
- AI evaluation duration is now recorded and persisted for each evaluation.
- Watchdog detects and logs evaluations that exceed a configurable threshold (default 60s).
- Evaluation duration is included in logs, metrics, verdicts, and tracking records for better traceability.
- New Eval Duration Analysis report shows count, average, min/max durations and threshold violations within the selected window.

Root cause verified (t1251 fixes confirmed): heartbeat + configurable timeout already reduce stale-evaluating rate. This PR adds the remaining instrumentation: 1. eval_duration_secs column in tasks table — persists how long each AI eval took, enabling trend analysis and regression detection across pulse cycles. 2. Watchdog timer (SUPERVISOR_EVAL_WATCHDOG, default 60s) — background subshell fires a EVAL_WATCHDOG_FIRED log line if eval hasn't completed within 60s, giving early visibility into hangs before the full stale timeout fires. Watchdog is cancelled immediately when eval completes (sentinel file pattern). 3. Eval start/end timestamp logging — evaluate_with_ai() now logs start time (ISO 8601) and duration at completion: EVAL_DURATION task=$id duration_secs=N. Duration is also included in state_log audit entries and pattern tracker tags. 4. Eval duration in stale GC report — cmd_stale_gc_report shows a new 'Eval Duration Analysis' section when eval_duration_secs data is available, including avg/max/min and count of evals exceeding the 60s watchdog threshold. 5. Eval duration in pattern tracker — record_evaluation_metadata() includes eval_duration_secs in description and tags for cross-task trend analysis. Expected benefit: eliminates ~35min avg wasted per stale recovery by making evaluation hangs visible within 60s instead of waiting for the full grace period.

gemini-code-assist · 2026-02-19T09:57:07Z

Summary of Changes

Hello @marcusquinn, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces significant instrumentation to monitor and detect issues with AI evaluation processes. By adding a watchdog timer, recording evaluation durations, and integrating these metrics into logging and reporting, the system can now identify and respond to evaluation hangs much faster than before. This enhancement aims to improve the reliability and visibility of AI task processing, building upon previous fixes to reduce stale-evaluating rates by providing an early detection mechanism for prolonged evaluations.

Highlights

Evaluation Watchdog: A 60-second watchdog timer has been added to detect AI evaluation hangs early, using the SUPERVISOR_EVAL_WATCHDOG environment variable for configuration.
Evaluation Duration Metrics: A new eval_duration_secs column was added to the tasks table to record the duration of AI evaluation steps, enabling trend analysis.
Structured Logging: Structured metrics EVAL_DURATION and EVAL_WATCHDOG_FIRED are now logged to the supervisor log for better observability and trend analysis.
Reporting and Auditing Integration: Evaluation duration is now included in the stale GC report, pattern tracker tags, and state_log audit entries for comprehensive tracking.

Changelog

.agents/scripts/supervisor/database.sh
- Added a migration to include the eval_duration_secs column in the tasks table.
- Updated the tasks table schema to define the new eval_duration_secs column.
.agents/scripts/supervisor/evaluate.sh
- Modified record_evaluation_metadata to fetch and include eval_duration_secs in pattern tracker tags and descriptions.
- Implemented a watchdog timer using a background subshell and sentinel file to detect long-running evaluations.
- Recorded evaluation start and end timestamps to calculate eval_duration_secs.
- Persisted eval_duration_secs to the tasks table in the database.
- Logged EVAL_DURATION and EVAL_WATCHDOG_FIRED structured metrics.
- Updated log_info and db calls to include evaluation duration in state log audit entries and general logs.
.agents/scripts/supervisor/pulse.sh
- Added a new 'Eval Duration Analysis' section to the cmd_stale_gc_report command, displaying statistics on evaluation times.

Activity

No human activity has occurred on this pull request yet.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

github-actions · 2026-02-19T09:57:19Z

🔍 Code Quality Report

�[0;35m[MONITOR]�[0m Code Review Monitoring Report

�[0;34m[INFO]�[0m Latest Quality Status:
SonarCloud: 0 bugs, 0 vulnerabilities, 30 code smells

�[0;34m[INFO]�[0m Recent monitoring activity:
Thu Feb 19 09:57:15 UTC 2026: Code review monitoring started
Thu Feb 19 09:57:16 UTC 2026: SonarCloud - Bugs: 0, Vulnerabilities: 0, Code Smells: 30

📈 Current Quality Metrics

BUGS: 0
CODE SMELLS: 30
VULNERABILITIES: 0

Generated on: Thu Feb 19 09:57:18 UTC 2026

Generated by AI DevOps Framework Code Review Monitoring

gemini-code-assist

Code Review

This pull request introduces valuable instrumentation for monitoring the AI evaluation process, including a watchdog for hangs and duration metrics. The changes are well-implemented across the database, evaluation script, and reporting. My review includes one high-severity suggestion in evaluate.sh to add a trap for resource cleanup, ensuring the new watchdog mechanism is robust against unexpected script termination, in accordance with the repository's style guide.

.agents/scripts/supervisor/evaluate.sh

coderabbitai · 2026-02-19T10:00:48Z

Walkthrough

Adds eval_duration_secs to the tasks schema, instruments AI evaluations to measure and persist evaluation duration with a watchdog for long runs, and surfaces eval duration metrics in the stale GC report.

Changes

Cohort / File(s)	Summary
Database Schema & Migration `.agents/scripts/supervisor/database.sh`	Adds `eval_duration_secs` column to `tasks` creation SQL and startup migration that conditionally adds the column if missing.
Evaluation Timing & Monitoring `.agents/scripts/supervisor/evaluate.sh`	Records `eval_start_epoch`/`eval_end_epoch`, computes `eval_duration_secs`, persists it to `tasks`, emits EVAL_DURATION logs/metrics, and adds a configurable watchdog background process that logs and records when runs exceed the threshold.
Analytics / Reporting `.agents/scripts/supervisor/pulse.sh`	Adds an "Eval Duration Analysis" block to stale GC report that queries recent `eval_duration_secs` and prints count, avg, max, min, and watchdog-threshold-exceeded counts when data exists.

Sequence Diagram

sequenceDiagram
    participant Eval as Evaluate Script
    participant WD as Watchdog
    participant DB as Database
    participant Report as Pulse Report

    Eval->>Eval: record eval_start_epoch
    Eval->>WD: spawn watchdog (configurable delay)
    Eval->>Eval: run AI evaluation
    WD-->>WD: monitor elapsed time
    Eval->>Eval: compute eval_end_epoch & eval_duration_secs
    Eval->>WD: cancel watchdog
    Eval->>DB: persist eval_duration_secs to tasks
    Eval->>Eval: emit EVAL_DURATION log/metric
    Report->>DB: query recent eval_duration_secs
    DB-->>Report: return duration rows
    Report->>Report: compute stats (count, avg, max, min, threshold exceed)
    Report->>Report: display Eval Duration Analysis

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Possibly related PRs

t1096: Add failure mode classification and output quality rating to evaluate.sh #1632: Modifies evaluation metadata persistence and record_evaluation_metadata paths that relate to storing eval fields like eval_duration_secs.
feat: add supervisor-helper.sh with SQLite schema and state machine (t128.1) #376: Establishes the tasks table schema that this change extends by adding eval_duration_secs.

Poem

⏱️ A tiny timestamp set at start,
A watchdog listens, plays its part,
Durations logged and stored with care,
Metrics bloom inside the GC's glare,
Small ticks, big insight — DevOps declares! 🚀

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the three main changes: adding an eval watchdog, duration metrics, and hang detection—all of which are present throughout the changeset.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feature/t1252

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.agents/scripts/supervisor/evaluate.sh:
- Around line 867-886: Reset the eval_duration_secs variable at the start of
evaluation to avoid carrying stale values into description/tags: explicitly set
eval_duration_secs="" (or unset) before the DB read and before building
description/record_args so only a freshly-populated value from the current AI
evaluation is included; update the code around the eval_duration_secs assignment
and the description/record_args construction (referencing the eval_duration_secs
variable and the block that builds description and record_args) to ensure it's
cleared on every evaluation start, including retry and skip-AI paths.

In @.agents/scripts/supervisor/pulse.sh:
- Around line 425-447: The report currently hardcodes the watchdog threshold
(60s) in the SQL CASE expression inside the eval duration block; read the
SUPERVISOR_EVAL_WATCHDOG env var (with a sensible default like 60) into a shell
variable (e.g., eval_watchdog) after has_eval_duration_data is computed,
validate/ensure it is numeric, and substitute that variable into the SQL
(replace the literal 60 in the CASE WHEN eval_duration_secs >= 60 THEN 1 ELSE 0
END with the shell variable) so the count uses the configured
SUPERVISOR_EVAL_WATCHDOG value; keep variable names like has_eval_duration_data
and eval_watchdog to locate the relevant block.

.agents/scripts/supervisor/evaluate.sh

.agents/scripts/supervisor/pulse.sh

Add trap-based cleanup for watchdog sentinel file and background process in evaluate_with_ai(). Uses the established _save_cleanup_scope + push_cleanup pattern (per styleguide: temp files must have trap cleanup). The trap fires on any exit path — normal return, early return, or signal — ensuring no sentinel file or watchdog process leaks.

github-actions · 2026-02-19T10:08:07Z

🔍 Code Quality Report

�[0;35m[MONITOR]�[0m Code Review Monitoring Report

�[0;34m[INFO]�[0m Latest Quality Status:
SonarCloud: 0 bugs, 0 vulnerabilities, 30 code smells

�[0;34m[INFO]�[0m Recent monitoring activity:
Thu Feb 19 10:08:03 UTC 2026: Code review monitoring started
Thu Feb 19 10:08:04 UTC 2026: SonarCloud - Bugs: 0, Vulnerabilities: 0, Code Smells: 30

📈 Current Quality Metrics

BUGS: 0
CODE SMELLS: 30
VULNERABILITIES: 0

Generated on: Thu Feb 19 10:08:06 UTC 2026

Generated by AI DevOps Framework Code Review Monitoring

sonarqubecloud · 2026-02-19T10:08:59Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.agents/scripts/supervisor/evaluate.sh:
- Around line 1459-1475: The watchdog sleep may be longer-or-equal to
eval_timeout, causing the subshell to never fire EVAL_WATCHDOG_FIRED; before
spawning the subshell that uses SUPERVISOR_EVAL_WATCHDOG, compare the resolved
watchdog_delay (SUPERVISOR_EVAL_WATCHDOG or default 60) with eval_timeout and
log a warning (e.g. via log_warn) if watchdog_delay >= eval_timeout so
misconfiguration is visible, then only spawn the subshell when watchdog_delay <
eval_timeout; inside the subshell drop the unnecessary local and reference
SUPERVISOR_EVAL_WATCHDOG directly in sleep, keeping the existing push_cleanup
usage for watchdog_sentinel and watchdog_pid to preserve cleanup ordering.
- Around line 1502-1512: The subtraction storing eval_duration_secs can produce
negative or zero values when clocks step backwards; before writing to the DB or
emitting the structured metric, validate eval_duration_secs (computed from
eval_end_epoch - eval_start_epoch) and only perform db "UPDATE ..." and log_info
"EVAL_DURATION ..." when eval_duration_secs is greater than zero; otherwise emit
a warning via log_info/log_warn with context (task_id, eval_start_epoch,
eval_end_epoch, eval_timeout, eval_model) and skip the DB write/metric emission
to avoid corrupting trend data. Ensure you reference the existing variables and
helpers (eval_duration_secs, eval_end_epoch, eval_start_epoch, db,
SUPERVISOR_DB, log_info, eval_timeout, eval_model, task_id) so the guard is
applied in the same block that currently performs the DB update and metric log.

---

Duplicate comments:
In @.agents/scripts/supervisor/evaluate.sh:
- Around line 867-871: Clear the stale eval_duration_secs DB column at the start
of each evaluation to avoid leaking a previous AI run's duration into
description/tags/pattern records: before the existing read of eval_duration_secs
(the block referencing SUPERVISOR_DB, eval_duration_secs and task_id and using
db/sql_escape), execute an update that sets eval_duration_secs = NULL for this
task_id via the db helper so subsequent logic reads a fresh value; keep the same
db/sql_escape usage and only perform this NULL-reset when SUPERVISOR_DB is set.

.agents/scripts/supervisor/evaluate.sh

Auto-dismissed: bot review does not block autonomous pipeline

gemini-code-assist bot reviewed Feb 19, 2026

View reviewed changes

.agents/scripts/supervisor/evaluate.sh Show resolved Hide resolved

marcusquinn mentioned this pull request Feb 19, 2026

[Supervisor:marcusquinn] 0 queued, 0 working, 1 in review at 14:46 UTC #1502

Open

coderabbitai bot previously requested changes Feb 19, 2026

View reviewed changes

.agents/scripts/supervisor/evaluate.sh Show resolved Hide resolved

.agents/scripts/supervisor/pulse.sh Show resolved Hide resolved

coderabbitai bot previously requested changes Feb 19, 2026

View reviewed changes

.agents/scripts/supervisor/evaluate.sh Show resolved Hide resolved

.agents/scripts/supervisor/evaluate.sh Show resolved Hide resolved

coderabbitai bot approved these changes Feb 19, 2026

View reviewed changes

marcusquinn merged commit b043556 into main Feb 19, 2026
23 checks passed

marcusquinn deleted the feature/t1252 branch February 19, 2026 10:38

marcusquinn added a commit that referenced this pull request Feb 19, 2026

chore: mark t1252 complete in TODO.md (#1955)

6613544

marcusquinn mentioned this pull request Feb 19, 2026

t1252: Investigate and reduce stale-evaluating recovery frequency #1954

Closed

This was referenced Feb 19, 2026

t1254: Fix stale-evaluating periodic heartbeat — eliminate false recoveries for long evals #1958

Merged

t1256: Reduce stale-evaluating frequency via rate limit cooldown and eval checkpoint #1963

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

t1252: add eval watchdog, duration metrics, and hang detection#1955

t1252: add eval watchdog, duration metrics, and hang detection#1955
marcusquinn merged 2 commits intomainfrom
feature/t1252

marcusquinn commented Feb 19, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

gemini-code-assist bot commented Feb 19, 2026

Uh oh!

github-actions bot commented Feb 19, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

coderabbitai bot commented Feb 19, 2026 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Feb 19, 2026

Uh oh!

sonarqubecloud bot commented Feb 19, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

marcusquinn commented Feb 19, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root Cause Context

Changes

Summary by CodeRabbit

Uh oh!

gemini-code-assist bot commented Feb 19, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

github-actions bot commented Feb 19, 2026

🔍 Code Quality Report

📈 Current Quality Metrics

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

coderabbitai bot commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Feb 19, 2026

🔍 Code Quality Report

📈 Current Quality Metrics

Uh oh!

sonarqubecloud bot commented Feb 19, 2026

Quality Gate passed

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

marcusquinn commented Feb 19, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 19, 2026 •

edited

Loading