Skip to content

feat(server): watcher/scheduler health metrics + ServiceMonitor/PrometheusRule#1047

Merged
buremba merged 1 commit into
mainfrom
feat/watcher-alerting
May 25, 2026
Merged

feat(server): watcher/scheduler health metrics + ServiceMonitor/PrometheusRule#1047
buremba merged 1 commit into
mainfrom
feat/watcher-alerting

Conversation

@buremba
Copy link
Copy Markdown
Member

@buremba buremba commented May 25, 2026

Why

Follow-up to #1046. That bug was a silent 12-day outage — the app was healthy but watcher-automation produced zero successful ticks and nothing alerted. This adds the metrics + alert rules so it can't happen silently again. Stacked on #1046 (the phase-failure metric hooks runWatcherAutomationTick).

What

Metrics (gateway/metrics/prometheus.ts — note: the module was vestigial, registering metrics that were never incremented):

  • lobu_scheduled_job_runs_total{job,outcome} — per cron tick in TaskScheduler.dispatch (the scheduler heartbeat).
  • lobu_watcher_automation_phase_failures_total{phase} — per failed phase in runWatcherAutomationTick. Needed because the hardened tick swallows phase errors, so the scheduler-level counter shows success even when reconcile/etc. fail internally.
  • lobu_watcher_runs_created_total, lobu_watchers_unrunnable (gauge).

Per-pod in-memory counters are the correct Prometheus model — each pod's /metrics is scraped and summed; rate()/increase() handle restart resets. No cross-replica shared state.

Chart (charts/lobu, both default OFF; the prod overlay enables them):

  • ServiceMonitor scraping the app /metrics. Adds app.kubernetes.io/component=api to the app Service so the monitor targets it, not the embeddings Service (both share name+instance labels).
  • PrometheusRule:
    • WatcherAutomationSilent (critical) — dead-man's-switch: fires on the absence of successful ticks (... or on() vector(0)) == 0), the actual failure mode.
    • WatcherAutomationPhaseFailing, LobuScheduledJobFailing (warning).

severity labels route to #devops via the already-live slack-devops AlertmanagerConfig (verified loaded in the running Alertmanager) — no Alertmanager/webhook change needed. Both resources carry release: kube-prometheus-stack (Prometheus's selector) via additionalLabels.

Validation

  • tsc clean; watcher suite 25/25 (exercises the metric calls in runWatcherAutomationTick).
  • helm lint + helm template (with metrics enabled) render correctly; ServiceMonitor selector is unique to the app, both resources labeled for the cluster's Prometheus.

Sequencing / not in this PR

  • Chart.yaml version intentionally not hand-bumped (release-please owns it; reconcileStrategy: ChartVersion means the templates deploy on the next release).
  • Enabling metrics.serviceMonitor/prometheusRule in the summaries-prod values is a separate owletto-deploy PR, to merge after the release that ships these templates.

Base is #1046's branch; retarget to main once #1046 merges.

Summary by CodeRabbit

Release Notes

  • New Features
    • Added Prometheus monitoring integration with configurable ServiceMonitor and PrometheusRule resources (requires Prometheus Operator).
    • New metrics for tracking watcher automation health, scheduled job execution outcomes, and system failures.
    • New alerting rules for monitoring critical system events and performance degradation.

Review Change Stack

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 25, 2026

📝 Walkthrough

Walkthrough

This PR adds end-to-end Prometheus monitoring to the Lobu application. It introduces metric emission in watcher automation and scheduled job modules, exposes these metrics via Kubernetes ServiceMonitor, and defines alerting rules for health detection, all controlled via Helm chart values.

Changes

Prometheus Monitoring and Health Alerting

Layer / File(s) Summary
Metrics API and internal gauge management
packages/server/src/gateway/metrics/prometheus.ts
Registers new scheduler/watcher-related health metrics (counters for job runs, phase failures, runs created; gauge for unrunnable watchers), adds setGaugeInternal for label-aware gauge updates, and exports incrementCounter to atomically update counter metrics by label set with configurable increment amounts.
Watcher automation health metrics emission
packages/server/src/watchers/automation.ts
Imports and calls metrics functions after tick execution to increment per-phase failure counters from recorded errors, increment runs-created counter from materialize output, and set unrunnable gauge from materialization results.
Scheduled job outcome metrics
packages/server/src/scheduled/task-scheduler.ts
Imports metrics helper and wraps task dispatch in try/catch to emit success/error outcome counters, preserving original retry behavior by re-throwing errors.
Kubernetes monitoring and alerting configuration
charts/lobu/templates/servicemonitor.yaml, charts/lobu/templates/prometheusrule.yaml, charts/lobu/values.yaml, charts/lobu/templates/service.yaml
Adds ServiceMonitor for scraping /metrics on configurable intervals, PrometheusRule with three alerts (WatcherAutomationSilent for no ticks, WatcherAutomationPhaseFailing for phase errors, LobuScheduledJobFailing for job errors over 15m), metrics configuration in values.yaml with feature toggles, and component label to Service for monitoring discovery.

Sequence Diagram(s)

sequenceDiagram
  participant Application as Application<br/>WatcherAutomation<br/>TaskScheduler
  participant MetricsAPI as Prometheus<br/>Module
  participant ServiceMonitor as ServiceMonitor<br/>Resource
  participant Prometheus as Prometheus<br/>Server
  participant PrometheusRule as PrometheusRule<br/>Alerts
  Application->>MetricsAPI: incrementCounter()/setGauge()
  MetricsAPI->>MetricsAPI: Update in-memory metrics
  ServiceMonitor->>MetricsAPI: Scrape /metrics endpoint
  MetricsAPI-->>ServiceMonitor: Return metric data
  ServiceMonitor->>Prometheus: Forward scraped metrics
  Prometheus->>Prometheus: Store time series
  Prometheus->>PrometheusRule: Evaluate alert rules (15m window)
  PrometheusRule-->>Prometheus: Fire WatcherAutomationSilent<br/>WatcherAutomationPhaseFailing<br/>LobuScheduledJobFailing
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • lobu-ai/lobu#1046: The watcher automation metrics in this PR build directly on the prior PR's changes to runWatcherAutomationTick tick orchestration and the addition of unrunnable and runsCreated plumbing in the materialize phase.

Poem

🐰 Prometheus gathers the tale,
Each tick and job now leaves a trail,
Alerts will chirp when silence breaks,
The watcher automation shakes,
Health metrics flow—no more to fail! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 14.29% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'feat(server): watcher/scheduler health metrics + ServiceMonitor/PrometheusRule' clearly and concisely describes the main changes: adding health metrics and Kubernetes monitoring resources for the watcher and scheduler components.
Description check ✅ Passed The PR description comprehensively covers all required template sections: a clear 'Why' explaining the context and motivation, a detailed 'What' describing the metrics and chart changes, validation details, and sequencing notes. It exceeds minimum requirements.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/watcher-alerting

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

ESLint skipped: no ESLint configuration detected in root package.json. To enable, add eslint to devDependencies.


Comment @coderabbitai help to get the list of available commands and usage tips.

@buremba
Copy link
Copy Markdown
Member Author

buremba commented May 25, 2026

bug_free 86, simplicity 88, slop 0, bugs 0, 0 blockers

Read diff; script suites passed (typecheck/unit/integration exit 0). Explored helm template with ServiceMonitor/PrometheusRule enabled -> OK, and imported metrics module to verify new counter/gauge text. Did not boot full server.

Full verdict JSON
{
  "bug_free_confidence": 86,
  "bugs": 0,
  "slop": 0,
  "simplicity": 88,
  "blockers": [],
  "change_type": "feat",
  "behavior_change_risk": "low",
  "tests_adequate": true,
  "suggested_fixes": [],
  "notes": "Read diff; script suites passed (typecheck/unit/integration exit 0). Explored helm template with ServiceMonitor/PrometheusRule enabled -> OK, and imported metrics module to verify new counter/gauge text. Did not boot full server.",
  "categories": {
    "src": 99,
    "tests": 0,
    "docs": 0,
    "config": 105,
    "deps": 0,
    "migrations": 0,
    "ci": 0,
    "generated": 0
  }
}

Local review gate — branch protection can require the pi-review commit status. See docs/REVIEW_SCHEMA.md.

Base automatically changed from feat/fix-watcher-reconcile-array to main May 25, 2026 21:16
…theusRule

Observability for the silent failure mode behind the 12-day watcher outage
(lobu#1046): the app was healthy but watcher-automation produced zero successful
ticks, with no alert.

Metrics (the Prometheus module was vestigial — registered but never incremented):
- lobu_scheduled_job_runs_total{job,outcome} — incremented per cron tick in the
  TaskScheduler. The scheduler heartbeat.
- lobu_watcher_automation_phase_failures_total{phase} — per failed phase in
  runWatcherAutomationTick. Needed because the hardened tick swallows phase
  errors, so the scheduler-level counter alone can't see internal failures.
- lobu_watcher_runs_created_total / lobu_watchers_unrunnable (gauge).
Per-pod in-memory counters are the correct Prometheus model: each pod's /metrics
is scraped and summed; rate()/increase() handle restart resets. No cross-replica
shared state.

Chart (charts/lobu, off by default; prod overlay enables):
- ServiceMonitor (templates/servicemonitor.yaml) scraping the app /metrics. Adds
  app.kubernetes.io/component=api to the app Service so the monitor targets it
  and not the embeddings Service (both share name+instance labels).
- PrometheusRule (templates/prometheusrule.yaml): WatcherAutomationSilent
  (critical, dead-man's-switch — alerts on ABSENCE of successful ticks, the
  actual failure mode), WatcherAutomationPhaseFailing + LobuScheduledJobFailing
  (warning). severity labels route to #devops via the existing slack-devops
  AlertmanagerConfig — no Alertmanager change needed.
Both gated on .Values.metrics.*; require release=kube-prometheus-stack label
(Prometheus selector) via additionalLabels.

Chart.yaml version intentionally not bumped — release-please owns it; templates
ship on the next release.
@buremba buremba force-pushed the feat/watcher-alerting branch from 9080898 to 7b4d36c Compare May 25, 2026 21:16
@codecov-commenter
Copy link
Copy Markdown

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@charts/lobu/templates/prometheusrule.yaml`:
- Around line 49-56: The alert currently uses the expression
sum(increase(lobu_scheduled_job_runs_total{job=~"watcher-automation|check-stalled-executions",outcome="error"}[15m]))
by (job) > 0 but the annotation/summary claims the job "threw on every run";
update the annotations in prometheusrule.yaml (the summary and description
fields) to accurately reflect the query (e.g., "had at least one error in the
last 15m" / "The {{`{{ $labels.job }}`}} cron task had one or more error(s) over
the past 15m.") or, if you prefer stricter behavior, tighten the expression
(replace > 0 with a comparison to total runs to detect if errors == total runs)
so the alert text matches the metric logic.

In `@packages/server/src/scheduled/task-scheduler.ts`:
- Around line 250-269: The current try/catch around reg.handler only counts
handler errors; extend the try to begin before the pre-handler cron
seeding/validation code so any failure prior to calling reg.handler is also
caught and triggers incrementCounter('lobu_scheduled_job_runs_total', { job:
data.name, outcome: 'error' }). Keep the success incrementCounter call after
reg.handler completes, preserve rethrowing the caught error, and update
references to reg.handler, incrementCounter and lobu_scheduled_job_runs_total
accordingly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 441c69eb-157c-4bc5-8015-83bd3bf81295

📥 Commits

Reviewing files that changed from the base of the PR and between c524b42 and 7b4d36c.

📒 Files selected for processing (7)
  • charts/lobu/templates/prometheusrule.yaml
  • charts/lobu/templates/service.yaml
  • charts/lobu/templates/servicemonitor.yaml
  • charts/lobu/values.yaml
  • packages/server/src/gateway/metrics/prometheus.ts
  • packages/server/src/scheduled/task-scheduler.ts
  • packages/server/src/watchers/automation.ts

Comment on lines +49 to +56
sum(increase(lobu_scheduled_job_runs_total{job=~"watcher-automation|check-stalled-executions",outcome="error"}[15m])) by (job) > 0
for: 15m
labels:
severity: warning
service: lobu
annotations:
summary: "Lobu scheduled job {{`{{ $labels.job }}`}} is failing"
description: "The {{`{{ $labels.job }}`}} cron task threw on every run over 15m."
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Alert annotation overstates failure frequency.

Line 49 alerts on any error in 15m (> 0), but Line 56 says the job “threw on every run.” Please align the annotation text (or tighten the expression).

💡 Suggested patch
-            description: "The {{`{{ $labels.job }}`}} cron task threw on every run over 15m."
+            description: "The {{`{{ $labels.job }}`}} cron task threw at least once over 15m."
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
sum(increase(lobu_scheduled_job_runs_total{job=~"watcher-automation|check-stalled-executions",outcome="error"}[15m])) by (job) > 0
for: 15m
labels:
severity: warning
service: lobu
annotations:
summary: "Lobu scheduled job {{`{{ $labels.job }}`}} is failing"
description: "The {{`{{ $labels.job }}`}} cron task threw on every run over 15m."
sum(increase(lobu_scheduled_job_runs_total{job=~"watcher-automation|check-stalled-executions",outcome="error"}[15m])) by (job) > 0
for: 15m
labels:
severity: warning
service: lobu
annotations:
summary: "Lobu scheduled job {{`{{ $labels.job }}`}} is failing"
description: "The {{`{{ $labels.job }}`}} cron task threw at least once over 15m."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@charts/lobu/templates/prometheusrule.yaml` around lines 49 - 56, The alert
currently uses the expression
sum(increase(lobu_scheduled_job_runs_total{job=~"watcher-automation|check-stalled-executions",outcome="error"}[15m]))
by (job) > 0 but the annotation/summary claims the job "threw on every run";
update the annotations in prometheusrule.yaml (the summary and description
fields) to accurately reflect the query (e.g., "had at least one error in the
last 15m" / "The {{`{{ $labels.job }}`}} cron task had one or more error(s) over
the past 15m.") or, if you prefer stricter behavior, tighten the expression
(replace > 0 with a comparison to total runs to detect if errors == total runs)
so the alert text matches the metric logic.

Comment on lines +250 to +269
// Per-tick outcome counter — the scheduler heartbeat that backs the
// "watcher-automation silent / failing" alerts. Counts every dispatched
// task; alerts filter by job name. Re-throw is preserved so the runs-queue
// retry path is unchanged.
try {
await reg.handler({
payload: data.payload,
taskRunId: Number(job.id),
});
incrementCounter('lobu_scheduled_job_runs_total', {
job: data.name,
outcome: 'success',
});
} catch (err) {
incrementCounter('lobu_scheduled_job_runs_total', {
job: data.name,
outcome: 'error',
});
throw err;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Count pre-handler failures in lobu_scheduled_job_runs_total as errors.

Line 254 only wraps reg.handler(...). If cron seeding fails earlier (Line 241-248), the run fails without incrementing outcome: "error", which leaves an alerting blind spot.

Proposed fix
-    if (reg.cron) {
-      const fromTick = data.__scheduledTick
-        ? new Date(data.__scheduledTick)
-        : new Date();
-      // Add 1ms so nextRunAt skips past the current tick when fromTick falls
-      // exactly on a cron boundary.
-      await this.seedNextCronTick(reg, new Date(fromTick.getTime() + 1));
-    }
-
-    // Per-tick outcome counter — the scheduler heartbeat that backs the
-    // "watcher-automation silent / failing" alerts. Counts every dispatched
-    // task; alerts filter by job name. Re-throw is preserved so the runs-queue
-    // retry path is unchanged.
     try {
+      if (reg.cron) {
+        const fromTick = data.__scheduledTick
+          ? new Date(data.__scheduledTick)
+          : new Date();
+        // Add 1ms so nextRunAt skips past the current tick when fromTick falls
+        // exactly on a cron boundary.
+        await this.seedNextCronTick(reg, new Date(fromTick.getTime() + 1));
+      }
+
+      // Per-tick outcome counter — the scheduler heartbeat that backs the
+      // "watcher-automation silent / failing" alerts. Counts every dispatched
+      // task; alerts filter by job name. Re-throw is preserved so the runs-queue
+      // retry path is unchanged.
       await reg.handler({
         payload: data.payload,
         taskRunId: Number(job.id),
       });
       incrementCounter('lobu_scheduled_job_runs_total', {
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/server/src/scheduled/task-scheduler.ts` around lines 250 - 269, The
current try/catch around reg.handler only counts handler errors; extend the try
to begin before the pre-handler cron seeding/validation code so any failure
prior to calling reg.handler is also caught and triggers
incrementCounter('lobu_scheduled_job_runs_total', { job: data.name, outcome:
'error' }). Keep the success incrementCounter call after reg.handler completes,
preserve rethrowing the caught error, and update references to reg.handler,
incrementCounter and lobu_scheduled_job_runs_total accordingly.

@buremba buremba merged commit 60c6e73 into main May 25, 2026
39 checks passed
@buremba buremba deleted the feat/watcher-alerting branch May 25, 2026 21:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants