feat(server): watcher/scheduler health metrics + ServiceMonitor/PrometheusRule by buremba · Pull Request #1047 · lobu-ai/lobu

buremba · 2026-05-25T18:35:25Z

Why

Follow-up to #1046. That bug was a silent 12-day outage — the app was healthy but watcher-automation produced zero successful ticks and nothing alerted. This adds the metrics + alert rules so it can't happen silently again. Stacked on #1046 (the phase-failure metric hooks runWatcherAutomationTick).

What

Metrics (gateway/metrics/prometheus.ts — note: the module was vestigial, registering metrics that were never incremented):

lobu_scheduled_job_runs_total{job,outcome} — per cron tick in TaskScheduler.dispatch (the scheduler heartbeat).
lobu_watcher_automation_phase_failures_total{phase} — per failed phase in runWatcherAutomationTick. Needed because the hardened tick swallows phase errors, so the scheduler-level counter shows success even when reconcile/etc. fail internally.
lobu_watcher_runs_created_total, lobu_watchers_unrunnable (gauge).

Per-pod in-memory counters are the correct Prometheus model — each pod's /metrics is scraped and summed; rate()/increase() handle restart resets. No cross-replica shared state.

Chart (charts/lobu, both default OFF; the prod overlay enables them):

ServiceMonitor scraping the app /metrics. Adds app.kubernetes.io/component=api to the app Service so the monitor targets it, not the embeddings Service (both share name+instance labels).
PrometheusRule:
- WatcherAutomationSilent (critical) — dead-man's-switch: fires on the absence of successful ticks (... or on() vector(0)) == 0), the actual failure mode.
- WatcherAutomationPhaseFailing, LobuScheduledJobFailing (warning).

severity labels route to #devops via the already-live slack-devops AlertmanagerConfig (verified loaded in the running Alertmanager) — no Alertmanager/webhook change needed. Both resources carry release: kube-prometheus-stack (Prometheus's selector) via additionalLabels.

Validation

tsc clean; watcher suite 25/25 (exercises the metric calls in runWatcherAutomationTick).
helm lint + helm template (with metrics enabled) render correctly; ServiceMonitor selector is unique to the app, both resources labeled for the cluster's Prometheus.

Sequencing / not in this PR

Chart.yaml version intentionally not hand-bumped (release-please owns it; reconcileStrategy: ChartVersion means the templates deploy on the next release).
Enabling metrics.serviceMonitor/prometheusRule in the summaries-prod values is a separate owletto-deploy PR, to merge after the release that ships these templates.

Base is #1046's branch; retarget to main once #1046 merges.

Summary by CodeRabbit

Release Notes

New Features
- Added Prometheus monitoring integration with configurable ServiceMonitor and PrometheusRule resources (requires Prometheus Operator).
- New metrics for tracking watcher automation health, scheduled job execution outcomes, and system failures.
- New alerting rules for monitoring critical system events and performance degradation.

coderabbitai · 2026-05-25T18:35:30Z

📝 Walkthrough

Walkthrough

This PR adds end-to-end Prometheus monitoring to the Lobu application. It introduces metric emission in watcher automation and scheduled job modules, exposes these metrics via Kubernetes ServiceMonitor, and defines alerting rules for health detection, all controlled via Helm chart values.

Changes

Prometheus Monitoring and Health Alerting

Layer / File(s)	Summary
Metrics API and internal gauge management `packages/server/src/gateway/metrics/prometheus.ts`	Registers new scheduler/watcher-related health metrics (counters for job runs, phase failures, runs created; gauge for unrunnable watchers), adds `setGaugeInternal` for label-aware gauge updates, and exports `incrementCounter` to atomically update counter metrics by label set with configurable increment amounts.
Watcher automation health metrics emission `packages/server/src/watchers/automation.ts`	Imports and calls metrics functions after tick execution to increment per-phase failure counters from recorded errors, increment runs-created counter from materialize output, and set unrunnable gauge from materialization results.
Scheduled job outcome metrics `packages/server/src/scheduled/task-scheduler.ts`	Imports metrics helper and wraps task dispatch in try/catch to emit success/error outcome counters, preserving original retry behavior by re-throwing errors.
Kubernetes monitoring and alerting configuration `charts/lobu/templates/servicemonitor.yaml`, `charts/lobu/templates/prometheusrule.yaml`, `charts/lobu/values.yaml`, `charts/lobu/templates/service.yaml`	Adds ServiceMonitor for scraping /metrics on configurable intervals, PrometheusRule with three alerts (WatcherAutomationSilent for no ticks, WatcherAutomationPhaseFailing for phase errors, LobuScheduledJobFailing for job errors over 15m), metrics configuration in values.yaml with feature toggles, and component label to Service for monitoring discovery.

Sequence Diagram(s)

sequenceDiagram
  participant Application as Application<br/>WatcherAutomation<br/>TaskScheduler
  participant MetricsAPI as Prometheus<br/>Module
  participant ServiceMonitor as ServiceMonitor<br/>Resource
  participant Prometheus as Prometheus<br/>Server
  participant PrometheusRule as PrometheusRule<br/>Alerts
  Application->>MetricsAPI: incrementCounter()/setGauge()
  MetricsAPI->>MetricsAPI: Update in-memory metrics
  ServiceMonitor->>MetricsAPI: Scrape /metrics endpoint
  MetricsAPI-->>ServiceMonitor: Return metric data
  ServiceMonitor->>Prometheus: Forward scraped metrics
  Prometheus->>Prometheus: Store time series
  Prometheus->>PrometheusRule: Evaluate alert rules (15m window)
  PrometheusRule-->>Prometheus: Fire WatcherAutomationSilent<br/>WatcherAutomationPhaseFailing<br/>LobuScheduledJobFailing

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

lobu-ai/lobu#1046: The watcher automation metrics in this PR build directly on the prior PR's changes to runWatcherAutomationTick tick orchestration and the addition of unrunnable and runsCreated plumbing in the materialize phase.

Poem

🐰 Prometheus gathers the tale,
Each tick and job now leaves a trail,
Alerts will chirp when silence breaks,
The watcher automation shakes,
Health metrics flow—no more to fail! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 14.29% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'feat(server): watcher/scheduler health metrics + ServiceMonitor/PrometheusRule' clearly and concisely describes the main changes: adding health metrics and Kubernetes monitoring resources for the watcher and scheduler components.
Description check	✅ Passed	The PR description comprehensively covers all required template sections: a clear 'Why' explaining the context and motivation, a detailed 'What' describing the metrics and chart changes, validation details, and sequencing notes. It exceeds minimum requirements.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/watcher-alerting

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

ESLint skipped: no ESLint configuration detected in root package.json. To enable, add eslint to devDependencies.

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

buremba · 2026-05-25T20:37:55Z

bug_free 86, simplicity 88, slop 0, bugs 0, 0 blockers

Read diff; script suites passed (typecheck/unit/integration exit 0). Explored helm template with ServiceMonitor/PrometheusRule enabled -> OK, and imported metrics module to verify new counter/gauge text. Did not boot full server.

Full verdict JSON

{
  "bug_free_confidence": 86,
  "bugs": 0,
  "slop": 0,
  "simplicity": 88,
  "blockers": [],
  "change_type": "feat",
  "behavior_change_risk": "low",
  "tests_adequate": true,
  "suggested_fixes": [],
  "notes": "Read diff; script suites passed (typecheck/unit/integration exit 0). Explored helm template with ServiceMonitor/PrometheusRule enabled -> OK, and imported metrics module to verify new counter/gauge text. Did not boot full server.",
  "categories": {
    "src": 99,
    "tests": 0,
    "docs": 0,
    "config": 105,
    "deps": 0,
    "migrations": 0,
    "ci": 0,
    "generated": 0
  }
}

Local review gate — branch protection can require the pi-review commit status. See docs/REVIEW_SCHEMA.md.

…theusRule Observability for the silent failure mode behind the 12-day watcher outage (lobu#1046): the app was healthy but watcher-automation produced zero successful ticks, with no alert. Metrics (the Prometheus module was vestigial — registered but never incremented): - lobu_scheduled_job_runs_total{job,outcome} — incremented per cron tick in the TaskScheduler. The scheduler heartbeat. - lobu_watcher_automation_phase_failures_total{phase} — per failed phase in runWatcherAutomationTick. Needed because the hardened tick swallows phase errors, so the scheduler-level counter alone can't see internal failures. - lobu_watcher_runs_created_total / lobu_watchers_unrunnable (gauge). Per-pod in-memory counters are the correct Prometheus model: each pod's /metrics is scraped and summed; rate()/increase() handle restart resets. No cross-replica shared state. Chart (charts/lobu, off by default; prod overlay enables): - ServiceMonitor (templates/servicemonitor.yaml) scraping the app /metrics. Adds app.kubernetes.io/component=api to the app Service so the monitor targets it and not the embeddings Service (both share name+instance labels). - PrometheusRule (templates/prometheusrule.yaml): WatcherAutomationSilent (critical, dead-man's-switch — alerts on ABSENCE of successful ticks, the actual failure mode), WatcherAutomationPhaseFailing + LobuScheduledJobFailing (warning). severity labels route to #devops via the existing slack-devops AlertmanagerConfig — no Alertmanager change needed. Both gated on .Values.metrics.*; require release=kube-prometheus-stack label (Prometheus selector) via additionalLabels. Chart.yaml version intentionally not bumped — release-please owns it; templates ship on the next release.

codecov-commenter · 2026-05-25T21:20:39Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@charts/lobu/templates/prometheusrule.yaml`:
- Around line 49-56: The alert currently uses the expression
sum(increase(lobu_scheduled_job_runs_total{job=~"watcher-automation|check-stalled-executions",outcome="error"}[15m]))
by (job) > 0 but the annotation/summary claims the job "threw on every run";
update the annotations in prometheusrule.yaml (the summary and description
fields) to accurately reflect the query (e.g., "had at least one error in the
last 15m" / "The {{`{{ $labels.job }}`}} cron task had one or more error(s) over
the past 15m.") or, if you prefer stricter behavior, tighten the expression
(replace > 0 with a comparison to total runs to detect if errors == total runs)
so the alert text matches the metric logic.

In `@packages/server/src/scheduled/task-scheduler.ts`:
- Around line 250-269: The current try/catch around reg.handler only counts
handler errors; extend the try to begin before the pre-handler cron
seeding/validation code so any failure prior to calling reg.handler is also
caught and triggers incrementCounter('lobu_scheduled_job_runs_total', { job:
data.name, outcome: 'error' }). Keep the success incrementCounter call after
reg.handler completes, preserve rethrowing the caught error, and update
references to reg.handler, incrementCounter and lobu_scheduled_job_runs_total
accordingly.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 441c69eb-157c-4bc5-8015-83bd3bf81295

📥 Commits

Reviewing files that changed from the base of the PR and between c524b42 and 7b4d36c.

📒 Files selected for processing (7)

charts/lobu/templates/prometheusrule.yaml
charts/lobu/templates/service.yaml
charts/lobu/templates/servicemonitor.yaml
charts/lobu/values.yaml
packages/server/src/gateway/metrics/prometheus.ts
packages/server/src/scheduled/task-scheduler.ts
packages/server/src/watchers/automation.ts

coderabbitai · 2026-05-25T21:24:33Z

+            sum(increase(lobu_scheduled_job_runs_total{job=~"watcher-automation|check-stalled-executions",outcome="error"}[15m])) by (job) > 0
+          for: 15m
+          labels:
+            severity: warning
+            service: lobu
+          annotations:
+            summary: "Lobu scheduled job {{`{{ $labels.job }}`}} is failing"
+            description: "The {{`{{ $labels.job }}`}} cron task threw on every run over 15m."


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Alert annotation overstates failure frequency.

Line 49 alerts on any error in 15m (> 0), but Line 56 says the job “threw on every run.” Please align the annotation text (or tighten the expression).

💡 Suggested patch

- description: "The {{`{{ $labels.job }}`}} cron task threw on every run over 15m." + description: "The {{`{{ $labels.job }}`}} cron task threw at least once over 15m."

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

sum(increase(lobu_scheduled_job_runs_total{job=~"watcher-automation|check-stalled-executions",outcome="error"}[15m])) by (job) > 0

for: 15m

labels:

severity: warning

service: lobu

annotations:

summary: "Lobu scheduled job {{`{{ $labels.job }}`}} is failing"

description: "The {{`{{ $labels.job }}`}} cron task threw on every run over 15m."

sum(increase(lobu_scheduled_job_runs_total{job=~"watcher-automation|check-stalled-executions",outcome="error"}[15m])) by (job) > 0

for: 15m

labels:

severity: warning

service: lobu

annotations:

summary: "Lobu scheduled job {{`{{ $labels.job }}`}} is failing"

description: "The {{`{{ $labels.job }}`}} cron task threw at least once over 15m."

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@charts/lobu/templates/prometheusrule.yaml` around lines 49 - 56, The alert currently uses the expression sum(increase(lobu_scheduled_job_runs_total{job=~"watcher-automation|check-stalled-executions",outcome="error"}[15m])) by (job) > 0 but the annotation/summary claims the job "threw on every run"; update the annotations in prometheusrule.yaml (the summary and description fields) to accurately reflect the query (e.g., "had at least one error in the last 15m" / "The {{`{{ $labels.job }}`}} cron task had one or more error(s) over the past 15m.") or, if you prefer stricter behavior, tighten the expression (replace > 0 with a comparison to total runs to detect if errors == total runs) so the alert text matches the metric logic.

coderabbitai · 2026-05-25T21:24:33Z

+    // Per-tick outcome counter — the scheduler heartbeat that backs the
+    // "watcher-automation silent / failing" alerts. Counts every dispatched
+    // task; alerts filter by job name. Re-throw is preserved so the runs-queue
+    // retry path is unchanged.
+    try {
+      await reg.handler({
+        payload: data.payload,
+        taskRunId: Number(job.id),
+      });
+      incrementCounter('lobu_scheduled_job_runs_total', {
+        job: data.name,
+        outcome: 'success',
+      });
+    } catch (err) {
+      incrementCounter('lobu_scheduled_job_runs_total', {
+        job: data.name,
+        outcome: 'error',
+      });
+      throw err;
+    }


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Count pre-handler failures in lobu_scheduled_job_runs_total as errors.

Line 254 only wraps reg.handler(...). If cron seeding fails earlier (Line 241-248), the run fails without incrementing outcome: "error", which leaves an alerting blind spot.

Proposed fix

- if (reg.cron) { - const fromTick = data.__scheduledTick - ? new Date(data.__scheduledTick) - : new Date(); - // Add 1ms so nextRunAt skips past the current tick when fromTick falls - // exactly on a cron boundary. - await this.seedNextCronTick(reg, new Date(fromTick.getTime() + 1)); - } - - // Per-tick outcome counter — the scheduler heartbeat that backs the - // "watcher-automation silent / failing" alerts. Counts every dispatched - // task; alerts filter by job name. Re-throw is preserved so the runs-queue - // retry path is unchanged. try { + if (reg.cron) { + const fromTick = data.__scheduledTick + ? new Date(data.__scheduledTick) + : new Date(); + // Add 1ms so nextRunAt skips past the current tick when fromTick falls + // exactly on a cron boundary. + await this.seedNextCronTick(reg, new Date(fromTick.getTime() + 1)); + } + + // Per-tick outcome counter — the scheduler heartbeat that backs the + // "watcher-automation silent / failing" alerts. Counts every dispatched + // task; alerts filter by job name. Re-throw is preserved so the runs-queue + // retry path is unchanged. await reg.handler({ payload: data.payload, taskRunId: Number(job.id), }); incrementCounter('lobu_scheduled_job_runs_total', {

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/server/src/scheduled/task-scheduler.ts` around lines 250 - 269, The current try/catch around reg.handler only counts handler errors; extend the try to begin before the pre-handler cron seeding/validation code so any failure prior to calling reg.handler is also caught and triggers incrementCounter('lobu_scheduled_job_runs_total', { job: data.name, outcome: 'error' }). Keep the success incrementCounter call after reg.handler completes, preserve rethrowing the caught error, and update references to reg.handler, incrementCounter and lobu_scheduled_job_runs_total accordingly.

Base automatically changed from feat/fix-watcher-reconcile-array to main May 25, 2026 21:16

buremba force-pushed the feat/watcher-alerting branch from 9080898 to 7b4d36c Compare May 25, 2026 21:16

coderabbitai Bot reviewed May 25, 2026

View reviewed changes

buremba merged commit 60c6e73 into main May 25, 2026
39 checks passed

buremba deleted the feat/watcher-alerting branch May 25, 2026 21:25

This was referenced May 25, 2026

chore(main): release lobu 9.4.0 #1031

Merged

fix(chart+metrics): ServiceMonitor path /lobu/metrics + rename label job→task #1053

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(server): watcher/scheduler health metrics + ServiceMonitor/PrometheusRule#1047

feat(server): watcher/scheduler health metrics + ServiceMonitor/PrometheusRule#1047
buremba merged 1 commit into
mainfrom
feat/watcher-alerting

buremba commented May 25, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 25, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

buremba commented May 25, 2026

Uh oh!

codecov-commenter commented May 25, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 25, 2026

Uh oh!

coderabbitai Bot May 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

buremba commented May 25, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

What

Validation

Sequencing / not in this PR

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

buremba commented May 25, 2026

Uh oh!

codecov-commenter commented May 25, 2026

Codecov Report

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 25, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 25, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

buremba commented May 25, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 25, 2026 •

edited

Loading