feat(server): watcher/scheduler health metrics + ServiceMonitor/PrometheusRule#1047
Conversation
📝 WalkthroughWalkthroughThis PR adds end-to-end Prometheus monitoring to the Lobu application. It introduces metric emission in watcher automation and scheduled job modules, exposes these metrics via Kubernetes ServiceMonitor, and defines alerting rules for health detection, all controlled via Helm chart values. ChangesPrometheus Monitoring and Health Alerting
Sequence Diagram(s)sequenceDiagram
participant Application as Application<br/>WatcherAutomation<br/>TaskScheduler
participant MetricsAPI as Prometheus<br/>Module
participant ServiceMonitor as ServiceMonitor<br/>Resource
participant Prometheus as Prometheus<br/>Server
participant PrometheusRule as PrometheusRule<br/>Alerts
Application->>MetricsAPI: incrementCounter()/setGauge()
MetricsAPI->>MetricsAPI: Update in-memory metrics
ServiceMonitor->>MetricsAPI: Scrape /metrics endpoint
MetricsAPI-->>ServiceMonitor: Return metric data
ServiceMonitor->>Prometheus: Forward scraped metrics
Prometheus->>Prometheus: Store time series
Prometheus->>PrometheusRule: Evaluate alert rules (15m window)
PrometheusRule-->>Prometheus: Fire WatcherAutomationSilent<br/>WatcherAutomationPhaseFailing<br/>LobuScheduledJobFailing
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 ESLint
ESLint skipped: no ESLint configuration detected in root package.json. To enable, add Comment |
|
bug_free 86, simplicity 88, slop 0, bugs 0, 0 blockers Read diff; script suites passed (typecheck/unit/integration exit 0). Explored helm template with ServiceMonitor/PrometheusRule enabled -> OK, and imported metrics module to verify new counter/gauge text. Did not boot full server. Full verdict JSON{
"bug_free_confidence": 86,
"bugs": 0,
"slop": 0,
"simplicity": 88,
"blockers": [],
"change_type": "feat",
"behavior_change_risk": "low",
"tests_adequate": true,
"suggested_fixes": [],
"notes": "Read diff; script suites passed (typecheck/unit/integration exit 0). Explored helm template with ServiceMonitor/PrometheusRule enabled -> OK, and imported metrics module to verify new counter/gauge text. Did not boot full server.",
"categories": {
"src": 99,
"tests": 0,
"docs": 0,
"config": 105,
"deps": 0,
"migrations": 0,
"ci": 0,
"generated": 0
}
}Local review gate — branch protection can require the |
…theusRule
Observability for the silent failure mode behind the 12-day watcher outage
(lobu#1046): the app was healthy but watcher-automation produced zero successful
ticks, with no alert.
Metrics (the Prometheus module was vestigial — registered but never incremented):
- lobu_scheduled_job_runs_total{job,outcome} — incremented per cron tick in the
TaskScheduler. The scheduler heartbeat.
- lobu_watcher_automation_phase_failures_total{phase} — per failed phase in
runWatcherAutomationTick. Needed because the hardened tick swallows phase
errors, so the scheduler-level counter alone can't see internal failures.
- lobu_watcher_runs_created_total / lobu_watchers_unrunnable (gauge).
Per-pod in-memory counters are the correct Prometheus model: each pod's /metrics
is scraped and summed; rate()/increase() handle restart resets. No cross-replica
shared state.
Chart (charts/lobu, off by default; prod overlay enables):
- ServiceMonitor (templates/servicemonitor.yaml) scraping the app /metrics. Adds
app.kubernetes.io/component=api to the app Service so the monitor targets it
and not the embeddings Service (both share name+instance labels).
- PrometheusRule (templates/prometheusrule.yaml): WatcherAutomationSilent
(critical, dead-man's-switch — alerts on ABSENCE of successful ticks, the
actual failure mode), WatcherAutomationPhaseFailing + LobuScheduledJobFailing
(warning). severity labels route to #devops via the existing slack-devops
AlertmanagerConfig — no Alertmanager change needed.
Both gated on .Values.metrics.*; require release=kube-prometheus-stack label
(Prometheus selector) via additionalLabels.
Chart.yaml version intentionally not bumped — release-please owns it; templates
ship on the next release.
9080898 to
7b4d36c
Compare
|
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@charts/lobu/templates/prometheusrule.yaml`:
- Around line 49-56: The alert currently uses the expression
sum(increase(lobu_scheduled_job_runs_total{job=~"watcher-automation|check-stalled-executions",outcome="error"}[15m]))
by (job) > 0 but the annotation/summary claims the job "threw on every run";
update the annotations in prometheusrule.yaml (the summary and description
fields) to accurately reflect the query (e.g., "had at least one error in the
last 15m" / "The {{`{{ $labels.job }}`}} cron task had one or more error(s) over
the past 15m.") or, if you prefer stricter behavior, tighten the expression
(replace > 0 with a comparison to total runs to detect if errors == total runs)
so the alert text matches the metric logic.
In `@packages/server/src/scheduled/task-scheduler.ts`:
- Around line 250-269: The current try/catch around reg.handler only counts
handler errors; extend the try to begin before the pre-handler cron
seeding/validation code so any failure prior to calling reg.handler is also
caught and triggers incrementCounter('lobu_scheduled_job_runs_total', { job:
data.name, outcome: 'error' }). Keep the success incrementCounter call after
reg.handler completes, preserve rethrowing the caught error, and update
references to reg.handler, incrementCounter and lobu_scheduled_job_runs_total
accordingly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: 441c69eb-157c-4bc5-8015-83bd3bf81295
📒 Files selected for processing (7)
charts/lobu/templates/prometheusrule.yamlcharts/lobu/templates/service.yamlcharts/lobu/templates/servicemonitor.yamlcharts/lobu/values.yamlpackages/server/src/gateway/metrics/prometheus.tspackages/server/src/scheduled/task-scheduler.tspackages/server/src/watchers/automation.ts
| sum(increase(lobu_scheduled_job_runs_total{job=~"watcher-automation|check-stalled-executions",outcome="error"}[15m])) by (job) > 0 | ||
| for: 15m | ||
| labels: | ||
| severity: warning | ||
| service: lobu | ||
| annotations: | ||
| summary: "Lobu scheduled job {{`{{ $labels.job }}`}} is failing" | ||
| description: "The {{`{{ $labels.job }}`}} cron task threw on every run over 15m." |
There was a problem hiding this comment.
Alert annotation overstates failure frequency.
Line 49 alerts on any error in 15m (> 0), but Line 56 says the job “threw on every run.” Please align the annotation text (or tighten the expression).
💡 Suggested patch
- description: "The {{`{{ $labels.job }}`}} cron task threw on every run over 15m."
+ description: "The {{`{{ $labels.job }}`}} cron task threw at least once over 15m."📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| sum(increase(lobu_scheduled_job_runs_total{job=~"watcher-automation|check-stalled-executions",outcome="error"}[15m])) by (job) > 0 | |
| for: 15m | |
| labels: | |
| severity: warning | |
| service: lobu | |
| annotations: | |
| summary: "Lobu scheduled job {{`{{ $labels.job }}`}} is failing" | |
| description: "The {{`{{ $labels.job }}`}} cron task threw on every run over 15m." | |
| sum(increase(lobu_scheduled_job_runs_total{job=~"watcher-automation|check-stalled-executions",outcome="error"}[15m])) by (job) > 0 | |
| for: 15m | |
| labels: | |
| severity: warning | |
| service: lobu | |
| annotations: | |
| summary: "Lobu scheduled job {{`{{ $labels.job }}`}} is failing" | |
| description: "The {{`{{ $labels.job }}`}} cron task threw at least once over 15m." |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@charts/lobu/templates/prometheusrule.yaml` around lines 49 - 56, The alert
currently uses the expression
sum(increase(lobu_scheduled_job_runs_total{job=~"watcher-automation|check-stalled-executions",outcome="error"}[15m]))
by (job) > 0 but the annotation/summary claims the job "threw on every run";
update the annotations in prometheusrule.yaml (the summary and description
fields) to accurately reflect the query (e.g., "had at least one error in the
last 15m" / "The {{`{{ $labels.job }}`}} cron task had one or more error(s) over
the past 15m.") or, if you prefer stricter behavior, tighten the expression
(replace > 0 with a comparison to total runs to detect if errors == total runs)
so the alert text matches the metric logic.
| // Per-tick outcome counter — the scheduler heartbeat that backs the | ||
| // "watcher-automation silent / failing" alerts. Counts every dispatched | ||
| // task; alerts filter by job name. Re-throw is preserved so the runs-queue | ||
| // retry path is unchanged. | ||
| try { | ||
| await reg.handler({ | ||
| payload: data.payload, | ||
| taskRunId: Number(job.id), | ||
| }); | ||
| incrementCounter('lobu_scheduled_job_runs_total', { | ||
| job: data.name, | ||
| outcome: 'success', | ||
| }); | ||
| } catch (err) { | ||
| incrementCounter('lobu_scheduled_job_runs_total', { | ||
| job: data.name, | ||
| outcome: 'error', | ||
| }); | ||
| throw err; | ||
| } |
There was a problem hiding this comment.
Count pre-handler failures in lobu_scheduled_job_runs_total as errors.
Line 254 only wraps reg.handler(...). If cron seeding fails earlier (Line 241-248), the run fails without incrementing outcome: "error", which leaves an alerting blind spot.
Proposed fix
- if (reg.cron) {
- const fromTick = data.__scheduledTick
- ? new Date(data.__scheduledTick)
- : new Date();
- // Add 1ms so nextRunAt skips past the current tick when fromTick falls
- // exactly on a cron boundary.
- await this.seedNextCronTick(reg, new Date(fromTick.getTime() + 1));
- }
-
- // Per-tick outcome counter — the scheduler heartbeat that backs the
- // "watcher-automation silent / failing" alerts. Counts every dispatched
- // task; alerts filter by job name. Re-throw is preserved so the runs-queue
- // retry path is unchanged.
try {
+ if (reg.cron) {
+ const fromTick = data.__scheduledTick
+ ? new Date(data.__scheduledTick)
+ : new Date();
+ // Add 1ms so nextRunAt skips past the current tick when fromTick falls
+ // exactly on a cron boundary.
+ await this.seedNextCronTick(reg, new Date(fromTick.getTime() + 1));
+ }
+
+ // Per-tick outcome counter — the scheduler heartbeat that backs the
+ // "watcher-automation silent / failing" alerts. Counts every dispatched
+ // task; alerts filter by job name. Re-throw is preserved so the runs-queue
+ // retry path is unchanged.
await reg.handler({
payload: data.payload,
taskRunId: Number(job.id),
});
incrementCounter('lobu_scheduled_job_runs_total', {🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@packages/server/src/scheduled/task-scheduler.ts` around lines 250 - 269, The
current try/catch around reg.handler only counts handler errors; extend the try
to begin before the pre-handler cron seeding/validation code so any failure
prior to calling reg.handler is also caught and triggers
incrementCounter('lobu_scheduled_job_runs_total', { job: data.name, outcome:
'error' }). Keep the success incrementCounter call after reg.handler completes,
preserve rethrowing the caught error, and update references to reg.handler,
incrementCounter and lobu_scheduled_job_runs_total accordingly.
Why
Follow-up to #1046. That bug was a silent 12-day outage — the app was healthy but
watcher-automationproduced zero successful ticks and nothing alerted. This adds the metrics + alert rules so it can't happen silently again. Stacked on #1046 (the phase-failure metric hooksrunWatcherAutomationTick).What
Metrics (
gateway/metrics/prometheus.ts— note: the module was vestigial, registering metrics that were never incremented):lobu_scheduled_job_runs_total{job,outcome}— per cron tick inTaskScheduler.dispatch(the scheduler heartbeat).lobu_watcher_automation_phase_failures_total{phase}— per failed phase inrunWatcherAutomationTick. Needed because the hardened tick swallows phase errors, so the scheduler-level counter shows success even when reconcile/etc. fail internally.lobu_watcher_runs_created_total,lobu_watchers_unrunnable(gauge).Per-pod in-memory counters are the correct Prometheus model — each pod's
/metricsis scraped and summed;rate()/increase()handle restart resets. No cross-replica shared state.Chart (
charts/lobu, both default OFF; the prod overlay enables them):ServiceMonitorscraping the app/metrics. Addsapp.kubernetes.io/component=apito the app Service so the monitor targets it, not the embeddings Service (both share name+instance labels).PrometheusRule:WatcherAutomationSilent(critical) — dead-man's-switch: fires on the absence of successful ticks (... or on() vector(0)) == 0), the actual failure mode.WatcherAutomationPhaseFailing,LobuScheduledJobFailing(warning).severitylabels route to #devops via the already-liveslack-devopsAlertmanagerConfig (verified loaded in the running Alertmanager) — no Alertmanager/webhook change needed. Both resources carryrelease: kube-prometheus-stack(Prometheus's selector) viaadditionalLabels.Validation
runWatcherAutomationTick).helm lint+helm template(with metrics enabled) render correctly; ServiceMonitor selector is unique to the app, both resources labeled for the cluster's Prometheus.Sequencing / not in this PR
Chart.yamlversion intentionally not hand-bumped (release-please owns it;reconcileStrategy: ChartVersionmeans the templates deploy on the next release).metrics.serviceMonitor/prometheusRulein thesummaries-prodvalues is a separate owletto-deploy PR, to merge after the release that ships these templates.Summary by CodeRabbit
Release Notes