feat(chart): real-worker smoke test gates Helm upgrades on actual run completion by buremba · Pull Request #878 · lobu-ai/lobu

buremba · 2026-05-18T15:35:21Z

Summary

The current charts/lobu/templates/smoke-test-job.yaml only curls /api/health. Every recent regression — Phase-5 env flip, runs denormalize/revert, JobEventSchema, today's action_input JSONB-shape bug — passed that check cleanly while workers couldn't process a single chat message in prod.

This PR adds a Phase-3 worker-smoke gate that drives an end-to-end run through the runs-queue + worker-spawn pipeline:

New internal endpoint POST /api/internal/smoke/dispatch (auth: shared bearer SMOKE_TEST_TOKEN, constant-time compare). Inserts a synthetic chat_message run into public.runs with platform="smoke" and an explicit smoke-org / smoke-test / smoke-<release>-<revision> namespace. Refuses to dispatch unless the caller passes all three identifiers AND the conversationId carries the smoke- prefix.
The runs-queue MessageConsumer (running in the app pod) claims the row, spawns a worker subprocess, the worker runs end-to-end, and on terminal cleanup writes a row into agent_transcript_snapshot.
The smoke Job polls that row over the existing DATABASE_URL. If terminal_status='completed' materialises within workerSmoke.timeoutSeconds, exit 0; otherwise fail the deploy and Helm rolls back.

This makes the "gateway boots fine but workers are broken" regression class un-shippable.

What's intentionally not here

No platform integration. The endpoint synthesises the message directly into the runs queue — no Slack/Telegram bot needed. The path under test is worker spawn + run completion, not platform ingress.
workerSmoke.enabled defaults to FALSE. Per the issue notes, do not flip it to true in the helmrelease until the sibling fix/runs-action-input-jsonb-shape PR has landed and prod has been verified to actually process chats. Otherwise this gate would intentionally break every deploy until that fix lands.
Operator preprovisions the synthetic agent once. Documented inline in values.yaml. The chart can't preprovision a row in someone else's DB.
Mirror lands in owletto fork in same PR cycle. Prod's Flux pulls from lobu-ai/owletto, so the chart change must mirror — done in lobu-ai/owletto#chore/worker-smoke-test-gate.

Files

packages/server/src/gateway/routes/internal/smoke.ts — new internal endpoint
packages/server/src/index.ts — mount before the mcpAuth middleware
packages/server/src/gateway/__tests__/smoke-dispatch.test.ts — auth + validation + insert + idempotency
charts/lobu/values.yaml — new releaseGates.smokeTest.workerSmoke block
charts/lobu/templates/smoke-test-job.yaml — phase-3 dispatch + poll loop, gated by workerSmoke.enabled

Test plan

make typecheck clean
make build-packages clean
bun test packages/server/src/gateway/__tests__/smoke-dispatch.test.ts — 13 tests pass (auth contract, missing-field rejection, prefix enforcement, run insert, idempotency)
helm template charts/lobu (defaults) — renders, no worker-smoke env vars emitted
helm template charts/lobu --set releaseGates.smokeTest.workerSmoke.enabled=true — renders, envFrom + WORKER_SMOKE_* env present, conv-id and dispatch loop present
Post-merge: preprovision smoke agent in a staging org, flip workerSmoke.enabled=true in the staging helmrelease, verify a known-good chart upgrade passes the gate and a deliberately-broken one (e.g. revert the action_input fix) fails it.

Summary by CodeRabbit

New Features
- Optional extended "worker smoke" phase that dispatches synthetic chat runs, polls for completion, and can fail rollouts if not finished within configured timeouts.
- Internal smoke dispatch endpoint to enqueue and manage smoke runs.
Documentation
- New configuration for worker smoke: enable flag, conversation ID prefix, timeouts, polling interval, and required secret keys.
Tests
- Comprehensive tests for dispatch, auth, validation, host/header defenses, idempotency, and enqueueing.

coderabbitai · 2026-05-18T15:35:35Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 83a4483f-7da2-4da3-8180-39d860a56c39

📥 Commits

Reviewing files that changed from the base of the PR and between 884638e and 52058ab.

📒 Files selected for processing (6)

charts/lobu/templates/deployment.yaml
charts/lobu/templates/smoke-test-job.yaml
charts/lobu/values.yaml
packages/server/src/gateway/__tests__/smoke-dispatch.test.ts
packages/server/src/gateway/routes/internal/smoke.ts
packages/server/src/index.ts

🚧 Files skipped from review as they are similar to previous changes (4)

charts/lobu/templates/smoke-test-job.yaml
packages/server/src/gateway/tests/smoke-dispatch.test.ts
packages/server/src/index.ts
charts/lobu/values.yaml

📝 Walkthrough

Walkthrough

Adds a worker smoke-test capability: Helm values + Job changes to run an optional worker-smoke phase, a new internal POST /api/internal/smoke/dispatch endpoint (auth, idempotent run enqueue, pg_notify), tests, and job-side polling of agent_transcript_snapshot for terminal completion.

Changes

Worker Smoke Test End-to-End

Layer / File(s)	Summary
Helm Job template and worker-smoke script `charts/lobu/templates/smoke-test-job.yaml`	Computes combined job timeout, sets `activeDeadlineSeconds`, conditionally injects `SMOKE_TEST_TOKEN` and `WORKER_SMOKE_*` env vars, and appends phase-3 script that dispatches a smoke run and polls `public.agent_transcript_snapshot` until `terminal_status='completed'` or timeout/failure.
Helm values and configuration `charts/lobu/values.yaml`	Adds `releaseGates.workerSmoke` block (default `enabled: false`) documenting enablement, `conversationIdPrefix`, `timeoutSeconds`, `intervalSeconds`, and required Secret keys (`SMOKE_TEST_TOKEN`, `SMOKE_TEST_AGENT_ID`, `SMOKE_TEST_ORG_ID`).
Deployment env injection `charts/lobu/templates/deployment.yaml`	Conditionally injects `SMOKE_TEST_ALLOWED_HOST` into the main Deployment container when worker smoke is enabled, set to the in-cluster app Service DNS name.
Smoke dispatch endpoint implementation and routing `packages/server/src/gateway/routes/internal/smoke.ts`, `packages/server/src/index.ts`	Adds `createSmokeRoutes()` registering `POST /dispatch` with constant-time Bearer token auth (env-pinned), forwarded-header rejection, `conversationId` validation enforcing `smoke-` prefix, idempotent insert into `public.runs` with `ON CONFLICT` handling, fallback SELECT for existing runId, triggers `pg_notify`, and returns `{ runId, idempotencyKey }`. Route is mounted before `mcpAuth`.
Dispatch endpoint test suite `packages/server/src/gateway/__tests__/smoke-dispatch.test.ts`	Adds tests covering harness setup, auth/availability (401/503), forwarded-header defenses (403), input validation (400 cases), successful enqueue and persisted run fields (pinned env agent/org, platform `smoke`, messageText defaulting), and idempotency behavior.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

lobu-ai/lobu#871: Related work that affects whether the worker-smoke phase can observe agent_transcript_snapshot rows by changing the session store behavior used when writing snapshots.

Poem

🐰 I dispatch a tiny chat at dawn,
Tokens clutched like carrots on a lawn,
Postgres whispers when it's done,
Polling waits until it's won,
Rollouts pass and bunnies yawn. 🥕

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely describes the main change: adding a worker-smoke test gate to Helm upgrades that validates actual run completion, which directly matches the core objective of this PR.
Description check	✅ Passed	The description is comprehensive and well-structured, covering Summary, Files, Test plan sections with checkmarks, and detailed Notes. All required sections from the template are present and complete.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch chore/worker-smoke-test-gate

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

ESLint skipped: no ESLint configuration detected in root package.json. To enable, add eslint to devDependencies.

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov-commenter · 2026-05-18T15:37:36Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@charts/lobu/templates/smoke-test-job.yaml`:
- Around line 34-50: The smoke-test Job currently only pulls DATABASE_URL from
.Values.secrets.stringData or the deployment secret (via include
"lobu.secretName") and can accidentally skip Phase 3 when requiredSchema or
DATABASE_URL is missing; update the env/envFrom logic to also read DATABASE_URL
from .Values.database.existingSecret (and/or prefer that secret when present)
and make the Job fail-closed when workerSmoke.enabled/$workerSmokeEnabled is
true and DATABASE_URL is not available (remove or guard the early `exit 0` path
used around requiredSchema checks so the Job errors out instead of exiting
successfully); touch the template bits referencing $secretName, include
"lobu.secretName", .Values.secrets.create,
.Values.secrets.stringData["DATABASE_URL"], .Values.database.existingSecret, and
the requiredSchema/Phase 3 exit logic so the smoke gate cannot be bypassed when
workerSmoke is enabled.

In `@charts/lobu/values.yaml`:
- Around line 291-297: The values.yaml exposes conversationIdPrefix which the
server (packages/server/src/gateway/routes/internal/smoke.ts) hard-validates to
start with "smoke-", causing 400s if operators change it; remove the
conversationIdPrefix knob from charts/lobu/values.yaml (or render it only when
equal to the default "smoke-") so Helm cannot set a non-default value, and/or
add a template-time check that fails the release if conversationIdPrefix !=
"smoke-" to ensure workers always send a prefix accepted by the smoke route.

In `@packages/server/src/gateway/routes/internal/smoke.ts`:
- Around line 100-110: The code calls .trim() on body fields that may be
non-strings which can throw; before trimming the fields from the parsed body
(from c.req.json()), validate that body.agentId, body.organizationId,
body.conversationId, and body.messageText are strings (and not null/undefined)
and reject with c.json({ error: "Invalid JSON body" }, 400) or a more specific
400 when they are not strings; then safely call .trim() (or provide defaults
like "smoke-test ping" for messageText) and assign to agentId, organizationId,
conversationId, messageText—update the SmokeDispatchBody handling so non-string
values do not reach .trim().

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: a0d241b9-ce0e-4ceb-843f-5314dfcd52cf

📥 Commits

Reviewing files that changed from the base of the PR and between 191d075 and e652ba0.

📒 Files selected for processing (5)

charts/lobu/templates/smoke-test-job.yaml
charts/lobu/values.yaml
packages/server/src/gateway/__tests__/smoke-dispatch.test.ts
packages/server/src/gateway/routes/internal/smoke.ts
packages/server/src/index.ts

coderabbitai · 2026-05-18T15:41:46Z

+    let body: SmokeDispatchBody;
+    try {
+      body = (await c.req.json()) as SmokeDispatchBody;
+    } catch {
+      return c.json({ error: "Invalid JSON body" }, 400);
+    }
+
+    const agentId = body.agentId?.trim();
+    const organizationId = body.organizationId?.trim();
+    const conversationId = body.conversationId?.trim();
+    const messageText = body.messageText?.trim() || "smoke-test ping";


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Reject non-string fields before calling .trim().

await c.req.json() can return null or objects with non-string values, and { "agentId": 1 } currently throws here before your 400-path runs. That turns a bad request into a 500.

💡 Proposed fix

- let body: SmokeDispatchBody; + let body: unknown; try { - body = (await c.req.json()) as SmokeDispatchBody; + body = await c.req.json(); } catch { return c.json({ error: "Invalid JSON body" }, 400); } - const agentId = body.agentId?.trim(); - const organizationId = body.organizationId?.trim(); - const conversationId = body.conversationId?.trim(); - const messageText = body.messageText?.trim() || "smoke-test ping"; + if (!body || typeof body !== "object" || Array.isArray(body)) { + return c.json({ error: "Request body must be a JSON object" }, 400); + } + + const parsed = body as SmokeDispatchBody; + const agentId = + typeof parsed.agentId === "string" ? parsed.agentId.trim() : undefined; + const organizationId = + typeof parsed.organizationId === "string" + ? parsed.organizationId.trim() + : undefined; + const conversationId = + typeof parsed.conversationId === "string" + ? parsed.conversationId.trim() + : undefined; + const messageText = + typeof parsed.messageText === "string" && parsed.messageText.trim().length > 0 + ? parsed.messageText.trim() + : "smoke-test ping";

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

let body: SmokeDispatchBody;

try {

body = (await c.req.json()) as SmokeDispatchBody;

} catch {

return c.json({ error: "Invalid JSON body" }, 400);

}

const agentId = body.agentId?.trim();

const organizationId = body.organizationId?.trim();

const conversationId = body.conversationId?.trim();

const messageText = body.messageText?.trim() || "smoke-test ping";

let body: unknown;

try {

body = await c.req.json();

} catch {

return c.json({ error: "Invalid JSON body" }, 400);

}

if (!body || typeof body !== "object" || Array.isArray(body)) {

return c.json({ error: "Request body must be a JSON object" }, 400);

}

const parsed = body as SmokeDispatchBody;

const agentId =

typeof parsed.agentId === "string" ? parsed.agentId.trim() : undefined;

const organizationId =

typeof parsed.organizationId === "string"

? parsed.organizationId.trim()

: undefined;

const conversationId =

typeof parsed.conversationId === "string"

? parsed.conversationId.trim()

: undefined;

const messageText =

typeof parsed.messageText === "string" && parsed.messageText.trim().length > 0

? parsed.messageText.trim()

: "smoke-test ping";

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/server/src/gateway/routes/internal/smoke.ts` around lines 100 - 110, The code calls .trim() on body fields that may be non-strings which can throw; before trimming the fields from the parsed body (from c.req.json()), validate that body.agentId, body.organizationId, body.conversationId, and body.messageText are strings (and not null/undefined) and reject with c.json({ error: "Invalid JSON body" }, 400) or a more specific 400 when they are not strings; then safely call .trim() (or provide defaults like "smoke-test ping" for messageText) and assign to agentId, organizationId, conversationId, messageText—update the SmokeDispatchBody handling so non-string values do not reach .trim().

buremba · 2026-05-18T15:43:55Z

Codex round 1 flagged two real issues — both fixed:

1. Ingress-bypass defense. The route is mounted on the main app which is exposed by the public Ingress. Bearer auth alone isn't enough: a leaked token could be replayed from outside. Fix: reject any request that carries x-forwarded-* / forwarded / x-real-ip headers. The in-cluster smoke Job hits <release>-app via cluster DNS → ClusterIP → Pod, which never traverses ingress, so the headers are absent on legitimate calls. New tests cover all four header variants.

2. Server-side namespace pinning. The first draft accepted caller-supplied agentId/organizationId, which was convention-only isolation. Fix: pin both from process.env.SMOKE_TEST_AGENT_ID / SMOKE_TEST_ORG_ID (loaded into the app pod via the deployment Secret). Body fields with those names are silently ignored. A leaked SMOKE_TEST_TOKEN can now only dispatch against the env-configured smoke namespace — structurally unable to target real tenants. Added test caller-supplied agentId/organizationId in body are silently ignored.

Chart wiring updated to match: smoke Job no longer passes the identifiers in the body; the in-cluster app pod resolves them from its own envFrom-loaded Secret.

Total tests: 18 pass (was 13). Helm template render still clean both modes.

Latest commit: 884638e8.

… completion Adds a Phase-3 worker-smoke gate to the post-upgrade smoke Job: the Job POSTs to a new internal endpoint that inserts a synthetic chat_message run, then polls `agent_transcript_snapshot` for that run's terminal status. If the snapshot doesn't materialise with `terminal_status='completed'` inside the configured window, the deploy fails and Helm rolls back to the previous chart. The current smoke-test-job only curls /api/health — every recent regression (Phase 5 env flip, runs denormalize/revert, JobEventSchema, action_input JSONB shape) passed that check while workers couldn't process a single chat message. Driving an end-to-end run through the same runs-queue + worker-spawn pipeline that real traffic uses catches that class. Backend: POST /api/internal/smoke/dispatch (auth: bearer SMOKE_TEST_TOKEN, constant-time compare). Inserts a chat_message run with a synthetic platform="smoke" payload. Refuses to dispatch unless agentId, organizationId, and conversationId are explicit AND the conversationId starts with "smoke-". Idempotent on conversationId so repeated calls don't flood the queue. Chart: new releaseGates.smokeTest.workerSmoke block — disabled by default. Operators preprovision a synthetic agent matching `agentId` and add SMOKE_TEST_TOKEN to the deployment Secret before flipping it on. Job picks up the token via envFrom on the existing Secret, polls the snapshot row over the existing DATABASE_URL.

buremba · 2026-05-18T15:48:39Z

Round 2 fix: added Host-header allowlist for stronger ingress-bypass defense.

Layer 1 (already in round 1): x-forwarded-* / forwarded / x-real-ip headers → 403. Catches any standards-compliant reverse proxy.

Layer 2 (new): Host header must match SMOKE_TEST_ALLOWED_HOST (set by the chart to <release>-app). Public ingress traffic carries the operator's external hostname (e.g. app.lobu.ai) in Host; the in-cluster smoke Job hits cluster DNS (<release>-app) so its Host is the service name. This rejects requests that came through ingress even if a non-compliant proxy stripped the x-forwarded-* headers.

Both layers fire before the bearer check. The chart wires SMOKE_TEST_ALLOWED_HOST automatically on the app deployment when workerSmoke.enabled=true, so operators don't need to set it manually. Tests cover the matrix: exact, port-suffixed, FQDN-suffixed (<svc>.<ns>.svc.cluster.local), and public-hostname rejection.

Total tests: 23 pass (was 18).

Latest commit: 52058aba.

coderabbitai Bot reviewed May 18, 2026

View reviewed changes

buremba force-pushed the chore/worker-smoke-test-gate branch from e652ba0 to 884638e Compare May 18, 2026 15:43

buremba force-pushed the chore/worker-smoke-test-gate branch from 884638e to 52058ab Compare May 18, 2026 15:48

buremba merged commit 48ee1ed into main May 18, 2026
25 checks passed

buremba deleted the chore/worker-smoke-test-gate branch May 18, 2026 15:52

buremba mentioned this pull request May 18, 2026

chore(main): release lobu 7.2.0 #863

Merged

coderabbitai Bot mentioned this pull request May 18, 2026

chore(chart): align lobu chart to owletto fork (pre-consolidation, byte-identical) #882

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(chart): real-worker smoke test gates Helm upgrades on actual run completion#878

feat(chart): real-worker smoke test gates Helm upgrades on actual run completion#878
buremba merged 1 commit into
mainfrom
chore/worker-smoke-test-gate

buremba commented May 18, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 18, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

codecov-commenter commented May 18, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot May 18, 2026

Uh oh!

buremba commented May 18, 2026

Uh oh!

buremba commented May 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-    let body: SmokeDispatchBody;
-    try {
-      body = (await c.req.json()) as SmokeDispatchBody;
-    } catch {
-      return c.json({ error: "Invalid JSON body" }, 400);
-    }
-    const agentId = body.agentId?.trim();
-    const organizationId = body.organizationId?.trim();
-    const conversationId = body.conversationId?.trim();
-    const messageText = body.messageText?.trim() || "smoke-test ping";
+    let body: unknown;
+    try {
+      body = await c.req.json();
+    } catch {
+      return c.json({ error: "Invalid JSON body" }, 400);
+    }
+    if (!body || typeof body !== "object" || Array.isArray(body)) {
+      return c.json({ error: "Request body must be a JSON object" }, 400);
+    }
+    const parsed = body as SmokeDispatchBody;
+    const agentId =
+      typeof parsed.agentId === "string" ? parsed.agentId.trim() : undefined;
+    const organizationId =
+      typeof parsed.organizationId === "string"
+        ? parsed.organizationId.trim()
+        : undefined;
+    const conversationId =
+      typeof parsed.conversationId === "string"
+        ? parsed.conversationId.trim()
+        : undefined;
+    const messageText =
+      typeof parsed.messageText === "string" && parsed.messageText.trim().length > 0
+        ? parsed.messageText.trim()
+        : "smoke-test ping";

Conversation

buremba commented May 18, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's intentionally not here

Files

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

codecov-commenter commented May 18, 2026

Codecov Report

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot May 18, 2026

Choose a reason for hiding this comment

Uh oh!

buremba commented May 18, 2026

Uh oh!

buremba commented May 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

buremba commented May 18, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 18, 2026 •

edited

Loading