Skip to content

feat(chart): real-worker smoke test gates Helm upgrades on actual run completion#878

Merged
buremba merged 1 commit into
mainfrom
chore/worker-smoke-test-gate
May 18, 2026
Merged

feat(chart): real-worker smoke test gates Helm upgrades on actual run completion#878
buremba merged 1 commit into
mainfrom
chore/worker-smoke-test-gate

Conversation

@buremba
Copy link
Copy Markdown
Member

@buremba buremba commented May 18, 2026

Summary

The current charts/lobu/templates/smoke-test-job.yaml only curls /api/health. Every recent regression — Phase-5 env flip, runs denormalize/revert, JobEventSchema, today's action_input JSONB-shape bug — passed that check cleanly while workers couldn't process a single chat message in prod.

This PR adds a Phase-3 worker-smoke gate that drives an end-to-end run through the runs-queue + worker-spawn pipeline:

  1. New internal endpoint POST /api/internal/smoke/dispatch (auth: shared bearer SMOKE_TEST_TOKEN, constant-time compare). Inserts a synthetic chat_message run into public.runs with platform="smoke" and an explicit smoke-org / smoke-test / smoke-<release>-<revision> namespace. Refuses to dispatch unless the caller passes all three identifiers AND the conversationId carries the smoke- prefix.
  2. The runs-queue MessageConsumer (running in the app pod) claims the row, spawns a worker subprocess, the worker runs end-to-end, and on terminal cleanup writes a row into agent_transcript_snapshot.
  3. The smoke Job polls that row over the existing DATABASE_URL. If terminal_status='completed' materialises within workerSmoke.timeoutSeconds, exit 0; otherwise fail the deploy and Helm rolls back.

This makes the "gateway boots fine but workers are broken" regression class un-shippable.

What's intentionally not here

  • No platform integration. The endpoint synthesises the message directly into the runs queue — no Slack/Telegram bot needed. The path under test is worker spawn + run completion, not platform ingress.
  • workerSmoke.enabled defaults to FALSE. Per the issue notes, do not flip it to true in the helmrelease until the sibling fix/runs-action-input-jsonb-shape PR has landed and prod has been verified to actually process chats. Otherwise this gate would intentionally break every deploy until that fix lands.
  • Operator preprovisions the synthetic agent once. Documented inline in values.yaml. The chart can't preprovision a row in someone else's DB.
  • Mirror lands in owletto fork in same PR cycle. Prod's Flux pulls from lobu-ai/owletto, so the chart change must mirror — done in lobu-ai/owletto#chore/worker-smoke-test-gate.

Files

  • packages/server/src/gateway/routes/internal/smoke.ts — new internal endpoint
  • packages/server/src/index.ts — mount before the mcpAuth middleware
  • packages/server/src/gateway/__tests__/smoke-dispatch.test.ts — auth + validation + insert + idempotency
  • charts/lobu/values.yaml — new releaseGates.smokeTest.workerSmoke block
  • charts/lobu/templates/smoke-test-job.yaml — phase-3 dispatch + poll loop, gated by workerSmoke.enabled

Test plan

  • make typecheck clean
  • make build-packages clean
  • bun test packages/server/src/gateway/__tests__/smoke-dispatch.test.ts — 13 tests pass (auth contract, missing-field rejection, prefix enforcement, run insert, idempotency)
  • helm template charts/lobu (defaults) — renders, no worker-smoke env vars emitted
  • helm template charts/lobu --set releaseGates.smokeTest.workerSmoke.enabled=true — renders, envFrom + WORKER_SMOKE_* env present, conv-id and dispatch loop present
  • Post-merge: preprovision smoke agent in a staging org, flip workerSmoke.enabled=true in the staging helmrelease, verify a known-good chart upgrade passes the gate and a deliberately-broken one (e.g. revert the action_input fix) fails it.

Summary by CodeRabbit

  • New Features

    • Optional extended "worker smoke" phase that dispatches synthetic chat runs, polls for completion, and can fail rollouts if not finished within configured timeouts.
    • Internal smoke dispatch endpoint to enqueue and manage smoke runs.
  • Documentation

    • New configuration for worker smoke: enable flag, conversation ID prefix, timeouts, polling interval, and required secret keys.
  • Tests

    • Comprehensive tests for dispatch, auth, validation, host/header defenses, idempotency, and enqueueing.

Review Change Stack

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 18, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 83a4483f-7da2-4da3-8180-39d860a56c39

📥 Commits

Reviewing files that changed from the base of the PR and between 884638e and 52058ab.

📒 Files selected for processing (6)
  • charts/lobu/templates/deployment.yaml
  • charts/lobu/templates/smoke-test-job.yaml
  • charts/lobu/values.yaml
  • packages/server/src/gateway/__tests__/smoke-dispatch.test.ts
  • packages/server/src/gateway/routes/internal/smoke.ts
  • packages/server/src/index.ts
🚧 Files skipped from review as they are similar to previous changes (4)
  • charts/lobu/templates/smoke-test-job.yaml
  • packages/server/src/gateway/tests/smoke-dispatch.test.ts
  • packages/server/src/index.ts
  • charts/lobu/values.yaml

📝 Walkthrough

Walkthrough

Adds a worker smoke-test capability: Helm values + Job changes to run an optional worker-smoke phase, a new internal POST /api/internal/smoke/dispatch endpoint (auth, idempotent run enqueue, pg_notify), tests, and job-side polling of agent_transcript_snapshot for terminal completion.

Changes

Worker Smoke Test End-to-End

Layer / File(s) Summary
Helm Job template and worker-smoke script
charts/lobu/templates/smoke-test-job.yaml
Computes combined job timeout, sets activeDeadlineSeconds, conditionally injects SMOKE_TEST_TOKEN and WORKER_SMOKE_* env vars, and appends phase-3 script that dispatches a smoke run and polls public.agent_transcript_snapshot until terminal_status='completed' or timeout/failure.
Helm values and configuration
charts/lobu/values.yaml
Adds releaseGates.workerSmoke block (default enabled: false) documenting enablement, conversationIdPrefix, timeoutSeconds, intervalSeconds, and required Secret keys (SMOKE_TEST_TOKEN, SMOKE_TEST_AGENT_ID, SMOKE_TEST_ORG_ID).
Deployment env injection
charts/lobu/templates/deployment.yaml
Conditionally injects SMOKE_TEST_ALLOWED_HOST into the main Deployment container when worker smoke is enabled, set to the in-cluster app Service DNS name.
Smoke dispatch endpoint implementation and routing
packages/server/src/gateway/routes/internal/smoke.ts, packages/server/src/index.ts
Adds createSmokeRoutes() registering POST /dispatch with constant-time Bearer token auth (env-pinned), forwarded-header rejection, conversationId validation enforcing smoke- prefix, idempotent insert into public.runs with ON CONFLICT handling, fallback SELECT for existing runId, triggers pg_notify, and returns { runId, idempotencyKey }. Route is mounted before mcpAuth.
Dispatch endpoint test suite
packages/server/src/gateway/__tests__/smoke-dispatch.test.ts
Adds tests covering harness setup, auth/availability (401/503), forwarded-header defenses (403), input validation (400 cases), successful enqueue and persisted run fields (pinned env agent/org, platform smoke, messageText defaulting), and idempotency behavior.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • lobu-ai/lobu#871: Related work that affects whether the worker-smoke phase can observe agent_transcript_snapshot rows by changing the session store behavior used when writing snapshots.

Poem

🐰 I dispatch a tiny chat at dawn,
Tokens clutched like carrots on a lawn,
Postgres whispers when it's done,
Polling waits until it's won,
Rollouts pass and bunnies yawn. 🥕

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely describes the main change: adding a worker-smoke test gate to Helm upgrades that validates actual run completion, which directly matches the core objective of this PR.
Description check ✅ Passed The description is comprehensive and well-structured, covering Summary, Files, Test plan sections with checkmarks, and detailed Notes. All required sections from the template are present and complete.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch chore/worker-smoke-test-gate

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

ESLint skipped: no ESLint configuration detected in root package.json. To enable, add eslint to devDependencies.


Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov-commenter
Copy link
Copy Markdown

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@charts/lobu/templates/smoke-test-job.yaml`:
- Around line 34-50: The smoke-test Job currently only pulls DATABASE_URL from
.Values.secrets.stringData or the deployment secret (via include
"lobu.secretName") and can accidentally skip Phase 3 when requiredSchema or
DATABASE_URL is missing; update the env/envFrom logic to also read DATABASE_URL
from .Values.database.existingSecret (and/or prefer that secret when present)
and make the Job fail-closed when workerSmoke.enabled/$workerSmokeEnabled is
true and DATABASE_URL is not available (remove or guard the early `exit 0` path
used around requiredSchema checks so the Job errors out instead of exiting
successfully); touch the template bits referencing $secretName, include
"lobu.secretName", .Values.secrets.create,
.Values.secrets.stringData["DATABASE_URL"], .Values.database.existingSecret, and
the requiredSchema/Phase 3 exit logic so the smoke gate cannot be bypassed when
workerSmoke is enabled.

In `@charts/lobu/values.yaml`:
- Around line 291-297: The values.yaml exposes conversationIdPrefix which the
server (packages/server/src/gateway/routes/internal/smoke.ts) hard-validates to
start with "smoke-", causing 400s if operators change it; remove the
conversationIdPrefix knob from charts/lobu/values.yaml (or render it only when
equal to the default "smoke-") so Helm cannot set a non-default value, and/or
add a template-time check that fails the release if conversationIdPrefix !=
"smoke-" to ensure workers always send a prefix accepted by the smoke route.

In `@packages/server/src/gateway/routes/internal/smoke.ts`:
- Around line 100-110: The code calls .trim() on body fields that may be
non-strings which can throw; before trimming the fields from the parsed body
(from c.req.json()), validate that body.agentId, body.organizationId,
body.conversationId, and body.messageText are strings (and not null/undefined)
and reject with c.json({ error: "Invalid JSON body" }, 400) or a more specific
400 when they are not strings; then safely call .trim() (or provide defaults
like "smoke-test ping" for messageText) and assign to agentId, organizationId,
conversationId, messageText—update the SmokeDispatchBody handling so non-string
values do not reach .trim().
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: a0d241b9-ce0e-4ceb-843f-5314dfcd52cf

📥 Commits

Reviewing files that changed from the base of the PR and between 191d075 and e652ba0.

📒 Files selected for processing (5)
  • charts/lobu/templates/smoke-test-job.yaml
  • charts/lobu/values.yaml
  • packages/server/src/gateway/__tests__/smoke-dispatch.test.ts
  • packages/server/src/gateway/routes/internal/smoke.ts
  • packages/server/src/index.ts

Comment thread charts/lobu/templates/smoke-test-job.yaml
Comment thread charts/lobu/values.yaml Outdated
Comment on lines +100 to +110
let body: SmokeDispatchBody;
try {
body = (await c.req.json()) as SmokeDispatchBody;
} catch {
return c.json({ error: "Invalid JSON body" }, 400);
}

const agentId = body.agentId?.trim();
const organizationId = body.organizationId?.trim();
const conversationId = body.conversationId?.trim();
const messageText = body.messageText?.trim() || "smoke-test ping";
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Reject non-string fields before calling .trim().

await c.req.json() can return null or objects with non-string values, and { "agentId": 1 } currently throws here before your 400-path runs. That turns a bad request into a 500.

💡 Proposed fix
-    let body: SmokeDispatchBody;
+    let body: unknown;
     try {
-      body = (await c.req.json()) as SmokeDispatchBody;
+      body = await c.req.json();
     } catch {
       return c.json({ error: "Invalid JSON body" }, 400);
     }
 
-    const agentId = body.agentId?.trim();
-    const organizationId = body.organizationId?.trim();
-    const conversationId = body.conversationId?.trim();
-    const messageText = body.messageText?.trim() || "smoke-test ping";
+    if (!body || typeof body !== "object" || Array.isArray(body)) {
+      return c.json({ error: "Request body must be a JSON object" }, 400);
+    }
+
+    const parsed = body as SmokeDispatchBody;
+    const agentId =
+      typeof parsed.agentId === "string" ? parsed.agentId.trim() : undefined;
+    const organizationId =
+      typeof parsed.organizationId === "string"
+        ? parsed.organizationId.trim()
+        : undefined;
+    const conversationId =
+      typeof parsed.conversationId === "string"
+        ? parsed.conversationId.trim()
+        : undefined;
+    const messageText =
+      typeof parsed.messageText === "string" && parsed.messageText.trim().length > 0
+        ? parsed.messageText.trim()
+        : "smoke-test ping";
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
let body: SmokeDispatchBody;
try {
body = (await c.req.json()) as SmokeDispatchBody;
} catch {
return c.json({ error: "Invalid JSON body" }, 400);
}
const agentId = body.agentId?.trim();
const organizationId = body.organizationId?.trim();
const conversationId = body.conversationId?.trim();
const messageText = body.messageText?.trim() || "smoke-test ping";
let body: unknown;
try {
body = await c.req.json();
} catch {
return c.json({ error: "Invalid JSON body" }, 400);
}
if (!body || typeof body !== "object" || Array.isArray(body)) {
return c.json({ error: "Request body must be a JSON object" }, 400);
}
const parsed = body as SmokeDispatchBody;
const agentId =
typeof parsed.agentId === "string" ? parsed.agentId.trim() : undefined;
const organizationId =
typeof parsed.organizationId === "string"
? parsed.organizationId.trim()
: undefined;
const conversationId =
typeof parsed.conversationId === "string"
? parsed.conversationId.trim()
: undefined;
const messageText =
typeof parsed.messageText === "string" && parsed.messageText.trim().length > 0
? parsed.messageText.trim()
: "smoke-test ping";
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/server/src/gateway/routes/internal/smoke.ts` around lines 100 - 110,
The code calls .trim() on body fields that may be non-strings which can throw;
before trimming the fields from the parsed body (from c.req.json()), validate
that body.agentId, body.organizationId, body.conversationId, and
body.messageText are strings (and not null/undefined) and reject with c.json({
error: "Invalid JSON body" }, 400) or a more specific 400 when they are not
strings; then safely call .trim() (or provide defaults like "smoke-test ping"
for messageText) and assign to agentId, organizationId, conversationId,
messageText—update the SmokeDispatchBody handling so non-string values do not
reach .trim().

@buremba buremba force-pushed the chore/worker-smoke-test-gate branch from e652ba0 to 884638e Compare May 18, 2026 15:43
@buremba
Copy link
Copy Markdown
Member Author

buremba commented May 18, 2026

Codex round 1 flagged two real issues — both fixed:

1. Ingress-bypass defense. The route is mounted on the main app which is exposed by the public Ingress. Bearer auth alone isn't enough: a leaked token could be replayed from outside. Fix: reject any request that carries x-forwarded-* / forwarded / x-real-ip headers. The in-cluster smoke Job hits <release>-app via cluster DNS → ClusterIP → Pod, which never traverses ingress, so the headers are absent on legitimate calls. New tests cover all four header variants.

2. Server-side namespace pinning. The first draft accepted caller-supplied agentId/organizationId, which was convention-only isolation. Fix: pin both from process.env.SMOKE_TEST_AGENT_ID / SMOKE_TEST_ORG_ID (loaded into the app pod via the deployment Secret). Body fields with those names are silently ignored. A leaked SMOKE_TEST_TOKEN can now only dispatch against the env-configured smoke namespace — structurally unable to target real tenants. Added test caller-supplied agentId/organizationId in body are silently ignored.

Chart wiring updated to match: smoke Job no longer passes the identifiers in the body; the in-cluster app pod resolves them from its own envFrom-loaded Secret.

Total tests: 18 pass (was 13). Helm template render still clean both modes.

Latest commit: 884638e8.

… completion

Adds a Phase-3 worker-smoke gate to the post-upgrade smoke Job: the Job
POSTs to a new internal endpoint that inserts a synthetic chat_message
run, then polls `agent_transcript_snapshot` for that run's terminal
status. If the snapshot doesn't materialise with
`terminal_status='completed'` inside the configured window, the deploy
fails and Helm rolls back to the previous chart.

The current smoke-test-job only curls /api/health — every recent regression
(Phase 5 env flip, runs denormalize/revert, JobEventSchema, action_input
JSONB shape) passed that check while workers couldn't process a single
chat message. Driving an end-to-end run through the same runs-queue +
worker-spawn pipeline that real traffic uses catches that class.

Backend: POST /api/internal/smoke/dispatch (auth: bearer
SMOKE_TEST_TOKEN, constant-time compare). Inserts a chat_message run
with a synthetic platform="smoke" payload. Refuses to dispatch unless
agentId, organizationId, and conversationId are explicit AND the
conversationId starts with "smoke-". Idempotent on conversationId so
repeated calls don't flood the queue.

Chart: new releaseGates.smokeTest.workerSmoke block — disabled by
default. Operators preprovision a synthetic agent matching `agentId`
and add SMOKE_TEST_TOKEN to the deployment Secret before flipping it
on. Job picks up the token via envFrom on the existing Secret, polls
the snapshot row over the existing DATABASE_URL.
@buremba buremba force-pushed the chore/worker-smoke-test-gate branch from 884638e to 52058ab Compare May 18, 2026 15:48
@buremba
Copy link
Copy Markdown
Member Author

buremba commented May 18, 2026

Round 2 fix: added Host-header allowlist for stronger ingress-bypass defense.

Layer 1 (already in round 1): x-forwarded-* / forwarded / x-real-ip headers → 403. Catches any standards-compliant reverse proxy.

Layer 2 (new): Host header must match SMOKE_TEST_ALLOWED_HOST (set by the chart to <release>-app). Public ingress traffic carries the operator's external hostname (e.g. app.lobu.ai) in Host; the in-cluster smoke Job hits cluster DNS (<release>-app) so its Host is the service name. This rejects requests that came through ingress even if a non-compliant proxy stripped the x-forwarded-* headers.

Both layers fire before the bearer check. The chart wires SMOKE_TEST_ALLOWED_HOST automatically on the app deployment when workerSmoke.enabled=true, so operators don't need to set it manually. Tests cover the matrix: exact, port-suffixed, FQDN-suffixed (<svc>.<ns>.svc.cluster.local), and public-hostname rejection.

Total tests: 23 pass (was 18).

Latest commit: 52058aba.

@buremba buremba merged commit 48ee1ed into main May 18, 2026
25 checks passed
@buremba buremba deleted the chore/worker-smoke-test-gate branch May 18, 2026 15:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants