fix: close monitoring + deploy gaps from post-incident audit by buremba · Pull Request #775 · lobu-ai/lobu

buremba · 2026-05-16T21:47:33Z

Why

Post-incident audit after #771 + #772 landed. Prod is healthy but three real gaps remained that would let the next similar incident slip past monitoring or extend its blast radius:

Gap	Observed today	Risk
Sentry has zero coverage of logger errors	1,914 errors / 5 min in stdout, 0 Sentry issues in 24h	Background-job failures invisible to monitoring; today's noise masks tomorrow's signal
'website' connector orphan feeds	~380 error logs/min from CheckDueFeeds	Log spam, hides real errors; not user-facing
Single-replica Recreate strategy	every deploy = ~30s "no available server"	Cloudflare 502s on every rollout

What

1. pino → Sentry forwarding (`utils/logger.ts`)

Adds a pino destination that mirrors stdout AND forwards level >= error to Sentry. In-process dedupe by (msg, err.type, top stack frame) with a 60s window — Sentry has its own grouping but every captureException still costs an HTTP call, and a repeating error at 380/min would saturate the budget.

2. Orphan-feed cleanup (`queue-helpers.ts` + `20260517020000_softdelete_orphan_feeds.sql`)

Code: createSyncRunWithClient now warns + skips when no active connector_definition exists for (connector_key, organization_id). Pre-fix it threw, producing the error stream.
Migration: one-shot soft-delete of feeds matching the orphan criteria (no pinned_version, active connection, no active definition in the org). Conservative criteria — only touches feeds that the current code would also fail on.

3. Graceful drain (`charts/lobu/templates/deployment.yaml` + `values.yaml`)

preStop lifecycle hook: 15s sleep (configurable via app.preStopDelaySeconds). When k8s decides to terminate, this lets Service endpoint deregistration propagate through kube-proxy + Cloudflare LB before the process gets SIGTERM. In-flight requests routed during the deregistration lag still hit a live process.
terminationGracePeriodSeconds: 45 (configurable). Covers preStop + the graceful shutdown in server.ts (~5s for HTTP server close, task scheduler stop, DB pool drain).
Strategy stays Recreate (workspaces PVC is RWO). Comment explains the path to RollingUpdate: needs RWX storage or workspaces split off the gateway deployment. Not done here because it's an ops-repo storage-class decision, not in-tree code.

Validation

make typecheck clean
make build-packages builds
Migration applies on a fresh PG (validated locally via docker pgvector/pgvector:pg18)
db/schema.sql diff is intent-only (one new schema_migrations row)
Manual: after deploy, send a logger.error(new Error('test'), ...) from a debug endpoint, confirm a Sentry issue appears
Manual: trigger a rollout, observe whether the "no available server" window shrinks below the prior ~30s

Out of scope (acknowledged)

True elimination of the deploy gap needs RWX storage for workspaces or moving workspaces off the gateway pod entirely. That's an ops-repo decision (storage class, cost). Current commit reduces the gap but doesn't close it. Tracked for follow-up.
The orphan-feed migration uses conservative criteria; it should be safe but operators with non-standard feed setups may want to inspect SELECT count(*) FROM feeds WHERE ... before/after.

Summary by CodeRabbit

New Features
- Enhanced error monitoring with deduplication to reduce duplicate reports.
- Improved graceful pod shutdown with optional pre-stop delay and configurable termination timeout.
Bug Fixes
- Automatic cleanup (soft-delete) of orphaned feeds when connectors are missing.
Chores
- Added a one‑time DB migration and schema update to apply the orphan-feed cleanup.

Three gaps identified by the audit after #771/#772 landed (1914 errors/5min in stdout, zero Sentry issues, single-replica Recreate deploy strategy = ~30s "no available server" window per rollout): * **Sentry forwarding from pino** (utils/logger.ts). The existing Sentry capture middleware only fires on HTTP 500 responses; `logger.error()` from background jobs (CheckDueFeeds, runs queue, scheduled tasks) was invisible to monitoring. Now a pino destination inspects each log line and forwards `level >= error` to `Sentry.captureException` (with stack reconstruction) or `captureMessage`. In-process dedupe by (msg, err.type, top stack frame) with a 60s window prevents the 380/min repeating-error pattern from saturating Sentry — Sentry has its own grouping but every captureException still incurs an HTTP call. * **Orphan-feed cleanup + graceful skip** (queue-helpers.ts + 20260517020000_softdelete_orphan_feeds.sql). The 'website' connector was archived/uninstalled in some orgs but their feeds weren't soft-deleted. `createSyncRunWithClient` threw on every CheckDueFeeds tick when no active connector_definition existed for the (key, org) pair — produced the 380/min error stream that masked real issues. Code change: warn + skip instead of throw, so future orphans don't spam logs. Migration: one-shot soft-delete of feeds matching the orphan criteria (no pinned_version, active connection, no active connector_definition for the org). * **PreStop drain + grace period** (charts/lobu/templates/ deployment.yaml, values.yaml). Single-replica gateway with `strategy: Recreate` (workspaces PVC is RWO) means every deploy has a ~30s window where the old pod is gone but the new one isn't ready — Cloudflare returns "no available server" to live traffic during this window. The 2026-05-16 incident felt extreme because the pod was actually crash-looping; the same shape happens briefly on every healthy deploy. PreStop hook (15s sleep, configurable via app.preStopDelaySeconds) gives Service endpoint deregistration time to propagate before SIGTERM, so requests in-flight during the drain still hit a live process. terminationGracePeriodSeconds bumped to 45 to cover preStop + graceful shutdown (~5s). Full elimination needs RWX storage for workspaces so we can roll instead of recreate — documented in the deployment.yaml strategy comment.

coderabbitai · 2026-05-16T21:47:45Z

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 3f1ca546-7afa-4e9a-a38e-bf73df7e78d8

📥 Commits

Reviewing files that changed from the base of the PR and between 1636fb0 and 34a977d.

📒 Files selected for processing (1)

packages/web

📝 Walkthrough

Walkthrough

Adds configurable Helm pod termination timing and optional preStop hook, forwards error/fatal logs to Sentry with in-process deduplication and a marker to avoid duplicate reports, and soft-deletes orphan feeds with a migration and runtime orphan-handling change.

Changes

Graceful pod shutdown configuration

Layer / File(s)	Summary
Pod shutdown lifecycle and Helm values `charts/lobu/templates/deployment.yaml`, `charts/lobu/values.yaml`	Deployment template emits `terminationGracePeriodSeconds` templated from `Values.app.terminationGracePeriodSeconds` and conditionally renders a `lifecycle.preStop.exec` sleep when `app.preStopDelaySeconds > 0`. Helm values add `app.preStopDelaySeconds` (default `0`) and `app.terminationGracePeriodSeconds` (default `45`) with explanatory comments.

Sentry error observability

Layer / File(s)	Summary
Sentry forwarding, deduplication, and stream `packages/server/src/utils/logger.ts`, `packages/server/src/server.ts`	Logger imports `@sentry/node`, enforces Pino `err` serializer, implements an in-memory dedupe map and capture path for parsed `error`/`fatal` records via a custom destination stream that still writes to stdout. `server.ts` marks errors with `sentryReported: true` when already sent to Sentry.

Orphan feed cleanup and handling

Layer / File(s)	Summary
Soft-delete orphan feeds and record migration `db/migrations/20260517020000_softdelete_orphan_feeds.sql`, `db/schema.sql`	New migration sets `deleted_at = now()` for active, unpinned feeds whose connection exists but which lack an active `connector_definitions` entry for `(organization_id, connector_key)`. `schema.sql` records the new migration version; down migration is a no-op with manual recovery notes.
Queue-helpers graceful orphan treatment `packages/server/src/utils/queue-helpers.ts`	`createSyncRunWithClient` now soft-deletes feeds detected as orphans, logs a structured warning including `feedId`, `connector_key`, and `organization_id`, and returns `null` (skip) instead of throwing.

Sequence Diagram(s)

sequenceDiagram
  participant App as Application (server.ts)
  participant Logger as Pino (sentryAwareStream)
  participant Stdout as Stdout
  participant Sentry as `@sentry/node`
  App->>Logger: log(error/fatal, {err|error, msg}) (may set sentryReported)
  Logger->>Stdout: write JSON line
  Logger->>Logger: parse JSON -> dedupe check
  Logger->>Sentry: captureException / captureMessage (if not deduped && not sentryReported)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

fix(reliability): gate boot on schema, surface err, split readiness #767

Poem

🐰 In charts and logs the small changes creep,

Pods nap before they fall to sleep,
Errors hop to Sentry, once, not twice,
Orphan feeds tuck in, soft-deleted nice,
A rabbit hums: "All systems rest and keep."

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'fix: close monitoring + deploy gaps from post-incident audit' is concise, clear, and accurately summarizes the main objectives of the changeset: addressing monitoring and deployment gaps identified in a post-incident audit.
Description check	✅ Passed	The pull request description provides comprehensive coverage of all required sections: clear why/what/validation structure, specific test plan checkboxes marked as run, and detailed notes including follow-up considerations and out-of-scope items.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/monitoring-and-deploy-gaps

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov-commenter · 2026-05-16T21:49:58Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b3aefea639

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-16T21:50:15Z

+      const reconstructed = new Error(errObj.message);
+      if (errObj.stack) reconstructed.stack = errObj.stack;
+      Sentry.captureException(reconstructed, {
+        extra: parsed,


Sanitize log metadata before sending it to Sentry

When an error log includes application payloads, this forwards the entire parsed log line to Sentry as extra. For example, packages/server/src/tools/admin/manage_classifiers.ts:901 logs { error, args }, where args includes user-provided classification values/reasoning; with this change any failure in that path now exports those fields to Sentry instead of only stdout. Please redact or whitelist metadata before attaching it to Sentry events.

Useful? React with 👍 / 👎.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

packages/server/src/utils/logger.ts (1)

83-88: ⚡ Quick win

Preserve error type for better Sentry grouping.

The reconstructed Error loses its original type (e.g., TypeError, SyntaxError). Sentry uses error.name for grouping, so different error types with similar messages will be incorrectly grouped together.

♻️ Proposed fix

       // Reconstruct an Error so Sentry's grouping works on the stack.
       const reconstructed = new Error(errObj.message);
       if (errObj.stack) reconstructed.stack = errObj.stack;
+      if (errObj.type) reconstructed.name = errObj.type;
       Sentry.captureException(reconstructed, {

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/server/src/utils/logger.ts` around lines 83 - 88, The reconstructed
Error currently loses the original error type/name used by Sentry for grouping;
when building the synthetic error before calling Sentry.captureException (using
variables errObj, reconstructed, parsed, level), set the reconstructed error's
name to the original type (e.g., reconstructed.name = errObj.name ||
reconstructed.name) so Sentry preserves error.name/grouping and include the same
reconstructed.stack and extra/tags as before.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@charts/lobu/templates/deployment.yaml`:
- Line 50: The current use of Helm's default() treats 0 as empty and overrides
explicit zeros; change both occurrences (preStopDelaySeconds and
terminationGracePeriodSeconds) to check presence instead of using default(),
e.g. replace {{ .Values.app.terminationGracePeriodSeconds | default 45 }} and {{
.Values.app.preStopDelaySeconds | default 15 }} with an explicit conditional
using hasKey/.Values.app presence (for example: if hasKey .Values.app
"terminationGracePeriodSeconds" then render
.Values.app.terminationGracePeriodSeconds else render 45) so that a
user-provided 0 is honored.

---

Nitpick comments:
In `@packages/server/src/utils/logger.ts`:
- Around line 83-88: The reconstructed Error currently loses the original error
type/name used by Sentry for grouping; when building the synthetic error before
calling Sentry.captureException (using variables errObj, reconstructed, parsed,
level), set the reconstructed error's name to the original type (e.g.,
reconstructed.name = errObj.name || reconstructed.name) so Sentry preserves
error.name/grouping and include the same reconstructed.stack and extra/tags as
before.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: e8f6fe82-8615-4ea0-84f2-8113bb28f282

📥 Commits

Reviewing files that changed from the base of the PR and between 653566f and b3aefea.

📒 Files selected for processing (6)

charts/lobu/templates/deployment.yaml
charts/lobu/values.yaml
db/migrations/20260517020000_softdelete_orphan_feeds.sql
db/schema.sql
packages/server/src/utils/logger.ts
packages/server/src/utils/queue-helpers.ts

coderabbitai · 2026-05-16T21:51:31Z

+      # task scheduler, embedded gateway, DB pool, HTTP server). Default
+      # k8s grace period is 30s; bump so the preStop sleep doesn't eat
+      # the whole window.
+      terminationGracePeriodSeconds: {{ .Values.app.terminationGracePeriodSeconds | default 45 }}


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

🧩 Analysis chain

🌐 Web query:

Does Helm's defaultfunction treat numeric0 as empty (therefore returning the fallback value)?

💡 Result:

Yes. Helm’s default function treats the numeric value 0 as “empty”, so it will return the fallback value instead of 0. This comes from Sprig’s default semantics: the definition of “empty” includes numeric 0, and for “non-empty” values the given value is used, otherwise the default is returned [1]. Helm documents default as default DEFAULT_VALUE GIVEN_VALUE and that it returns the default “if [the given value] is empty” [2]. Helm issues report that when a value is set to 0 (integer), default behaves as if it were unset/empty and the fallback is used [3][4]. Practical implication: if 0 is a valid value in your chart, don’t use default directly; instead use an explicit check (for example, distinguish “not set” vs “set to 0”) as suggested by Helm community discussions [3][4].

Citations:

1: http://masterminds.github.io/sprig/defaults.html

2: https://helm.sh/pt/docs/chart_template_guide/functions_and_pipelines

3: default template function treats 0 value as unspecified helm/helm#13036

4: Wrong interpretation of Zero in templates files when having a default value helm/helm#3164

🏁 Script executed:

fd -type f -name 'values.yaml' charts/

Repository: lobu-ai/lobu

Length of output: 227

🏁 Script executed:

# Read the values.yaml to check for terminationGracePeriodSeconds and preStopDelaySeconds cat -n charts/lobu/values.yaml | grep -A 5 -B 5 "terminationGracePeriodSeconds\|preStopDelaySeconds"

Repository: lobu-ai/lobu

Length of output: 857

🏁 Script executed:

cat -n charts/lobu/templates/deployment.yaml | sed -n '45,55p; 118,125p'

Repository: lobu-ai/lobu

Length of output: 1255

default here overrides explicit 0 and breaks configurability.

On Line 50 and Line 122, Helm default will treat 0 as empty and fall back to 45/15. That means operators cannot intentionally set preStopDelaySeconds: 0 (disable delay) or terminationGracePeriodSeconds: 0.

Suggested fix

- terminationGracePeriodSeconds: {{ .Values.app.terminationGracePeriodSeconds | default 45 }} + terminationGracePeriodSeconds: {{ .Values.app.terminationGracePeriodSeconds }} ... - command: ["sh", "-c", "sleep {{ .Values.app.preStopDelaySeconds | default 15 }}"] + command: ["sh", "-c", "sleep {{ .Values.app.preStopDelaySeconds }}"]

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@charts/lobu/templates/deployment.yaml` at line 50, The current use of Helm's default() treats 0 as empty and overrides explicit zeros; change both occurrences (preStopDelaySeconds and terminationGracePeriodSeconds) to check presence instead of using default(), e.g. replace {{ .Values.app.terminationGracePeriodSeconds | default 45 }} and {{ .Values.app.preStopDelaySeconds | default 15 }} with an explicit conditional using hasKey/.Values.app presence (for example: if hasKey .Values.app "terminationGracePeriodSeconds" then render .Values.app.terminationGracePeriodSeconds else render 45) so that a user-provided 0 is honored.

Six findings, all addressed: * **#1 preStop on Recreate makes the gap WORSE.** Pi caught the reversal — under `strategy: Recreate` the new pod doesn't start until the old one fully terminates, so adding a preStop sleep EXTENDS the no-available-server window rather than shrinking it. Default `preStopDelaySeconds` to 0 and only emit the lifecycle hook when > 0. Document that ops repos should set it only with RollingUpdate (which needs RWX storage on workspaces — out of scope here). * **#2 migration too broad.** Tightened WHERE to require `feeds.status='active'` and `connections.status='active'` so we match exactly the set CheckDueFeeds would process. Paused / pending_auth / error feeds are left alone — they don't contribute to the error spam and may be intentionally recoverable. * **#3 Sentry double-report.** `server.ts onError` already calls `Sentry.captureException` then `logger.error`; the new logger forwarder would have sent the same event again. Added `sentryReported: true` marker on that log line; logger transport skips forwarding when it sees the marker. * **#4 dedupe fingerprint too coarse.** Added `err.message` to the fingerprint so distinct messages from the same catch site (same Error type, same top stack frame) get distinct fingerprints within the 60s window. * **#5 existing string-error call sites are pre-existing tech debt** — left as-is for now. Documented in the PR body as a follow-up; the new forwarder handles `err` / `error` Error-object payloads well, and call sites with stringified errors are a separate cleanup pass. * **#6 warn-skip leaves future orphans being polled forever.** Now `createSyncRunWithClient` soft-deletes the feed in-place when no active connector_definition is found, so CheckDueFeeds stops selecting it. Operators recover by clearing `deleted_at` after reinstalling the definition.

buremba · 2026-05-16T21:52:51Z

pi review — 6 findings, all addressed

1636fb0:

preStop on Recreate would make the gap WORSE. Pi caught a real reversal. Under `strategy: Recreate` the new pod waits for the old one to terminate, so adding a preStop sleep EXTENDS the no-available-server window. Default `preStopDelaySeconds` is now 0; the lifecycle hook is only emitted when > 0. Operators only set it with RollingUpdate (which needs RWX storage on workspaces — separate ops decision).
Migration too broad. Tightened WHERE to require `feeds.status='active' AND connections.status='active'`, matching exactly the set CheckDueFeeds processes. Paused / pending_auth / error feeds untouched.
Sentry double-report. `server.ts onError` captures the exception then logs it; the new forwarder would re-send. Added `sentryReported: true` marker on those log lines; forwarder skips when seen.
Dedupe fingerprint too coarse. Added `err.message` to the fingerprint so distinct messages from the same catch site don't suppress each other within the 60s window.
Existing call sites with stringified errors (`logger.error({ error: errorMessage(err) }, ...)`): pre-existing tech debt, not regressed by this PR. The forwarder handles Error-object payloads cleanly; the string-error sites lose stack/type info but still produce captureMessage events that Sentry can group. Tracked as a follow-up cleanup pass.
Warn-skip would leave orphans being polled forever. `createSyncRunWithClient` now soft-deletes the feed in-place when no active definition is found, so CheckDueFeeds stops selecting it. Operators recover by clearing `deleted_at` after reinstalling the definition.

Build + typecheck clean.

Main's #775 added a migration at 20260517020000; bumping mine to 20260517030000 so file order is unambiguous.

…essaging (#773) * feat(chrome-extension): MV3 connector scaffold + dynamic device capability authorization - @lobu/core capabilities.ts: platform-keyed allowlist registry + authorizeCapabilities(). Device workers can only advertise capabilities whitelisted for their reported platform; anything outside the allowlist is silently dropped before connector matching. - worker-api.ts: intersect advertised capabilities with the registry for user-scoped workers; trusted-fleet (no platform) path unchanged. - apps/chrome/: MV3 service-worker + sidepanel-iframe scaffold. Pairs via standard RFC 8628 OAuth device-authorization flow against the existing gateway endpoints (same as the Mac app — no new endpoint needed). Baseline capabilities: browser.tabs, browser.scripting, browser.debugger. Optional history/bookmarks via runtime chrome.permissions.request(). - bridge.js: typed postMessage broker between sidepanel iframe and service worker. Named-ops allowlist, origin-checked, deny-by-default — defends the privileged extension surface from the (trusted today) embedded owletto-web app. Verified end-to-end against local PGlite: device_workers row appears with platform=chrome-extension and the authorized capability set after pairing. * feat(chrome-extension): permissions UI, /embedded route, chrome.tabs connector Three follow-ups to the initial Chrome-extension scaffold: apps/chrome (extension): - permissions.html / permissions.js: opt-in / revoke UI for the optional Chrome permissions (history, bookmarks). Toggling updates the granted set; the next /api/workers/poll re-advertises capabilities automatically (background.js recomputes via chrome.permissions.contains each cycle). - sidepanel topbar adds a "Permissions" link to the new page. - background.js gains a minimal run executor (executeRun): on each poll, if the gateway hands back a chrome.tabs run, it queries the live tab list, streams one batch of `tab_snapshot` events, and marks the run complete. Non-chrome.tabs runs fail fast with a marker — heartbeat loop, multi-batch streaming, and action runs are explicit v2 backlog items in SCOPE.md. packages/web (owletto-web): - /embedded route: chromeless layout (no sidebar, no nav, no command palette) targeted by the extension's sidepanel iframe. Reads worker_id from the URL fragment. Reuses the existing isStandalonePage path in __root.tsx. packages/server (gateway): - CSP frame-ancestors picks up `chrome-extension:` for /embedded only — every other HTML response keeps the existing 'self' + lobu.ai allowlist so the rest of the app can't be embedded by arbitrary extensions. packages/connectors: - chrome_tabs.ts: requiredCapability='browser.tabs', runtime.platforms=['chrome-extension']. Lists open tabs as `tab_snapshot` events. Cloud sync/execute throw (bridge-only) — actual work lives in the extension's service worker. Typecheck clean. End-to-end smoke: device pairs, advertises authorized capability set, auto-wires the chrome.tabs connector via existing device- reconcile loop, executor handles the run shape end-to-end (single batch). * chore: bump web submodule — /embedded route for Chrome extension sidepanel Submodule pointer follows the matching commit on feat/chrome-extension-embedded-route in lobu-ai/owletto-web. Branch will need to be pushed before the parent feat/chrome-extension-connector branch goes up — current SHA is local-only. * feat(chrome-extension): user-configurable gateway URL Lobu/Owletto is self-hosted — the extension can't ship with a fixed origin. First-run pairing now asks for the user's gateway URL before the OAuth dance: - config.js: drop GATEWAY_URL constant. Add getGatewayUrl() / getEmbeddedAppUrl() helpers that read from chrome.storage. The hardcoded http://localhost:8787 default is now just the pairing-screen pre-fill. - pairing.html / pairing.js: new setup step before the OAuth flow. URL is validated by fetching the well-known OAuth discovery doc; the extension requests chrome.permissions.request({origins}) for the entered host so subsequent fetches don't hit CORS. - background.js, sidepanel.js, bridge.js: read gateway URL from storage at call time. Falls back to opening pairing.html when missing. - permissions.html: shows the configured server with a "Change" action that clears creds + URL and re-runs setup. - manifest.json: drop static host_permissions entries for the (placeholder) owletto.ai hosts. Origin grants now happen at runtime via optional_host_permissions: ["<all_urls>"] — also cleaner for Web Store review. Typecheck clean. The previous local-dev pairing path still works: leave the default http://localhost:8787, click Continue, then Get pairing code. * feat(devices): native-messaging SSO from Mac bridge to Chrome extension When the Owletto Mac app is installed and signed in, Owletto for Chrome auto-pairs on first run with zero second login — no URL typing, no OAuth device-code dance. Gateway: - worker-api.ts: mintDeviceChildToken — POST /api/me/devices/mint-child-token. Auth: caller's bearer. Body: {platform, label?}. Generates a new worker_id, mints a fresh PAT in the caller's personal org with the device_worker:run scope, returns {worker_id, access_token, gateway_url}. Platform is gated against @lobu/core/capabilities (chrome-extension only for now). Extension: - manifest.json: re-add `nativeMessaging` permission. - pairing.js: on load, attempt chrome.runtime.connectNative("ai.owletto.bridge") with a 2.5s timeout. On success: request host-permission for the returned gateway origin, persist {gatewayUrl, workerId, accessToken}, close. On timeout/error: fall back to existing URL-setup → OAuth flow. Mac bridge: - ChromeBridgeHost.swift (NEW): two responsibilities. runHostIfRequested() serves a single native-messaging request when launched with --owletto-bridge (4-byte length-prefixed JSON on stdin/stdout); reads the user's stored OAuth bearer from KeychainTokenStore, calls /api/me/devices/mint-child-token, writes the response. installManifests() drops ai.owletto.bridge.json into Chrome / Brave / Arc / Edge NativeMessagingHosts directories, idempotent, with allowed_origins from LOBU_OWLETTO_CHROME_EXTENSION_ID env override. - LobuApp.swift init(): calls runHostIfRequested() before SwiftUI scene builds, then installManifests(). Outstanding (next slice): - ChromeBridgeHost.swift needs to be dragged into the Xcode project's Lobu target manually — until then LobuApp.swift's init() won't compile. Avoiding hand-editing project.pbxproj from this session. - Web Store extension ID will be hardcoded once published; today only the env override drives the manifest's allowed_origins. Typecheck clean. * fix(mac): wire ChromeBridgeHost.swift into Xcode target + compile fixes - Lobu.xcodeproj/project.pbxproj: add ChromeBridgeHost.swift to the PBXFileReference, PBXBuildFile, PBXGroup, and PBXSourcesBuildPhase sections. IDs follow the existing sequential B*0015 slot. Xcode now picks up the file in the Lobu target without a manual drag-and-drop. - ChromeBridgeHost.swift: switch Result<…, String> to Result<…, BridgeError> (Swift requires the error type to conform to Error), and fix the keychain store class name (KeychainCredentialStore, not KeychainTokenStore). Verified: `xcodebuild -project apps/mac/Lobu.xcodeproj -scheme Lobu -configuration Debug build` → BUILD SUCCEEDED. * fix(mac-bridge): readFrame consumption, URLSession main-thread deadlock, DEBUG creds env Three end-to-end blockers found while exercising the Mac bridge via a synthetic Chrome native-messaging spawn: 1. ChromeBridgeHost.readFrame consumed bytes via FileHandle.availableData then used .prefix(4) to "peek" the length header. availableData is read-once — any payload bytes that arrived in the same chunk as the header were dropped, and the subsequent payload-read loop blocked on EOF. Rewrote to accumulate into a single buffer and slice out header + payload from it. 2. mintChildToken called URLSession.shared.dataTask().resume() + sem.wait() from the main thread. URLSession.shared's default delegate queue is the main queue, so the completion handler couldn't fire — classic deadlock. Switched to a dedicated ephemeral URLSession with its own OperationQueue. 3. Production builds read OAuth creds from keychain (correct). For debug/ad-hoc-signed dev builds TCC blocks unsigned binaries from reading keychain items silently — the bridge subprocess hangs forever waiting for a prompt that can't render. Added a DEBUG-only env-var fallback (LOBU_BRIDGE_TEST_BASE_URL / LOBU_BRIDGE_TEST_TOKEN) so the bridge flow can be exercised without code signing. #if DEBUG-gated; never read in RELEASE. Plus granular [bridge] stderr logging to make future plumbing issues easy to diagnose. Verified end-to-end against the local gateway: 4-byte length-prefixed JSON frame in, mint-child-token call returns 200, response frame out. * chore(mac-bridge): gate diagnostic stderr logs behind #if DEBUG * chore: bump web submodule pointer to merged main SHA * fix(chrome-extension): security + correctness review fixes Addresses three HIGH-impact issues + one MEDIUM correctness bug surfaced in code review: 1. (HIGH) Capability authorization was bypassed when claiming runs. pollWorkerJob's user-scoped claim SQL filtered on `advertisedCapabilities` (the raw declared set), so a device claiming `os.shell` would have it dropped for storage but could still match and claim a connector run requiring it. Switched the two device branches in the claim SQL to `authorizedCapabilities` (the post- allowlist set). 2. (HIGH) `mintDeviceChildToken` only checked `c.var.user?.id`, so any authenticated session that mcpAuth accepts could escalate into a worker token. Now requires the caller's bearer to already carry the `device_worker:run` scope — i.e. only an existing device-worker token (or a PAT minted for that purpose) can mint a sibling. Browser sessions can't silently escalate; users wanting to pair Chrome from a browser go through OAuth device-authorization, not this endpoint. 3. (MEDIUM) Native-messaging readFrame trusted the wire-level 32-bit length. A buggy or malicious allowed extension could hang or memory-pressure the host on a crafted header. Added a 64 KB cap on the payload length (our pair request is ~50 bytes; this is generous). 4. (MEDIUM) Extension's `executeRun` sent the wrong field names to /api/workers/complete (`items_count`, `error`) — server expects `items_collected`, `error_message`. Tab-snapshot runs were completing as 0-item successes and failures dropped their diagnostic. SCOPE.md documents three additional review notes as follow-ups: the native host's argv check needs a stronger same-user defense, child PATs aren't yet bound to their `worker_id`, and the extension still advertises capabilities ahead of executors. Typecheck + xcode build clean. * fix(devices): platform is set-once at the worker level; restore readFrame cap Re-review surfaced two remaining gaps from the previous security pass: 1. (HIGH) Capability authorization trusted self-reported `platform` on every poll. A compromised chrome-extension PAT could post `platform: "macos"` and unlock the macOS allowlist — claim `os.shell` runs the device can't actually execute. Fix has three parts: - Poll handler reads the stored platform for (user_id, worker_id); if it exists and differs from the posted one, rejects 403 `platform_mismatch`. - UPSERT on conflict now COALESCE-preserves device_workers.platform (defense-in-depth against a race between SELECT and UPSERT). - Capability authorization uses the stored platform when present, not the body-posted one. - `mintDeviceChildToken` pre-creates the device_workers row with `platform` set, so the very first poll from a freshly-minted child already finds the platform locked. 2. (MEDIUM, re-introduced by an earlier edit revert) The 64 KB cap on the native-messaging frame length had been silently dropped. Restored `maxFramePayloadBytes` + bounds check in ChromeBridgeHost.readFrame. Typecheck + xcode build clean. The non-blocking review notes remain documented in apps/chrome/SCOPE.md. * fix(devices): bind PATs minted via mint-child-token to their worker_id Closes the last platform-binding bypass: a chrome-extension PAT could post a fresh, never-seen `worker_id` plus `platform: "macos"` and register a new macOS-platform device_workers row under the same user. The pre-existing-row platform lock didn't help because the row didn't exist yet. Fix: `personal_access_tokens` now has a `worker_id` column. PATs minted via /api/me/devices/mint-child-token carry the worker_id they're bound to. The worker-poll handler reads the bound id from `mcpAuthInfo` and rejects requests whose body posts a different one (403 `worker_id_mismatch`). Legacy paths (OAuth-issued bearers for the Mac/iOS bridges, the existing CLI PATs) keep `worker_id = NULL`; the poll handler treats NULL as "no binding" so they can register any worker_id they pick on first poll — same as today. - db/migrations/20260517020000_pat_worker_id_binding.sql: nullable column + partial index. - db/schema.sql: matching column + index entries. - auth/oauth/types.ts: AuthInfo.workerId optional. - auth/tokens.ts: create() accepts workerId option; verify() returns it. - worker-api.ts (pollWorkerJob): rejects on mismatch before the capability-authorization step. - worker-api.ts (mintDeviceChildToken): passes workerId to create(). * fix(worker-api): empty-string worker_id bypass on bound PATs The bound-token check skipped when body's worker_id was empty (falsy). Compare unconditionally; also require non-empty worker_id from any user-scoped caller. Otherwise a bound chrome-extension PAT could post worker_id='' and have the binding check + platform lock skipped, registering under (user_id, '') with attacker-chosen platform. * fix(capabilities): include legacy Mac caps in macos allowlist A Mac bridge advertises 'screentime', 'local_directory', 'healthkit', 'photos', 'whatsapp_local' from AppState.swift:caps. The macos allowlist only had os.* + browser.*, so these would all get dropped server-side and Mac connectors would stop matching their device. Add a LEGACY_MAC_CAPABILITIES export that the macos entry includes — keeps existing Mac functionality intact while still excluding cross-platform claims (a Mac PAT can't suddenly claim 'browser.history' via authorization, only via actually having the chrome-extension already covers it under its own platform binding). * fix(security+regression): worker_id binding hardening + Mac caps + migration format Three follow-ups from the security re-review: 1. (HIGH) `worker_id: ''` bypassed the bound-token check (empty string was falsy in the previous guard). pollWorkerJob now requires a non-empty `worker_id` from any user-scoped caller and compares the bound id unconditionally — no truthiness short-circuit. 2. (HIGH regression) macos capability allowlist dropped 'screentime', 'local_directory', 'healthkit', 'photos', 'whatsapp_local' — the existing Mac bridge advertises these and its connectors would have silently stopped matching. Adds MAC_DEVICE_CAPABILITIES alongside OS_/BROWSER_CAPABILITIES. 3. CI: migration was missing `-- migrate:down`; lint-format job failed. Added the down section + matching teardown. 4. Submodule bump: pulls in lobu-ai/owletto#138 — adds /embedded to the all-routes smoke test (was missing in #137). * chore: re-time PAT-worker-id migration past collision with main Main's #775 added a migration at 20260517020000; bumping mine to 20260517030000 so file order is unambiguous. * fix(schema): add 20260517030000 entry to schema_migrations

chatgpt-codex-connector Bot reviewed May 16, 2026

View reviewed changes

coderabbitai Bot reviewed May 16, 2026

View reviewed changes

chore: bump web submodule to current main for drift check

34a977d

buremba merged commit bdea9d5 into main May 16, 2026
18 of 21 checks passed

buremba deleted the fix/monitoring-and-deploy-gaps branch May 16, 2026 22:02

buremba mentioned this pull request May 16, 2026

chore(main): release lobu 7.1.0 #724

Merged

buremba added a commit that referenced this pull request May 16, 2026

chore: re-time PAT-worker-id migration past collision with main

bd7e723

Main's #775 added a migration at 20260517020000; bumping mine to 20260517030000 so file order is unambiguous.

buremba mentioned this pull request May 16, 2026

feat(chart): auto-pick RollingUpdate when workspaces is RWX #776

Merged

3 tasks

This was referenced May 18, 2026

feat(server,chart): flip snapshot default + drop workspaces PVC (Phase 5) #871

Merged

chore(chart): align lobu chart to owletto fork (pre-consolidation, byte-identical) #882

Merged

fix(start-local): close 7 PGlite/Postgres parity-hygiene risks #943

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: close monitoring + deploy gaps from post-incident audit#775

fix: close monitoring + deploy gaps from post-incident audit#775
buremba merged 3 commits into
mainfrom
fix/monitoring-and-deploy-gaps

buremba commented May 16, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 16, 2026 •

edited

Loading

Review failed

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

codecov-commenter commented May 16, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 16, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 16, 2026

Uh oh!

buremba commented May 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

buremba commented May 16, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

What

1. pino → Sentry forwarding (utils/logger.ts)

2. Orphan-feed cleanup (queue-helpers.ts + 20260517020000_softdelete_orphan_feeds.sql)

3. Graceful drain (charts/lobu/templates/deployment.yaml + values.yaml)

Validation

Out of scope (acknowledged)

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

codecov-commenter commented May 16, 2026

Codecov Report

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 16, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 16, 2026

Choose a reason for hiding this comment

Uh oh!

buremba commented May 16, 2026

pi review — 6 findings, all addressed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

buremba commented May 16, 2026 •

edited by coderabbitai Bot

Loading

1. pino → Sentry forwarding (`utils/logger.ts`)

2. Orphan-feed cleanup (`queue-helpers.ts` + `20260517020000_softdelete_orphan_feeds.sql`)

3. Graceful drain (`charts/lobu/templates/deployment.yaml` + `values.yaml`)

coderabbitai Bot commented May 16, 2026 •

edited

Loading