Skip to content

fix: close monitoring + deploy gaps from post-incident audit#775

Merged
buremba merged 3 commits into
mainfrom
fix/monitoring-and-deploy-gaps
May 16, 2026
Merged

fix: close monitoring + deploy gaps from post-incident audit#775
buremba merged 3 commits into
mainfrom
fix/monitoring-and-deploy-gaps

Conversation

@buremba
Copy link
Copy Markdown
Member

@buremba buremba commented May 16, 2026

Why

Post-incident audit after #771 + #772 landed. Prod is healthy but three real gaps remained that would let the next similar incident slip past monitoring or extend its blast radius:

Gap Observed today Risk
Sentry has zero coverage of logger errors 1,914 errors / 5 min in stdout, 0 Sentry issues in 24h Background-job failures invisible to monitoring; today's noise masks tomorrow's signal
'website' connector orphan feeds ~380 error logs/min from CheckDueFeeds Log spam, hides real errors; not user-facing
Single-replica Recreate strategy every deploy = ~30s "no available server" Cloudflare 502s on every rollout

What

1. pino → Sentry forwarding (utils/logger.ts)

Adds a pino destination that mirrors stdout AND forwards level >= error to Sentry. In-process dedupe by (msg, err.type, top stack frame) with a 60s window — Sentry has its own grouping but every captureException still costs an HTTP call, and a repeating error at 380/min would saturate the budget.

2. Orphan-feed cleanup (queue-helpers.ts + 20260517020000_softdelete_orphan_feeds.sql)

  • Code: createSyncRunWithClient now warns + skips when no active connector_definition exists for (connector_key, organization_id). Pre-fix it threw, producing the error stream.
  • Migration: one-shot soft-delete of feeds matching the orphan criteria (no pinned_version, active connection, no active definition in the org). Conservative criteria — only touches feeds that the current code would also fail on.

3. Graceful drain (charts/lobu/templates/deployment.yaml + values.yaml)

  • preStop lifecycle hook: 15s sleep (configurable via app.preStopDelaySeconds). When k8s decides to terminate, this lets Service endpoint deregistration propagate through kube-proxy + Cloudflare LB before the process gets SIGTERM. In-flight requests routed during the deregistration lag still hit a live process.
  • terminationGracePeriodSeconds: 45 (configurable). Covers preStop + the graceful shutdown in server.ts (~5s for HTTP server close, task scheduler stop, DB pool drain).
  • Strategy stays Recreate (workspaces PVC is RWO). Comment explains the path to RollingUpdate: needs RWX storage or workspaces split off the gateway deployment. Not done here because it's an ops-repo storage-class decision, not in-tree code.

Validation

  • make typecheck clean
  • make build-packages builds
  • Migration applies on a fresh PG (validated locally via docker pgvector/pgvector:pg18)
  • db/schema.sql diff is intent-only (one new schema_migrations row)
  • Manual: after deploy, send a logger.error(new Error('test'), ...) from a debug endpoint, confirm a Sentry issue appears
  • Manual: trigger a rollout, observe whether the "no available server" window shrinks below the prior ~30s

Out of scope (acknowledged)

  • True elimination of the deploy gap needs RWX storage for workspaces or moving workspaces off the gateway pod entirely. That's an ops-repo decision (storage class, cost). Current commit reduces the gap but doesn't close it. Tracked for follow-up.
  • The orphan-feed migration uses conservative criteria; it should be safe but operators with non-standard feed setups may want to inspect SELECT count(*) FROM feeds WHERE ... before/after.

Summary by CodeRabbit

  • New Features

    • Enhanced error monitoring with deduplication to reduce duplicate reports.
    • Improved graceful pod shutdown with optional pre-stop delay and configurable termination timeout.
  • Bug Fixes

    • Automatic cleanup (soft-delete) of orphaned feeds when connectors are missing.
  • Chores

    • Added a one‑time DB migration and schema update to apply the orphan-feed cleanup.

Review Change Stack

Three gaps identified by the audit after #771/#772 landed (1914
errors/5min in stdout, zero Sentry issues, single-replica Recreate
deploy strategy = ~30s "no available server" window per rollout):

* **Sentry forwarding from pino** (utils/logger.ts). The existing
  Sentry capture middleware only fires on HTTP 500 responses;
  `logger.error()` from background jobs (CheckDueFeeds, runs queue,
  scheduled tasks) was invisible to monitoring. Now a pino destination
  inspects each log line and forwards `level >= error` to
  `Sentry.captureException` (with stack reconstruction) or
  `captureMessage`. In-process dedupe by (msg, err.type, top stack
  frame) with a 60s window prevents the 380/min repeating-error
  pattern from saturating Sentry — Sentry has its own grouping but
  every captureException still incurs an HTTP call.

* **Orphan-feed cleanup + graceful skip** (queue-helpers.ts +
  20260517020000_softdelete_orphan_feeds.sql). The 'website'
  connector was archived/uninstalled in some orgs but their feeds
  weren't soft-deleted. `createSyncRunWithClient` threw on every
  CheckDueFeeds tick when no active connector_definition existed for
  the (key, org) pair — produced the 380/min error stream that
  masked real issues. Code change: warn + skip instead of throw, so
  future orphans don't spam logs. Migration: one-shot soft-delete of
  feeds matching the orphan criteria (no pinned_version, active
  connection, no active connector_definition for the org).

* **PreStop drain + grace period** (charts/lobu/templates/
  deployment.yaml, values.yaml). Single-replica gateway with
  `strategy: Recreate` (workspaces PVC is RWO) means every deploy
  has a ~30s window where the old pod is gone but the new one isn't
  ready — Cloudflare returns "no available server" to live traffic
  during this window. The 2026-05-16 incident felt extreme because
  the pod was actually crash-looping; the same shape happens briefly
  on every healthy deploy. PreStop hook (15s sleep, configurable via
  app.preStopDelaySeconds) gives Service endpoint deregistration time
  to propagate before SIGTERM, so requests in-flight during the drain
  still hit a live process. terminationGracePeriodSeconds bumped to
  45 to cover preStop + graceful shutdown (~5s). Full elimination
  needs RWX storage for workspaces so we can roll instead of recreate
  — documented in the deployment.yaml strategy comment.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 16, 2026

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 3f1ca546-7afa-4e9a-a38e-bf73df7e78d8

📥 Commits

Reviewing files that changed from the base of the PR and between 1636fb0 and 34a977d.

📒 Files selected for processing (1)
  • packages/web

📝 Walkthrough

Walkthrough

Adds configurable Helm pod termination timing and optional preStop hook, forwards error/fatal logs to Sentry with in-process deduplication and a marker to avoid duplicate reports, and soft-deletes orphan feeds with a migration and runtime orphan-handling change.

Changes

Graceful pod shutdown configuration

Layer / File(s) Summary
Pod shutdown lifecycle and Helm values
charts/lobu/templates/deployment.yaml, charts/lobu/values.yaml
Deployment template emits terminationGracePeriodSeconds templated from Values.app.terminationGracePeriodSeconds and conditionally renders a lifecycle.preStop.exec sleep when app.preStopDelaySeconds > 0. Helm values add app.preStopDelaySeconds (default 0) and app.terminationGracePeriodSeconds (default 45) with explanatory comments.

Sentry error observability

Layer / File(s) Summary
Sentry forwarding, deduplication, and stream
packages/server/src/utils/logger.ts, packages/server/src/server.ts
Logger imports @sentry/node, enforces Pino err serializer, implements an in-memory dedupe map and capture path for parsed error/fatal records via a custom destination stream that still writes to stdout. server.ts marks errors with sentryReported: true when already sent to Sentry.

Orphan feed cleanup and handling

Layer / File(s) Summary
Soft-delete orphan feeds and record migration
db/migrations/20260517020000_softdelete_orphan_feeds.sql, db/schema.sql
New migration sets deleted_at = now() for active, unpinned feeds whose connection exists but which lack an active connector_definitions entry for (organization_id, connector_key). schema.sql records the new migration version; down migration is a no-op with manual recovery notes.
Queue-helpers graceful orphan treatment
packages/server/src/utils/queue-helpers.ts
createSyncRunWithClient now soft-deletes feeds detected as orphans, logs a structured warning including feedId, connector_key, and organization_id, and returns null (skip) instead of throwing.

Sequence Diagram(s)

sequenceDiagram
  participant App as Application (server.ts)
  participant Logger as Pino (sentryAwareStream)
  participant Stdout as Stdout
  participant Sentry as `@sentry/node`
  App->>Logger: log(error/fatal, {err|error, msg}) (may set sentryReported)
  Logger->>Stdout: write JSON line
  Logger->>Logger: parse JSON -> dedupe check
  Logger->>Sentry: captureException / captureMessage (if not deduped && not sentryReported)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Poem

🐰 In charts and logs the small changes creep,

Pods nap before they fall to sleep,
Errors hop to Sentry, once, not twice,
Orphan feeds tuck in, soft-deleted nice,
A rabbit hums: "All systems rest and keep."

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'fix: close monitoring + deploy gaps from post-incident audit' is concise, clear, and accurately summarizes the main objectives of the changeset: addressing monitoring and deployment gaps identified in a post-incident audit.
Description check ✅ Passed The pull request description provides comprehensive coverage of all required sections: clear why/what/validation structure, specific test plan checkboxes marked as run, and detailed notes including follow-up considerations and out-of-scope items.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/monitoring-and-deploy-gaps

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov-commenter
Copy link
Copy Markdown

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b3aefea639

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

const reconstructed = new Error(errObj.message);
if (errObj.stack) reconstructed.stack = errObj.stack;
Sentry.captureException(reconstructed, {
extra: parsed,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Sanitize log metadata before sending it to Sentry

When an error log includes application payloads, this forwards the entire parsed log line to Sentry as extra. For example, packages/server/src/tools/admin/manage_classifiers.ts:901 logs { error, args }, where args includes user-provided classification values/reasoning; with this change any failure in that path now exports those fields to Sentry instead of only stdout. Please redact or whitelist metadata before attaching it to Sentry events.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
packages/server/src/utils/logger.ts (1)

83-88: ⚡ Quick win

Preserve error type for better Sentry grouping.

The reconstructed Error loses its original type (e.g., TypeError, SyntaxError). Sentry uses error.name for grouping, so different error types with similar messages will be incorrectly grouped together.

♻️ Proposed fix
       // Reconstruct an Error so Sentry's grouping works on the stack.
       const reconstructed = new Error(errObj.message);
       if (errObj.stack) reconstructed.stack = errObj.stack;
+      if (errObj.type) reconstructed.name = errObj.type;
       Sentry.captureException(reconstructed, {
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@packages/server/src/utils/logger.ts` around lines 83 - 88, The reconstructed
Error currently loses the original error type/name used by Sentry for grouping;
when building the synthetic error before calling Sentry.captureException (using
variables errObj, reconstructed, parsed, level), set the reconstructed error's
name to the original type (e.g., reconstructed.name = errObj.name ||
reconstructed.name) so Sentry preserves error.name/grouping and include the same
reconstructed.stack and extra/tags as before.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@charts/lobu/templates/deployment.yaml`:
- Line 50: The current use of Helm's default() treats 0 as empty and overrides
explicit zeros; change both occurrences (preStopDelaySeconds and
terminationGracePeriodSeconds) to check presence instead of using default(),
e.g. replace {{ .Values.app.terminationGracePeriodSeconds | default 45 }} and {{
.Values.app.preStopDelaySeconds | default 15 }} with an explicit conditional
using hasKey/.Values.app presence (for example: if hasKey .Values.app
"terminationGracePeriodSeconds" then render
.Values.app.terminationGracePeriodSeconds else render 45) so that a
user-provided 0 is honored.

---

Nitpick comments:
In `@packages/server/src/utils/logger.ts`:
- Around line 83-88: The reconstructed Error currently loses the original error
type/name used by Sentry for grouping; when building the synthetic error before
calling Sentry.captureException (using variables errObj, reconstructed, parsed,
level), set the reconstructed error's name to the original type (e.g.,
reconstructed.name = errObj.name || reconstructed.name) so Sentry preserves
error.name/grouping and include the same reconstructed.stack and extra/tags as
before.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: e8f6fe82-8615-4ea0-84f2-8113bb28f282

📥 Commits

Reviewing files that changed from the base of the PR and between 653566f and b3aefea.

📒 Files selected for processing (6)
  • charts/lobu/templates/deployment.yaml
  • charts/lobu/values.yaml
  • db/migrations/20260517020000_softdelete_orphan_feeds.sql
  • db/schema.sql
  • packages/server/src/utils/logger.ts
  • packages/server/src/utils/queue-helpers.ts

# task scheduler, embedded gateway, DB pool, HTTP server). Default
# k8s grace period is 30s; bump so the preStop sleep doesn't eat
# the whole window.
terminationGracePeriodSeconds: {{ .Values.app.terminationGracePeriodSeconds | default 45 }}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

🧩 Analysis chain

🌐 Web query:

Does Helm's defaultfunction treat numeric0 as empty (therefore returning the fallback value)?

💡 Result:

Yes. Helm’s default function treats the numeric value 0 as “empty”, so it will return the fallback value instead of 0. This comes from Sprig’s default semantics: the definition of “empty” includes numeric 0, and for “non-empty” values the given value is used, otherwise the default is returned [1]. Helm documents default as default DEFAULT_VALUE GIVEN_VALUE and that it returns the default “if [the given value] is empty” [2]. Helm issues report that when a value is set to 0 (integer), default behaves as if it were unset/empty and the fallback is used [3][4]. Practical implication: if 0 is a valid value in your chart, don’t use default directly; instead use an explicit check (for example, distinguish “not set” vs “set to 0”) as suggested by Helm community discussions [3][4].

Citations:


🏁 Script executed:

fd -type f -name 'values.yaml' charts/

Repository: lobu-ai/lobu

Length of output: 227


🏁 Script executed:

# Read the values.yaml to check for terminationGracePeriodSeconds and preStopDelaySeconds
cat -n charts/lobu/values.yaml | grep -A 5 -B 5 "terminationGracePeriodSeconds\|preStopDelaySeconds"

Repository: lobu-ai/lobu

Length of output: 857


🏁 Script executed:

cat -n charts/lobu/templates/deployment.yaml | sed -n '45,55p; 118,125p'

Repository: lobu-ai/lobu

Length of output: 1255


default here overrides explicit 0 and breaks configurability.

On Line 50 and Line 122, Helm default will treat 0 as empty and fall back to 45/15. That means operators cannot intentionally set preStopDelaySeconds: 0 (disable delay) or terminationGracePeriodSeconds: 0.

Suggested fix
-      terminationGracePeriodSeconds: {{ .Values.app.terminationGracePeriodSeconds | default 45 }}
+      terminationGracePeriodSeconds: {{ .Values.app.terminationGracePeriodSeconds }}
...
-                command: ["sh", "-c", "sleep {{ .Values.app.preStopDelaySeconds | default 15 }}"]
+                command: ["sh", "-c", "sleep {{ .Values.app.preStopDelaySeconds }}"]
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@charts/lobu/templates/deployment.yaml` at line 50, The current use of Helm's
default() treats 0 as empty and overrides explicit zeros; change both
occurrences (preStopDelaySeconds and terminationGracePeriodSeconds) to check
presence instead of using default(), e.g. replace {{
.Values.app.terminationGracePeriodSeconds | default 45 }} and {{
.Values.app.preStopDelaySeconds | default 15 }} with an explicit conditional
using hasKey/.Values.app presence (for example: if hasKey .Values.app
"terminationGracePeriodSeconds" then render
.Values.app.terminationGracePeriodSeconds else render 45) so that a
user-provided 0 is honored.

Six findings, all addressed:

* **#1 preStop on Recreate makes the gap WORSE.** Pi caught the
  reversal — under `strategy: Recreate` the new pod doesn't start
  until the old one fully terminates, so adding a preStop sleep
  EXTENDS the no-available-server window rather than shrinking it.
  Default `preStopDelaySeconds` to 0 and only emit the lifecycle
  hook when > 0. Document that ops repos should set it only with
  RollingUpdate (which needs RWX storage on workspaces — out of
  scope here).
* **#2 migration too broad.** Tightened WHERE to require
  `feeds.status='active'` and `connections.status='active'` so we
  match exactly the set CheckDueFeeds would process. Paused /
  pending_auth / error feeds are left alone — they don't
  contribute to the error spam and may be intentionally recoverable.
* **#3 Sentry double-report.** `server.ts onError` already calls
  `Sentry.captureException` then `logger.error`; the new logger
  forwarder would have sent the same event again. Added
  `sentryReported: true` marker on that log line; logger transport
  skips forwarding when it sees the marker.
* **#4 dedupe fingerprint too coarse.** Added `err.message` to the
  fingerprint so distinct messages from the same catch site (same
  Error type, same top stack frame) get distinct fingerprints
  within the 60s window.
* **#5 existing string-error call sites are pre-existing tech debt**
  — left as-is for now. Documented in the PR body as a follow-up;
  the new forwarder handles `err` / `error` Error-object payloads
  well, and call sites with stringified errors are a separate
  cleanup pass.
* **#6 warn-skip leaves future orphans being polled forever.** Now
  `createSyncRunWithClient` soft-deletes the feed in-place when no
  active connector_definition is found, so CheckDueFeeds stops
  selecting it. Operators recover by clearing `deleted_at` after
  reinstalling the definition.
@buremba
Copy link
Copy Markdown
Member Author

buremba commented May 16, 2026

pi review — 6 findings, all addressed

1636fb0:

  1. preStop on Recreate would make the gap WORSE. Pi caught a real reversal. Under `strategy: Recreate` the new pod waits for the old one to terminate, so adding a preStop sleep EXTENDS the no-available-server window. Default `preStopDelaySeconds` is now 0; the lifecycle hook is only emitted when > 0. Operators only set it with RollingUpdate (which needs RWX storage on workspaces — separate ops decision).
  2. Migration too broad. Tightened WHERE to require `feeds.status='active' AND connections.status='active'`, matching exactly the set CheckDueFeeds processes. Paused / pending_auth / error feeds untouched.
  3. Sentry double-report. `server.ts onError` captures the exception then logs it; the new forwarder would re-send. Added `sentryReported: true` marker on those log lines; forwarder skips when seen.
  4. Dedupe fingerprint too coarse. Added `err.message` to the fingerprint so distinct messages from the same catch site don't suppress each other within the 60s window.
  5. Existing call sites with stringified errors (`logger.error({ error: errorMessage(err) }, ...)`): pre-existing tech debt, not regressed by this PR. The forwarder handles Error-object payloads cleanly; the string-error sites lose stack/type info but still produce captureMessage events that Sentry can group. Tracked as a follow-up cleanup pass.
  6. Warn-skip would leave orphans being polled forever. `createSyncRunWithClient` now soft-deletes the feed in-place when no active definition is found, so CheckDueFeeds stops selecting it. Operators recover by clearing `deleted_at` after reinstalling the definition.

Build + typecheck clean.

@buremba buremba merged commit bdea9d5 into main May 16, 2026
18 of 21 checks passed
@buremba buremba deleted the fix/monitoring-and-deploy-gaps branch May 16, 2026 22:02
buremba added a commit that referenced this pull request May 16, 2026
Main's #775 added a migration at 20260517020000; bumping mine to
20260517030000 so file order is unambiguous.
buremba added a commit that referenced this pull request May 16, 2026
…essaging (#773)

* feat(chrome-extension): MV3 connector scaffold + dynamic device capability authorization

- @lobu/core capabilities.ts: platform-keyed allowlist registry +
  authorizeCapabilities(). Device workers can only advertise capabilities
  whitelisted for their reported platform; anything outside the allowlist
  is silently dropped before connector matching.
- worker-api.ts: intersect advertised capabilities with the registry for
  user-scoped workers; trusted-fleet (no platform) path unchanged.
- apps/chrome/: MV3 service-worker + sidepanel-iframe scaffold. Pairs via
  standard RFC 8628 OAuth device-authorization flow against the existing
  gateway endpoints (same as the Mac app — no new endpoint needed).
  Baseline capabilities: browser.tabs, browser.scripting, browser.debugger.
  Optional history/bookmarks via runtime chrome.permissions.request().
- bridge.js: typed postMessage broker between sidepanel iframe and service
  worker. Named-ops allowlist, origin-checked, deny-by-default — defends
  the privileged extension surface from the (trusted today) embedded
  owletto-web app.

Verified end-to-end against local PGlite: device_workers row appears with
platform=chrome-extension and the authorized capability set after pairing.

* feat(chrome-extension): permissions UI, /embedded route, chrome.tabs connector

Three follow-ups to the initial Chrome-extension scaffold:

apps/chrome (extension):
- permissions.html / permissions.js: opt-in / revoke UI for the optional
  Chrome permissions (history, bookmarks). Toggling updates the granted set;
  the next /api/workers/poll re-advertises capabilities automatically
  (background.js recomputes via chrome.permissions.contains each cycle).
- sidepanel topbar adds a "Permissions" link to the new page.
- background.js gains a minimal run executor (executeRun): on each poll,
  if the gateway hands back a chrome.tabs run, it queries the live tab list,
  streams one batch of `tab_snapshot` events, and marks the run complete.
  Non-chrome.tabs runs fail fast with a marker — heartbeat loop, multi-batch
  streaming, and action runs are explicit v2 backlog items in SCOPE.md.

packages/web (owletto-web):
- /embedded route: chromeless layout (no sidebar, no nav, no command
  palette) targeted by the extension's sidepanel iframe. Reads worker_id
  from the URL fragment. Reuses the existing isStandalonePage path in
  __root.tsx.

packages/server (gateway):
- CSP frame-ancestors picks up `chrome-extension:` for /embedded only —
  every other HTML response keeps the existing 'self' + lobu.ai allowlist
  so the rest of the app can't be embedded by arbitrary extensions.

packages/connectors:
- chrome_tabs.ts: requiredCapability='browser.tabs',
  runtime.platforms=['chrome-extension']. Lists open tabs as `tab_snapshot`
  events. Cloud sync/execute throw (bridge-only) — actual work lives in the
  extension's service worker.

Typecheck clean. End-to-end smoke: device pairs, advertises authorized
capability set, auto-wires the chrome.tabs connector via existing device-
reconcile loop, executor handles the run shape end-to-end (single batch).

* chore: bump web submodule — /embedded route for Chrome extension sidepanel

Submodule pointer follows the matching commit on
feat/chrome-extension-embedded-route in lobu-ai/owletto-web. Branch will
need to be pushed before the parent feat/chrome-extension-connector
branch goes up — current SHA is local-only.

* feat(chrome-extension): user-configurable gateway URL

Lobu/Owletto is self-hosted — the extension can't ship with a fixed
origin. First-run pairing now asks for the user's gateway URL before the
OAuth dance:

- config.js: drop GATEWAY_URL constant. Add getGatewayUrl() /
  getEmbeddedAppUrl() helpers that read from chrome.storage. The hardcoded
  http://localhost:8787 default is now just the pairing-screen pre-fill.
- pairing.html / pairing.js: new setup step before the OAuth flow. URL is
  validated by fetching the well-known OAuth discovery doc; the extension
  requests chrome.permissions.request({origins}) for the entered host so
  subsequent fetches don't hit CORS.
- background.js, sidepanel.js, bridge.js: read gateway URL from storage at
  call time. Falls back to opening pairing.html when missing.
- permissions.html: shows the configured server with a "Change" action
  that clears creds + URL and re-runs setup.
- manifest.json: drop static host_permissions entries for the (placeholder)
  owletto.ai hosts. Origin grants now happen at runtime via
  optional_host_permissions: ["<all_urls>"] — also cleaner for Web Store
  review.

Typecheck clean. The previous local-dev pairing path still works: leave
the default http://localhost:8787, click Continue, then Get pairing code.

* feat(devices): native-messaging SSO from Mac bridge to Chrome extension

When the Owletto Mac app is installed and signed in, Owletto for Chrome
auto-pairs on first run with zero second login — no URL typing, no OAuth
device-code dance.

Gateway:
- worker-api.ts: mintDeviceChildToken — POST /api/me/devices/mint-child-token.
  Auth: caller's bearer. Body: {platform, label?}. Generates a new
  worker_id, mints a fresh PAT in the caller's personal org with the
  device_worker:run scope, returns {worker_id, access_token, gateway_url}.
  Platform is gated against @lobu/core/capabilities (chrome-extension only
  for now).

Extension:
- manifest.json: re-add `nativeMessaging` permission.
- pairing.js: on load, attempt chrome.runtime.connectNative("ai.owletto.bridge")
  with a 2.5s timeout. On success: request host-permission for the
  returned gateway origin, persist {gatewayUrl, workerId, accessToken},
  close. On timeout/error: fall back to existing URL-setup → OAuth flow.

Mac bridge:
- ChromeBridgeHost.swift (NEW): two responsibilities. runHostIfRequested()
  serves a single native-messaging request when launched with
  --owletto-bridge (4-byte length-prefixed JSON on stdin/stdout); reads the
  user's stored OAuth bearer from KeychainTokenStore, calls
  /api/me/devices/mint-child-token, writes the response. installManifests()
  drops ai.owletto.bridge.json into Chrome / Brave / Arc / Edge
  NativeMessagingHosts directories, idempotent, with allowed_origins from
  LOBU_OWLETTO_CHROME_EXTENSION_ID env override.
- LobuApp.swift init(): calls runHostIfRequested() before SwiftUI scene
  builds, then installManifests().

Outstanding (next slice):
- ChromeBridgeHost.swift needs to be dragged into the Xcode project's
  Lobu target manually — until then LobuApp.swift's init() won't compile.
  Avoiding hand-editing project.pbxproj from this session.
- Web Store extension ID will be hardcoded once published; today only the
  env override drives the manifest's allowed_origins.

Typecheck clean.

* fix(mac): wire ChromeBridgeHost.swift into Xcode target + compile fixes

- Lobu.xcodeproj/project.pbxproj: add ChromeBridgeHost.swift to the
  PBXFileReference, PBXBuildFile, PBXGroup, and PBXSourcesBuildPhase
  sections. IDs follow the existing sequential B*0015 slot. Xcode now
  picks up the file in the Lobu target without a manual drag-and-drop.
- ChromeBridgeHost.swift: switch Result<…, String> to Result<…, BridgeError>
  (Swift requires the error type to conform to Error), and fix the
  keychain store class name (KeychainCredentialStore, not
  KeychainTokenStore).

Verified: `xcodebuild -project apps/mac/Lobu.xcodeproj -scheme Lobu
-configuration Debug build` → BUILD SUCCEEDED.

* fix(mac-bridge): readFrame consumption, URLSession main-thread deadlock, DEBUG creds env

Three end-to-end blockers found while exercising the Mac bridge via a
synthetic Chrome native-messaging spawn:

1. ChromeBridgeHost.readFrame consumed bytes via FileHandle.availableData
   then used .prefix(4) to "peek" the length header. availableData is
   read-once — any payload bytes that arrived in the same chunk as the
   header were dropped, and the subsequent payload-read loop blocked on
   EOF. Rewrote to accumulate into a single buffer and slice out
   header + payload from it.

2. mintChildToken called URLSession.shared.dataTask().resume() + sem.wait()
   from the main thread. URLSession.shared's default delegate queue is the
   main queue, so the completion handler couldn't fire — classic deadlock.
   Switched to a dedicated ephemeral URLSession with its own OperationQueue.

3. Production builds read OAuth creds from keychain (correct). For
   debug/ad-hoc-signed dev builds TCC blocks unsigned binaries from
   reading keychain items silently — the bridge subprocess hangs forever
   waiting for a prompt that can't render. Added a DEBUG-only env-var
   fallback (LOBU_BRIDGE_TEST_BASE_URL / LOBU_BRIDGE_TEST_TOKEN) so the
   bridge flow can be exercised without code signing. #if DEBUG-gated;
   never read in RELEASE.

Plus granular [bridge] stderr logging to make future plumbing issues easy
to diagnose.

Verified end-to-end against the local gateway: 4-byte length-prefixed
JSON frame in, mint-child-token call returns 200, response frame out.

* chore(mac-bridge): gate diagnostic stderr logs behind #if DEBUG

* chore: bump web submodule pointer to merged main SHA

* fix(chrome-extension): security + correctness review fixes

Addresses three HIGH-impact issues + one MEDIUM correctness bug surfaced
in code review:

1. (HIGH) Capability authorization was bypassed when claiming runs.
   pollWorkerJob's user-scoped claim SQL filtered on
   `advertisedCapabilities` (the raw declared set), so a device claiming
   `os.shell` would have it dropped for storage but could still match
   and claim a connector run requiring it. Switched the two device
   branches in the claim SQL to `authorizedCapabilities` (the post-
   allowlist set).

2. (HIGH) `mintDeviceChildToken` only checked `c.var.user?.id`, so any
   authenticated session that mcpAuth accepts could escalate into a
   worker token. Now requires the caller's bearer to already carry the
   `device_worker:run` scope — i.e. only an existing device-worker
   token (or a PAT minted for that purpose) can mint a sibling. Browser
   sessions can't silently escalate; users wanting to pair Chrome from
   a browser go through OAuth device-authorization, not this endpoint.

3. (MEDIUM) Native-messaging readFrame trusted the wire-level 32-bit
   length. A buggy or malicious allowed extension could hang or
   memory-pressure the host on a crafted header. Added a 64 KB cap on
   the payload length (our pair request is ~50 bytes; this is generous).

4. (MEDIUM) Extension's `executeRun` sent the wrong field names to
   /api/workers/complete (`items_count`, `error`) — server expects
   `items_collected`, `error_message`. Tab-snapshot runs were
   completing as 0-item successes and failures dropped their diagnostic.

SCOPE.md documents three additional review notes as follow-ups: the
native host's argv check needs a stronger same-user defense, child
PATs aren't yet bound to their `worker_id`, and the extension still
advertises capabilities ahead of executors.

Typecheck + xcode build clean.

* fix(devices): platform is set-once at the worker level; restore readFrame cap

Re-review surfaced two remaining gaps from the previous security pass:

1. (HIGH) Capability authorization trusted self-reported `platform` on
   every poll. A compromised chrome-extension PAT could post
   `platform: "macos"` and unlock the macOS allowlist — claim
   `os.shell` runs the device can't actually execute. Fix has three
   parts:
   - Poll handler reads the stored platform for (user_id, worker_id);
     if it exists and differs from the posted one, rejects 403
     `platform_mismatch`.
   - UPSERT on conflict now COALESCE-preserves device_workers.platform
     (defense-in-depth against a race between SELECT and UPSERT).
   - Capability authorization uses the stored platform when present,
     not the body-posted one.
   - `mintDeviceChildToken` pre-creates the device_workers row with
     `platform` set, so the very first poll from a freshly-minted
     child already finds the platform locked.

2. (MEDIUM, re-introduced by an earlier edit revert) The 64 KB cap on
   the native-messaging frame length had been silently dropped. Restored
   `maxFramePayloadBytes` + bounds check in ChromeBridgeHost.readFrame.

Typecheck + xcode build clean. The non-blocking review notes remain
documented in apps/chrome/SCOPE.md.

* fix(devices): bind PATs minted via mint-child-token to their worker_id

Closes the last platform-binding bypass: a chrome-extension PAT could
post a fresh, never-seen `worker_id` plus `platform: "macos"` and
register a new macOS-platform device_workers row under the same user.
The pre-existing-row platform lock didn't help because the row didn't
exist yet.

Fix: `personal_access_tokens` now has a `worker_id` column. PATs minted
via /api/me/devices/mint-child-token carry the worker_id they're bound
to. The worker-poll handler reads the bound id from `mcpAuthInfo` and
rejects requests whose body posts a different one (403
`worker_id_mismatch`).

Legacy paths (OAuth-issued bearers for the Mac/iOS bridges, the
existing CLI PATs) keep `worker_id = NULL`; the poll handler treats
NULL as "no binding" so they can register any worker_id they pick on
first poll — same as today.

- db/migrations/20260517020000_pat_worker_id_binding.sql: nullable
  column + partial index.
- db/schema.sql: matching column + index entries.
- auth/oauth/types.ts: AuthInfo.workerId optional.
- auth/tokens.ts: create() accepts workerId option; verify() returns it.
- worker-api.ts (pollWorkerJob): rejects on mismatch before the
  capability-authorization step.
- worker-api.ts (mintDeviceChildToken): passes workerId to create().

* fix(worker-api): empty-string worker_id bypass on bound PATs

The bound-token check skipped when body's worker_id was empty (falsy).
Compare unconditionally; also require non-empty worker_id from any
user-scoped caller. Otherwise a bound chrome-extension PAT could post
worker_id='' and have the binding check + platform lock skipped,
registering under (user_id, '') with attacker-chosen platform.

* fix(capabilities): include legacy Mac caps in macos allowlist

A Mac bridge advertises 'screentime', 'local_directory', 'healthkit',
'photos', 'whatsapp_local' from AppState.swift:caps. The macos allowlist
only had os.* + browser.*, so these would all get dropped server-side
and Mac connectors would stop matching their device. Add a
LEGACY_MAC_CAPABILITIES export that the macos entry includes — keeps
existing Mac functionality intact while still excluding cross-platform
claims (a Mac PAT can't suddenly claim 'browser.history' via
authorization, only via actually having the chrome-extension already
covers it under its own platform binding).

* fix(security+regression): worker_id binding hardening + Mac caps + migration format

Three follow-ups from the security re-review:

1. (HIGH) `worker_id: ''` bypassed the bound-token check (empty string
   was falsy in the previous guard). pollWorkerJob now requires a
   non-empty `worker_id` from any user-scoped caller and compares the
   bound id unconditionally — no truthiness short-circuit.

2. (HIGH regression) macos capability allowlist dropped 'screentime',
   'local_directory', 'healthkit', 'photos', 'whatsapp_local' — the
   existing Mac bridge advertises these and its connectors would have
   silently stopped matching. Adds MAC_DEVICE_CAPABILITIES alongside
   OS_/BROWSER_CAPABILITIES.

3. CI: migration was missing `-- migrate:down`; lint-format job
   failed. Added the down section + matching teardown.

4. Submodule bump: pulls in lobu-ai/owletto#138 — adds /embedded
   to the all-routes smoke test (was missing in #137).

* chore: re-time PAT-worker-id migration past collision with main

Main's #775 added a migration at 20260517020000; bumping mine to
20260517030000 so file order is unambiguous.

* fix(schema): add 20260517030000 entry to schema_migrations
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants