fix: close monitoring + deploy gaps from post-incident audit#775
Conversation
Three gaps identified by the audit after #771/#772 landed (1914 errors/5min in stdout, zero Sentry issues, single-replica Recreate deploy strategy = ~30s "no available server" window per rollout): * **Sentry forwarding from pino** (utils/logger.ts). The existing Sentry capture middleware only fires on HTTP 500 responses; `logger.error()` from background jobs (CheckDueFeeds, runs queue, scheduled tasks) was invisible to monitoring. Now a pino destination inspects each log line and forwards `level >= error` to `Sentry.captureException` (with stack reconstruction) or `captureMessage`. In-process dedupe by (msg, err.type, top stack frame) with a 60s window prevents the 380/min repeating-error pattern from saturating Sentry — Sentry has its own grouping but every captureException still incurs an HTTP call. * **Orphan-feed cleanup + graceful skip** (queue-helpers.ts + 20260517020000_softdelete_orphan_feeds.sql). The 'website' connector was archived/uninstalled in some orgs but their feeds weren't soft-deleted. `createSyncRunWithClient` threw on every CheckDueFeeds tick when no active connector_definition existed for the (key, org) pair — produced the 380/min error stream that masked real issues. Code change: warn + skip instead of throw, so future orphans don't spam logs. Migration: one-shot soft-delete of feeds matching the orphan criteria (no pinned_version, active connection, no active connector_definition for the org). * **PreStop drain + grace period** (charts/lobu/templates/ deployment.yaml, values.yaml). Single-replica gateway with `strategy: Recreate` (workspaces PVC is RWO) means every deploy has a ~30s window where the old pod is gone but the new one isn't ready — Cloudflare returns "no available server" to live traffic during this window. The 2026-05-16 incident felt extreme because the pod was actually crash-looping; the same shape happens briefly on every healthy deploy. PreStop hook (15s sleep, configurable via app.preStopDelaySeconds) gives Service endpoint deregistration time to propagate before SIGTERM, so requests in-flight during the drain still hit a live process. terminationGracePeriodSeconds bumped to 45 to cover preStop + graceful shutdown (~5s). Full elimination needs RWX storage for workspaces so we can roll instead of recreate — documented in the deployment.yaml strategy comment.
|
Caution Review failedThe pull request is closed. ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughAdds configurable Helm pod termination timing and optional preStop hook, forwards error/fatal logs to Sentry with in-process deduplication and a marker to avoid duplicate reports, and soft-deletes orphan feeds with a migration and runtime orphan-handling change. ChangesGraceful pod shutdown configuration
Sentry error observability
Orphan feed cleanup and handling
Sequence Diagram(s)sequenceDiagram
participant App as Application (server.ts)
participant Logger as Pino (sentryAwareStream)
participant Stdout as Stdout
participant Sentry as `@sentry/node`
App->>Logger: log(error/fatal, {err|error, msg}) (may set sentryReported)
Logger->>Stdout: write JSON line
Logger->>Logger: parse JSON -> dedupe check
Logger->>Sentry: captureException / captureMessage (if not deduped && not sentryReported)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRsPoem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b3aefea639
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| const reconstructed = new Error(errObj.message); | ||
| if (errObj.stack) reconstructed.stack = errObj.stack; | ||
| Sentry.captureException(reconstructed, { | ||
| extra: parsed, |
There was a problem hiding this comment.
Sanitize log metadata before sending it to Sentry
When an error log includes application payloads, this forwards the entire parsed log line to Sentry as extra. For example, packages/server/src/tools/admin/manage_classifiers.ts:901 logs { error, args }, where args includes user-provided classification values/reasoning; with this change any failure in that path now exports those fields to Sentry instead of only stdout. Please redact or whitelist metadata before attaching it to Sentry events.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
packages/server/src/utils/logger.ts (1)
83-88: ⚡ Quick winPreserve error type for better Sentry grouping.
The reconstructed Error loses its original type (e.g.,
TypeError,SyntaxError). Sentry useserror.namefor grouping, so different error types with similar messages will be incorrectly grouped together.♻️ Proposed fix
// Reconstruct an Error so Sentry's grouping works on the stack. const reconstructed = new Error(errObj.message); if (errObj.stack) reconstructed.stack = errObj.stack; + if (errObj.type) reconstructed.name = errObj.type; Sentry.captureException(reconstructed, {🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/server/src/utils/logger.ts` around lines 83 - 88, The reconstructed Error currently loses the original error type/name used by Sentry for grouping; when building the synthetic error before calling Sentry.captureException (using variables errObj, reconstructed, parsed, level), set the reconstructed error's name to the original type (e.g., reconstructed.name = errObj.name || reconstructed.name) so Sentry preserves error.name/grouping and include the same reconstructed.stack and extra/tags as before.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@charts/lobu/templates/deployment.yaml`:
- Line 50: The current use of Helm's default() treats 0 as empty and overrides
explicit zeros; change both occurrences (preStopDelaySeconds and
terminationGracePeriodSeconds) to check presence instead of using default(),
e.g. replace {{ .Values.app.terminationGracePeriodSeconds | default 45 }} and {{
.Values.app.preStopDelaySeconds | default 15 }} with an explicit conditional
using hasKey/.Values.app presence (for example: if hasKey .Values.app
"terminationGracePeriodSeconds" then render
.Values.app.terminationGracePeriodSeconds else render 45) so that a
user-provided 0 is honored.
---
Nitpick comments:
In `@packages/server/src/utils/logger.ts`:
- Around line 83-88: The reconstructed Error currently loses the original error
type/name used by Sentry for grouping; when building the synthetic error before
calling Sentry.captureException (using variables errObj, reconstructed, parsed,
level), set the reconstructed error's name to the original type (e.g.,
reconstructed.name = errObj.name || reconstructed.name) so Sentry preserves
error.name/grouping and include the same reconstructed.stack and extra/tags as
before.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: e8f6fe82-8615-4ea0-84f2-8113bb28f282
📒 Files selected for processing (6)
charts/lobu/templates/deployment.yamlcharts/lobu/values.yamldb/migrations/20260517020000_softdelete_orphan_feeds.sqldb/schema.sqlpackages/server/src/utils/logger.tspackages/server/src/utils/queue-helpers.ts
| # task scheduler, embedded gateway, DB pool, HTTP server). Default | ||
| # k8s grace period is 30s; bump so the preStop sleep doesn't eat | ||
| # the whole window. | ||
| terminationGracePeriodSeconds: {{ .Values.app.terminationGracePeriodSeconds | default 45 }} |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
Does Helm's defaultfunction treat numeric0 as empty (therefore returning the fallback value)?
💡 Result:
Yes. Helm’s default function treats the numeric value 0 as “empty”, so it will return the fallback value instead of 0. This comes from Sprig’s default semantics: the definition of “empty” includes numeric 0, and for “non-empty” values the given value is used, otherwise the default is returned [1]. Helm documents default as default DEFAULT_VALUE GIVEN_VALUE and that it returns the default “if [the given value] is empty” [2]. Helm issues report that when a value is set to 0 (integer), default behaves as if it were unset/empty and the fallback is used [3][4]. Practical implication: if 0 is a valid value in your chart, don’t use default directly; instead use an explicit check (for example, distinguish “not set” vs “set to 0”) as suggested by Helm community discussions [3][4].
Citations:
- 1: http://masterminds.github.io/sprig/defaults.html
- 2: https://helm.sh/pt/docs/chart_template_guide/functions_and_pipelines
- 3:
defaulttemplate function treats 0 value as unspecified helm/helm#13036 - 4: Wrong interpretation of Zero in templates files when having a default value helm/helm#3164
🏁 Script executed:
fd -type f -name 'values.yaml' charts/Repository: lobu-ai/lobu
Length of output: 227
🏁 Script executed:
# Read the values.yaml to check for terminationGracePeriodSeconds and preStopDelaySeconds
cat -n charts/lobu/values.yaml | grep -A 5 -B 5 "terminationGracePeriodSeconds\|preStopDelaySeconds"Repository: lobu-ai/lobu
Length of output: 857
🏁 Script executed:
cat -n charts/lobu/templates/deployment.yaml | sed -n '45,55p; 118,125p'Repository: lobu-ai/lobu
Length of output: 1255
default here overrides explicit 0 and breaks configurability.
On Line 50 and Line 122, Helm default will treat 0 as empty and fall back to 45/15. That means operators cannot intentionally set preStopDelaySeconds: 0 (disable delay) or terminationGracePeriodSeconds: 0.
Suggested fix
- terminationGracePeriodSeconds: {{ .Values.app.terminationGracePeriodSeconds | default 45 }}
+ terminationGracePeriodSeconds: {{ .Values.app.terminationGracePeriodSeconds }}
...
- command: ["sh", "-c", "sleep {{ .Values.app.preStopDelaySeconds | default 15 }}"]
+ command: ["sh", "-c", "sleep {{ .Values.app.preStopDelaySeconds }}"]🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@charts/lobu/templates/deployment.yaml` at line 50, The current use of Helm's
default() treats 0 as empty and overrides explicit zeros; change both
occurrences (preStopDelaySeconds and terminationGracePeriodSeconds) to check
presence instead of using default(), e.g. replace {{
.Values.app.terminationGracePeriodSeconds | default 45 }} and {{
.Values.app.preStopDelaySeconds | default 15 }} with an explicit conditional
using hasKey/.Values.app presence (for example: if hasKey .Values.app
"terminationGracePeriodSeconds" then render
.Values.app.terminationGracePeriodSeconds else render 45) so that a
user-provided 0 is honored.
Six findings, all addressed: * **#1 preStop on Recreate makes the gap WORSE.** Pi caught the reversal — under `strategy: Recreate` the new pod doesn't start until the old one fully terminates, so adding a preStop sleep EXTENDS the no-available-server window rather than shrinking it. Default `preStopDelaySeconds` to 0 and only emit the lifecycle hook when > 0. Document that ops repos should set it only with RollingUpdate (which needs RWX storage on workspaces — out of scope here). * **#2 migration too broad.** Tightened WHERE to require `feeds.status='active'` and `connections.status='active'` so we match exactly the set CheckDueFeeds would process. Paused / pending_auth / error feeds are left alone — they don't contribute to the error spam and may be intentionally recoverable. * **#3 Sentry double-report.** `server.ts onError` already calls `Sentry.captureException` then `logger.error`; the new logger forwarder would have sent the same event again. Added `sentryReported: true` marker on that log line; logger transport skips forwarding when it sees the marker. * **#4 dedupe fingerprint too coarse.** Added `err.message` to the fingerprint so distinct messages from the same catch site (same Error type, same top stack frame) get distinct fingerprints within the 60s window. * **#5 existing string-error call sites are pre-existing tech debt** — left as-is for now. Documented in the PR body as a follow-up; the new forwarder handles `err` / `error` Error-object payloads well, and call sites with stringified errors are a separate cleanup pass. * **#6 warn-skip leaves future orphans being polled forever.** Now `createSyncRunWithClient` soft-deletes the feed in-place when no active connector_definition is found, so CheckDueFeeds stops selecting it. Operators recover by clearing `deleted_at` after reinstalling the definition.
pi review — 6 findings, all addressed
Build + typecheck clean. |
Main's #775 added a migration at 20260517020000; bumping mine to 20260517030000 so file order is unambiguous.
…essaging (#773) * feat(chrome-extension): MV3 connector scaffold + dynamic device capability authorization - @lobu/core capabilities.ts: platform-keyed allowlist registry + authorizeCapabilities(). Device workers can only advertise capabilities whitelisted for their reported platform; anything outside the allowlist is silently dropped before connector matching. - worker-api.ts: intersect advertised capabilities with the registry for user-scoped workers; trusted-fleet (no platform) path unchanged. - apps/chrome/: MV3 service-worker + sidepanel-iframe scaffold. Pairs via standard RFC 8628 OAuth device-authorization flow against the existing gateway endpoints (same as the Mac app — no new endpoint needed). Baseline capabilities: browser.tabs, browser.scripting, browser.debugger. Optional history/bookmarks via runtime chrome.permissions.request(). - bridge.js: typed postMessage broker between sidepanel iframe and service worker. Named-ops allowlist, origin-checked, deny-by-default — defends the privileged extension surface from the (trusted today) embedded owletto-web app. Verified end-to-end against local PGlite: device_workers row appears with platform=chrome-extension and the authorized capability set after pairing. * feat(chrome-extension): permissions UI, /embedded route, chrome.tabs connector Three follow-ups to the initial Chrome-extension scaffold: apps/chrome (extension): - permissions.html / permissions.js: opt-in / revoke UI for the optional Chrome permissions (history, bookmarks). Toggling updates the granted set; the next /api/workers/poll re-advertises capabilities automatically (background.js recomputes via chrome.permissions.contains each cycle). - sidepanel topbar adds a "Permissions" link to the new page. - background.js gains a minimal run executor (executeRun): on each poll, if the gateway hands back a chrome.tabs run, it queries the live tab list, streams one batch of `tab_snapshot` events, and marks the run complete. Non-chrome.tabs runs fail fast with a marker — heartbeat loop, multi-batch streaming, and action runs are explicit v2 backlog items in SCOPE.md. packages/web (owletto-web): - /embedded route: chromeless layout (no sidebar, no nav, no command palette) targeted by the extension's sidepanel iframe. Reads worker_id from the URL fragment. Reuses the existing isStandalonePage path in __root.tsx. packages/server (gateway): - CSP frame-ancestors picks up `chrome-extension:` for /embedded only — every other HTML response keeps the existing 'self' + lobu.ai allowlist so the rest of the app can't be embedded by arbitrary extensions. packages/connectors: - chrome_tabs.ts: requiredCapability='browser.tabs', runtime.platforms=['chrome-extension']. Lists open tabs as `tab_snapshot` events. Cloud sync/execute throw (bridge-only) — actual work lives in the extension's service worker. Typecheck clean. End-to-end smoke: device pairs, advertises authorized capability set, auto-wires the chrome.tabs connector via existing device- reconcile loop, executor handles the run shape end-to-end (single batch). * chore: bump web submodule — /embedded route for Chrome extension sidepanel Submodule pointer follows the matching commit on feat/chrome-extension-embedded-route in lobu-ai/owletto-web. Branch will need to be pushed before the parent feat/chrome-extension-connector branch goes up — current SHA is local-only. * feat(chrome-extension): user-configurable gateway URL Lobu/Owletto is self-hosted — the extension can't ship with a fixed origin. First-run pairing now asks for the user's gateway URL before the OAuth dance: - config.js: drop GATEWAY_URL constant. Add getGatewayUrl() / getEmbeddedAppUrl() helpers that read from chrome.storage. The hardcoded http://localhost:8787 default is now just the pairing-screen pre-fill. - pairing.html / pairing.js: new setup step before the OAuth flow. URL is validated by fetching the well-known OAuth discovery doc; the extension requests chrome.permissions.request({origins}) for the entered host so subsequent fetches don't hit CORS. - background.js, sidepanel.js, bridge.js: read gateway URL from storage at call time. Falls back to opening pairing.html when missing. - permissions.html: shows the configured server with a "Change" action that clears creds + URL and re-runs setup. - manifest.json: drop static host_permissions entries for the (placeholder) owletto.ai hosts. Origin grants now happen at runtime via optional_host_permissions: ["<all_urls>"] — also cleaner for Web Store review. Typecheck clean. The previous local-dev pairing path still works: leave the default http://localhost:8787, click Continue, then Get pairing code. * feat(devices): native-messaging SSO from Mac bridge to Chrome extension When the Owletto Mac app is installed and signed in, Owletto for Chrome auto-pairs on first run with zero second login — no URL typing, no OAuth device-code dance. Gateway: - worker-api.ts: mintDeviceChildToken — POST /api/me/devices/mint-child-token. Auth: caller's bearer. Body: {platform, label?}. Generates a new worker_id, mints a fresh PAT in the caller's personal org with the device_worker:run scope, returns {worker_id, access_token, gateway_url}. Platform is gated against @lobu/core/capabilities (chrome-extension only for now). Extension: - manifest.json: re-add `nativeMessaging` permission. - pairing.js: on load, attempt chrome.runtime.connectNative("ai.owletto.bridge") with a 2.5s timeout. On success: request host-permission for the returned gateway origin, persist {gatewayUrl, workerId, accessToken}, close. On timeout/error: fall back to existing URL-setup → OAuth flow. Mac bridge: - ChromeBridgeHost.swift (NEW): two responsibilities. runHostIfRequested() serves a single native-messaging request when launched with --owletto-bridge (4-byte length-prefixed JSON on stdin/stdout); reads the user's stored OAuth bearer from KeychainTokenStore, calls /api/me/devices/mint-child-token, writes the response. installManifests() drops ai.owletto.bridge.json into Chrome / Brave / Arc / Edge NativeMessagingHosts directories, idempotent, with allowed_origins from LOBU_OWLETTO_CHROME_EXTENSION_ID env override. - LobuApp.swift init(): calls runHostIfRequested() before SwiftUI scene builds, then installManifests(). Outstanding (next slice): - ChromeBridgeHost.swift needs to be dragged into the Xcode project's Lobu target manually — until then LobuApp.swift's init() won't compile. Avoiding hand-editing project.pbxproj from this session. - Web Store extension ID will be hardcoded once published; today only the env override drives the manifest's allowed_origins. Typecheck clean. * fix(mac): wire ChromeBridgeHost.swift into Xcode target + compile fixes - Lobu.xcodeproj/project.pbxproj: add ChromeBridgeHost.swift to the PBXFileReference, PBXBuildFile, PBXGroup, and PBXSourcesBuildPhase sections. IDs follow the existing sequential B*0015 slot. Xcode now picks up the file in the Lobu target without a manual drag-and-drop. - ChromeBridgeHost.swift: switch Result<…, String> to Result<…, BridgeError> (Swift requires the error type to conform to Error), and fix the keychain store class name (KeychainCredentialStore, not KeychainTokenStore). Verified: `xcodebuild -project apps/mac/Lobu.xcodeproj -scheme Lobu -configuration Debug build` → BUILD SUCCEEDED. * fix(mac-bridge): readFrame consumption, URLSession main-thread deadlock, DEBUG creds env Three end-to-end blockers found while exercising the Mac bridge via a synthetic Chrome native-messaging spawn: 1. ChromeBridgeHost.readFrame consumed bytes via FileHandle.availableData then used .prefix(4) to "peek" the length header. availableData is read-once — any payload bytes that arrived in the same chunk as the header were dropped, and the subsequent payload-read loop blocked on EOF. Rewrote to accumulate into a single buffer and slice out header + payload from it. 2. mintChildToken called URLSession.shared.dataTask().resume() + sem.wait() from the main thread. URLSession.shared's default delegate queue is the main queue, so the completion handler couldn't fire — classic deadlock. Switched to a dedicated ephemeral URLSession with its own OperationQueue. 3. Production builds read OAuth creds from keychain (correct). For debug/ad-hoc-signed dev builds TCC blocks unsigned binaries from reading keychain items silently — the bridge subprocess hangs forever waiting for a prompt that can't render. Added a DEBUG-only env-var fallback (LOBU_BRIDGE_TEST_BASE_URL / LOBU_BRIDGE_TEST_TOKEN) so the bridge flow can be exercised without code signing. #if DEBUG-gated; never read in RELEASE. Plus granular [bridge] stderr logging to make future plumbing issues easy to diagnose. Verified end-to-end against the local gateway: 4-byte length-prefixed JSON frame in, mint-child-token call returns 200, response frame out. * chore(mac-bridge): gate diagnostic stderr logs behind #if DEBUG * chore: bump web submodule pointer to merged main SHA * fix(chrome-extension): security + correctness review fixes Addresses three HIGH-impact issues + one MEDIUM correctness bug surfaced in code review: 1. (HIGH) Capability authorization was bypassed when claiming runs. pollWorkerJob's user-scoped claim SQL filtered on `advertisedCapabilities` (the raw declared set), so a device claiming `os.shell` would have it dropped for storage but could still match and claim a connector run requiring it. Switched the two device branches in the claim SQL to `authorizedCapabilities` (the post- allowlist set). 2. (HIGH) `mintDeviceChildToken` only checked `c.var.user?.id`, so any authenticated session that mcpAuth accepts could escalate into a worker token. Now requires the caller's bearer to already carry the `device_worker:run` scope — i.e. only an existing device-worker token (or a PAT minted for that purpose) can mint a sibling. Browser sessions can't silently escalate; users wanting to pair Chrome from a browser go through OAuth device-authorization, not this endpoint. 3. (MEDIUM) Native-messaging readFrame trusted the wire-level 32-bit length. A buggy or malicious allowed extension could hang or memory-pressure the host on a crafted header. Added a 64 KB cap on the payload length (our pair request is ~50 bytes; this is generous). 4. (MEDIUM) Extension's `executeRun` sent the wrong field names to /api/workers/complete (`items_count`, `error`) — server expects `items_collected`, `error_message`. Tab-snapshot runs were completing as 0-item successes and failures dropped their diagnostic. SCOPE.md documents three additional review notes as follow-ups: the native host's argv check needs a stronger same-user defense, child PATs aren't yet bound to their `worker_id`, and the extension still advertises capabilities ahead of executors. Typecheck + xcode build clean. * fix(devices): platform is set-once at the worker level; restore readFrame cap Re-review surfaced two remaining gaps from the previous security pass: 1. (HIGH) Capability authorization trusted self-reported `platform` on every poll. A compromised chrome-extension PAT could post `platform: "macos"` and unlock the macOS allowlist — claim `os.shell` runs the device can't actually execute. Fix has three parts: - Poll handler reads the stored platform for (user_id, worker_id); if it exists and differs from the posted one, rejects 403 `platform_mismatch`. - UPSERT on conflict now COALESCE-preserves device_workers.platform (defense-in-depth against a race between SELECT and UPSERT). - Capability authorization uses the stored platform when present, not the body-posted one. - `mintDeviceChildToken` pre-creates the device_workers row with `platform` set, so the very first poll from a freshly-minted child already finds the platform locked. 2. (MEDIUM, re-introduced by an earlier edit revert) The 64 KB cap on the native-messaging frame length had been silently dropped. Restored `maxFramePayloadBytes` + bounds check in ChromeBridgeHost.readFrame. Typecheck + xcode build clean. The non-blocking review notes remain documented in apps/chrome/SCOPE.md. * fix(devices): bind PATs minted via mint-child-token to their worker_id Closes the last platform-binding bypass: a chrome-extension PAT could post a fresh, never-seen `worker_id` plus `platform: "macos"` and register a new macOS-platform device_workers row under the same user. The pre-existing-row platform lock didn't help because the row didn't exist yet. Fix: `personal_access_tokens` now has a `worker_id` column. PATs minted via /api/me/devices/mint-child-token carry the worker_id they're bound to. The worker-poll handler reads the bound id from `mcpAuthInfo` and rejects requests whose body posts a different one (403 `worker_id_mismatch`). Legacy paths (OAuth-issued bearers for the Mac/iOS bridges, the existing CLI PATs) keep `worker_id = NULL`; the poll handler treats NULL as "no binding" so they can register any worker_id they pick on first poll — same as today. - db/migrations/20260517020000_pat_worker_id_binding.sql: nullable column + partial index. - db/schema.sql: matching column + index entries. - auth/oauth/types.ts: AuthInfo.workerId optional. - auth/tokens.ts: create() accepts workerId option; verify() returns it. - worker-api.ts (pollWorkerJob): rejects on mismatch before the capability-authorization step. - worker-api.ts (mintDeviceChildToken): passes workerId to create(). * fix(worker-api): empty-string worker_id bypass on bound PATs The bound-token check skipped when body's worker_id was empty (falsy). Compare unconditionally; also require non-empty worker_id from any user-scoped caller. Otherwise a bound chrome-extension PAT could post worker_id='' and have the binding check + platform lock skipped, registering under (user_id, '') with attacker-chosen platform. * fix(capabilities): include legacy Mac caps in macos allowlist A Mac bridge advertises 'screentime', 'local_directory', 'healthkit', 'photos', 'whatsapp_local' from AppState.swift:caps. The macos allowlist only had os.* + browser.*, so these would all get dropped server-side and Mac connectors would stop matching their device. Add a LEGACY_MAC_CAPABILITIES export that the macos entry includes — keeps existing Mac functionality intact while still excluding cross-platform claims (a Mac PAT can't suddenly claim 'browser.history' via authorization, only via actually having the chrome-extension already covers it under its own platform binding). * fix(security+regression): worker_id binding hardening + Mac caps + migration format Three follow-ups from the security re-review: 1. (HIGH) `worker_id: ''` bypassed the bound-token check (empty string was falsy in the previous guard). pollWorkerJob now requires a non-empty `worker_id` from any user-scoped caller and compares the bound id unconditionally — no truthiness short-circuit. 2. (HIGH regression) macos capability allowlist dropped 'screentime', 'local_directory', 'healthkit', 'photos', 'whatsapp_local' — the existing Mac bridge advertises these and its connectors would have silently stopped matching. Adds MAC_DEVICE_CAPABILITIES alongside OS_/BROWSER_CAPABILITIES. 3. CI: migration was missing `-- migrate:down`; lint-format job failed. Added the down section + matching teardown. 4. Submodule bump: pulls in lobu-ai/owletto#138 — adds /embedded to the all-routes smoke test (was missing in #137). * chore: re-time PAT-worker-id migration past collision with main Main's #775 added a migration at 20260517020000; bumping mine to 20260517030000 so file order is unambiguous. * fix(schema): add 20260517030000 entry to schema_migrations
Why
Post-incident audit after #771 + #772 landed. Prod is healthy but three real gaps remained that would let the next similar incident slip past monitoring or extend its blast radius:
What
1. pino → Sentry forwarding (
utils/logger.ts)Adds a pino destination that mirrors stdout AND forwards
level >= errorto Sentry. In-process dedupe by(msg, err.type, top stack frame)with a 60s window — Sentry has its own grouping but everycaptureExceptionstill costs an HTTP call, and a repeating error at 380/min would saturate the budget.2. Orphan-feed cleanup (
queue-helpers.ts+20260517020000_softdelete_orphan_feeds.sql)createSyncRunWithClientnow warns + skips when no active connector_definition exists for(connector_key, organization_id). Pre-fix it threw, producing the error stream.pinned_version, active connection, no active definition in the org). Conservative criteria — only touches feeds that the current code would also fail on.3. Graceful drain (
charts/lobu/templates/deployment.yaml+values.yaml)preStoplifecycle hook: 15s sleep (configurable viaapp.preStopDelaySeconds). When k8s decides to terminate, this lets Service endpoint deregistration propagate through kube-proxy + Cloudflare LB before the process gets SIGTERM. In-flight requests routed during the deregistration lag still hit a live process.terminationGracePeriodSeconds: 45(configurable). Covers preStop + the graceful shutdown inserver.ts(~5s for HTTP server close, task scheduler stop, DB pool drain).Recreate(workspaces PVC is RWO). Comment explains the path to RollingUpdate: needs RWX storage or workspaces split off the gateway deployment. Not done here because it's an ops-repo storage-class decision, not in-tree code.Validation
make typecheckcleanmake build-packagesbuildsdb/schema.sqldiff is intent-only (one newschema_migrationsrow)logger.error(new Error('test'), ...)from a debug endpoint, confirm a Sentry issue appearsOut of scope (acknowledged)
SELECT count(*) FROM feeds WHERE ...before/after.Summary by CodeRabbit
New Features
Bug Fixes
Chores