fix(chrome-extension): keep MV3 service worker alive while SSE bridge expects to be open#29900
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d9d0f1ebc9
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| * Production Chrome enforces a minimum `periodInMinutes` of 0.5 | ||
| * (30 s); smaller values are silently rounded up. The alarm firing | ||
| * itself is what counts as "extension API activity" and resets the | ||
| * 30 s SW idle timer, so the alarm period needs to be at-or-below | ||
| * the idle threshold for the worker to stay alive across popup |
There was a problem hiding this comment.
Handle pre-Chrome-120 alarm floor for keepalive
KEEPALIVE_PERIOD_MIN assumes production alarms can fire every 30s, but this extension still declares minimum_chrome_version: "116" in manifest.json; on Chrome 116–119, repeating alarms are floored to 1 minute, so the worker can still idle out after ~30s and the SSE bridge will keep dropping for those users. Either raise the minimum Chrome version to 120+ or add a fallback keepalive strategy for 116–119.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Good catch — fixed in 7c3b66c33d. The 30 s alarm period is what makes the keepalive actually work, so I bumped minimum_chrome_version from 116 → 120 in manifest.json and updated the docstring to call out the dependency. Pre-120 the floor was 1 minute, which is too coarse to outpace the 30 s SW idle timer — there's no graceful fallback that achieves the same goal on those builds, and Chrome 120 shipped December 2023, so the user impact of bumping is minimal.
| chrome.runtime.onInstalled.addListener(() => { | ||
| void ensureKeepaliveAlarm(); | ||
| }); |
There was a problem hiding this comment.
Gate install-time keepalive alarm on connection intent
Creating the keepalive alarm unconditionally in onInstalled starts a perpetual 30s wake cycle even for users who have never connected (or intentionally remain disconnected), because bootstrap() returns early when autoConnect is false and never clears this alarm. This adds continuous background wakeups and battery/CPU overhead for idle installs; only create the alarm when autoConnect is true (or after a successful connect).
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in 7c3b66c33d. onInstalled now mirrors onStartup — it reads AUTO_CONNECT_KEY and only ensures the alarm if the user already had auto-connect set (e.g. an update over a connected install). For a fresh install with no prior connect, the alarm gets created on the first user-initiated Connect, so we don't burn a wake-up every 30 s on installs that never connect.
| // Clear the keep-alive alarm now that the user has explicitly | ||
| // stopped — let the worker idle out instead of waking every 30 s. | ||
| void clearKeepaliveAlarm(); |
There was a problem hiding this comment.
Clear keepalive alarm on all disconnect/logout paths
The new alarm cleanup is wired only to pause/disconnect, but other user-facing stop flows (cloud-logout and self-hosted-disconnect) also transition to shouldConnect = false and clear autoConnect without clearing the alarm. After those flows, the extension can stay on a 30s wake loop despite being logged out/disconnected; alarm teardown should be centralized for every path that ends connection intent.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in 7c3b66c33d. Both cloud-logout and self-hosted-disconnect handlers now await clearKeepaliveAlarm() before the storage writes, alongside the existing pause/disconnect cleanup.
There was a problem hiding this comment.
🚩 SSE onClose auth-error path sets shouldConnect=false but doesn't clear alarm
In the onClose callback at worker.ts:620-631, when an auth error is received, shouldConnect is set to false and sseConnection is nulled, but clearKeepaliveAlarm() is not called. The alarm will continue firing every 30 seconds. The alarm listener at line 964 checks shouldConnect first, so it won't attempt reconnection — the alarm is functionally inert but still wakes the worker. This is similar to the cloud-logout/self-hosted-disconnect omission reported as BUG-0004, but this path is slightly different: the user hasn't explicitly stopped, and a re-auth flow may follow. Clearing the alarm here would be correct but is a less clear-cut omission since the user may re-authenticate shortly.
(Refers to lines 620-631)
Was this helpful? React with 👍 or 👎 to provide feedback.
There was a problem hiding this comment.
Fixed in 7c3b66c33d. The SSE onClose auth-error branch now void clearKeepaliveAlarm() alongside setting shouldConnect = false. Auth-required is a hard stop until the user re-signs-in, so there's no value in keeping the worker awake.
There was a problem hiding this comment.
Already addressed in 7c3b66c — both cloud-logout and self-hosted-disconnect handlers call await clearKeepaliveAlarm() immediately after disconnect() and before persisting setAutoConnect(false). SSE auth-error close is also covered (cleared on auth_required health transition).
| // common trigger is the popup closing — see Marina's report from | ||
| // May 7, 2026 (Chrome ext "connection drops every time I close | ||
| // this... a few seconds, the connection kind of drops"). |
There was a problem hiding this comment.
🔴 Real person name "Marina" in code comment violates AGENTS.md Generic Examples rule
The comment at lines 905–907 references a real person by name ("Marina's report from May 7, 2026"). AGENTS.md §Generic Examples states: "Never include personal user data — real names, emails, phone numbers, account IDs, or other identifying details of specific people — anywhere in the codebase. This covers code, tests, fixtures, documentation, comments, commit messages, and AGENTS.md files." Should use a generic placeholder like "a user's report" instead.
| // common trigger is the popup closing — see Marina's report from | |
| // May 7, 2026 (Chrome ext "connection drops every time I close | |
| // this... a few seconds, the connection kind of drops"). | |
| // common trigger is the popup closing — see the user report from | |
| // May 7, 2026 (Chrome ext "connection drops every time I close | |
| // this... a few seconds, the connection kind of drops"). |
Was this helpful? React with 👍 or 👎 to provide feedback.
There was a problem hiding this comment.
Fixed in 7c3b66c33d. Stripped the named user reference from the introductory comment — replaced with a generic description of the popup-close idle path.
There was a problem hiding this comment.
Addressed in 9265d62 — stripped the explanatory comment block from worker.ts entirely (and the chrome-globals.d.ts JSDoc additions), per Vargas's preference to keep the "why" content in the PR description rather than the source. The reporter-name violations no longer have a comment to live in.
| // - Chrome 116+ extends SW lifetime on active fetches in theory, | ||
| // but our SSE's 7 s heartbeat comments did not reliably keep the | ||
| // worker alive in practice (Marina's reproducer). MV3 also caps | ||
| // active-fetch SW lifetime at 5 minutes. |
There was a problem hiding this comment.
🔴 Second real person name "Marina" in code comment violates AGENTS.md Generic Examples rule
Line 920 references "Marina's reproducer", another instance of using a real person's name in a code comment. Same AGENTS.md §Generic Examples violation as the first instance. Should use a generic placeholder like "the user's reproducer".
| // active-fetch SW lifetime at 5 minutes. | |
| // worker alive in practice (the reporter's reproducer). MV3 also caps |
Was this helpful? React with 👍 or 👎 to provide feedback.
There was a problem hiding this comment.
Fixed in 7c3b66c33d along with the other reference further up. The "alternatives considered" comment no longer attributes the SSE-heartbeat finding to a named user.
There was a problem hiding this comment.
Addressed in 9265d62 — stripped the explanatory comment block from worker.ts entirely (and the chrome-globals.d.ts JSDoc additions), per Vargas's preference to keep the "why" content in the PR description rather than the source. The reporter-name violations no longer have a comment to live in.
| chrome.runtime.onInstalled.addListener(() => { | ||
| void ensureKeepaliveAlarm(); | ||
| }); |
There was a problem hiding this comment.
🟡 onInstalled unconditionally creates keepalive alarm, wasting resources on fresh installs
The onInstalled listener at line 976 calls ensureKeepaliveAlarm() unconditionally — on fresh installs, Chrome updates, and extension updates — even when autoConnect is false. This contradicts the stated design at line 928 ("The alarm is created on first connect and cleared on pause so idle (paused) installs don't pay a wake-up cost every 30 s"). On a fresh install, autoConnect is unset and shouldConnect is false, so the alarm fires every 30 seconds doing nothing. No code path clears it until the user explicitly connects and later pauses. Compare with the onStartup listener at lines 984–990 which correctly gates on autoConnect.
| chrome.runtime.onInstalled.addListener(() => { | |
| void ensureKeepaliveAlarm(); | |
| }); | |
| chrome.runtime.onInstalled.addListener(() => { | |
| void chrome.storage.local.get(AUTO_CONNECT_KEY).then((result) => { | |
| if (result[AUTO_CONNECT_KEY] === true) { | |
| void ensureKeepaliveAlarm(); | |
| } | |
| }); | |
| }); |
Was this helpful? React with 👍 or 👎 to provide feedback.
There was a problem hiding this comment.
Fixed in 7c3b66c33d (same change as the Codex finding above). onInstalled is now gated on AUTO_CONNECT_KEY === true so a fresh install with no connection intent doesn't burn a wake-up every 30 s.
There was a problem hiding this comment.
Already addressed in 7c3b66c — onInstalled reads AUTO_CONNECT_KEY from chrome.storage.local and only calls ensureKeepaliveAlarm() when it's true, matching onStartup. Fresh installs with no prior connect won't burn wake-ups. (You can see the gated form on lines 994–1000 of the latest worker.ts.)
| // Clear the keep-alive alarm now that the user has explicitly | ||
| // stopped — let the worker idle out instead of waking every 30 s. | ||
| void clearKeepaliveAlarm(); |
There was a problem hiding this comment.
🟡 Keepalive alarm not cleared in cloud-logout and self-hosted-disconnect handlers
The PR adds clearKeepaliveAlarm() to the pause handler (worker.ts:1051) and to connect-failure / bootstrap-failure paths, but does not add it to the cloud-logout (worker.ts:1337-1349) or self-hosted-disconnect (worker.ts:1352-1362) handlers. Both handlers set shouldConnect = false and setAutoConnect(false) — the same teardown as pause — but omit the alarm clear. After logout or self-hosted disconnect, the alarm continues firing every 30 seconds, waking the service worker unnecessarily for the lifetime of the browser session.
Prompt for agents
The PR adds clearKeepaliveAlarm() to the pause handler (worker.ts:1049-1051) but misses two other stop-connection paths that perform the same state transitions (shouldConnect=false, setAutoConnect(false), disconnect):
1. cloud-logout handler at worker.ts:1337-1349: Add void clearKeepaliveAlarm() after setConnectionHealth("paused") (around line 1341), matching the pattern in the pause handler.
2. self-hosted-disconnect handler at worker.ts:1352-1362: Same addition after setConnectionHealth("paused") (around line 1356).
Without this, the keepalive alarm continues firing every 30 seconds after the user logs out or disconnects their self-hosted setup, unnecessarily waking the service worker.
Was this helpful? React with 👍 or 👎 to provide feedback.
There was a problem hiding this comment.
Fixed in 7c3b66c33d (same change as the Codex finding above). cloud-logout and self-hosted-disconnect both clear the alarm before the storage writes, alongside the existing pause/disconnect cleanup.
There was a problem hiding this comment.
Already addressed in 7c3b66c — both cloud-logout and self-hosted-disconnect handlers call await clearKeepaliveAlarm() immediately after disconnect() and before persisting setAutoConnect(false). SSE auth-error close is also covered (cleared on auth_required health transition).
d9d0f1e to
e793fbe
Compare
… expects to be open
Adds chrome.alarms-based keepalive so the SSE bridge survives popup close,
addressing the SW idle-suspension issue surfaced when host_browser was
extended to web/iOS turns.
- chrome.alarms keepalive at the production minimum period (0.5 min)
- Defensive belt: alarm handler kicks reconnect if SSE is closed but
shouldConnect remains true
- Pin minimum_chrome_version to 120 (the 30s alarm floor that the
keepalive depends on shipped in Chrome 120; pre-120 floor is 1 min
which is too coarse to outpace the 30s SW idle timeout)
- Add 'alarms' permission to manifest
- Lifecycle wiring:
- bootstrap / connect (success) ensure the alarm
- pause / disconnect / cloud-logout / self-hosted-disconnect / failed
connect / SSE auth-error onClose all clear the alarm
- onInstalled / onStartup gate ensure on existing auto-connect intent
e793fbe to
7c3b66c
Compare
…ive section Drop the section banner explainer, period docblock, alarm-handler 'defensive belt' comment, onStartup explainer, and inline call-site comments. Move the 'why' content (MV3 lifecycle, native alternatives considered, follow-ups) to the PR description per house style. Also strip the JSDoc descriptions from chrome-globals.d.ts on alarms / onInstalled / onStartup — the names speak for themselves. Net: 70 deletions, 0 additions; no behavior change. Resolves the two AGENTS.md generic-examples bot findings (no more user names in source).
| "description": "Bridges the Vellum assistant to your live browser session via Chrome DevTools Protocol (CDP) through chrome.debugger.", | ||
| "homepage_url": "https://www.vellum.ai", | ||
| "permissions": [ | ||
| "alarms", |
There was a problem hiding this comment.
cc: @noanflaherty are we as sensitive to new perms now that we are post launch?
What
The Vellum Chrome extension's MV3 service worker holds the SSE bridge to the cloud assistant (or local gateway). When the popup is closed, the worker idles and Chrome unloads it after ~30 s — the bridge dies and the cloud can't reach the extension until the next time the popup wakes the worker.
This PR adds a
chrome.alarmskeep-alive that fires every 30 s while we expect the connection to be live, resetting the SW idle timer so the bridge stays up across popup closures.Reproduction (before this PR)
host_browsertool call.bootstrap()→ SSE reconnects → it works again.Why now
PR #29829 (May 6) expanded
host_browserdispatch to web/iOS turns, so the cloud assistant routinely tries to reach the extension cross-client while the popup is closed. The SW idle suspension was always there — it just rarely mattered until the assistant started using the bridge during normal browsing.What changed
manifest.json: added"alarms"topermissions. Bumpedminimum_chrome_version116 → 120 (production-honored alarm minimum is 0.5 min, the tightest period that resets the SW idle timer; pre-120 the floor was 1 min, too coarse).background/worker.ts: keep-alive section with idempotentensureKeepaliveAlarm()/clearKeepaliveAlarm(), anonAlarmlistener (kicks reconnect ifshouldConnect&& SSE closed),onInstalled/onStartupregistrations gated on theautoConnectstorage flag, and ensure/clear wired into the connect/pause/cloud-logout/self-hosted-disconnect/bootstrap-fail/auth-error paths.types/chrome-globals.d.ts: narrow ambient types forchrome.alarms,chrome.runtime.onInstalled,chrome.runtime.onStartup(we don't depend on@types/chrome).Why
chrome.alarmsover native alternativesfetch()with 7 s heartbeat comments did not reliably keep the worker alive across popup closures. MV3 also caps active-fetch SW lifetime at 5 minutes regardless.chrome.runtime.connectports: only keep the SW alive while the popup is open — doesn't help when it closes.host_browsersession. The upgrade path.chrome.alarms✓: smallest diff, canonical MV3 pattern, no UX cost.Wrinkles for separate followups (not in this PR)
host_browsersession runs longer than that with the popup closed throughout, the SSE will still drop briefly. Offscreen documents are the fix when this surfaces./v1/eventsendpoint doesn't honorLast-Event-Id. Not in scope here; file separately if it bites.Manual verification
bunx tsc --noEmitcleanbunx eslint background/worker.ts types/chrome-globals.d.tscleanbun test— all 120 existing tests passNo dedicated unit test for the keep-alive: testing it in isolation would require mocking the entire
chrome.*surface (alarms + runtime events + storage + the SSE state machine) for ~30 lines that boil down to "did we callchrome.alarms.create". The behavior is small, easy to read, and verified end-to-end on real Chrome.Iteration log
7c3b66c33d— initial keep-alive wiring.9265d62cae— stripped explanatory comments per house style; "why" lives in this PR description.