Skip to content

fix(chrome-extension): keep MV3 service worker alive while SSE bridge expects to be open#29900

Merged
dvargasfuertes merged 2 commits into
mainfrom
apollo/chrome-ext-sw-keepalive
May 8, 2026
Merged

fix(chrome-extension): keep MV3 service worker alive while SSE bridge expects to be open#29900
dvargasfuertes merged 2 commits into
mainfrom
apollo/chrome-ext-sw-keepalive

Conversation

@vellum-apollo-bot
Copy link
Copy Markdown
Contributor

@vellum-apollo-bot vellum-apollo-bot Bot commented May 7, 2026

What

The Vellum Chrome extension's MV3 service worker holds the SSE bridge to the cloud assistant (or local gateway). When the popup is closed, the worker idles and Chrome unloads it after ~30 s — the bridge dies and the cloud can't reach the extension until the next time the popup wakes the worker.

This PR adds a chrome.alarms keep-alive that fires every 30 s while we expect the connection to be live, resetting the SW idle timer so the bridge stays up across popup closures.

Reproduction (before this PR)

  1. Pair the extension to a cloud assistant.
  2. Open Vellum on web, ask the assistant to drive a host_browser tool call.
  3. Click the popup closed and wait 30+ seconds.
  4. Try the assistant again. Symptom: the dispatch is black-holed for the suspension window. Reopen the popup → SW wakes via bootstrap() → SSE reconnects → it works again.

Why now

PR #29829 (May 6) expanded host_browser dispatch to web/iOS turns, so the cloud assistant routinely tries to reach the extension cross-client while the popup is closed. The SW idle suspension was always there — it just rarely mattered until the assistant started using the bridge during normal browsing.

What changed

  • manifest.json: added "alarms" to permissions. Bumped minimum_chrome_version 116 → 120 (production-honored alarm minimum is 0.5 min, the tightest period that resets the SW idle timer; pre-120 the floor was 1 min, too coarse).
  • background/worker.ts: keep-alive section with idempotent ensureKeepaliveAlarm() / clearKeepaliveAlarm(), an onAlarm listener (kicks reconnect if shouldConnect && SSE closed), onInstalled / onStartup registrations gated on the autoConnect storage flag, and ensure/clear wired into the connect/pause/cloud-logout/self-hosted-disconnect/bootstrap-fail/auth-error paths.
  • types/chrome-globals.d.ts: narrow ambient types for chrome.alarms, chrome.runtime.onInstalled, chrome.runtime.onStartup (we don't depend on @types/chrome).

Why chrome.alarms over native alternatives

  • Chrome 116+ "active fetch keeps SW alive": in theory yes; in practice an SSE fetch() with 7 s heartbeat comments did not reliably keep the worker alive across popup closures. MV3 also caps active-fetch SW lifetime at 5 minutes regardless.
  • chrome.runtime.connect ports: only keep the SW alive while the popup is open — doesn't help when it closes.
  • Offscreen documents: avoid SW suspension entirely. More code; better fit if we hit the alarms approach's 5 min cap during a long host_browser session. The upgrade path.
  • chrome.alarms: smallest diff, canonical MV3 pattern, no UX cost.

Wrinkles for separate followups (not in this PR)

  • 5 min hard cap — even with alarms firing, MV3 hard-caps SW life at 5 minutes per wake. If a host_browser session runs longer than that with the popup closed throughout, the SSE will still drop briefly. Offscreen documents are the fix when this surfaces.
  • Lost events during suspension — even with reconnect, events the cloud sent while the SW was dead are gone. Our /v1/events endpoint doesn't honor Last-Event-Id. Not in scope here; file separately if it bites.

Manual verification

  • bunx tsc --noEmit clean
  • bunx eslint background/worker.ts types/chrome-globals.d.ts clean
  • bun test — all 120 existing tests pass
  • 🛠️ Manual smoke test on packaged extension: pair → close popup → wait 60 s → cloud assistant can still reach the extension. (Will run on dev-build install.)

No dedicated unit test for the keep-alive: testing it in isolation would require mocking the entire chrome.* surface (alarms + runtime events + storage + the SSE state machine) for ~30 lines that boil down to "did we call chrome.alarms.create". The behavior is small, easy to read, and verified end-to-end on real Chrome.

Iteration log

  • Commit 7c3b66c33d — initial keep-alive wiring.
  • Commit 9265d62cae — stripped explanatory comments per house style; "why" lives in this PR description.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d9d0f1ebc9

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +933 to +937
* Production Chrome enforces a minimum `periodInMinutes` of 0.5
* (30 s); smaller values are silently rounded up. The alarm firing
* itself is what counts as "extension API activity" and resets the
* 30 s SW idle timer, so the alarm period needs to be at-or-below
* the idle threshold for the worker to stay alive across popup
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Handle pre-Chrome-120 alarm floor for keepalive

KEEPALIVE_PERIOD_MIN assumes production alarms can fire every 30s, but this extension still declares minimum_chrome_version: "116" in manifest.json; on Chrome 116–119, repeating alarms are floored to 1 minute, so the worker can still idle out after ~30s and the SSE bridge will keep dropping for those users. Either raise the minimum Chrome version to 120+ or add a fallback keepalive strategy for 116–119.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — fixed in 7c3b66c33d. The 30 s alarm period is what makes the keepalive actually work, so I bumped minimum_chrome_version from 116 → 120 in manifest.json and updated the docstring to call out the dependency. Pre-120 the floor was 1 minute, which is too coarse to outpace the 30 s SW idle timer — there's no graceful fallback that achieves the same goal on those builds, and Chrome 120 shipped December 2023, so the user impact of bumping is minimal.

Comment on lines +976 to +978
chrome.runtime.onInstalled.addListener(() => {
void ensureKeepaliveAlarm();
});
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Gate install-time keepalive alarm on connection intent

Creating the keepalive alarm unconditionally in onInstalled starts a perpetual 30s wake cycle even for users who have never connected (or intentionally remain disconnected), because bootstrap() returns early when autoConnect is false and never clears this alarm. This adds continuous background wakeups and battery/CPU overhead for idle installs; only create the alarm when autoConnect is true (or after a successful connect).

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 7c3b66c33d. onInstalled now mirrors onStartup — it reads AUTO_CONNECT_KEY and only ensures the alarm if the user already had auto-connect set (e.g. an update over a connected install). For a fresh install with no prior connect, the alarm gets created on the first user-initiated Connect, so we don't burn a wake-up every 30 s on installs that never connect.

Comment on lines +1049 to +1051
// Clear the keep-alive alarm now that the user has explicitly
// stopped — let the worker idle out instead of waking every 30 s.
void clearKeepaliveAlarm();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Clear keepalive alarm on all disconnect/logout paths

The new alarm cleanup is wired only to pause/disconnect, but other user-facing stop flows (cloud-logout and self-hosted-disconnect) also transition to shouldConnect = false and clear autoConnect without clearing the alarm. After those flows, the extension can stay on a 30s wake loop despite being logged out/disconnected; alarm teardown should be centralized for every path that ends connection intent.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 7c3b66c33d. Both cloud-logout and self-hosted-disconnect handlers now await clearKeepaliveAlarm() before the storage writes, alongside the existing pause/disconnect cleanup.

Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 5 potential issues.

View 3 additional findings in Devin Review.

Open in Devin Review

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚩 SSE onClose auth-error path sets shouldConnect=false but doesn't clear alarm

In the onClose callback at worker.ts:620-631, when an auth error is received, shouldConnect is set to false and sseConnection is nulled, but clearKeepaliveAlarm() is not called. The alarm will continue firing every 30 seconds. The alarm listener at line 964 checks shouldConnect first, so it won't attempt reconnection — the alarm is functionally inert but still wakes the worker. This is similar to the cloud-logout/self-hosted-disconnect omission reported as BUG-0004, but this path is slightly different: the user hasn't explicitly stopped, and a re-auth flow may follow. Clearing the alarm here would be correct but is a less clear-cut omission since the user may re-authenticate shortly.

(Refers to lines 620-631)

Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 7c3b66c33d. The SSE onClose auth-error branch now void clearKeepaliveAlarm() alongside setting shouldConnect = false. Auth-required is a hard stop until the user re-signs-in, so there's no value in keeping the worker awake.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already addressed in 7c3b66c — both cloud-logout and self-hosted-disconnect handlers call await clearKeepaliveAlarm() immediately after disconnect() and before persisting setAutoConnect(false). SSE auth-error close is also covered (cleared on auth_required health transition).

Comment on lines +905 to +907
// common trigger is the popup closing — see Marina's report from
// May 7, 2026 (Chrome ext "connection drops every time I close
// this... a few seconds, the connection kind of drops").
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Real person name "Marina" in code comment violates AGENTS.md Generic Examples rule

The comment at lines 905–907 references a real person by name ("Marina's report from May 7, 2026"). AGENTS.md §Generic Examples states: "Never include personal user data — real names, emails, phone numbers, account IDs, or other identifying details of specific people — anywhere in the codebase. This covers code, tests, fixtures, documentation, comments, commit messages, and AGENTS.md files." Should use a generic placeholder like "a user's report" instead.

Suggested change
// common trigger is the popup closing — see Marina's report from
// May 7, 2026 (Chrome ext "connection drops every time I close
// this... a few seconds, the connection kind of drops").
// common trigger is the popup closing — see the user report from
// May 7, 2026 (Chrome ext "connection drops every time I close
// this... a few seconds, the connection kind of drops").
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 7c3b66c33d. Stripped the named user reference from the introductory comment — replaced with a generic description of the popup-close idle path.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 9265d62 — stripped the explanatory comment block from worker.ts entirely (and the chrome-globals.d.ts JSDoc additions), per Vargas's preference to keep the "why" content in the PR description rather than the source. The reporter-name violations no longer have a comment to live in.

// - Chrome 116+ extends SW lifetime on active fetches in theory,
// but our SSE's 7 s heartbeat comments did not reliably keep the
// worker alive in practice (Marina's reproducer). MV3 also caps
// active-fetch SW lifetime at 5 minutes.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Second real person name "Marina" in code comment violates AGENTS.md Generic Examples rule

Line 920 references "Marina's reproducer", another instance of using a real person's name in a code comment. Same AGENTS.md §Generic Examples violation as the first instance. Should use a generic placeholder like "the user's reproducer".

Suggested change
// active-fetch SW lifetime at 5 minutes.
// worker alive in practice (the reporter's reproducer). MV3 also caps
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 7c3b66c33d along with the other reference further up. The "alternatives considered" comment no longer attributes the SSE-heartbeat finding to a named user.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 9265d62 — stripped the explanatory comment block from worker.ts entirely (and the chrome-globals.d.ts JSDoc additions), per Vargas's preference to keep the "why" content in the PR description rather than the source. The reporter-name violations no longer have a comment to live in.

Comment on lines +976 to +978
chrome.runtime.onInstalled.addListener(() => {
void ensureKeepaliveAlarm();
});
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 onInstalled unconditionally creates keepalive alarm, wasting resources on fresh installs

The onInstalled listener at line 976 calls ensureKeepaliveAlarm() unconditionally — on fresh installs, Chrome updates, and extension updates — even when autoConnect is false. This contradicts the stated design at line 928 ("The alarm is created on first connect and cleared on pause so idle (paused) installs don't pay a wake-up cost every 30 s"). On a fresh install, autoConnect is unset and shouldConnect is false, so the alarm fires every 30 seconds doing nothing. No code path clears it until the user explicitly connects and later pauses. Compare with the onStartup listener at lines 984–990 which correctly gates on autoConnect.

Suggested change
chrome.runtime.onInstalled.addListener(() => {
void ensureKeepaliveAlarm();
});
chrome.runtime.onInstalled.addListener(() => {
void chrome.storage.local.get(AUTO_CONNECT_KEY).then((result) => {
if (result[AUTO_CONNECT_KEY] === true) {
void ensureKeepaliveAlarm();
}
});
});
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 7c3b66c33d (same change as the Codex finding above). onInstalled is now gated on AUTO_CONNECT_KEY === true so a fresh install with no connection intent doesn't burn a wake-up every 30 s.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already addressed in 7c3b66conInstalled reads AUTO_CONNECT_KEY from chrome.storage.local and only calls ensureKeepaliveAlarm() when it's true, matching onStartup. Fresh installs with no prior connect won't burn wake-ups. (You can see the gated form on lines 994–1000 of the latest worker.ts.)

Comment on lines +1049 to +1051
// Clear the keep-alive alarm now that the user has explicitly
// stopped — let the worker idle out instead of waking every 30 s.
void clearKeepaliveAlarm();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Keepalive alarm not cleared in cloud-logout and self-hosted-disconnect handlers

The PR adds clearKeepaliveAlarm() to the pause handler (worker.ts:1051) and to connect-failure / bootstrap-failure paths, but does not add it to the cloud-logout (worker.ts:1337-1349) or self-hosted-disconnect (worker.ts:1352-1362) handlers. Both handlers set shouldConnect = false and setAutoConnect(false) — the same teardown as pause — but omit the alarm clear. After logout or self-hosted disconnect, the alarm continues firing every 30 seconds, waking the service worker unnecessarily for the lifetime of the browser session.

Prompt for agents
The PR adds clearKeepaliveAlarm() to the pause handler (worker.ts:1049-1051) but misses two other stop-connection paths that perform the same state transitions (shouldConnect=false, setAutoConnect(false), disconnect):

1. cloud-logout handler at worker.ts:1337-1349: Add void clearKeepaliveAlarm() after setConnectionHealth("paused") (around line 1341), matching the pattern in the pause handler.

2. self-hosted-disconnect handler at worker.ts:1352-1362: Same addition after setConnectionHealth("paused") (around line 1356).

Without this, the keepalive alarm continues firing every 30 seconds after the user logs out or disconnects their self-hosted setup, unnecessarily waking the service worker.
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 7c3b66c33d (same change as the Codex finding above). cloud-logout and self-hosted-disconnect both clear the alarm before the storage writes, alongside the existing pause/disconnect cleanup.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already addressed in 7c3b66c — both cloud-logout and self-hosted-disconnect handlers call await clearKeepaliveAlarm() immediately after disconnect() and before persisting setAutoConnect(false). SSE auth-error close is also covered (cleared on auth_required health transition).

@vellum-apollo-bot vellum-apollo-bot Bot force-pushed the apollo/chrome-ext-sw-keepalive branch from d9d0f1e to e793fbe Compare May 7, 2026 21:04
… expects to be open

Adds chrome.alarms-based keepalive so the SSE bridge survives popup close,
addressing the SW idle-suspension issue surfaced when host_browser was
extended to web/iOS turns.

- chrome.alarms keepalive at the production minimum period (0.5 min)
- Defensive belt: alarm handler kicks reconnect if SSE is closed but
  shouldConnect remains true
- Pin minimum_chrome_version to 120 (the 30s alarm floor that the
  keepalive depends on shipped in Chrome 120; pre-120 floor is 1 min
  which is too coarse to outpace the 30s SW idle timeout)
- Add 'alarms' permission to manifest
- Lifecycle wiring:
  - bootstrap / connect (success) ensure the alarm
  - pause / disconnect / cloud-logout / self-hosted-disconnect / failed
    connect / SSE auth-error onClose all clear the alarm
  - onInstalled / onStartup gate ensure on existing auto-connect intent
@vellum-apollo-bot vellum-apollo-bot Bot force-pushed the apollo/chrome-ext-sw-keepalive branch from e793fbe to 7c3b66c Compare May 7, 2026 21:04
…ive section

Drop the section banner explainer, period docblock, alarm-handler 'defensive belt' comment, onStartup explainer, and inline call-site comments. Move the 'why' content (MV3 lifecycle, native alternatives considered, follow-ups) to the PR description per house style. Also strip the JSDoc descriptions from chrome-globals.d.ts on alarms / onInstalled / onStartup — the names speak for themselves.

Net: 70 deletions, 0 additions; no behavior change. Resolves the two AGENTS.md generic-examples bot findings (no more user names in source).
"description": "Bridges the Vellum assistant to your live browser session via Chrome DevTools Protocol (CDP) through chrome.debugger.",
"homepage_url": "https://www.vellum.ai",
"permissions": [
"alarms",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc: @noanflaherty are we as sensitive to new perms now that we are post launch?

@dvargasfuertes dvargasfuertes merged commit 74738e4 into main May 8, 2026
3 checks passed
@dvargasfuertes dvargasfuertes deleted the apollo/chrome-ext-sw-keepalive branch May 8, 2026 12:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant