feat(web): auto re-pair a local assistant on connect via vellum wake (LUM-2233)#33271
Conversation
… bootstrap When connecting to a local assistant from the login page, the connect path now self-heals before surfacing an error: on a repairable failure it runs `vellum wake` (re-seeds the guardian token and restarts the daemon + gateway, leaving the assistant's data and identity untouched), then primes the connection once more. This matches the native client's bootstrap, which re-pairs a stopped, expired, or mis-seeded assistant automatically rather than dead-ending the user on a "Couldn't connect" + Retry card. Wiring follows the existing hatch/retire host seam end to end: - `runWake` CLI driver in @vellumai/local-mode (mirrors runRetire). - `vellum:localMode:wake` IPC handler + preload contract in the Electron app. - `/assistant/__local/wake` dev middleware in the web Vite plugin. - `wakeLocalAssistantHost` seam in runtime/local-mode-host (Electron IPC vs dev fetch), and `primeLocalGatewayConnectionWithRepair` wrapping the connect primitive with one wake + retry. Safe and contained: only the interactive connect (`connectLocalAssistant`) opts into repair; the best-effort boot probe stays on the plain primitive so app launch never spawns daemon processes. A 403 (refused loopback boundary) is non-repairable and surfaces unchanged; a failed wake or a still-failing retry propagates the original error so the existing connect-error UI is preserved. The preload `wake` channel is treated as optional in the renderer so an older Electron shell (the macOS app and web bundle don't release together) degrades gracefully instead of throwing. Co-Authored-By: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
🤖 Devin AI EngineerI'll be helping with this pull request! Here's what you should know: ✅ I will automatically:
Note: I can only respond to comments from users who have write access to this repository. ⚙️ Control Options:
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7ca8d30093
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
…ests connectLocalAssistant now primes through the repair wrapper, so the @/lib/local-mode mock must export it; without it the imported function is undefined and the connect tests throw before exercising behavior. Co-Authored-By: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6442c1a94e
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
vellum wake cold-starts a stopped assistant: the CLI waits up to 60s for the daemon (plus a 60s source-daemon fallback) and up to 30s for the gateway. The 60s wrapper timeout could kill a slow-but-succeeding wake before the gateway came up, misreporting auto-repair as a timeout. Raise the safety-net timeout to 180s so it sits above those readiness windows. Co-Authored-By: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 67f8955dfb
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| await primeLocalGatewayConnection(); | ||
| return; |
There was a problem hiding this comment.
Verify gateway reachability before skipping wake
When a user reconnects to a slept/stopped local assistant while an unexpired gateway token is still cached for the same token URL, primeLocalGatewayConnection() can resolve without touching the gateway: it reads the guardian token from disk, ensureGatewayToken() returns the cached token, and this path returns before wake is attempted. connectLocalAssistant then marks the user logged in even though the gateway is still stopped, so the new auto-repair path misses a common stopped-assistant case until later API calls fail. Consider forcing a real gateway/token probe or bypassing the cached gateway token before deciding repair is unnecessary.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Same gap as the earlier thread on this file — agreed it's real, and intentionally not fixed here.
ensureGatewayToken() returns the cached token (30-day TTL) without touching the gateway, so a slept/stopped assistant with a live token primes "successfully" and this path returns before wake; the dead gateway only surfaces on the first data-plane request.
Two reasons it stays out of this PR:
- It changes the happy path. A real gateway/token probe (or bypassing the cache) on every connect would spawn the daemon + gateway on every login even when nothing's wrong — out of scope for this minimal native-parity change. Note native is the same shape:
forceReBootstrapfires fromGatewayConnectionManager.attemptRePair()on a failed gateway request, not a pre-connect probe. - It's the gateway-liveness half of the recovery work being handed to @noanflaherty.
Tracked as remaining scope on LUM-2232 (gateway-liveness probe / recovery driven off the first failed request). Leaving this thread open as the pointer to that follow-up rather than resolving it.
There was a problem hiding this comment.
✦ APPROVE
Value: Closes the last rung the web/Electron local-connect path was missing relative to the native macOS bootstrap (forceReBootstrap → re-provision creds + restart). On a repairable connect failure, the renderer now spawns vellum wake <id> (non-destructive: re-seeds guardian token + restarts daemon+gateway, data/identity intact) and re-primes once. Adds one rung; carves out the gateway-liveness probe + recovery UI as LUM-2232 (handed to @noanflaherty) — clean scope discipline.
Layered seam (mirrors existing retire symmetrically):
packages/local-mode/src/wake.ts—runWakeCLI driver: spawnvellum wake <id>with the existingCliInvocation, never-reject{ok, status?, error?}contract.WAKE_TIMEOUT_MS = 180_000(raised in 67f8955 after Codex's correctly-flagged P2 — see below).apps/macos/src/main/local-mode.ts—vellum:localMode:wakeIPC handler.apps/macos/src/preload/index.ts— typedwakechannel onwindow.vellum.localMode.apps/web/src/runtime/local-mode-host.ts—wakeLocalAssistantHost: Electron branch readswindow.vellum!.localMode.wakeas optional (older shell →{ok: false, error: 'Wake is not supported by this app version'}→ degrades to no-op repair); dev branchPOST /assistant/__local/wake.apps/web/vite-plugin-local-mode.ts— loopback-guarded dev endpoint mirroring retire.apps/web/src/lib/local-mode.ts—primeLocalGatewayConnectionWithRepair: try-prime → ifisRepairableConnectError(everything exceptGuardianTokenError.status === 403, which is a host-refused-loopback security boundary wake can't change) → resolve assistantId →wake→ ifrepair.okre-prime; otherwise throw the original error.apps/web/src/stores/auth-store.ts— ONLY the interactiveconnectLocalAssistantopts in. The boot probe stays on the plain primitive — deliberate so app launch never spawns daemon processes.
Codex P2 at HEAD "Verify gateway reachability before skipping wake" — mooted (out of scope, scope-carved to LUM-2232)
Codex's mechanic is correct: ensureGatewayToken() returns a cached gateway token (30-day TTL) without contacting the gateway, so a slept/stopped assistant with a still-cached token primes "successfully" and this path skips wake — the dead gateway only surfaces on the first data-plane request.
Devin's inline reply at HEAD covers exactly this:
- Native parity says: repair on failure, not preemptively.
forceReBootstrapinclients/macos/.../AppDelegate+Bootstrap.swiftfires fromGatewayConnectionManager.attemptRePair()on a failed gateway request, not a pre-connect probe. Forcing a wake/probe on every connect would change the happy path and spawn daemon+gateway on every login even when nothing's wrong. - Gateway-liveness is the other half of recovery work, explicitly carved out in the PR body and tracked as remaining scope on LUM-2232 (gateway-liveness probe + recovery driven off the first failed request, handed to @noanflaherty).
The finding is a follow-up pointer, not a regression. Devin left the thread open as the LUM-2232 hand-off marker rather than resolving it — correct ergonomics.
Earlier Codex findings — both auto-resolved by Devin
- First commit P2 "Update local-mode mock for new repair call" → Devin self-fixed in
6442c1a9:auth-store.test.ts's@/lib/local-modemock now exports bothprimeLocalGatewayConnection(boot probe) andprimeLocalGatewayConnectionWithRepair(interactive connect). Cleared inline. - Second commit P2 "Allow wake to outlive CLI startup waits" → Devin self-fixed in
67f8955d. OriginalWAKE_TIMEOUT_MS = 60_000was below the CLI's documented readiness windows (60s daemon wait + optional 60s source-daemon fallback + 30s gateway wait ≈ 90s prod, ≥150s on dev source-fallback). A cold-start wake could SIGTERM at 60s while still succeeding, misreporting as timeout. Raised to 180s with a docstring citing the CLI readiness windows.retire's 60s correctly unchanged — teardown is a different envelope.
Anti-pattern cross-check
- No
asruntime-boundary casts in the diff. ✓ - Version skew handled. macOS shell + web bundle don't release together, so a newer renderer can run against an older preload that predates the
wakechannel. The channel is typed optional onwindow.vellum.localMode.wakeand the host seam returns{ok: false, error: 'Wake is not supported by this app version'}onundefined. Caller treats as no-op repair → falls through to existing connect-error card. Same shape the rest of the local-mode-host seam uses. ✓ - Non-destructive repair. Wake re-seeds creds + restarts; data + identity survive (matches native). The
retirepath stays separate for destructive removal. ✓ - Opt-in. Only interactive
connectLocalAssistant. Boot probe (best-effort) stays on plainprimeLocalGatewayConnectionso app launch never spawns daemon. ✓ - 403 classification.
GuardianTokenError.status === 403= host-refused-loopback security decision wake can't change → surfaces unchanged. All other failures (missing/expired/malformed token, unreachable/stopped gateway) → repairable. Correct boundary. ✓ - Symmetric with
retireat every layer — same{ok, error?}contract, samerunRetire-style never-reject CLI driver, same dev-endpoint shape. New code reads exactly like the existing pattern. ✓ - Test coverage. 3 new test files:
wake.test.ts(spawn argv + non-zero exit + spawn failure),local-mode-host.test.ts(dev POST + Electron bridge + older-shell unsupported),local-mode-repair.test.ts(clean→no-wake, repairable→wake-then-retry-success, still-failing-retry → original error, failed-wake → original error, 403 → immediate surface, no wake). ✓
Territory check (R11e): Self-hosted local-connect arc — Boss-owned (Devin-authored on Boss's behalf, follow-up to #33241 and #33252 that already merged this session). auth-store.ts is touched but the line range is the connectLocalAssistant action only; #33219 (tri-state platform-session liveness, also Devin/Boss-authored, still open) is a separate concern in the same store. Same arc, same author lineage — no external collision. Not Vargas's SSE/seq territory. Not Mahmoud's vembda territory. ✓
Merge gate at HEAD 67f8955d: Vex ✓ · Codex P2 at HEAD mooted (out-of-scope, scope-carved to LUM-2232, Devin inline rebuttal at HEAD with native-parity citation) · Codex's earlier P2s both self-fixed by Devin and cleared ✓ · CI all green ✓. Devin is last pusher → bot-merge may be blocked by branch protection; will attempt and flag if blocked.
Vellum Constitution — Distinct: this is exactly the kind of "add one rung at a time" PR that keeps a multi-cycle recovery arc legible. Carving out the gateway-liveness probe as LUM-2232 instead of bundling it preserves both the diff's reviewability and the next reviewer's leverage on the harder half (when to force a repair vs trust a cached token) — the right surface area for the right decision.
Prompt / plan
Reach parity with the native macOS client's bootstrap for the web/Electron local-assistant login path, and nothing more — a minimal, unblocking change. The native client (
clients/macos/.../App/AppDelegate+Bootstrap.swift) re-pairs a stopped, expired, or mis-seeded local assistant automatically (forceReBootstrap→ re-provision creds + restart) before any error reaches the user. The web/Electron connect path had only the token-refresh rung and dead-ended on a "Couldn't connect" + Retry card. This PR adds the one missing rung: an automaticvellum wake+ retry on a repairable connect failure.wakeis the CLI's non-destructive repair primitive — it re-seeds the guardian token from a sibling environment and restarts the daemon + gateway, leaving the assistant's data and identity untouched. It is the CLI equivalent of native's re-pair.Closes LUM-2233.
Broader recovery UI (a terminal recovery card with Create/Remove and per-lockfile-state cloud fallbacks) is intentionally out of scope and handed off under LUM-2232.
Changes
The wake seam is wired end to end, following the existing hatch/retire pattern at every layer:
@vellumai/local-mode—runWakeCLI driver (mirrorsrunRetire: spawnvellum wake <id>, 60s timeout, never-reject{ ok }contract).vellum:localMode:wakeIPC handler.window.vellum.localMode.wakecontract./assistant/__local/wakeendpoint in the Vite plugin (loopback-guarded, mirrors the retire endpoint).runtime/local-mode-host—wakeLocalAssistantHostseam branching Electron IPC vs devfetch.lib/local-mode—primeLocalGatewayConnectionWithRepair: on a repairable failure, runwakeonce and re-prime; classify aGuardianTokenError403 (refused loopback boundary) as non-repairable.stores/auth-store— the interactiveconnectLocalAssistantopts into the repair-wrapped primitive; the best-effort boot probe stays on the plain primitive.Test plan
packages/local-mode/src/__tests__/wake.test.ts—runWakespawns the right argv, surfaces CLI output on non-zero exit, and resolves (not rejects) on spawn failure.apps/web/src/runtime/local-mode-host.test.ts—wakeLocalAssistantHostPOSTs to the dev endpoint, routes through the Electron bridge without touchingfetch, and reports an unsupported failure when an older shell lacks thewakechannel.apps/web/src/lib/local-mode-repair.test.ts— clean connect never wakes; a repairable failure wakes once then retries to success; a still-failing retry and a failed wake both surface the original error without looping; a non-repairable 403 surfaces immediately and never wakes.bunx tsc --noEmitgreen forapps/web,packages/local-mode,apps/macos.bun run lintgreen forapps/web(pre-existingexhaustive-depswarnings only, none in touched files).Safety
connectLocalAssistantchanges behavior; the happy path is unchanged (a clean first prime never spawns anything). The boot probe deliberately stays on the plain primitive so app launch never spawns daemon processes.wakere-pairs in place — it never retires or re-hatches, so the assistant's data and identity survive. This matches native'sforceReBootstrap, which is destructive only to credentials, not the assistant.wakechannel is typed optional in the renderer and guarded at the seam, so an older shell degrades to a no-op repair (falls through to the connect error) instead of throwing.References
clients/macos/vellum-assistant/App/AppDelegate+Bootstrap.swift(forceReBootstrap,ensureActorCredentials).CLI verb checklist
Not applicable — this PR adds no new IPC route under
assistant/src/runtime/routes/. It wires an Electron-main IPC handler and a Vite dev middleware that drive the existingvellum wakeCLI verb; no newassistantverb is needed.Link to Devin session: https://app.devin.ai/sessions/15bca57bd4c64a3085cfb80e1f26355a
Requested by: @ashleeradka