fix(macos): terminate-first restart + cold-launch avatar cache#27227
Conversation
…cause) Co-Authored-By: ashlee@vellum.ai <ashlee@vellum.ai>
🤖 Devin AI EngineerI'll be helping with this pull request! Here's what you should know: ✅ I will automatically:
Note: I can only respond to comments from users who have write access to this repository. ⚙️ Control Options:
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 60225d0d82
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if let bundleId = Bundle.main.bundleIdentifier { | ||
| let others = NSRunningApplication.runningApplications(withBundleIdentifier: bundleId) | ||
| .filter { $0 != .current && !$0.isTerminated } | ||
| if let existing = others.first { |
There was a problem hiding this comment.
Keep overlap handling for rename relaunch timeout path
The single-instance guard now always terminates any second process, but AppBundleRenamer still uses a bounded wait before relaunch (for i in $(seq 1 100) in the same file) and can open the new bundle while the old process is still alive. In that slow-shutdown case, this guard path will kill the replacement instance and the app can end up not relaunching after retire/rename flows. Before this change, the sentinel allowed this specific overlap window, so removing the exception without removing the timeout introduces a restart failure mode.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Good catch — you're right that the sentinel was covering the "timeout-elapsed-but-old-process-still-alive" window in AppBundleRenamer, and removing the guard exception without addressing that window would introduce a failure mode. Fixed in 8da224c by making the renamer script bail (via rm -f "$0"; exit 0) if the PID is still alive after the 10s wait, instead of proceeding to rename filesystem state under a live process and calling open. The rename is retried on next launch via AppBundleRenamer.needsRename, which is the correct recovery path.
…open on PID exit Co-Authored-By: ashlee@vellum.ai <ashlee@vellum.ai>
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8da224c254
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if kill -0 \(pid) 2>/dev/null; then | ||
| exit 0 |
There was a problem hiding this comment.
Match restart watcher timeout to shutdown timeout
The watcher aborts relaunch with exit 0 when the current PID is still alive, but this check runs after only ~10s while app termination can legitimately take up to 15s (VellumCli.stopTimeout in App/VellumCli.swift). In the 10–15s window (e.g., slow daemon/gateway shutdown), clicking Restart will terminate the current app without reopening it, so users get a quit instead of a restart. Increase this wait (or gate on explicit cancel) so normal shutdown latency does not drop relaunches.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Good catch — confirmed VellumCli.stopTimeout = 15.0 (VellumCli.swift:358), and applicationWillTerminate adds more teardown on top. The 10s cap would quit-without-relaunch in the 10–15s window.
Fixed in bc01be8 — bumped both the performRestart watcher and AppBundleRenamer helper to 30s (seq 1 300) to cover cli.stop's full timeout plus AppKit teardown headroom, without going unbounded (the abort branch still guards against a cancelled terminate looping forever).
VellumCli.stopTimeout is 15s, and AppKit teardown adds more time on top of that. A 10s watcher cap caused Restart to terminate the app without relaunching whenever shutdown took 10-15s. Both performRestart and AppBundleRenamer now wait up to 30s. Co-Authored-By: ashlee@vellum.ai <ashlee@vellum.ai>
Persists the assistant's avatar image and character traits to Application Support on every successful gateway fetch and hydrates applicationIconImage synchronously during start() before awaitDaemonReady. The gateway remains the authoritative source - reloadAvatar() still runs on startup and overwrites the cache on any change. Eliminates the Vellum-logo-to-avatar flash users see on every cold launch (not just after Restart). Co-Authored-By: ashlee@vellum.ai <ashlee@vellum.ai>
…s-restart-terminate-first
Two related macOS restart/launch fixes in one PR.
1. Restart via terminate-first helper
The menu bar "Restart" relaunched via
NSWorkspace.openApplication(..., createsNewApplicationInstance: true), which starts a new instance before the old one terminates. The resulting overlap window is the root cause of LUM-728 (zombie old instance when termination stalls) and LUM-1042 (avatar cleared after restart — the old instance'sapplicationWillTerminateresets avatar state on top of the new instance's already-hydrated state), and required a sentinel-file + single-instance-guard exception +isRestartingshort-circuit just to paper over it.This switches to the terminate-first self-relaunch pattern used by Sparkle, Electron's
app.relaunch, and the existingAppBundleRenamerin this codebase: spawn a detached/bin/shwatcher that polls our PID withkill -0, then callsopen "<bundle>"once we're gone. Only one instance is ever alive, so the overlap window — and all the machinery built around it — disappears.Why this is safe
applicationShouldTerminate→vellumCli.stop()→applicationWillTerminatecleanup), so the new instance boots against a cleanly-freed daemon state. No PID/port/lockfile fights.connectionManager.disconnect()is still called first (same ordering asperformRetireAsync) soautoWakeIfAssistantDied()doesn't fightcli.stop().Processis orphaned tolaunchdwhen we terminate — standard POSIX semantics;AppBundleRenamerhas used the same pattern in production for a while.Process.run()throws, we reconnect SSE instead of terminating.applicationDidFinishLaunchingis simplified, not weakened — the exception was only needed because we were creating simultaneous instances.Code removed
performRestart()'s sentinel file write/cleanup andNSWorkspace.openApplicationcall.AppDelegate.isRestartingfield and the.terminateNowshort-circuit inapplicationShouldTerminate(every quit now takes the same.terminateLater+cli.stop()path).applicationDidFinishLaunching.AppBundleRenamer(its own script already gates onkill -0 $pid, so no overlap exists there either).2. Local avatar cache for cold-launch hydration
Avatar fetches are deferred until
awaitDaemonReadysucceeds (otherwise they race daemon startup and a connection-refused blanks the dock icon). The user-visible consequence is a Vellum-logo-to-avatar flash on every cold launch while the daemon comes up.AvatarAppearanceManagernow persists the avatar image (PNG) and character traits (JSON) under~/Library/Application Support/<bundleID>/AvatarCache/on every successful gateway fetch or UI save, and hydratesapplicationIconImagesynchronously from that cache at the start ofstart()— before the daemon-ready wait.reloadAvatar()continues to run on startup and overwrites both the cache and in-memory state from the gateway. The gateway remains authoritative; the cache is a write-through client-side mirror.Invalidation points
resetForDisconnect()clearAll()clearCustomAvatar()clearAll()fetchTraitsViaHTTP200fetchAvatarViaHTTP200fetchAvatarViaHTTP/fetchTraitsViaHTTPclearImage/clearTraitsWhy this is safe
Caches/, which the OS may purge). Reference: File System Basics.load()returns empty snapshot → normal gateway-fetch path runs unchanged.try?everywhere, decode failure returns empty snapshot → falls back to gateway.reloadAvatar()runs after daemon-ready.restoreBundleIcon()inapplicationWillTerminate) unchanged.Root cause analysis (restart fix)
createsNewApplicationInstance: truebecauseNSWorkspace.openApplicationis the obvious AppKit API for "launch an app". When LUM-728 surfaced zombies, PR fix: prevent zombie instances on macOS app restart #23763 added theisRestartingflag +.terminateNowshort-circuit + sentinel file — each change made the existing pattern less broken without questioning the pattern itself.AppBundleRenameralready implemented the correct pattern in this codebase. The accumulating workarounds (sentinel file, guard exception, isRestarting flag, terminateNow short-circuit) were a strong signal that the underlying primitive was wrong.performRestart()comment where it's actually needed.Alternatives considered and rejected
Restart:
hatch()/ platform API, same as Settings). Cleaner architecturally and fixes the same bugs, but changes the UX — "Restart" on a menu bar is expected to restart the whole app, not just the background daemon. Rejected by product.createsNewApplicationInstance: true. Doesn't eliminate the overlap window, just narrows it. We've iterated on this path twice (feat(macos): add Restart menu item to status bar menu #8347, fix: prevent zombie instances on macOS app restart #23763) and still have LUM-1042./bin/shdependency), but adds bundle/sign/path-resolve complexity. The shell helper is already proven here byAppBundleRenamer.Avatar cache:
restoreBundleIcon()on the restart path only. Minimal change, but doesn't help generic cold launches and leaves a Vellum-logo flash on pinned-dock tiles between PID exit and new-PID launch.restoreBundleIcon()entirely. One-line fix, but changes pinned-dock behavior (avatar sticks when app is quit) which contradicts Apple's convention of the bundle icon being the canonical "not running" state.~/Library/Caches/. OS can purge at any time, defeating the cold-start hydration purpose.References
NSWorkspace.OpenConfiguration.createsNewApplicationInstance— confirms two simultaneous instances are documented behavior.NSApplicationDelegate.applicationShouldTerminate—.terminateLater/NSApp.reply(...)pattern.Testing
CI skips macOS builds. Verify locally:
ps aux | grep vellum-assistantat any time; old PID exits before new PID launches; no dock-icon flash of a second instance.~/Library/Application Support/<bundleID>/AvatarCache/is emptied.