Skip to content

fix(macos): terminate-first restart + cold-launch avatar cache#27227

Merged
ashleeradka merged 5 commits into
mainfrom
devin/1776791002-macos-restart-terminate-first
Apr 21, 2026
Merged

fix(macos): terminate-first restart + cold-launch avatar cache#27227
ashleeradka merged 5 commits into
mainfrom
devin/1776791002-macos-restart-terminate-first

Conversation

@devin-ai-integration
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot commented Apr 21, 2026

Two related macOS restart/launch fixes in one PR.

1. Restart via terminate-first helper

The menu bar "Restart" relaunched via NSWorkspace.openApplication(..., createsNewApplicationInstance: true), which starts a new instance before the old one terminates. The resulting overlap window is the root cause of LUM-728 (zombie old instance when termination stalls) and LUM-1042 (avatar cleared after restart — the old instance's applicationWillTerminate resets avatar state on top of the new instance's already-hydrated state), and required a sentinel-file + single-instance-guard exception + isRestarting short-circuit just to paper over it.

This switches to the terminate-first self-relaunch pattern used by Sparkle, Electron's app.relaunch, and the existing AppBundleRenamer in this codebase: spawn a detached /bin/sh watcher that polls our PID with kill -0, then calls open "<bundle>" once we're gone. Only one instance is ever alive, so the overlap window — and all the machinery built around it — disappears.

Why this is safe

  • Daemon/gateway shutdown runs through the normal terminate path (applicationShouldTerminatevellumCli.stop()applicationWillTerminate cleanup), so the new instance boots against a cleanly-freed daemon state. No PID/port/lockfile fights.
  • connectionManager.disconnect() is still called first (same ordering as performRetireAsync) so autoWakeIfAssistantDied() doesn't fight cli.stop().
  • The spawned Process is orphaned to launchd when we terminate — standard POSIX semantics; AppBundleRenamer has used the same pattern in production for a while.
  • Error handling preserved: if Process.run() throws, we reconnect SSE instead of terminating.
  • The single-instance guard in applicationDidFinishLaunching is simplified, not weakened — the exception was only needed because we were creating simultaneous instances.

Code removed

  • performRestart()'s sentinel file write/cleanup and NSWorkspace.openApplication call.
  • AppDelegate.isRestarting field and the .terminateNow short-circuit in applicationShouldTerminate (every quit now takes the same .terminateLater + cli.stop() path).
  • The sentinel-read and guard-exception branch in applicationDidFinishLaunching.
  • The redundant sentinel write in AppBundleRenamer (its own script already gates on kill -0 $pid, so no overlap exists there either).

2. Local avatar cache for cold-launch hydration

Avatar fetches are deferred until awaitDaemonReady succeeds (otherwise they race daemon startup and a connection-refused blanks the dock icon). The user-visible consequence is a Vellum-logo-to-avatar flash on every cold launch while the daemon comes up.

AvatarAppearanceManager now persists the avatar image (PNG) and character traits (JSON) under ~/Library/Application Support/<bundleID>/AvatarCache/ on every successful gateway fetch or UI save, and hydrates applicationIconImage synchronously from that cache at the start of start() — before the daemon-ready wait. reloadAvatar() continues to run on startup and overwrites both the cache and in-memory state from the gateway. The gateway remains authoritative; the cache is a write-through client-side mirror.

Invalidation points

Trigger Site Action
Daemon disconnect / logout / assistant switch resetForDisconnect() clearAll()
User clears custom avatar (Settings) clearCustomAvatar() clearAll()
Image → character mode fetchTraitsViaHTTP 200 save traits, clear image
Character → image mode fetchAvatarViaHTTP 200 save image
Authoritative 404 from gateway fetchAvatarViaHTTP / fetchTraitsViaHTTP clearImage / clearTraits

Why this is safe

  • Apple-recommended location for per-app persistent data (not Caches/, which the OS may purge). Reference: File System Basics.
  • First-launch: cache dir doesn't exist → load() returns empty snapshot → normal gateway-fetch path runs unchanged.
  • Corrupted files: try? everywhere, decode failure returns empty snapshot → falls back to gateway.
  • No migration needed — cache is write-through from the gateway, so any divergence is reconciled the moment reloadAvatar() runs after daemon-ready.
  • Pinned-dock behavior (restoreBundleIcon() in applicationWillTerminate) unchanged.
  • No behavior change — gateway is still the source of truth; only the cold-start dock-icon timing improves.

Root cause analysis (restart fix)

  1. How did the code get into this state? PR feat(macos): add Restart menu item to status bar menu #8347 added the menu bar Restart using createsNewApplicationInstance: true because NSWorkspace.openApplication is the obvious AppKit API for "launch an app". When LUM-728 surfaced zombies, PR fix: prevent zombie instances on macOS app restart #23763 added the isRestarting flag + .terminateNow short-circuit + sentinel file — each change made the existing pattern less broken without questioning the pattern itself.
  2. What decisions led to it? Reaching for the first AppKit API that matches the verbal description of the task, instead of researching how self-relaunch is actually done on macOS. Apple provides no first-party self-relaunch API, and every mature macOS app uses a helper-process pattern for exactly this reason.
  3. Warning signs missed. AppBundleRenamer already implemented the correct pattern in this codebase. The accumulating workarounds (sentinel file, guard exception, isRestarting flag, terminateNow short-circuit) were a strong signal that the underlying primitive was wrong.
  4. Preventing recurrence. When a bug fix adds special-cased branches to lifecycle code (shutdown, launch, single-instance), that's a signal the primitive is wrong — step back and question the primitive before adding the branch.
  5. AGENTS.md guidance? No addition — the rule "question the primitive when workarounds accumulate" is too general to encode usefully, and the specific macOS self-relaunch pattern is now documented in the performRestart() comment where it's actually needed.

Alternatives considered and rejected

Restart:

  • Daemon-only restart (redirect menu bar to hatch() / platform API, same as Settings). Cleaner architecturally and fixes the same bugs, but changes the UX — "Restart" on a menu bar is expected to restart the whole app, not just the background daemon. Rejected by product.
  • More fencing around createsNewApplicationInstance: true. Doesn't eliminate the overlap window, just narrows it. We've iterated on this path twice (feat(macos): add Restart menu item to status bar menu #8347, fix: prevent zombie instances on macOS app restart #23763) and still have LUM-1042.
  • AppleScript-based relaunch. Relies on scripting additions, timing-fragile, and depends on the target responding to AppleScript quit cleanly.
  • Dedicated compiled helper binary (Sparkle-style). Better in principle (signed artifact, no /bin/sh dependency), but adds bundle/sign/path-resolve complexity. The shell helper is already proven here by AppBundleRenamer.

Avatar cache:

  • Skip restoreBundleIcon() on the restart path only. Minimal change, but doesn't help generic cold launches and leaves a Vellum-logo flash on pinned-dock tiles between PID exit and new-PID launch.
  • Remove restoreBundleIcon() entirely. One-line fix, but changes pinned-dock behavior (avatar sticks when app is quit) which contradicts Apple's convention of the bundle icon being the canonical "not running" state.
  • Cache in ~/Library/Caches/. OS can purge at any time, defeating the cold-start hydration purpose.

References

Testing

CI skips macOS builds. Verify locally:

  1. Menu bar → Restart: only one instance visible in ps aux | grep vellum-assistant at any time; old PID exits before new PID launches; no dock-icon flash of a second instance.
  2. Cold launch: dock icon shows the assistant avatar immediately (within first frame), not the Vellum logo.
  3. Logout / switch assistant: ~/Library/Application Support/<bundleID>/AvatarCache/ is emptied.
  4. Clear custom avatar in Settings: cache is emptied.

Open in Devin Review

…cause)

Co-Authored-By: ashlee@vellum.ai <ashlee@vellum.ai>
@devin-ai-integration
Copy link
Copy Markdown
Contributor Author

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

devin-ai-integration[bot]

This comment was marked as resolved.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 60225d0d82

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +364 to 367
if let bundleId = Bundle.main.bundleIdentifier {
let others = NSRunningApplication.runningApplications(withBundleIdentifier: bundleId)
.filter { $0 != .current && !$0.isTerminated }
if let existing = others.first {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep overlap handling for rename relaunch timeout path

The single-instance guard now always terminates any second process, but AppBundleRenamer still uses a bounded wait before relaunch (for i in $(seq 1 100) in the same file) and can open the new bundle while the old process is still alive. In that slow-shutdown case, this guard path will kill the replacement instance and the app can end up not relaunching after retire/rename flows. Before this change, the sentinel allowed this specific overlap window, so removing the exception without removing the timeout introduces a restart failure mode.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — you're right that the sentinel was covering the "timeout-elapsed-but-old-process-still-alive" window in AppBundleRenamer, and removing the guard exception without addressing that window would introduce a failure mode. Fixed in 8da224c by making the renamer script bail (via rm -f "$0"; exit 0) if the PID is still alive after the 10s wait, instead of proceeding to rename filesystem state under a live process and calling open. The rename is retried on next launch via AppBundleRenamer.needsRename, which is the correct recovery path.

Comment thread clients/macos/vellum-assistant/App/AppDelegate+AuthLifecycle.swift Outdated
…open on PID exit

Co-Authored-By: ashlee@vellum.ai <ashlee@vellum.ai>
@ashleeradka
Copy link
Copy Markdown
Contributor

@codex review

ashleeradka
ashleeradka previously approved these changes Apr 21, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8da224c254

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +192 to +193
if kill -0 \(pid) 2>/dev/null; then
exit 0
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Match restart watcher timeout to shutdown timeout

The watcher aborts relaunch with exit 0 when the current PID is still alive, but this check runs after only ~10s while app termination can legitimately take up to 15s (VellumCli.stopTimeout in App/VellumCli.swift). In the 10–15s window (e.g., slow daemon/gateway shutdown), clicking Restart will terminate the current app without reopening it, so users get a quit instead of a restart. Increase this wait (or gate on explicit cancel) so normal shutdown latency does not drop relaunches.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — confirmed VellumCli.stopTimeout = 15.0 (VellumCli.swift:358), and applicationWillTerminate adds more teardown on top. The 10s cap would quit-without-relaunch in the 10–15s window.

Fixed in bc01be8 — bumped both the performRestart watcher and AppBundleRenamer helper to 30s (seq 1 300) to cover cli.stop's full timeout plus AppKit teardown headroom, without going unbounded (the abort branch still guards against a cancelled terminate looping forever).

VellumCli.stopTimeout is 15s, and AppKit teardown adds more time on
top of that. A 10s watcher cap caused Restart to terminate the app
without relaunching whenever shutdown took 10-15s. Both performRestart
and AppBundleRenamer now wait up to 30s.

Co-Authored-By: ashlee@vellum.ai <ashlee@vellum.ai>
Persists the assistant's avatar image and character traits to Application
Support on every successful gateway fetch and hydrates applicationIconImage
synchronously during start() before awaitDaemonReady. The gateway remains the
authoritative source - reloadAvatar() still runs on startup and overwrites
the cache on any change. Eliminates the Vellum-logo-to-avatar flash users
see on every cold launch (not just after Restart).

Co-Authored-By: ashlee@vellum.ai <ashlee@vellum.ai>
@devin-ai-integration devin-ai-integration Bot changed the title fix(macos): restart via terminate-first helper (remove overlap-window race) fix(macos): terminate-first restart + cold-launch avatar cache Apr 21, 2026
@ashleeradka ashleeradka merged commit bc26712 into main Apr 21, 2026
6 checks passed
@ashleeradka ashleeradka deleted the devin/1776791002-macos-restart-terminate-first branch April 21, 2026 19:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant