Skip to content

fix: replace bootstrap polling loops with reactive AsyncStream observation#21686

Merged
ashleeradka merged 8 commits into
mainfrom
devin/LUM-486-1774536721
Mar 26, 2026
Merged

fix: replace bootstrap polling loops with reactive AsyncStream observation#21686
ashleeradka merged 8 commits into
mainfrom
devin/LUM-486-1774536721

Conversation

@devin-ai-integration

@devin-ai-integration devin-ai-integration Bot commented Mar 26, 2026

Copy link
Copy Markdown
Contributor

Summary

Fixes LUM-459 / LUM-486 — Sentry ANR VELLUM-ASSISTANT-MACOS-8Z

Problem

awaitDaemonReady() polls connectionManager.isConnected every 500 ms and performInitialBootstrap() polls with exponential backoff (500 ms → 10 s), both on @MainActor. After sleep/wake, these loops resume alongside health checks, token refresh, reconnection, and SSE — all serialised through the main actor — exceeding the 2 s ANR threshold.

What changed

File Change
GatewayConnectionManager.swift New isConnectedStream property — an AsyncStream<Bool> bridge over $isConnected via Combine .sink, with onTermination cleanup.
AppDelegate+Bootstrap.swift awaitDaemonReady() — replaced polling loop with withTaskGroup racing isConnectedStream against a timeout (same pattern as existing awaitLocalBootstrapCompleted()).
AppDelegate+Bootstrap.swift performInitialBootstrap() — replaced exponential-backoff connection wait with awaitConnectionEstablished(), which suspends via isConnectedStream until connected. Outer token-bootstrap retry loop unchanged.
AppDelegate+Bootstrap.swift New awaitConnectionEstablished()@MainActor, suspends on isConnectedStream.

Why AsyncStream instead of $isConnected.values

Combine's AsyncPublisher (.values) does not terminate its iterator on task cancellation. In a withTaskGroup that calls cancelAll(), the child iterating .values never exits, so the group hangs indefinitely. AsyncStream's iterator returns nil on cancellation, allowing task groups to complete normally. This is a known Combine defect (FB9700937), confirmed by Apple engineer Philippe Hausler.

Benefits

  • Eliminates ANR: Zero main-actor wakeups while waiting for connection — suspends entirely instead of polling every 500 ms.
  • Faster reaction: Responds instantly to connection state changes instead of waiting up to 500 ms (or 10 s with exponential backoff) to notice.
  • Cancellation-safe: AsyncStream + onTermination cooperates with Swift structured concurrency, unlike AsyncPublisher.

Why it's safe

  • Behavior is identical: same timeouts, same cancellation semantics — only the wait mechanism changes (reactive suspension vs. polling).
  • @Published emits the current value on .sink subscription, so the guard-then-observe pattern has no TOCTOU gap.
  • isConnectedStream creates a fresh Combine subscription per call; onTermination cleans it up when the consumer finishes or is cancelled.
  • Connection waiting in performInitialBootstrap no longer uses exponential backoff — it suspends entirely until connected. The old backoff added latency with no benefit since connection state changes are event-driven.

References

Topic Source
AsyncStream as the recommended bridge for callback-based APIs → async/await Apple AsyncStream docs
onTermination with .cancelled for cooperative cancellation cleanup Apple AsyncStream.Continuation.onTermination
AsyncPublisher cancellation defect (FB9700937) Swift Forums discussion
withTaskGroup waits for all children before returning Apple withTaskGroup docs
ObservableObject@Observable migration guidance (not done here — 50+ views) Apple migration guide
Analyzing and eliminating hangs (WWDC23) WWDC23: Analyze hangs with Instruments

Review & Testing Checklist for Human

⚠️ CI has no macOS build environment — compilation and runtime behavior are entirely unverified in CI.

  • Verify Xcode compilationimport Combine is new in both files; isConnectedStream uses $isConnected.sink. Confirm no compiler errors with strict concurrency checking enabled.
  • Sleep/wake test — put Mac to sleep with assistant disconnected, wake, confirm no ANR and bootstrap resumes correctly. This is the core bug.
  • Send a message — confirm no hang during normal message send (regression check).
  • Timeout path — kill the daemon, launch app, confirm awaitDaemonReady times out gracefully (shows timeout screen) rather than hanging.
  • First-launch token bootstrap — remove actor token, launch app, verify performInitialBootstrap waits for connection then bootstraps credentials successfully.

Link to Devin session: https://app.devin.ai/sessions/0dbe772658c84bb583dd673f7d076dcb
Requested by: @ashleeradka


Open with Devin

Convert awaitDaemonReady() and performInitialBootstrap() from polling
loops (500ms-10s exponential backoff on @mainactor) to reactive Combine
observation using $isConnected.values AsyncPublisher.

Root cause: After Mac sleep/wake, multiple @mainactor tasks resume
simultaneously (health checks, token refresh, reconnection, SSE),
and the polling loops add unnecessary main-thread wakeups that push
total main-actor work past the 2000ms ANR threshold.

Changes:
- awaitDaemonReady(): Uses withTaskGroup to race $isConnected.values
  against a timeout, matching the pattern in awaitLocalBootstrapCompleted()
- performInitialBootstrap(): Replaces exponential-backoff connection
  polling with awaitConnectionEstablished() reactive helper
- New awaitConnectionEstablished(): Suspends via for-await on
  $isConnected.values until connected, producing zero main-actor
  wakeups while disconnected

Refs: LUM-459, LUM-486
Apple refs checked (2026-03-26): WWDC23 'Analyze hangs with Instruments',
Apple docs 'Improving app responsiveness', clients/AGENTS.md §165

Co-Authored-By: ashlee@vellum.ai <ashlee@vellum.ai>
@linear

linear Bot commented Mar 26, 2026

Copy link
Copy Markdown
LUM-459 App Hanging: App hanging for at least 2000 ms.


Summary: Sentry reported the macOS app hanging for >= 2000ms (likely an ANR / main-thread stall) in vellum-assistant-macos.

Key Context:

Generated by Linear Issue Context Agent

LUM-486 Convert bootstrap/connection polling to reactive Combine observation

Context

Parent: LUM-459 (App Hanging: App hanging for at least 2000 ms)

Problem

performInitialBootstrap() in AppDelegate+Bootstrap.swift runs an indefinite while loop on @MainActor that polls connectionManager.isConnected with exponential backoff (500ms → 10s). Similarly, awaitDaemonReady() polls at 500ms intervals. These polling loops contribute to main-thread pressure, especially after Mac sleep/wake when multiple @MainActor tasks resume simultaneously.

Acceptance Criteria

  1. Replace the polling loop in performInitialBootstrap() with a reactive wait using connectionManager.$isConnected.values (Combine AsyncSequence bridge)
  2. Replace the polling loop in awaitDaemonReady() with reactive observation + timeout using withTaskGroup
  3. Verify that bootstrap still works correctly when connection becomes available
  4. Verify cancellation behavior is preserved
  5. No regressions in connection establishment flow

Files to Modify

  • clients/macos/vellum-assistant/App/AppDelegate+Bootstrap.swift

Technical Approach

  • Use connectionManager.$isConnected.values (AsyncPublisher from Combine) to reactively wait for connection state changes instead of polling
  • For awaitDaemonReady(), use withTaskGroup to race reactive observation against a timeout
  • This eliminates unnecessary main-actor wakeups every 500ms-10s

LUM-459 App hang: bootstrap polling loops + Combine AsyncPublisher cancellation defect cause main-thread stalls ≥ 2000 ms

Summary

Sentry ANR VELLUM-ASSISTANT-MACOS-8Z — the macOS app hangs for ≥ 2000 ms (first seen 2026-03-25 in vellum-macos@0.5.8).

Root Causes

1. Main-actor polling loops in bootstrap code

awaitDaemonReady() polls connectionManager.isConnected every 500 ms and performInitialBootstrap() polls with exponential backoff (500 ms → 10 s), both on @MainActor. After Mac sleep/wake, these loops resume alongside health checks, token refresh, reconnection, and SSE — all serialized through the main actor — exceeding the 2 s hang threshold.

2. Combine's AsyncPublisher (.values) does not cooperate with Swift task cancellation

The initial fix replaced polling with Combine's $isConnected.values (AsyncPublisher) inside withTaskGroup. However, AsyncPublisher has a known defect: a cancelled for await on .values never terminates. Since withTaskGroup waits for all children to complete before returning (per Apple's structured concurrency contract), the group hangs indefinitely when cancelAll() is called but the .values child never exits.

This is a known issue in the Swift ecosystem — Apple engineer Philippe Hausler confirmed it as a bug (FB9700937). The AsyncPublisher iterator does not properly respond to task cancellation or iterator deallocation.

Fix

PR: #21686

  1. Replace polling loops with reactive observationawaitDaemonReady() and performInitialBootstrap() now suspend via AsyncStream instead of polling. Zero CPU while waiting; instant reaction on state change.
  2. Use AsyncStream instead of AsyncPublisher — Added GatewayConnectionManager.isConnectedStream, a cancellation-safe AsyncStream<Bool> that bridges $isConnected via Combine's .sink + onTermination. Per Apple's AsyncStream documentation, onTermination receives a .cancelled case when the consuming task is cancelled, and the iterator returns nil — allowing withTaskGroup children to exit cleanly.

Why AsyncStream (Apple's recommended pattern)

Apple's AsyncStream documentation states: "An asynchronous stream is well-suited to adapt callback- or delegation-based APIs to participate with async-await." The onTermination handler with .cancelled case is the designed-for mechanism for cooperative cancellation cleanup.

Why Combine (not Observation framework)

GatewayConnectionManager is an ObservableObject with @Published properties. Migrating to @Observable would touch 50+ SwiftUI views and is a separate effort. Per Apple's migration guide, the existing architecture is valid — bridging via .sink into AsyncStream is the correct async-safe approach for ObservableObject.

Files Changed

  • clients/macos/vellum-assistant/App/AppDelegate+Bootstrap.swift — replaced polling loops with reactive AsyncStream observation
  • clients/shared/Network/GatewayConnectionManager.swift — added isConnectedStream property

Original Sentry context:

  • Sentry issue: VELLUM-ASSISTANT-MACOS-8Z (first seen 2026-03-25)
  • First affected version: vellum-macos@0.5.8
  • Event link: Sentry event

LUM-486 Replace bootstrap polling loops with cancellation-safe AsyncStream observation

Context

Parent: LUM-459 (App hang: bootstrap polling loops + Combine AsyncPublisher cancellation defect)

Problem

Two issues in AppDelegate+Bootstrap.swift:

1. Main-actor polling loops

performInitialBootstrap() runs an indefinite while loop on @MainActor that polls connectionManager.isConnected with exponential backoff (500 ms → 10 s). Similarly, awaitDaemonReady() polls at 500 ms intervals. These polling loops contribute to main-thread pressure, especially after Mac sleep/wake when multiple @MainActor tasks resume simultaneously.

2. Combine's AsyncPublisher does not cooperate with task cancellation

The naive fix of replacing polling with $isConnected.values (Combine's AsyncPublisher) inside withTaskGroup causes the app to hang indefinitely. This is because AsyncPublisher's iterator does not respond to task cancellation — when cancelAll() is called, the for await on .values never terminates, and withTaskGroup waits for all children forever.

This is a known Swift bug (FB9700937), confirmed by Apple engineer Philippe Hausler.

Solution

Use AsyncStream (Apple's recommended bridge for callback/delegation APIs → async/await) instead of AsyncPublisher:

  1. Add GatewayConnectionManager.isConnectedStream — an AsyncStream<Bool> that bridges $isConnected via Combine's .sink with an onTermination handler that cancels the subscription
  2. Replace polling in awaitDaemonReady() with withTaskGroup racing isConnectedStream against a timeout
  3. Replace polling in performInitialBootstrap() with a new awaitConnectionEstablished() helper that suspends via isConnectedStream

Per Apple's AsyncStream documentation, onTermination receives .cancelled when the consuming task is cancelled, and the iterator returns nil — allowing structured concurrency groups to complete promptly.

Acceptance Criteria

  1. Replace the polling loop in ~~performInitialBootstrap()~~ with reactive wait
  2. ~~Replace the polling loop in ~~awaitDaemonReady()~~ with reactive observation + timeout using ~~~~withTaskGroup~~
  3. Use ~~AsyncStream~~ (not ~~AsyncPublisher~~) for cancellation safety
  4. Verify bootstrap works correctly when connection becomes available
  5. Verify no app hang on message send or during bootstrap
  6. Verify cancellation/timeout behavior is preserved

Files Modified

  • clients/macos/vellum-assistant/App/AppDelegate+Bootstrap.swift
  • clients/shared/Network/GatewayConnectionManager.swift

PR

#21686

@devin-ai-integration

Copy link
Copy Markdown
Contributor Author

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

@devin-ai-integration devin-ai-integration Bot left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 3 additional findings.

Open in Devin Review

GatewayHTTPClient is a stateless enum with only static methods for HTTP
operations (get, post, patch, put, delete, download, stream). None of
these methods touch UI state — they construct URLs, build requests, and
execute network calls via URLSession.

The type-level @mainactor annotation forces ALL callers to serialize
their network I/O through the main actor, which directly contributes
to the ANR reported in LUM-459. After Mac sleep/wake, the main actor
becomes a bottleneck as health checks, bootstrap, SSE, and credential
refresh all queue behind each other.

Removing @mainactor allows network operations to run on any executor,
reducing main-thread pressure. All existing callers (SwiftUI views,
@mainactor stores) continue to work — they just no longer force an
unnecessary actor hop for pure HTTP work.

Refs: LUM-459, LUM-487
Co-Authored-By: ashlee@vellum.ai <ashlee@vellum.ai>
@devin-ai-integration devin-ai-integration Bot changed the title fix: replace bootstrap polling loops with reactive Combine observation (LUM-459, LUM-486) fix: replace bootstrap polling with reactive Combine observation, remove @MainActor from GatewayHTTPClient (LUM-459) Mar 26, 2026
devin-ai-integration[bot]

This comment was marked as resolved.

devin-ai-integration Bot and others added 2 commits March 26, 2026 16:11
After removing @mainactor from GatewayHTTPClient, resolveConnection()
can no longer synchronously access AuthService.shared.baseURL (which is
@MainActor-isolated). The correct fix per Apple's concurrency model:

- resolveConnection() → async throws (awaits AuthService.shared.baseURL)
- isConnectionManaged() → async throws (cascaded from resolveConnection)
- buildURL() → async throws (cascaded from resolveConnection)
- handleAuthenticationFailure() → async (cascaded from isConnectionManaged)
- Update all callers with await

All high-level API methods (get, post, stream, etc.) were already async,
so this change only affects the internal resolution path. No API surface
changes for external callers.

Co-Authored-By: ashlee@vellum.ai <ashlee@vellum.ai>
Address Devin Review feedback: awaitConnectionEstablished() accesses
connectionManager.$isConnected which is a @published property on the
@MainActor-isolated GatewayConnectionManager. Adding explicit @mainactor
ensures safety regardless of whether AppDelegate inherits @mainactor
from NSApplicationDelegate, matching the pattern used in
awaitDaemonReady().

Co-Authored-By: ashlee@vellum.ai <ashlee@vellum.ai>
@devin-ai-integration devin-ai-integration Bot changed the title fix: replace bootstrap polling with reactive Combine observation, remove @MainActor from GatewayHTTPClient (LUM-459) fix: eliminate 2s+ ANR on sleep/wake by replacing polling with reactive observation Mar 26, 2026
Co-Authored-By: ashlee@vellum.ai <ashlee@vellum.ai>
@devin-ai-integration devin-ai-integration Bot changed the title fix: eliminate 2s+ ANR on sleep/wake by replacing polling with reactive observation fix: eliminate 2s+ ANR on sleep/wake by replacing bootstrap polling with reactive observation Mar 26, 2026
devin-ai-integration Bot and others added 3 commits March 26, 2026 17:32
Combine's AsyncPublisher (.values) does not cooperate with Swift task
cancellation — a cancelled 'for await' on .values never terminates,
which causes withTaskGroup to hang (it waits for all children before
returning). Replace with an AsyncStream bridge that properly returns
nil when the consuming task is cancelled, allowing structured
concurrency groups to complete promptly.

Co-Authored-By: ashlee@vellum.ai <ashlee@vellum.ai>
Co-Authored-By: ashlee@vellum.ai <ashlee@vellum.ai>
…story

Co-Authored-By: ashlee@vellum.ai <ashlee@vellum.ai>
@devin-ai-integration devin-ai-integration Bot changed the title fix: eliminate 2s+ ANR on sleep/wake by replacing bootstrap polling with reactive observation fix: replace bootstrap polling loops with reactive AsyncStream observation Mar 26, 2026
@ashleeradka ashleeradka merged commit f34b591 into main Mar 26, 2026
4 checks passed
@ashleeradka ashleeradka deleted the devin/LUM-486-1774536721 branch March 26, 2026 20:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant