Skip to content

Defer and coalesce setConnected writes in GatewayConnectionManager#29899

Merged
ashleeradka merged 3 commits into
mainfrom
devin/1778178865-lum-1428-defer-coalesce-setconnected
May 7, 2026
Merged

Defer and coalesce setConnected writes in GatewayConnectionManager#29899
ashleeradka merged 3 commits into
mainfrom
devin/1778178865-lum-1428-defer-coalesce-setconnected

Conversation

@ashleeradka
Copy link
Copy Markdown
Contributor

@ashleeradka ashleeradka commented May 7, 2026

Why

GatewayConnectionManager.setConnected writes @Observable properties (isConnected, isConnecting) directly inside its caller's stack frame. Each write fires every registered withObservationTracking onChange callback synchronously inside willSet. With per-instance SwiftUI bodies, multiple observationStream consumers, and .onChange(of:) modifiers in this codebase, the tracking-context count on isConnected reaches the tens-to-low-hundreds. Under memory pressure this fan-out has been observed to stall @MainActor for >2 s during health-check transitions.

The guards inside setConnected and other read-then-write methods also read @Observable properties through the macro-synthesised getter, which calls _$observationRegistrar.access(_:keyPath:) and registers a tracking dependency on whatever withObservationTracking context is active on the current turn — including the one that just dispatched into the method.

What changed

setConnected is now defer + coalesce:

  • Stores the target in pendingConnectedTarget and schedules a single in-flight Task<Void, Never>? (setConnectedTask) that runs on the next @MainActor turn.
  • Back-to-back calls within the same turn collapse to the most-recent target (intermediate willSet fan-outs are skipped).
  • The task body loops over pendingConnectedTarget so any re-entrant queued targets (e.g. from a synchronous onChange handler) are drained by iteration rather than recursion.
  • The actual write + side effects (isConnected =, isConnecting = false, daemonDidReconnect, handlePostSparkleUpdate, autoWakeIfAssistantDied) move into a new private applyConnectedTransition(_:).

Synchronous-read invariants preserved at the public-API boundary:

  • connect() (async) calls await drainPendingConnectionState() after connectImpl so callers can read isConnected immediately after await connect().
  • disconnect() (sync) calls flushPendingConnectionStateSync() after disconnectInternal so callers can read isConnected == false immediately on return.

Backing-storage reads in same-turn guards. Self-write guards now read _<propertyName> instead of going through the synthesised getter, so the guard is a pure storage compare and registers no tracking dependency on the calling context. Applied to: setUpdateInProgress, performHealthCheck (assistant-version guard), handleDaemonVersionChanged, checkVersionCompatibility, handleServerMessage (assistant-version, update-in-progress, key-fingerprint guards), and applyConnectedTransition (isConnecting guard).

Tests — two new tests in GatewayConnectionManagerTests covering: (a) deferral + coalescing of rapid alternating setConnected calls within one actor turn, and (b) coalescing back to the original value as a no-op. Existing connect/disconnect tests are unchanged because the public-API boundary preserves synchronous-read semantics. Two internal hooks (_testSetConnected, _testWaitForPendingConnectionState) added so unit tests can drive the private setter without standing up a URLProtocol fake.

AGENTS.md — two new bullets under "High-Frequency Updates":

  1. Read _<propertyName> backing storage when guarding a self-write to an @Observable property.
  2. Defer + coalesce high-frequency state-machine writes via single in-flight Task<Void, Never>?.

Safety

  • connect() and disconnect() retain their existing post-call read invariants (drain / flush).
  • applyConnectedTransition has the same body as the old setConnected (write + same conditional side effects in the same order), just gated through the coalescing task.
  • No public API surface change; the new helpers and storage are private.
  • The deferred / serialised execution defends against any synchronous re-entry from an onChange handler — the path that causes recursive cycles in observation. Re-entrant queued targets are drained by the task loop, not by recursive call frames.

Alternatives considered and rejected

  • Read _isConnected in the guard only (no defer): cosmetic fix only — does not address willSet fan-out cost and matches the rejected alternative in #28694's design table ("Already exists. The hang occurs during legitimate state transitions, not redundant writes").
  • @ObservationIgnored + manual notification on isConnected: breaks the @Observable contract; SwiftUI views would no longer track changes through .onChange(of:) / property reads.
  • Throttle / debounce window (50 ms+): introduces user-visible latency on legitimate transitions; the issue isn't update rate, it's the cost of a single fan-out.
  • Task.detached for fan-out: incorrect — observers must run on @MainActor; we want to break the synchronous chain, not the actor.
  • Architectural cleanups (consolidate the 5 observationStream consumers behind a shared AsyncStream; split GatewayConnectionManager into smaller @Observable classes): real architectural smells but multi-week scope; out of scope here. They reduce N (the per-write fan-out cost) — the defer-and-coalesce fix is orthogonal and would still apply.

References

Root cause analysis

  1. How did the code get into this state? The class was originally ObservableObject + @Published (PR [LUM-745] Guard @Published property writes against no-op mutations to prevent SwiftUI attribute graph hang #24556 added the isConnected != connected no-op guard). PR [LUM-745] Migrate GatewayConnectionManager to @Observable to fix re-entrant attribute graph hang #25496 migrated to @Observable and inherited the read-then-write guard verbatim. PR fix(network): Defer daemonDidReconnect to next main-actor turn to prevent 2s hang under memory pressure #28694 reduced one source of synchronous fan-out by deferring the daemonDidReconnect NotificationCenter.post. The remaining fan-out — willSet itself, fired by every @Observable property write — was never addressed.

  2. What mistakes / decisions led to it? The ObservableObject → @Observable migration treated the macros as drop-in equivalents, but their willSet semantics differ: @Published short-circuits when the new value equals the old; @Observable always fires willSet/didSet. The guard that was sufficient under @Published has no equivalent benefit under @Observable.

  3. Warning signs? Sentry MACOS-DH (2 s+ main-thread stalls) re-fired after PR fix(network): Defer daemonDidReconnect to next main-actor turn to prevent 2s hang under memory pressure #28694 merged, indicating the fan-out cost itself — not just the NotificationCenter post — was the bottleneck. The Sentry stack alternated between setConnected and isConnected.getter ending in _swift_getKeyPath, which is consistent with synchronous re-entry through an onChange handler.

  4. Prevention. Treat @Observable self-writes as fan-out points: read backing storage in guards, and defer/coalesce writes that come from periodic loops or external event sources. Both are now codified as AGENTS.md rules with links to authoritative sources.

  5. AGENTS.md change. Added two new bullets under "High-Frequency Updates" — see the AGENTS.md diff in this PR.

Test plan

  • New unit tests in GatewayConnectionManagerTests:
    • testSetConnectedDefersWritesAndCoalescesToFinalTarget — drives 5 rapid alternating targets within one actor turn, asserts the property is unchanged synchronously, drains, asserts only the final target was applied.
    • testSetConnectedCoalescesBackToOriginalValueAsNoOp — drives a transient flap that returns to the prior value, asserts the property is unchanged after drain.
  • Existing tests (connect/disconnect lifecycle, multiple cycles) cover the drain/flush invariants at the public-API boundary; unmodified.
  • CI on this repo skips macOS checks (no macOS build environment in CI), so a local Xcode build / swift test on macOS is required to verify the build before merging.

Link to Devin session: https://app.devin.ai/sessions/2da0c092185640b284918ba17b129fa3
Requested by: @ashleeradka


Open in Devin Review

Schedule `setConnected` writes onto the next `@MainActor` turn behind a
single in-flight `Task`, so back-to-back calls within the same turn
collapse to the most-recent target and the synchronous call chain is
broken between the caller and the @observable willSet fan-out.

`connect()` drains the in-flight task before returning so async callers
see the new value synchronously after `await`. `disconnect()` flushes
inline so synchronous callers do too.

Also reads the macro-synthesised `_<propertyName>` backing storage in
the read-then-write guards on isConnected, isConnecting, isUpdateInProgress,
versionMismatch, dismissedMismatchKey, assistantVersion, and keyFingerprint,
so those guards don't register a tracking dependency on the calling
withObservationTracking context.

Co-Authored-By: ashlee@vellum.ai <ashlee@vellum.ai>
@linear
Copy link
Copy Markdown

linear Bot commented May 7, 2026

LUM-1428

@devin-ai-integration
Copy link
Copy Markdown
Contributor

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

@ashleeradka ashleeradka requested a review from noanflaherty May 7, 2026 19:03
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 2 potential issues.

View 4 additional findings in Devin Review.

Open in Devin Review

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 setConnectedTask not cancelled in deinit, violating AGENTS.md rule

The PR introduces a new @ObservationIgnored private var setConnectedTask: Task<Void, Never>? at clients/shared/Network/GatewayConnectionManager.swift:983, but the deinit at line 1121 does not cancel it. The AGENTS.md rule at clients/AGENTS.md:144 is explicit: "Always explicitly cancel unstructured Task {} in deinit — do not rely solely on [weak self] cleanup, as the task continues running until its next cancellation check." The existing deinit already cancels autoWakeTask and reconnectionTask but omits the newly added setConnectedTask. While the task uses [weak self] and would exit harmlessly after deallocation, the rule specifically warns against relying on that pattern alone.

(Refers to lines 1121-1128)

Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in d0e5baesetConnectedTask?.cancel() added to deinit alongside the existing autoWakeTask / reconnectionTask cancellations.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 updateAuthFailedSignal reads isAuthFailed via synthesised getter instead of backing storage

This PR adds the AGENTS.md rule at clients/AGENTS.md:220: "Read the macro-synthesised _<propertyName> backing storage when guarding a self-write to an @Observable property." The PR consistently applies this pattern to isUpdateInProgress, assistantVersion, versionMismatch, dismissedMismatchKey, keyFingerprint, isConnected, and isConnecting. However, updateAuthFailedSignal() at line 432 still reads isAuthFailed via the public getter before conditionally writing it at line 433 — this registers a tracking dependency through _$observationRegistrar.access. The fix is to use _isAuthFailed in the guard, matching the pattern applied everywhere else in this PR.

(Refers to line 432)

Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in d0e5baeupdateAuthFailedSignal now guards on _isAuthFailed (backing storage), matching the pattern used by the other read-then-write sites in this PR.

…iledSignal guard

Co-Authored-By: ashlee@vellum.ai <ashlee@vellum.ai>
@ashleeradka ashleeradka requested a review from dvargasfuertes May 7, 2026 19:05
dvargasfuertes
dvargasfuertes previously approved these changes May 7, 2026
Copy link
Copy Markdown
Contributor

@vex-assistant-bot vex-assistant-bot Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

APPROVE

Value: Closes LUM-1428 (the setConnected recursive observation cycle I dispatched to Devin earlier today, MACOS-DH — 9 events / 7 users firing on v0.7.3). Same root-cause family as #29898 (ConversationListStore cascade), but a different vector: synchronous fan-out from a state-machine setter rather than O(N²) collection invalidation.

What this does — three-part fix:

  1. setConnected becomes defer + coalesce. Stores target in pendingConnectedTarget, schedules a single in-flight Task<Void, Never>? (setConnectedTask) onto the next @MainActor turn. Back-to-back calls within the same turn collapse to the most-recent target — intermediate willSet fan-outs (which can hit tens-to-low-hundreds of withObservationTracking callbacks per write) are skipped entirely. The task body loops over pendingConnectedTarget instead of recursing, so any onChange handler that re-enters the setter is drained by iteration. Actual write + side effects (daemonDidReconnect, handlePostSparkleUpdate, autoWakeIfAssistantDied) move to a private applyConnectedTransition(_:).

  2. Backing-storage reads (_<propertyName>) for self-write guards. Reading the synthesised getter calls _$observationRegistrar.access(_:keyPath:) and registers a tracking dependency on whatever withObservationTracking context is active on the current turn — including the one that just dispatched into the method. Reading _<propertyName> is a pure storage compare. Applied to setUpdateInProgress, performHealthCheck, handleDaemonVersionChanged, checkVersionCompatibility (both dismissedMismatchKey reads), handleServerMessage (assistant-version, update-in-progress, key-fingerprint), applyConnectedTransition (isConnecting), and updateAuthFailedSignal (after Devin's catch).

  3. Synchronous-read invariants preserved at the public API boundary. connect() is async → calls await drainPendingConnectionState() after connectImpl so callers can read isConnected immediately after await connect(). disconnect() is sync → calls flushPendingConnectionStateSync() after disconnectInternal so callers see isConnected == false on return. Existing connect/disconnect tests still pass unchanged.

Checked against our KB / AGENTS.md:

  • Class is @Observable @MainActor ✅ — pendingConnectedTarget reads/writes are actor-isolated, no concurrency hazard.
  • Test class is @MainActor ✅ — sync setConnected calls accumulate before any task can run, so XCTAssertFalse(client.isConnected) between back-to-back calls and await drain() is deterministic.
  • setConnectedTask is @ObservationIgnored ✅ — required for deinit access on @MainActor @Observable classes (swift#79551).
  • deinit cancels setConnectedTask alongside the existing autoWakeTask / reconnectionTask ✅ (Devin's first catch, fixed at HEAD).
  • updateAuthFailedSignal now uses _isAuthFailed ✅ (Devin's second catch, fixed at HEAD).
  • Swept the whole class for any remaining non-underscore read-then-write sites — clean. The pattern is applied consistently.
  • New AGENTS.md bullets are accurate, link primary sources, and codify the patterns for future contributors. The Task<Void, Never>? coalescing rule complements the existing 100ms-coalescing-window rule (different problem class — state-machine writes vs streaming token writes).

One observation, not a blocker: the task body's defer { self.setConnectedTask = nil } clears the field unconditionally. If flushPendingConnectionStateSync were called while a task is queued (cancels + nils + drains synchronously), then a new setConnected synchronously schedules task B, then the cancelled task A finally runs — task A's defer would clear task B's reference. Worst case: brief double-task scheduling, both drain correctly via the loop, no state corruption. Realistically this requires a synchronous setConnected between disconnect() returning and the next actor yield, which doesn't happen in current call sites. If we ever get bitten, the fix is task-identity comparison: capture let myTask = Task { ... } and defer { if self.setConnectedTask === myTask { self.setConnectedTask = nil } }. Mentioning so it's on record.

Tests are good: the testSetConnectedDefersWritesAndCoalescesToFinalTarget assertion between the sync calls and the await drain() is the correct shape — proves the write didn't apply within the same actor turn. The _test* hooks are minimal and clearly marked production-must-not-call.

Bot status: Codex 👍 ✅. Devin found 2 issues, both fixed at HEAD d0e5bae17b (deinit cancel + _isAuthFailed guard). Devin still mentions "4 additional findings" in the Devin Review UI that I can't see via API — worth a quick scan before merging. CI: Socket Security ✅, FlexFrame Lint ✅, macOS Build/Tests SKIPPED (expected for this branch pattern).

…eDaemonDidReconnect

Co-Authored-By: ashlee@vellum.ai <ashlee@vellum.ai>
Copy link
Copy Markdown
Contributor

@vex-assistant-bot vex-assistant-bot Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

APPROVE (re-review on c794eb1a23)

Pure refactor commit, zero behavioral change. Sweep confirmed.

What changed in c794eb1a23:

  1. DRY'd the drain loop into applyAllPendingConnectedTransitions(). The pattern that lived in three places (setConnected's task body, flushPendingConnectionStateSync, and the loop body) is now a single helper. Both the task body and flushPendingConnectedTransitions call it. Drain semantics are byte-identical: while let target = pendingConnectedTarget { ... guard _isConnected != target else { continue } ... }.
  2. Renamed helpers for naming consistency:
    • flushPendingConnectionStateSyncflushPendingConnectedTransitions
    • drainPendingConnectionStateawaitPendingConnectedTransitions
    • _testWaitForPendingConnectionState_testAwaitPendingConnectedTransitions
      The verb pairs (flush = sync, await = async, applyAll = the actual work) read better.
  3. Extracted scheduleDaemonDidReconnect() from inline Task { @MainActor in } in applyConnectedTransition. Same body, same [weak self] capture, same self.isConnected re-check. Method extraction makes the side-effect intent explicit.
  4. AGENTS.md prose trimmed. The two new bullets I noted in my first review have been compressed; primary-source links to Apple Observation docs / WWDC21 / WWDC23 are preserved. The longer prose for both rules now lives as section comments in GatewayConnectionManager.swift itself, which is arguably a better home (closer to the implementation that exemplifies the rule).
  5. Redundant inline comments removed. The "Read backing storage so this guard doesn't register a tracking dependency" comments at five sites have been removed because the AGENTS.md rule covers it once. Tighter, less noise.
  6. Test renames track the helper renames. Test bodies and assertions unchanged.

Verified by sweep:

  • ✅ All 7 self-write guards still use _<propertyName> backing storage (_isUpdateInProgress, _assistantVersion, _versionMismatch, _dismissedMismatchKey ×2, _keyFingerprint, _isConnected, _isConnecting, _isAuthFailed).
  • deinit still cancels setConnectedTask.
  • updateAuthFailedSignal still guards on _isAuthFailed.
  • ✅ Sweep for any remaining non-underscore self-read in guard/if patterns: clean.
  • setConnected early-out still uses _isConnected. Task body still uses [weak self] + defer { self.setConnectedTask = nil }.
  • ✅ Sync-read invariants preserved: connect() awaits awaitPendingConnectedTransitions(), disconnect() calls flushPendingConnectedTransitions().

Observation from my first review still applies (still non-blocking): the task body's defer { self.setConnectedTask = nil } is unconditional, so a sequence of flushPendingConnectedTransitions (cancels task A, clears reference, drains) → synchronous setConnected → schedules task B → cancelled task A finally runs → its defer clears task B's reference. Worst case is brief double-task scheduling; no state corruption. The refactor preserves this exactly. If we ever take a fix, the cleanest shape is task-identity comparison in the defer or an early Task.isCancelled check followed by a conditional setConnectedTask = nil.

Nit (not a blocker): scheduleDaemonDidReconnect reads self.isConnected (synthesised getter) inside the Task body. Behaviourally fine — the Task runs on a fresh @MainActor turn outside any withObservationTracking context, so the access call is a no-op for tracking purposes. But for visual consistency with the rest of the PR's pattern, _isConnected would be defensible. Skip if you want to ship.

Status: Codex 👍 still in. Devin's two flagged issues remain resolved (deinit cancel + _isAuthFailed). CI on c794eb1a23: Socket Security ✅, FlexFrame Lint ✅, macOS Build/Tests SKIPPED (expected). The 4 "additional findings" Devin shows in its UI but doesn't expose via API (https://app.devin.ai/review/vellum-ai/vellum-assistant/pull/29899) are still worth a glance before merge.

Re-approving on c794eb1a23.

@ashleeradka ashleeradka merged commit e33f7f8 into main May 7, 2026
7 checks passed
@ashleeradka ashleeradka deleted the devin/1778178865-lum-1428-defer-coalesce-setconnected branch May 7, 2026 19:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants