Defer and coalesce setConnected writes in GatewayConnectionManager#29899
Conversation
Schedule `setConnected` writes onto the next `@MainActor` turn behind a single in-flight `Task`, so back-to-back calls within the same turn collapse to the most-recent target and the synchronous call chain is broken between the caller and the @observable willSet fan-out. `connect()` drains the in-flight task before returning so async callers see the new value synchronously after `await`. `disconnect()` flushes inline so synchronous callers do too. Also reads the macro-synthesised `_<propertyName>` backing storage in the read-then-write guards on isConnected, isConnecting, isUpdateInProgress, versionMismatch, dismissedMismatchKey, assistantVersion, and keyFingerprint, so those guards don't register a tracking dependency on the calling withObservationTracking context. Co-Authored-By: ashlee@vellum.ai <ashlee@vellum.ai>
🤖 Devin AI EngineerI'll be helping with this pull request! Here's what you should know: ✅ I will automatically:
Note: I can only respond to comments from users who have write access to this repository. ⚙️ Control Options:
|
There was a problem hiding this comment.
🟡 setConnectedTask not cancelled in deinit, violating AGENTS.md rule
The PR introduces a new @ObservationIgnored private var setConnectedTask: Task<Void, Never>? at clients/shared/Network/GatewayConnectionManager.swift:983, but the deinit at line 1121 does not cancel it. The AGENTS.md rule at clients/AGENTS.md:144 is explicit: "Always explicitly cancel unstructured Task {} in deinit — do not rely solely on [weak self] cleanup, as the task continues running until its next cancellation check." The existing deinit already cancels autoWakeTask and reconnectionTask but omits the newly added setConnectedTask. While the task uses [weak self] and would exit harmlessly after deallocation, the rule specifically warns against relying on that pattern alone.
(Refers to lines 1121-1128)
Was this helpful? React with 👍 or 👎 to provide feedback.
There was a problem hiding this comment.
Fixed in d0e5bae — setConnectedTask?.cancel() added to deinit alongside the existing autoWakeTask / reconnectionTask cancellations.
There was a problem hiding this comment.
🟡 updateAuthFailedSignal reads isAuthFailed via synthesised getter instead of backing storage
This PR adds the AGENTS.md rule at clients/AGENTS.md:220: "Read the macro-synthesised _<propertyName> backing storage when guarding a self-write to an @Observable property." The PR consistently applies this pattern to isUpdateInProgress, assistantVersion, versionMismatch, dismissedMismatchKey, keyFingerprint, isConnected, and isConnecting. However, updateAuthFailedSignal() at line 432 still reads isAuthFailed via the public getter before conditionally writing it at line 433 — this registers a tracking dependency through _$observationRegistrar.access. The fix is to use _isAuthFailed in the guard, matching the pattern applied everywhere else in this PR.
(Refers to line 432)
Was this helpful? React with 👍 or 👎 to provide feedback.
There was a problem hiding this comment.
Fixed in d0e5bae — updateAuthFailedSignal now guards on _isAuthFailed (backing storage), matching the pattern used by the other read-then-write sites in this PR.
…iledSignal guard Co-Authored-By: ashlee@vellum.ai <ashlee@vellum.ai>
There was a problem hiding this comment.
✦ APPROVE
Value: Closes LUM-1428 (the setConnected recursive observation cycle I dispatched to Devin earlier today, MACOS-DH — 9 events / 7 users firing on v0.7.3). Same root-cause family as #29898 (ConversationListStore cascade), but a different vector: synchronous fan-out from a state-machine setter rather than O(N²) collection invalidation.
What this does — three-part fix:
-
setConnectedbecomes defer + coalesce. Stores target inpendingConnectedTarget, schedules a single in-flightTask<Void, Never>?(setConnectedTask) onto the next@MainActorturn. Back-to-back calls within the same turn collapse to the most-recent target — intermediatewillSetfan-outs (which can hit tens-to-low-hundreds ofwithObservationTrackingcallbacks per write) are skipped entirely. The task body loops overpendingConnectedTargetinstead of recursing, so anyonChangehandler that re-enters the setter is drained by iteration. Actual write + side effects (daemonDidReconnect,handlePostSparkleUpdate,autoWakeIfAssistantDied) move to a privateapplyConnectedTransition(_:). -
Backing-storage reads (
_<propertyName>) for self-write guards. Reading the synthesised getter calls_$observationRegistrar.access(_:keyPath:)and registers a tracking dependency on whateverwithObservationTrackingcontext is active on the current turn — including the one that just dispatched into the method. Reading_<propertyName>is a pure storage compare. Applied tosetUpdateInProgress,performHealthCheck,handleDaemonVersionChanged,checkVersionCompatibility(bothdismissedMismatchKeyreads),handleServerMessage(assistant-version, update-in-progress, key-fingerprint),applyConnectedTransition(isConnecting), andupdateAuthFailedSignal(after Devin's catch). -
Synchronous-read invariants preserved at the public API boundary.
connect()is async → callsawait drainPendingConnectionState()afterconnectImplso callers can readisConnectedimmediately afterawait connect().disconnect()is sync → callsflushPendingConnectionStateSync()afterdisconnectInternalso callers seeisConnected == falseon return. Existing connect/disconnect tests still pass unchanged.
Checked against our KB / AGENTS.md:
- Class is
@Observable @MainActor✅ —pendingConnectedTargetreads/writes are actor-isolated, no concurrency hazard. - Test class is
@MainActor✅ — syncsetConnectedcalls accumulate before any task can run, soXCTAssertFalse(client.isConnected)between back-to-back calls andawait drain()is deterministic. setConnectedTaskis@ObservationIgnored✅ — required fordeinitaccess on@MainActor @Observableclasses (swift#79551).deinitcancelssetConnectedTaskalongside the existingautoWakeTask/reconnectionTask✅ (Devin's first catch, fixed at HEAD).updateAuthFailedSignalnow uses_isAuthFailed✅ (Devin's second catch, fixed at HEAD).- Swept the whole class for any remaining non-underscore read-then-write sites — clean. The pattern is applied consistently.
- New AGENTS.md bullets are accurate, link primary sources, and codify the patterns for future contributors. The
Task<Void, Never>?coalescing rule complements the existing 100ms-coalescing-window rule (different problem class — state-machine writes vs streaming token writes).
One observation, not a blocker: the task body's defer { self.setConnectedTask = nil } clears the field unconditionally. If flushPendingConnectionStateSync were called while a task is queued (cancels + nils + drains synchronously), then a new setConnected synchronously schedules task B, then the cancelled task A finally runs — task A's defer would clear task B's reference. Worst case: brief double-task scheduling, both drain correctly via the loop, no state corruption. Realistically this requires a synchronous setConnected between disconnect() returning and the next actor yield, which doesn't happen in current call sites. If we ever get bitten, the fix is task-identity comparison: capture let myTask = Task { ... } and defer { if self.setConnectedTask === myTask { self.setConnectedTask = nil } }. Mentioning so it's on record.
Tests are good: the testSetConnectedDefersWritesAndCoalescesToFinalTarget assertion between the sync calls and the await drain() is the correct shape — proves the write didn't apply within the same actor turn. The _test* hooks are minimal and clearly marked production-must-not-call.
Bot status: Codex 👍 ✅. Devin found 2 issues, both fixed at HEAD d0e5bae17b (deinit cancel + _isAuthFailed guard). Devin still mentions "4 additional findings" in the Devin Review UI that I can't see via API — worth a quick scan before merging. CI: Socket Security ✅, FlexFrame Lint ✅, macOS Build/Tests SKIPPED (expected for this branch pattern).
…eDaemonDidReconnect Co-Authored-By: ashlee@vellum.ai <ashlee@vellum.ai>
There was a problem hiding this comment.
✦ APPROVE (re-review on c794eb1a23)
Pure refactor commit, zero behavioral change. Sweep confirmed.
What changed in c794eb1a23:
- DRY'd the drain loop into
applyAllPendingConnectedTransitions(). The pattern that lived in three places (setConnected's task body,flushPendingConnectionStateSync, and the loop body) is now a single helper. Both the task body andflushPendingConnectedTransitionscall it. Drain semantics are byte-identical:while let target = pendingConnectedTarget { ... guard _isConnected != target else { continue } ... }. - Renamed helpers for naming consistency:
flushPendingConnectionStateSync→flushPendingConnectedTransitionsdrainPendingConnectionState→awaitPendingConnectedTransitions_testWaitForPendingConnectionState→_testAwaitPendingConnectedTransitions
The verb pairs (flush= sync,await= async,applyAll= the actual work) read better.
- Extracted
scheduleDaemonDidReconnect()from inlineTask { @MainActor in }inapplyConnectedTransition. Same body, same[weak self]capture, sameself.isConnectedre-check. Method extraction makes the side-effect intent explicit. - AGENTS.md prose trimmed. The two new bullets I noted in my first review have been compressed; primary-source links to Apple Observation docs / WWDC21 / WWDC23 are preserved. The longer prose for both rules now lives as section comments in
GatewayConnectionManager.swiftitself, which is arguably a better home (closer to the implementation that exemplifies the rule). - Redundant inline comments removed. The "Read backing storage so this guard doesn't register a tracking dependency" comments at five sites have been removed because the AGENTS.md rule covers it once. Tighter, less noise.
- Test renames track the helper renames. Test bodies and assertions unchanged.
Verified by sweep:
- ✅ All 7 self-write guards still use
_<propertyName>backing storage (_isUpdateInProgress,_assistantVersion,_versionMismatch,_dismissedMismatchKey×2,_keyFingerprint,_isConnected,_isConnecting,_isAuthFailed). - ✅
deinitstill cancelssetConnectedTask. - ✅
updateAuthFailedSignalstill guards on_isAuthFailed. - ✅ Sweep for any remaining non-underscore self-read in guard/if patterns: clean.
- ✅
setConnectedearly-out still uses_isConnected. Task body still uses[weak self]+defer { self.setConnectedTask = nil }. - ✅ Sync-read invariants preserved:
connect()awaitsawaitPendingConnectedTransitions(),disconnect()callsflushPendingConnectedTransitions().
Observation from my first review still applies (still non-blocking): the task body's defer { self.setConnectedTask = nil } is unconditional, so a sequence of flushPendingConnectedTransitions (cancels task A, clears reference, drains) → synchronous setConnected → schedules task B → cancelled task A finally runs → its defer clears task B's reference. Worst case is brief double-task scheduling; no state corruption. The refactor preserves this exactly. If we ever take a fix, the cleanest shape is task-identity comparison in the defer or an early Task.isCancelled check followed by a conditional setConnectedTask = nil.
Nit (not a blocker): scheduleDaemonDidReconnect reads self.isConnected (synthesised getter) inside the Task body. Behaviourally fine — the Task runs on a fresh @MainActor turn outside any withObservationTracking context, so the access call is a no-op for tracking purposes. But for visual consistency with the rest of the PR's pattern, _isConnected would be defensible. Skip if you want to ship.
Status: Codex 👍 still in. Devin's two flagged issues remain resolved (deinit cancel + _isAuthFailed). CI on c794eb1a23: Socket Security ✅, FlexFrame Lint ✅, macOS Build/Tests SKIPPED (expected). The 4 "additional findings" Devin shows in its UI but doesn't expose via API (https://app.devin.ai/review/vellum-ai/vellum-assistant/pull/29899) are still worth a glance before merge.
Re-approving on c794eb1a23.
Why
GatewayConnectionManager.setConnectedwrites@Observableproperties (isConnected,isConnecting) directly inside its caller's stack frame. Each write fires every registeredwithObservationTrackingonChangecallback synchronously insidewillSet. With per-instance SwiftUI bodies, multipleobservationStreamconsumers, and.onChange(of:)modifiers in this codebase, the tracking-context count onisConnectedreaches the tens-to-low-hundreds. Under memory pressure this fan-out has been observed to stall@MainActorfor >2 s during health-check transitions.The guards inside
setConnectedand other read-then-write methods also read@Observableproperties through the macro-synthesised getter, which calls_$observationRegistrar.access(_:keyPath:)and registers a tracking dependency on whateverwithObservationTrackingcontext is active on the current turn — including the one that just dispatched into the method.What changed
setConnectedis now defer + coalesce:pendingConnectedTargetand schedules a single in-flightTask<Void, Never>?(setConnectedTask) that runs on the next@MainActorturn.willSetfan-outs are skipped).pendingConnectedTargetso any re-entrant queued targets (e.g. from a synchronousonChangehandler) are drained by iteration rather than recursion.isConnected =,isConnecting = false,daemonDidReconnect,handlePostSparkleUpdate,autoWakeIfAssistantDied) move into a new privateapplyConnectedTransition(_:).Synchronous-read invariants preserved at the public-API boundary:
connect()(async) callsawait drainPendingConnectionState()afterconnectImplso callers can readisConnectedimmediately afterawait connect().disconnect()(sync) callsflushPendingConnectionStateSync()afterdisconnectInternalso callers can readisConnected == falseimmediately on return.Backing-storage reads in same-turn guards. Self-write guards now read
_<propertyName>instead of going through the synthesised getter, so the guard is a pure storage compare and registers no tracking dependency on the calling context. Applied to:setUpdateInProgress,performHealthCheck(assistant-version guard),handleDaemonVersionChanged,checkVersionCompatibility,handleServerMessage(assistant-version, update-in-progress, key-fingerprint guards), andapplyConnectedTransition(isConnectingguard).Tests — two new tests in
GatewayConnectionManagerTestscovering: (a) deferral + coalescing of rapid alternatingsetConnectedcalls within one actor turn, and (b) coalescing back to the original value as a no-op. Existing connect/disconnect tests are unchanged because the public-API boundary preserves synchronous-read semantics. Twointernalhooks (_testSetConnected,_testWaitForPendingConnectionState) added so unit tests can drive the private setter without standing up a URLProtocol fake.AGENTS.md — two new bullets under "High-Frequency Updates":
_<propertyName>backing storage when guarding a self-write to an@Observableproperty.Task<Void, Never>?.Safety
connect()anddisconnect()retain their existing post-call read invariants (drain / flush).applyConnectedTransitionhas the same body as the oldsetConnected(write + same conditional side effects in the same order), just gated through the coalescing task.onChangehandler — the path that causes recursive cycles in observation. Re-entrant queued targets are drained by the task loop, not by recursive call frames.Alternatives considered and rejected
_isConnectedin the guard only (no defer): cosmetic fix only — does not addresswillSetfan-out cost and matches the rejected alternative in #28694's design table ("Already exists. The hang occurs during legitimate state transitions, not redundant writes").@ObservationIgnored+ manual notification onisConnected: breaks the@Observablecontract; SwiftUI views would no longer track changes through.onChange(of:)/ property reads.Task.detachedfor fan-out: incorrect — observers must run on@MainActor; we want to break the synchronous chain, not the actor.observationStreamconsumers behind a sharedAsyncStream; splitGatewayConnectionManagerinto smaller@Observableclasses): real architectural smells but multi-week scope; out of scope here. They reduce N (the per-write fan-out cost) — the defer-and-coalesce fix is orthogonal and would still apply.References
ObservablewithObservationTracking(_:onChange:)NotificationCenter.post(name:object:)Root cause analysis
How did the code get into this state? The class was originally
ObservableObject+@Published(PR [LUM-745] Guard @Published property writes against no-op mutations to prevent SwiftUI attribute graph hang #24556 added theisConnected != connectedno-op guard). PR [LUM-745] Migrate GatewayConnectionManager to @Observable to fix re-entrant attribute graph hang #25496 migrated to@Observableand inherited the read-then-write guard verbatim. PR fix(network): Defer daemonDidReconnect to next main-actor turn to prevent 2s hang under memory pressure #28694 reduced one source of synchronous fan-out by deferring thedaemonDidReconnectNotificationCenter.post. The remaining fan-out —willSetitself, fired by every@Observableproperty write — was never addressed.What mistakes / decisions led to it? The
ObservableObject → @Observablemigration treated the macros as drop-in equivalents, but theirwillSetsemantics differ:@Publishedshort-circuits when the new value equals the old;@Observablealways fireswillSet/didSet. The guard that was sufficient under@Publishedhas no equivalent benefit under@Observable.Warning signs? Sentry MACOS-DH (2 s+ main-thread stalls) re-fired after PR fix(network): Defer daemonDidReconnect to next main-actor turn to prevent 2s hang under memory pressure #28694 merged, indicating the fan-out cost itself — not just the
NotificationCenterpost — was the bottleneck. The Sentry stack alternated betweensetConnectedandisConnected.getterending in_swift_getKeyPath, which is consistent with synchronous re-entry through anonChangehandler.Prevention. Treat
@Observableself-writes as fan-out points: read backing storage in guards, and defer/coalesce writes that come from periodic loops or external event sources. Both are now codified as AGENTS.md rules with links to authoritative sources.AGENTS.md change. Added two new bullets under "High-Frequency Updates" — see the AGENTS.md diff in this PR.
Test plan
GatewayConnectionManagerTests:testSetConnectedDefersWritesAndCoalescesToFinalTarget— drives 5 rapid alternating targets within one actor turn, asserts the property is unchanged synchronously, drains, asserts only the final target was applied.testSetConnectedCoalescesBackToOriginalValueAsNoOp— drives a transient flap that returns to the prior value, asserts the property is unchanged after drain.swift teston macOS is required to verify the build before merging.Link to Devin session: https://app.devin.ai/sessions/2da0c092185640b284918ba17b129fa3
Requested by: @ashleeradka