fix(clients): own SSE URLSession inside the streaming Task#27292
Conversation
Moves the SSE URLSession into the Task that reads from it so no other MainActor caller can invalidate a session mid-await. Eliminates the TOCTOU race that crashed the process with an uncatchable NSGenericException from -[__NSURLSessionLocal taskForClassInfo:] when startSSE/stopSSE/token rotation interleaved with URLSession.bytes(for:). Closes LUM-1001 Co-Authored-By: ashlee@vellum.ai <ashlee@vellum.ai>
🤖 Devin AI EngineerI'll be helping with this pull request! Here's what you should know: ✅ I will automatically:
Note: I can only respond to comments from users who have write access to this repository. ⚙️ Control Options:
|
There was a problem hiding this comment.
✦ APPROVE
Value: Eliminates the LUM-1001 TOCTOU race that crashed the process with an uncatchable NSGenericException from -[__NSURLSessionLocal taskForClassInfo:] when startSSE/stopSSE/token rotation interleaved with URLSession.bytes(for:).
What changed: Session ownership moves from instance property (sseSession: URLSession?) to Task-local (let session inside the streaming Task + defer { session.invalidateAndCancel() }). Nothing outside the Task can reach the session, so the race's precondition — shared mutable reference — no longer exists.
Verified:
- ✅ Pattern matches PR #27250 (BtwClient) exactly — Task-local session with defer invalidation. The PR description explicitly cites #27250 as the correct pattern to retrofit
- ✅ Session passed to
GatewayHTTPClient.stream()viasession:param (line 310,session: session).stream()already accepts this param (GatewayHTTPClient.swift:460) - ✅
stopSSE()relies onsseTask?.cancel()— cancellation propagates throughURLSession.bytes(for:delegate:)'swithTaskCancellationHandlerto the data task, then the Task'sdeferinvalidates the session. Clean teardown chain - ✅ Back-to-back
startSSEStream()is safe — old Task cancelled → its defer fires → its session invalidated. New Task creates its own session. No shared state - ✅
deinitsimplified correctly — cancels tasks → their defers fire → sessions cleaned up - ✅ Removed dead code —
invalidateSSESession()method,sseSessionproperty, superseded-session guard (sseSession === session) - ✅ 5 lifecycle tests cover: rapid start/stop cycling (20x), back-to-back start, teardown after start, stop without start, dealloc while running
- ✅ Root cause analysis in PR description is exceptional — decision archaeology tracing through #25396, #25426, #27250, with clear prevention reasoning and AGENTS.md check
URLSession lifecycle family is now consistent:
EventStreamClient(this PR) — Task-local session, defer invalidationBtwClient(#27250) — Task-local session, defer invalidationGatewayHTTPClient.stream/streamPost/streamPostWithRetry— acceptsession:param, default.shared
All three crash tickets (LUM-820, LUM-903, LUM-1001) are now addressed with the same proven pattern.
Moves the SSE
URLSessioninto theTaskthat reads from it so no other@MainActorcaller can invalidate a session mid-await. This eliminates the TOCTOU race that crashed the process with an uncatchableNSGenericExceptionfrom-[__NSURLSessionLocal taskForClassInfo:]whenstartSSE/stopSSE/ token rotation interleaved withURLSession.bytes(for:).Closes LUM-1001.
Root cause analysis
EventStreamClientis@MainActorand storedsseSession: URLSession?as shared mutable state.GatewayHTTPClient.stream(...)isnonisolated async(intentional, from #21729 — network must not block main). At theawait, execution hops off@MainActor; any interleaved@MainActorturn (token rotation, reconnect, explicit stop+start) could callinvalidateSSESession()before the concurrent-executor resumption calledbytes(for:). That resumption synchronously creates a data task — which on an invalidated session raises an uncatchable ObjC exception.sseSession === sessionguard that only closed the window before the Task started, not the window opened by theawait. Both retained shared ownership of the session across@MainActorcallers.BtwClientrecently landed (#27250) using the correct pattern — per-call session local to the Task withdefer { session.invalidateAndCancel() }. Retrofitting SSE to the same pattern was the obvious follow-up.sseTask?.cancel(), whichURLSession.bytes(for:delegate:)honors viawithTaskCancellationHandler.@MainActor Isolation Boundariessection already captures the relevant principle ("keep mutable state on the main actor; offload only the expensive computation"). No new rule needed — a targeted rule like "don't storeURLSessionas an instance property for a single-stream call" would be narrow enough to rot. The existing rule plus the in-file comment citing LUM-1001 is the right level.Holistic changes in this PR
sseSession: URLSession?andinvalidateSSESession()— the shared-ownership bug.startSSEStream— dead code once ownership is fixed.stopSSEnow relies onsseTask?.cancel()alone; session teardown runs via the Task'sdefer.deinitnow cancelstokenRotationTask,sseReconnectTask, andsseTaskexplicitly, perclients/AGENTS.md"Always cancel subscriptions and tasks" — previously only the session was invalidated, leaving the reconnect/token-rotation tasks running with[weak self]closures until their sleep completed.EventStreamClientLifecycleTestsexercising back-to-backstartSSE/stopSSE, idempotentstartSSE,teardownafterstartSSE,stopSSEwithoutstartSSE, and dealloc while running.Alternatives considered
SafeAsyncBytes+ ObjC exception trampoline (#26281, approved but closed). ConvertsSIGABRTtoURLError(.cancelled)but leaves the shared-session architecture intact — defense-in-depth on top of a broken architecture. ~400 LOC of ObjC module + delegate + NSLock-serialized backpressure to maintain. Not needed once ownership is fixed: with this PR,dataTaskWithRequest:is never called on an invalidated session, so the exception path is unreachable.session.dataTask(with:).state). Rejected — doesn't close the window; the race can occur between probe andbytes(for:).GatewayHTTPClient.stream@MainActor. Rejected — deliberately reversed by #21729; SSE backpressure would stall the UI.EventStreamClienttoactoror lock around the session. Rejected — actor reentrancy atawaitpreserves the same TOCTOU window; locks can't be held acrossawait.Why this is safe
EXC_BAD_ACCESS#25396 fixed. That crash requiredTask.cancel()to race with anAsyncBytesiterator onURLSession.shared. Here each stream owns a dedicated session with exactly one data task;URLSession.bytesserializes cancel-vs-iterator teardown internally viawithTaskCancellationHandler, and the session is invalidated bydeferonly after iteration exits.BtwClientsibling case has been running this exact pattern in production since #27250 merged on 2026-04-21 with no crash reports.startSSEStream, which cancels the old Task before spawning a new one — the old Task'sdefertears down its session. No change in externally observable reconnect behavior.EventStreamClientis the only caller ofGatewayHTTPClient.stream(session:), and the only place inclients/that stored aURLSessionas a long-lived property for a single-stream call.Apple references checked (2026-04-21):
URLSession.bytes(for:delegate:),withTaskCancellationHandler, WWDC21 — Use async/await with URLSession, WWDC25 — Embracing Swift concurrency, WWDC21 — Protect mutable state with Swift actors.Link to Devin session: https://app.devin.ai/sessions/407502b1cf444309a4dacb287c0cfaeb
Requested by: @ashleeradka