Skip to content

fix(clients): own SSE URLSession inside the streaming Task#27292

Merged
ashleeradka merged 1 commit into
mainfrom
devin/lum-1001-1776809031-own-session-in-sse-task
Apr 21, 2026
Merged

fix(clients): own SSE URLSession inside the streaming Task#27292
ashleeradka merged 1 commit into
mainfrom
devin/lum-1001-1776809031-own-session-in-sse-task

Conversation

@devin-ai-integration

@devin-ai-integration devin-ai-integration Bot commented Apr 21, 2026

Copy link
Copy Markdown
Contributor

Moves the SSE URLSession into the Task that reads from it so no other @MainActor caller can invalidate a session mid-await. This eliminates the TOCTOU race that crashed the process with an uncatchable NSGenericException from -[__NSURLSessionLocal taskForClassInfo:] when startSSE / stopSSE / token rotation interleaved with URLSession.bytes(for:).

Closes LUM-1001.

Root cause analysis

  1. How did we get here? EventStreamClient is @MainActor and stored sseSession: URLSession? as shared mutable state. GatewayHTTPClient.stream(...) is nonisolated async (intentional, from #21729 — network must not block main). At the await, execution hops off @MainActor; any interleaved @MainActor turn (token rotation, reconnect, explicit stop+start) could call invalidateSSESession() before the concurrent-executor resumption called bytes(for:). That resumption synchronously creates a data task — which on an invalidated session raises an uncatchable ObjC exception.
  2. What decisions led to it? #25396 gave SSE its own session and invalidated it on teardown — correct in isolation. #25426 added an sseSession === session guard that only closed the window before the Task started, not the window opened by the await. Both retained shared ownership of the session across @MainActor callers.
  3. Warning signs we missed? BtwClient recently landed (#27250) using the correct pattern — per-call session local to the Task with defer { session.invalidateAndCancel() }. Retrofitting SSE to the same pattern was the obvious follow-up.
  4. Prevention. Session ownership is now tied to the Task that uses it. Nothing outside the Task can reach it, so the race's precondition (shared reference) no longer exists. External teardown flows through sseTask?.cancel(), which URLSession.bytes(for:delegate:) honors via withTaskCancellationHandler.
  5. AGENTS.md guideline? The existing @MainActor Isolation Boundaries section already captures the relevant principle ("keep mutable state on the main actor; offload only the expensive computation"). No new rule needed — a targeted rule like "don't store URLSession as an instance property for a single-stream call" would be narrow enough to rot. The existing rule plus the in-file comment citing LUM-1001 is the right level.

Holistic changes in this PR

  • Removed sseSession: URLSession? and invalidateSSESession() — the shared-ownership bug.
  • Removed the superseded-session guard in startSSEStream — dead code once ownership is fixed.
  • stopSSE now relies on sseTask?.cancel() alone; session teardown runs via the Task's defer.
  • deinit now cancels tokenRotationTask, sseReconnectTask, and sseTask explicitly, per clients/AGENTS.md "Always cancel subscriptions and tasks" — previously only the session was invalidated, leaving the reconnect/token-rotation tasks running with [weak self] closures until their sleep completed.
  • Added EventStreamClientLifecycleTests exercising back-to-back startSSE/stopSSE, idempotent startSSE, teardown after startSSE, stopSSE without startSSE, and dealloc while running.

Alternatives considered

  • Keep the guard, tolerate the race. Rejected — today's crash proves the guard cannot close the window.
  • Delegate-driven SafeAsyncBytes + ObjC exception trampoline (#26281, approved but closed). Converts SIGABRT to URLError(.cancelled) but leaves the shared-session architecture intact — defense-in-depth on top of a broken architecture. ~400 LOC of ObjC module + delegate + NSLock-serialized backpressure to maintain. Not needed once ownership is fixed: with this PR, dataTaskWithRequest: is never called on an invalidated session, so the exception path is unreachable.
  • Probe-then-call (session.dataTask(with:).state). Rejected — doesn't close the window; the race can occur between probe and bytes(for:).
  • Make GatewayHTTPClient.stream @MainActor. Rejected — deliberately reversed by #21729; SSE backpressure would stall the UI.
  • Convert EventStreamClient to actor or lock around the session. Rejected — actor reentrancy at await preserves the same TOCTOU window; locks can't be held across await.

Why this is safe

  • Does not reintroduce the EXC_BAD_ACCESS #25396 fixed. That crash required Task.cancel() to race with an AsyncBytes iterator on URLSession.shared. Here each stream owns a dedicated session with exactly one data task; URLSession.bytes serializes cancel-vs-iterator teardown internally via withTaskCancellationHandler, and the session is invalidated by defer only after iteration exits.
  • The BtwClient sibling case has been running this exact pattern in production since #27250 merged on 2026-04-21 with no crash reports.
  • All three reconnect triggers (normal disconnect, token rotation, scheduled backoff) funnel through startSSEStream, which cancels the old Task before spawning a new one — the old Task's defer tears down its session. No change in externally observable reconnect behavior.
  • Grep-verified: EventStreamClient is the only caller of GatewayHTTPClient.stream(session:), and the only place in clients/ that stored a URLSession as a long-lived property for a single-stream call.

Apple references checked (2026-04-21): URLSession.bytes(for:delegate:), withTaskCancellationHandler, WWDC21 — Use async/await with URLSession, WWDC25 — Embracing Swift concurrency, WWDC21 — Protect mutable state with Swift actors.

Link to Devin session: https://app.devin.ai/sessions/407502b1cf444309a4dacb287c0cfaeb
Requested by: @ashleeradka


Open in Devin Review

Moves the SSE URLSession into the Task that reads from it so no other
MainActor caller can invalidate a session mid-await. Eliminates the
TOCTOU race that crashed the process with an uncatchable
NSGenericException from -[__NSURLSessionLocal taskForClassInfo:] when
startSSE/stopSSE/token rotation interleaved with URLSession.bytes(for:).

Closes LUM-1001

Co-Authored-By: ashlee@vellum.ai <ashlee@vellum.ai>
@devin-ai-integration

Copy link
Copy Markdown
Contributor Author

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

@linear

linear Bot commented Apr 21, 2026

Copy link
Copy Markdown

LUM-1001 Crash: NSGenericException — Task created in invalidated URLSession during SSE stream (MACOS-J4)

@devin-ai-integration devin-ai-integration Bot left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 3 additional findings.

Open in Devin Review

@vex-assistant-bot vex-assistant-bot Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

APPROVE

Value: Eliminates the LUM-1001 TOCTOU race that crashed the process with an uncatchable NSGenericException from -[__NSURLSessionLocal taskForClassInfo:] when startSSE/stopSSE/token rotation interleaved with URLSession.bytes(for:).

What changed: Session ownership moves from instance property (sseSession: URLSession?) to Task-local (let session inside the streaming Task + defer { session.invalidateAndCancel() }). Nothing outside the Task can reach the session, so the race's precondition — shared mutable reference — no longer exists.

Verified:

  • Pattern matches PR #27250 (BtwClient) exactly — Task-local session with defer invalidation. The PR description explicitly cites #27250 as the correct pattern to retrofit
  • Session passed to GatewayHTTPClient.stream() via session: param (line 310, session: session). stream() already accepts this param (GatewayHTTPClient.swift:460)
  • stopSSE() relies on sseTask?.cancel() — cancellation propagates through URLSession.bytes(for:delegate:)'s withTaskCancellationHandler to the data task, then the Task's defer invalidates the session. Clean teardown chain
  • Back-to-back startSSEStream() is safe — old Task cancelled → its defer fires → its session invalidated. New Task creates its own session. No shared state
  • deinit simplified correctly — cancels tasks → their defers fire → sessions cleaned up
  • Removed dead codeinvalidateSSESession() method, sseSession property, superseded-session guard (sseSession === session)
  • 5 lifecycle tests cover: rapid start/stop cycling (20x), back-to-back start, teardown after start, stop without start, dealloc while running
  • Root cause analysis in PR description is exceptional — decision archaeology tracing through #25396, #25426, #27250, with clear prevention reasoning and AGENTS.md check

URLSession lifecycle family is now consistent:

  • EventStreamClient (this PR) — Task-local session, defer invalidation
  • BtwClient (#27250) — Task-local session, defer invalidation
  • GatewayHTTPClient.stream/streamPost/streamPostWithRetry — accept session: param, default .shared

All three crash tickets (LUM-820, LUM-903, LUM-1001) are now addressed with the same proven pattern.

@ashleeradka ashleeradka merged commit 26219d4 into main Apr 21, 2026
7 checks passed
@ashleeradka ashleeradka deleted the devin/lum-1001-1776809031-own-session-in-sse-task branch April 21, 2026 22:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant