Fix cancel race: disarm safety-net timer before QA finalization#6934
Conversation
Co-Authored-By: Claude <noreply@anthropic.com>
| cancelSafetyNetTask = Task { @MainActor in | ||
| try? await Task.sleep(nanoseconds: 2_000_000_000) // 2 seconds | ||
| guard self.isCancelled else { return } // in case state changed | ||
| try? self.daemonClient.send(CuSessionAbortMessage(sessionId: self.id)) |
There was a problem hiding this comment.
🔴 Safety-net task cancellation is ineffective — abort still fires after cancel() on the Task
The safety-net Task created in cancel() uses try? await Task.sleep(...) which swallows CancellationError. After the sleep is interrupted by cancelSafetyNetTask?.cancel(), execution continues to the guard self.isCancelled check — but self.isCancelled is true (set by cancel()), so the guard passes and cu_session_abort is sent anyway.
Root Cause: Task.isCancelled vs self.isCancelled
When run() reaches line 287 and calls cancelSafetyNetTask?.cancel(), the cooperative cancellation sets the task's cancellation flag. The Task.sleep at Session.swift:1065 throws CancellationError, but try? swallows it. The task then continues to line 1066:
guard self.isCancelled else { return }Since self.isCancelled was set to true by cancel() at Session.swift:1055, this guard passes. The task then sends CuSessionAbortMessage at line 1067.
Critically, run() then awaits finalizeQARecording() at line 291, which is a suspension point. This gives the cancelled safety-net task a chance to resume on @MainActor and send the abort before finalizeQARecording() completes — exactly the race condition the PR is supposed to fix.
The fix should check Task.isCancelled inside the safety-net task body, e.g.:
cancelSafetyNetTask = Task { @MainActor in
try? await Task.sleep(nanoseconds: 2_000_000_000)
guard !Task.isCancelled else { return } // <-- check cooperative cancellation
guard self.isCancelled else { return }
try? self.daemonClient.send(CuSessionAbortMessage(sessionId: self.id))
}Impact: The cu_session_abort message can still arrive at the daemon before cu_session_finalized, causing the daemon's handleCuSessionAbort to delete cuSessionMetadata before the summary can be injected — the exact bug this PR was meant to prevent.
(Refers to lines 1064-1068)
Was this helpful? React with 👍 or 👎 to provide feedback.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b6357b3394
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| cancelSafetyNetTask = Task { @MainActor in | ||
| try? await Task.sleep(nanoseconds: 2_000_000_000) // 2 seconds | ||
| guard self.isCancelled else { return } // in case state changed |
There was a problem hiding this comment.
Return when canceling the safety-net task
Canceling cancelSafetyNetTask in run() does not actually disarm it because the task body ignores Task.sleep cancellation (try? await Task.sleep(...)) and continues immediately; with isCancelled == true, it still sends cu_session_abort before finalizeQARecording() finishes, so the original metadata-loss race can still occur on slow finalization paths. This is triggered specifically when the new post-loop cancelSafetyNetTask?.cancel() path runs during QA cancellation.
Useful? React with 👍 / 👎.
Co-authored-by: Vellum Assistant <assistant@vellum.ai> Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: Vellum Assistant <assistant@vellum.ai> Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: Vellum Assistant <assistant@vellum.ai> Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: Vellum Assistant <assistant@vellum.ai> Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: Vellum Assistant <assistant@vellum.ai> Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: Vellum Assistant <assistant@vellum.ai> Co-authored-by: Claude <noreply@anthropic.com>
The 2-second cancel safety-net timer could fire before finalizeQARecording() completes on slow paths, sending cu_session_abort before cu_session_finalized. The daemon's handleCuSessionAbort deletes cuSessionMetadata, so the later cu_session_finalized can't inject the summary. Fix: store the safety-net Task and cancel it when run() reaches the finalization code path, since run() handles finalization + abort itself.