fix(daemon): retry Anthropic stream aborted by transport (LUM-1537)#30586
Conversation
When the Anthropic SDK reports a transport-level abort
("Request was aborted." with status=undefined), the catch-site in
providers/anthropic/client.ts distinguishes between caller cancellation
(tags abortReason) and transport-level termination (leaves abortReason
undefined). The retry layer was already short-circuiting caller
cancellation via the abortReason check, but had no positive predicate
for transport aborts, so the SDK message fell through to the default
"network_error" branch — which checks ECONNRESET-style errno codes that
the SDK never sets on this code path. Net result: every Anthropic
stream that was cut mid-token by Bun's fetch deadline, an edge LB idle
cutoff, or an upstream TCP close propagated to the SSE chat handler as
a terminal error, which surfaced to the web client as a 45 s silent
stall before the watchdog fired and reconnected with no message to
reconcile.
Adds isRetryableTransportAbort() as a sibling predicate to
isRetryableProviderMessage() and isRetryableStreamError(), gated on
ProviderError with statusCode=undefined and a /request was aborted/i
message match. Slots into the existing isRetryableError() chain after
the daemon-cancel and HTTP-status short-circuits, so:
- daemon-initiated cancellations stay non-retryable (abortReason !==
undefined short-circuits first);
- inner-timeout failures stay non-retryable (the catch-site rewrites
the message to "Anthropic stream timed out after Xs (inner
streamTimeoutMs)", which does not match the pattern);
- only true transport aborts retry, using the same exponential-backoff
budget as overloaded_error / 429 / 5xx.
Adds errorType="transport_abort" tagging so the retry log warnings are
distinguishable in Logfire/Sentry from generic network_error.
Three new regression tests in provider-error-scenarios.test.ts cover:
the retry path on transport abort, the abortReason short-circuit on
caller cancellation, and the inner-timeout rewrite path stays
non-retryable.
This is the daemon-side counterpart to vellum-assistant-platform#6708
(vembda graceful close on RemoteProtocolError) and
vellum-assistant-platform#6711 (web SSE observability) — together they
collapse the 45 s silent-stall window described in LUM-1431.
Refs LUM-1537, LUM-1431, LUM-1459, LUM-1536, LUM-1538.
Co-Authored-By: ashlee@vellum.ai <ashlee@vellum.ai>
🤖 Devin AI EngineerI'll be helping with this pull request! Here's what you should know: ✅ I will automatically:
Note: I can only respond to comments from users who have write access to this repository. ⚙️ Control Options:
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ad1a27e3a4
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| // saw an HTTP response). A real HTTP status here means a server error, | ||
| // which is handled by the status check. | ||
| if (error.statusCode !== undefined) return false; | ||
| return RETRYABLE_TRANSPORT_ABORT_PATTERNS.some((p) => p.test(error.message)); |
There was a problem hiding this comment.
Avoid retrying other providers' timeout aborts
When a non-Anthropic provider's own createStreamTimeout() aborts the SDK call, the OpenAI/Gemini catch sites only preserve abortReason from the external caller signal and otherwise wrap non-API aborts as a ProviderError with statusCode === undefined; if that SDK reports the same Request was aborted text, this new provider-agnostic predicate will retry the deterministic 30-minute timeout three more times instead of surfacing it. Anthropic rewrites its inner timeout before this point, but the other providers do not, so this should be scoped to the Anthropic/OpenRouter-Anthropic transport-abort shape or those catch sites should tag/rewrite inner timeouts first.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Good catch — confirmed and fixed in 55ad034.
Walked through the catch-sites:
providers/openai/chat-completions-provider.ts:338-382— computesabortReasonfrom the externalsignal?.aborted, so when the innercreateStreamTimeoutfires (timing outtimeoutSignal, notsignal),abortReasonstaysundefined. The OpenAI SDK throwsAPIUserAbortError extends APIErrorwithstatus === undefinedandmessage === "Request was aborted.", so the catch falls into theOpenAI.APIErrorbranch and throwsProviderError("OpenAI API error (undefined): Request was aborted.", "openai", undefined, {})— no message rewrite. ❌ would have retried.providers/openai/responses-provider.ts:401-443— same shape,providerLabelsubstituted. ❌ would have retried.providers/gemini/client.ts:336-372— same shape,"Gemini API error (undefined): ...". ❌ would have retried.providers/openrouter/client.tsextends chat-completions-provider, so it inherits the OpenAI catch-site path. ❌ would have retried.providers/anthropic/client.ts:1417-1523— the only catch-site with theinnerTimeoutFiredcheck that rewrites the message to"Anthropic stream timed out after Xs (inner streamTimeoutMs)", ensuring inner-timeout failures don't match"Request was aborted". ✓ safe.
Tightened the pattern to /^anthropic api error:\s*request was aborted/i so it matches only the Anthropic catch-site's verbatim output. Non-Anthropic providers' aborted-stream errors (including their (undefined) parenthetical form) fall through to network-error classification, unchanged from before this PR.
Added a regression test (does NOT retry OpenAI/Gemini-shaped 'Request was aborted') that pins this — feeding "OpenAI API error (undefined): Request was aborted." through RetryProvider confirms callCount === 1 (zero retries).
The right long-term fix is for the OpenAI/Gemini catch-sites to grow the same innerTimeoutFired distinction, at which point the pattern set here can be expanded to cover them. Out of scope for LUM-1537 since (a) there's no telemetry showing transport aborts on those providers in our fleet (1,344/4d is Anthropic-specific), and (b) it requires touching three more provider files plus their tests.
The previous /request was aborted/i pattern was provider-agnostic, which risked retrying OpenAI / Gemini / OpenRouter inner-streamTimeoutMs deadline failures. Those catch-sites format their errors as "<Provider> API error (undefined): Request was aborted." and do NOT rewrite inner-timeout failures the way providers/anthropic/client.ts does, so a permissive regex would burn three retries on a deterministic 30-minute deadline failure that fires identically on every attempt. Anchors the regex to ^anthropic api error:\s*request was aborted so only the Anthropic catch-site's transport-abort output matches. Anthropic's inner-timeout failures rewrite the message to start with 'Anthropic stream timed out after Xs (inner streamTimeoutMs)' which doesn't match the prefix. Non-Anthropic providers' aborted-stream errors fall through to network-error classification (unchanged). Adds a regression test verifying OpenAI-shaped 'Request was aborted' errors stay non-retryable until those catch-sites grow the same innerTimeoutFired distinction. Refs LUM-1537.
There was a problem hiding this comment.
✦ APPROVE
Value: Eliminates the #1 daemon error (1,344 events/4d) — Anthropic transport aborts that were silently failing the entire turn instead of retrying, leaving users staring at a blank screen for 45s.
What this does: Adds isRetryableTransportAbort() to the retry predicate chain, gated behind ^anthropic api error:\s*request was aborted — intentionally Anthropic-specific so OpenAI/Gemini inner-timeout failures (which share the same SDK message but with (undefined) parenthetical) aren't retried on their deterministic 30-minute deadline paths.
Analysis:
-
Pattern anchoring is correct. The regex
/^anthropic api error:\s*request was aborted/imatches"Anthropic API error: Request was aborted."but not"OpenAI API error (undefined): Request was aborted."— the(undefined)parenthetical in other providers' format naturally excludes them. Codex's P2 about non-Anthropic provider inner-timeouts was already handled by this anchoring; Devin confirmed in 55ad034. -
Three-way disambiguation is airtight. Caller-cancel (
abortReason !== undefined) short-circuits before reaching this predicate. Inner-timeout gets a rewritten message ("Anthropic stream timed out after Xs") that doesn't match. Only true transport aborts fall through. The docstring makes this explicit. -
Retry budget appropriate.
DEFAULT_MAX_RETRIES = 3with equal-jitter backoff (≤7s total) is the right tradeoff — bursty LB rolling restarts justify more than 1 retry; bounded budget avoids runaway. -
Token replay on retry (pre-existing, non-blocking NIT): If attempt 1 emits partial tokens via
onEventbefore aborting, attempt 2 replays from the start — users could see duplicate output. This is existingRetryProviderbehavior (theevents accumulate across retriestest documents it), not introduced by this PR. Worth a future dedup pass but not blocking here. -
Three test cases are exactly right: transport abort retries (mirrors exact production error shape), caller-cancelled doesn't retry (abortReason check), inner-timeout doesn't retry (catch-site's rewritten message). All three cover distinct code paths.
-
errorType: "transport_abort"tag makes retry-exhaustion alerting actionable in Logfire/Sentry. The RCA calls this out explicitly.
PR description is one of the best I've seen — full failure chain traced with matching event counts, every alternative rejected with clear reasoning, AGENTS.md guidance answered honestly. This is the daemon-side half of the fix; #6708 (vembda) and #6711 (web observability) close out the other layers. Vellum Constitution — Trust-seeking: retry budget is bounded and reasoning is fully auditable.
Prompt / plan
Sentry telemetry over the last 4d showed
[anthropic-client] Anthropic stream aborted by transport — likely bun fetch deadline, edge LB, or network idle cutofffiring 1,344 times — the daemon's #1 error by 5×, on the/v1/agent/loopSSE chat path. Cross-correlating with the platform Sentry projects on the same time window:RemoteProtocolError: peer closed connection... (incomplete chunked read)on/assistants/{assistant_id}/query: 234 + 34 related = 268sse_watchdog_fired: 313That order-of-magnitude match (1,344 daemon aborts → 268 vembda upstream-disconnects → 313 web watchdog fires) traced the failure chain end-to-end:
Anthropic.APIError(status: undefined, message: "Request was aborted.").assistant/src/providers/anthropic/client.ts:1447correctly classifies this as a transport abort (vs. caller-cancel byabortReason, vs. inner-timeout by message rewrite) and re-throws asProviderError(statusCode: undefined, message: "Anthropic API error: Request was aborted.").RetryProviderthen drops the error onto the floor.isRetryableError()short-circuitsabortReason !== undefined(✓ correct), skips the 429/5xx branch (statusCode === undefined),isRetryableProviderMessagematches only/overloaded/i,isRetryableStreamErrormatches only the four SDK stream-corruption patterns, andisRetryableNetworkErrorchecks ECONNRESET-style errno codes that the SDK never sets on this path. So the error rethrows up the agent loop, the turn fails, the daemon→vembda TCP closes mid-frame, and the web client observes 45 s of silence before its watchdog fires.The comment at
retry.ts:97even calls out this gap: "transport-level aborts (retryable) and caller-cancels both surface as 'Request was aborted' from the SDK." — the daemon-cancel side was wired up; the transport-retryable side wasn't.This PR adds the missing positive predicate.
Companion PRs that address the same chain at the other two layers:
vellum-ai/vellum-assistant-platform#6708— vembda emits a graceful-close SSE comment frame onhttpx.RemoteProtocolError(boundary-gated per WHATWG §10.5) so the client gets an SSE-level signal instead of a half-open chunked response.vellum-ai/vellum-assistant-platform#6711— web SSE observability (wasTurnSendingtag,lastByteAgeMs, heartbeat-vs-data counters,attachStacktrace) to split user-harming stalls from benign idle stalls in Sentry.Together these collapse the 45 s silent-stall window described in LUM-1431 and the deeper investigation at LUM-1459. This PR is the daemon-side primary fix — the other two are observability + symptom mitigation.
What this changes
assistant/src/providers/retry.ts:RETRYABLE_TRANSPORT_ABORT_PATTERNS = [/request was aborted/i]constant, separate from the existingRETRYABLE_PROVIDER_MESSAGE_PATTERNS(overloaded) andRETRYABLE_STREAM_PATTERNS(SDK corruption) so log/metric tagging stays distinguishable.isRetryableTransportAbort(error)predicate — gated onProviderErrorwithstatusCode === undefined, message matches the pattern.isRetryableError()calls it afterisRetryableStreamErrorand beforeisRetryableNetworkError, so:abortReason !== undefinedshort-circuits first, unchanged)."Anthropic stream timed out after Xs (inner streamTimeoutMs)", which doesn't contain"request was aborted".DEFAULT_MAX_RETRIES = 3,DEFAULT_BASE_DELAY_MS = 1000with equal jitter) as overloaded_error.errorType: "transport_abort"in the retry log warning so Logfire/Sentry filters can distinguish it from genericnetwork_error.Retry semantics on
sendMessage()are stream-aware: each attempt delivers events independently viaonEvent, the existingRetryProvider — streaming response handling > events accumulate across retriestest covers the partial-token replay case. No new idempotency surface is introduced —sendMessage()always issues a fresh HTTP request, so retries are at-most-once at the SDK layer.Test plan
Three new regression tests in
src/__tests__/provider-error-scenarios.test.ts:retries transport-aborted stream (Anthropic 'Request was aborted' with no abortReason)— mirrors the exactProviderErrorshape from the catch-site, confirms 1 fail + 1 success → 2 calls (one retry).does NOT retry caller-aborted stream (abortReason set short-circuits retry)— same message,abortReason: "user-cancelled"set, confirmscallCount === 1(zero retries, error surfaces as-is).does NOT retry inner-timeout stream (deterministic 30min deadline failure)— uses the catch-site's rewritten message, confirmscallCount === 1.Existing coverage (also re-verified):
retries overloaded_error with undefined statusCode (mid-stream SSE)— overloaded path unchanged.retries on 429 and succeeds after transient rate limit— HTTP-status path unchanged.RETRYABLE_STREAM_PATTERNStests — SDK-corruption path unchanged.events accumulate across retries— onEvent-replay semantics unchanged.Local verification:
Pre-push hook ran all 521 affected tests in
assistant/— all pass.bunx tsc --noEmitclean,bun run lintclean (one pre-existing warning in an unrelated file).CLI verb checklist
No new IPC routes — pure retry-policy change in the provider adapter layer.
References
APIError,APIConnectionError,APIUserAbortError): https://github.com/anthropics/anthropic-sdk-typescript/blob/main/src/error.tscomputeRetryDelayalready implements): https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/Alternatives considered
/request was aborted/itoRETRYABLE_NETWORK_MESSAGE_PATTERNSinutil/retry.ts. Rejected — that predicate is shared withnotifications/adapters/platform.tsandmemory/embed.ts, neither of which has theabortReasonshort-circuit that distinguishes caller-cancel from transport-abort. Adding the pattern there would risk retrying user-cancelled requests in those code paths. Keeping the pattern provider-retry-local preserves the abort-reason invariant.retryable: trueflag on theProviderError). Rejected — the retry layer is intentionally generic across providers (OpenAI, Gemini, OpenRouter, Fireworks). Future providers that surface the same SDK message shape (e.g.@anthropic-ai/sdkis used internally by OpenRouter's Anthropic-compat path) get this fix for free without changing each catch-site.DEFAULT_MAX_RETRIES = 3budget used for 429/5xx). Rejected — empirically the abort rate is bursty (Anthropic edge LB rolling restarts, Bun fetch deadline spikes correlated with cold cache). A single retry leaves users stalled when 2 successive aborts hit; the equal-jitter backoff (500-1000ms, 1000-2000ms, 2000-4000ms) keeps total retry budget bounded at ≤7s. Cheaper than the 45s the user currently waits.Root cause analysis
isRetryableError()chain was built in tranches —isRetryableNetworkErrorhandled errno-coded transport failures,isRetryableStreamErrorhandled SDK stream-corruption patterns,isRetryableProviderMessagehandledoverloaded. The Anthropic transport-abort case wasn't a runtime observation when the chain was written; it became prominent only as call volume scaled.anthropic/client.tscorrectly logged the transport-abort case with a clearWARN-level message ("Anthropic stream aborted by transport — likely bun fetch deadline, edge LB, or network idle cutoff"), but no observability surface paged the team when the rate of those warnings climbed. So the gap between "the warning is logged" and "the warning is acted on by the retry layer" went unnoticed for 1,344 events in 4d.sse_watchdog_firedevents with 100%messagesAddedBucket=0, which is the symptom side of this bug. (c) Vembda Sentry showed 268RemoteProtocolErrorpeer-closed events, which are the proxy-layer symptom. None of these were cross-correlated until this investigation.errorType: "transport_abort"tag on the retry log warning is the first step — once it's deployed, Logfire/Sentry can alert on it specifically rather than collapsing it intonetwork_error. Beyond that, the broader pattern (a logged WARN with no SLO attached) is the systemic issue. Worth following up with: an Logfire/Sentry alert onmodule=retry AND errorType=transport_abort AND count > X/hourso retry exhaustion (i.e. transport aborts that persist through 3 retries) pages the on-call.RETRYABLE_TRANSPORT_ABORT_PATTERNSthat links transport-abort to the LUM-1536 / LUM-1538 companion PRs so future readers see the full chain.Refs LUM-1537, LUM-1431, LUM-1459, LUM-1536, LUM-1538.
Link to Devin session: https://app.devin.ai/sessions/c23057eacba947afb722cea232ba157d
Requested by: @ashleeradka