Skip to content

fix(daemon): retry Anthropic stream aborted by transport (LUM-1537)#30586

Merged
ashleeradka merged 2 commits into
mainfrom
devin/1778696525-lum-1537-anthropic-transport-abort-retry
May 13, 2026
Merged

fix(daemon): retry Anthropic stream aborted by transport (LUM-1537)#30586
ashleeradka merged 2 commits into
mainfrom
devin/1778696525-lum-1537-anthropic-transport-abort-retry

Conversation

@ashleeradka
Copy link
Copy Markdown
Contributor

@ashleeradka ashleeradka commented May 13, 2026

Prompt / plan

Sentry telemetry over the last 4d showed [anthropic-client] Anthropic stream aborted by transport — likely bun fetch deadline, edge LB, or network idle cutoff firing 1,344 times — the daemon's #1 error by 5×, on the /v1/agent/loop SSE chat path. Cross-correlating with the platform Sentry projects on the same time window:

  • vembda RemoteProtocolError: peer closed connection... (incomplete chunked read) on /assistants/{assistant_id}/query: 234 + 34 related = 268
  • web sse_watchdog_fired: 313

That order-of-magnitude match (1,344 daemon aborts → 268 vembda upstream-disconnects → 313 web watchdog fires) traced the failure chain end-to-end:

  1. Anthropic stream cuts off mid-token (Anthropic blip / Bun fetch deadline / LB idle cutoff). SDK throws Anthropic.APIError(status: undefined, message: "Request was aborted.").
  2. The daemon's Anthropic provider catch-site at assistant/src/providers/anthropic/client.ts:1447 correctly classifies this as a transport abort (vs. caller-cancel by abortReason, vs. inner-timeout by message rewrite) and re-throws as ProviderError(statusCode: undefined, message: "Anthropic API error: Request was aborted.").
  3. RetryProvider then drops the error onto the floor. isRetryableError() short-circuits abortReason !== undefined (✓ correct), skips the 429/5xx branch (statusCode === undefined), isRetryableProviderMessage matches only /overloaded/i, isRetryableStreamError matches only the four SDK stream-corruption patterns, and isRetryableNetworkError checks ECONNRESET-style errno codes that the SDK never sets on this path. So the error rethrows up the agent loop, the turn fails, the daemon→vembda TCP closes mid-frame, and the web client observes 45 s of silence before its watchdog fires.

The comment at retry.ts:97 even calls out this gap: "transport-level aborts (retryable) and caller-cancels both surface as 'Request was aborted' from the SDK." — the daemon-cancel side was wired up; the transport-retryable side wasn't.

This PR adds the missing positive predicate.

Companion PRs that address the same chain at the other two layers:

  • vellum-ai/vellum-assistant-platform#6708 — vembda emits a graceful-close SSE comment frame on httpx.RemoteProtocolError (boundary-gated per WHATWG §10.5) so the client gets an SSE-level signal instead of a half-open chunked response.
  • vellum-ai/vellum-assistant-platform#6711 — web SSE observability (wasTurnSending tag, lastByteAgeMs, heartbeat-vs-data counters, attachStacktrace) to split user-harming stalls from benign idle stalls in Sentry.

Together these collapse the 45 s silent-stall window described in LUM-1431 and the deeper investigation at LUM-1459. This PR is the daemon-side primary fix — the other two are observability + symptom mitigation.

What this changes

assistant/src/providers/retry.ts:

  1. New RETRYABLE_TRANSPORT_ABORT_PATTERNS = [/request was aborted/i] constant, separate from the existing RETRYABLE_PROVIDER_MESSAGE_PATTERNS (overloaded) and RETRYABLE_STREAM_PATTERNS (SDK corruption) so log/metric tagging stays distinguishable.
  2. New isRetryableTransportAbort(error) predicate — gated on ProviderError with statusCode === undefined, message matches the pattern.
  3. isRetryableError() calls it after isRetryableStreamError and before isRetryableNetworkError, so:
    • Daemon-initiated cancellations stay non-retryable (abortReason !== undefined short-circuits first, unchanged).
    • HTTP-status errors (429/5xx) take the existing fast path (unchanged).
    • Inner-timeout failures stay non-retryable — the catch-site rewrites the message to "Anthropic stream timed out after Xs (inner streamTimeoutMs)", which doesn't contain "request was aborted".
    • Only true transport aborts retry, with the same exponential-backoff budget (DEFAULT_MAX_RETRIES = 3, DEFAULT_BASE_DELAY_MS = 1000 with equal jitter) as overloaded_error.
  4. New errorType: "transport_abort" in the retry log warning so Logfire/Sentry filters can distinguish it from generic network_error.

Retry semantics on sendMessage() are stream-aware: each attempt delivers events independently via onEvent, the existing RetryProvider — streaming response handling > events accumulate across retries test covers the partial-token replay case. No new idempotency surface is introduced — sendMessage() always issues a fresh HTTP request, so retries are at-most-once at the SDK layer.

Test plan

Three new regression tests in src/__tests__/provider-error-scenarios.test.ts:

  • retries transport-aborted stream (Anthropic 'Request was aborted' with no abortReason) — mirrors the exact ProviderError shape from the catch-site, confirms 1 fail + 1 success → 2 calls (one retry).
  • does NOT retry caller-aborted stream (abortReason set short-circuits retry) — same message, abortReason: "user-cancelled" set, confirms callCount === 1 (zero retries, error surfaces as-is).
  • does NOT retry inner-timeout stream (deterministic 30min deadline failure) — uses the catch-site's rewritten message, confirms callCount === 1.

Existing coverage (also re-verified):

  • retries overloaded_error with undefined statusCode (mid-stream SSE) — overloaded path unchanged.
  • retries on 429 and succeeds after transient rate limit — HTTP-status path unchanged.
  • All four RETRYABLE_STREAM_PATTERNS tests — SDK-corruption path unchanged.
  • events accumulate across retries — onEvent-replay semantics unchanged.

Local verification:

$ bun test src/__tests__/provider-error-scenarios.test.ts src/__tests__/retry-backoff.test.ts src/__tests__/retry-after-extraction.test.ts src/providers/__tests__/retry-callsite.test.ts src/__tests__/openai-provider.test.ts
158 pass, 0 fail

Pre-push hook ran all 521 affected tests in assistant/ — all pass. bunx tsc --noEmit clean, bun run lint clean (one pre-existing warning in an unrelated file).

CLI verb checklist

No new IPC routes — pure retry-policy change in the provider adapter layer.

References

Alternatives considered

  • Add /request was aborted/i to RETRYABLE_NETWORK_MESSAGE_PATTERNS in util/retry.ts. Rejected — that predicate is shared with notifications/adapters/platform.ts and memory/embed.ts, neither of which has the abortReason short-circuit that distinguishes caller-cancel from transport-abort. Adding the pattern there would risk retrying user-cancelled requests in those code paths. Keeping the pattern provider-retry-local preserves the abort-reason invariant.
  • Move the classification into the Anthropic catch-site (set a retryable: true flag on the ProviderError). Rejected — the retry layer is intentionally generic across providers (OpenAI, Gemini, OpenRouter, Fireworks). Future providers that surface the same SDK message shape (e.g. @anthropic-ai/sdk is used internally by OpenRouter's Anthropic-compat path) get this fix for free without changing each catch-site.
  • Cap retries at 1 for transport aborts (vs. the DEFAULT_MAX_RETRIES = 3 budget used for 429/5xx). Rejected — empirically the abort rate is bursty (Anthropic edge LB rolling restarts, Bun fetch deadline spikes correlated with cold cache). A single retry leaves users stalled when 2 successive aborts hit; the equal-jitter backoff (500-1000ms, 1000-2000ms, 2000-4000ms) keeps total retry budget bounded at ≤7s. Cheaper than the 45s the user currently waits.

Root cause analysis

  1. How did the code get into this state? The isRetryableError() chain was built in tranches — isRetryableNetworkError handled errno-coded transport failures, isRetryableStreamError handled SDK stream-corruption patterns, isRetryableProviderMessage handled overloaded. The Anthropic transport-abort case wasn't a runtime observation when the chain was written; it became prominent only as call volume scaled.
  2. What mistakes or decisions led to it? The catch-site in anthropic/client.ts correctly logged the transport-abort case with a clear WARN-level message ("Anthropic stream aborted by transport — likely bun fetch deadline, edge LB, or network idle cutoff"), but no observability surface paged the team when the rate of those warnings climbed. So the gap between "the warning is logged" and "the warning is acted on by the retry layer" went unnoticed for 1,344 events in 4d.
  3. Were there warning signs we missed? Yes — three of them. (a) The retry.ts comment at line 97 explicitly named the gap: "transport-level aborts (retryable) and caller-cancels both surface as 'Request was aborted' from the SDK" — the comment acknowledges retryability but the predicate didn't follow. (b) Web Sentry showed 313 sse_watchdog_fired events with 100% messagesAddedBucket=0, which is the symptom side of this bug. (c) Vembda Sentry showed 268 RemoteProtocolError peer-closed events, which are the proxy-layer symptom. None of these were cross-correlated until this investigation.
  4. What can we do to prevent this pattern from recurring? The new errorType: "transport_abort" tag on the retry log warning is the first step — once it's deployed, Logfire/Sentry can alert on it specifically rather than collapsing it into network_error. Beyond that, the broader pattern (a logged WARN with no SLO attached) is the systemic issue. Worth following up with: an Logfire/Sentry alert on module=retry AND errorType=transport_abort AND count > X/hour so retry exhaustion (i.e. transport aborts that persist through 3 retries) pages the on-call.
  5. AGENTS.md guidance? No new rule. The catch-site / retry-layer separation of concerns is already correct; the bug was a missed predicate, not a missed convention. Adding a rule like "every catch-site that re-throws a logged warning must verify the retry layer handles it" would be too prescriptive — most logged warnings genuinely shouldn't retry. The right durable artifact is the test coverage in this PR (the three new tests pin the desired retry behavior for the three abort sub-cases) plus the in-code docstring on RETRYABLE_TRANSPORT_ABORT_PATTERNS that links transport-abort to the LUM-1536 / LUM-1538 companion PRs so future readers see the full chain.

Refs LUM-1537, LUM-1431, LUM-1459, LUM-1536, LUM-1538.

Link to Devin session: https://app.devin.ai/sessions/c23057eacba947afb722cea232ba157d
Requested by: @ashleeradka


Open in Devin Review

When the Anthropic SDK reports a transport-level abort
("Request was aborted." with status=undefined), the catch-site in
providers/anthropic/client.ts distinguishes between caller cancellation
(tags abortReason) and transport-level termination (leaves abortReason
undefined). The retry layer was already short-circuiting caller
cancellation via the abortReason check, but had no positive predicate
for transport aborts, so the SDK message fell through to the default
"network_error" branch — which checks ECONNRESET-style errno codes that
the SDK never sets on this code path. Net result: every Anthropic
stream that was cut mid-token by Bun's fetch deadline, an edge LB idle
cutoff, or an upstream TCP close propagated to the SSE chat handler as
a terminal error, which surfaced to the web client as a 45 s silent
stall before the watchdog fired and reconnected with no message to
reconcile.

Adds isRetryableTransportAbort() as a sibling predicate to
isRetryableProviderMessage() and isRetryableStreamError(), gated on
ProviderError with statusCode=undefined and a /request was aborted/i
message match. Slots into the existing isRetryableError() chain after
the daemon-cancel and HTTP-status short-circuits, so:

- daemon-initiated cancellations stay non-retryable (abortReason !==
  undefined short-circuits first);
- inner-timeout failures stay non-retryable (the catch-site rewrites
  the message to "Anthropic stream timed out after Xs (inner
  streamTimeoutMs)", which does not match the pattern);
- only true transport aborts retry, using the same exponential-backoff
  budget as overloaded_error / 429 / 5xx.

Adds errorType="transport_abort" tagging so the retry log warnings are
distinguishable in Logfire/Sentry from generic network_error.

Three new regression tests in provider-error-scenarios.test.ts cover:
the retry path on transport abort, the abortReason short-circuit on
caller cancellation, and the inner-timeout rewrite path stays
non-retryable.

This is the daemon-side counterpart to vellum-assistant-platform#6708
(vembda graceful close on RemoteProtocolError) and
vellum-assistant-platform#6711 (web SSE observability) — together they
collapse the 45 s silent-stall window described in LUM-1431.

Refs LUM-1537, LUM-1431, LUM-1459, LUM-1536, LUM-1538.

Co-Authored-By: ashlee@vellum.ai <ashlee@vellum.ai>
@linear
Copy link
Copy Markdown

linear Bot commented May 13, 2026

LUM-1537

LUM-1431

LUM-1459

LUM-1536

LUM-1538

@devin-ai-integration
Copy link
Copy Markdown
Contributor

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ad1a27e3a4

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

// saw an HTTP response). A real HTTP status here means a server error,
// which is handled by the status check.
if (error.statusCode !== undefined) return false;
return RETRYABLE_TRANSPORT_ABORT_PATTERNS.some((p) => p.test(error.message));
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid retrying other providers' timeout aborts

When a non-Anthropic provider's own createStreamTimeout() aborts the SDK call, the OpenAI/Gemini catch sites only preserve abortReason from the external caller signal and otherwise wrap non-API aborts as a ProviderError with statusCode === undefined; if that SDK reports the same Request was aborted text, this new provider-agnostic predicate will retry the deterministic 30-minute timeout three more times instead of surfacing it. Anthropic rewrites its inner timeout before this point, but the other providers do not, so this should be scoped to the Anthropic/OpenRouter-Anthropic transport-abort shape or those catch sites should tag/rewrite inner timeouts first.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — confirmed and fixed in 55ad034.

Walked through the catch-sites:

  • providers/openai/chat-completions-provider.ts:338-382 — computes abortReason from the external signal?.aborted, so when the inner createStreamTimeout fires (timing out timeoutSignal, not signal), abortReason stays undefined. The OpenAI SDK throws APIUserAbortError extends APIError with status === undefined and message === "Request was aborted.", so the catch falls into the OpenAI.APIError branch and throws ProviderError("OpenAI API error (undefined): Request was aborted.", "openai", undefined, {}) — no message rewrite. ❌ would have retried.
  • providers/openai/responses-provider.ts:401-443 — same shape, providerLabel substituted. ❌ would have retried.
  • providers/gemini/client.ts:336-372 — same shape, "Gemini API error (undefined): ...". ❌ would have retried.
  • providers/openrouter/client.ts extends chat-completions-provider, so it inherits the OpenAI catch-site path. ❌ would have retried.
  • providers/anthropic/client.ts:1417-1523 — the only catch-site with the innerTimeoutFired check that rewrites the message to "Anthropic stream timed out after Xs (inner streamTimeoutMs)", ensuring inner-timeout failures don't match "Request was aborted". ✓ safe.

Tightened the pattern to /^anthropic api error:\s*request was aborted/i so it matches only the Anthropic catch-site's verbatim output. Non-Anthropic providers' aborted-stream errors (including their (undefined) parenthetical form) fall through to network-error classification, unchanged from before this PR.

Added a regression test (does NOT retry OpenAI/Gemini-shaped 'Request was aborted') that pins this — feeding "OpenAI API error (undefined): Request was aborted." through RetryProvider confirms callCount === 1 (zero retries).

The right long-term fix is for the OpenAI/Gemini catch-sites to grow the same innerTimeoutFired distinction, at which point the pattern set here can be expanded to cover them. Out of scope for LUM-1537 since (a) there's no telemetry showing transport aborts on those providers in our fleet (1,344/4d is Anthropic-specific), and (b) it requires touching three more provider files plus their tests.

Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 3 additional findings.

Open in Devin Review

The previous /request was aborted/i pattern was provider-agnostic, which
risked retrying OpenAI / Gemini / OpenRouter inner-streamTimeoutMs
deadline failures. Those catch-sites format their errors as
"<Provider> API error (undefined): Request was aborted." and do NOT
rewrite inner-timeout failures the way providers/anthropic/client.ts
does, so a permissive regex would burn three retries on a deterministic
30-minute deadline failure that fires identically on every attempt.

Anchors the regex to ^anthropic api error:\s*request was aborted so
only the Anthropic catch-site's transport-abort output matches.
Anthropic's inner-timeout failures rewrite the message to start with
'Anthropic stream timed out after Xs (inner streamTimeoutMs)' which
doesn't match the prefix. Non-Anthropic providers' aborted-stream
errors fall through to network-error classification (unchanged).

Adds a regression test verifying OpenAI-shaped 'Request was aborted'
errors stay non-retryable until those catch-sites grow the same
innerTimeoutFired distinction.

Refs LUM-1537.
Copy link
Copy Markdown
Contributor

@vex-assistant-bot vex-assistant-bot Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

APPROVE

Value: Eliminates the #1 daemon error (1,344 events/4d) — Anthropic transport aborts that were silently failing the entire turn instead of retrying, leaving users staring at a blank screen for 45s.

What this does: Adds isRetryableTransportAbort() to the retry predicate chain, gated behind ^anthropic api error:\s*request was aborted — intentionally Anthropic-specific so OpenAI/Gemini inner-timeout failures (which share the same SDK message but with (undefined) parenthetical) aren't retried on their deterministic 30-minute deadline paths.

Analysis:

  • Pattern anchoring is correct. The regex /^anthropic api error:\s*request was aborted/i matches "Anthropic API error: Request was aborted." but not "OpenAI API error (undefined): Request was aborted." — the (undefined) parenthetical in other providers' format naturally excludes them. Codex's P2 about non-Anthropic provider inner-timeouts was already handled by this anchoring; Devin confirmed in 55ad034.

  • Three-way disambiguation is airtight. Caller-cancel (abortReason !== undefined) short-circuits before reaching this predicate. Inner-timeout gets a rewritten message ("Anthropic stream timed out after Xs") that doesn't match. Only true transport aborts fall through. The docstring makes this explicit.

  • Retry budget appropriate. DEFAULT_MAX_RETRIES = 3 with equal-jitter backoff (≤7s total) is the right tradeoff — bursty LB rolling restarts justify more than 1 retry; bounded budget avoids runaway.

  • Token replay on retry (pre-existing, non-blocking NIT): If attempt 1 emits partial tokens via onEvent before aborting, attempt 2 replays from the start — users could see duplicate output. This is existing RetryProvider behavior (the events accumulate across retries test documents it), not introduced by this PR. Worth a future dedup pass but not blocking here.

  • Three test cases are exactly right: transport abort retries (mirrors exact production error shape), caller-cancelled doesn't retry (abortReason check), inner-timeout doesn't retry (catch-site's rewritten message). All three cover distinct code paths.

  • errorType: "transport_abort" tag makes retry-exhaustion alerting actionable in Logfire/Sentry. The RCA calls this out explicitly.

PR description is one of the best I've seen — full failure chain traced with matching event counts, every alternative rejected with clear reasoning, AGENTS.md guidance answered honestly. This is the daemon-side half of the fix; #6708 (vembda) and #6711 (web observability) close out the other layers. Vellum Constitution — Trust-seeking: retry budget is bounded and reasoning is fully auditable.

@ashleeradka ashleeradka merged commit cc19e80 into main May 13, 2026
13 checks passed
@ashleeradka ashleeradka deleted the devin/1778696525-lum-1537-anthropic-transport-abort-retry branch May 13, 2026 20:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant