Skip to content

fix: add retry circuit breaker and backoff cap to prevent infinite retry loops#17790

Open
dawidbednarczyk wants to merge 1 commit intoanomalyco:devfrom
dawidbednarczyk:fix/retry-circuit-breaker
Open

fix: add retry circuit breaker and backoff cap to prevent infinite retry loops#17790
dawidbednarczyk wants to merge 1 commit intoanomalyco:devfrom
dawidbednarczyk:fix/retry-circuit-breaker

Conversation

@dawidbednarczyk
Copy link

@dawidbednarczyk dawidbednarczyk commented Mar 16, 2026

Issue for this PR

Closes #17648

Type of change

  • Bug fix
  • New feature
  • Refactor / code improvement
  • Documentation

What does this PR do?

This fixes an infinite retry loop in the session processor.

Today processor.ts retries inside a while (true) loop with no exit condition. At the same time, retry.ts caps backoff only when response headers are missing. If headers are present but retry-after is missing, the fallback path returns raw exponential backoff with no cap.

In practice this means transient upstream API failures can leave a session stuck retrying for hours. In my logs this showed up as repeated AI_APICallError: Could not relay message upstream failures with delays growing to 202 seconds between attempts, and the process never recovered without manual intervention.

This PR makes three small changes:

  1. adds RETRY_MAX_ATTEMPTS = 10
  2. adds RETRY_MAX_DELAY_WITH_HEADERS = 60_000
  3. breaks out of the retry loop in processor.ts after max attempts, publishes the error event, and returns the session to idle

This is a minimal fix. It does not change behavior for successful retries or for responses that already provide a valid retry-after header.

This is also the same underlying problem described in #12234 and #17169.

How did you verify your code works?

  • Rebasing onto the latest dev pulled in the recent Windows CI fix (fix(ci): workaround by using hoisted Bun linker on Windows #17751)
  • bun turbo typecheck passes locally across all packages
  • I reproduced the failure mode from production logs and checked that the new cap would reduce the 202s retry delay to 60s
  • I verified the retry loop now stops after the configured max attempts instead of continuing indefinitely
  • The PR CI now passes typecheck, unit (linux), and unit (windows), with e2e jobs running after that

Screenshots / recordings

Not a UI change.

Checklist

  • I have tested my changes locally
  • I have not included unrelated changes in this PR

…try loops

When API errors trigger retries with response headers present but no
retry-after header, the exponential backoff grows without bound (observed
202s+ delays in production). Combined with the while(true) loop in
processor.ts having no exit condition, this causes sessions to hang
indefinitely burning CPU and tokens.

Changes:
- Add RETRY_MAX_ATTEMPTS (10) to cap total retry count
- Add RETRY_MAX_DELAY_WITH_HEADERS (60s) to cap backoff when headers
  are present but missing retry-after
- Add circuit breaker in processor.ts that breaks the retry loop after
  max attempts, publishes error event, and sets session to idle

Validated against production logs showing 11 retries over 542 seconds
with AI_APICallError: Could not relay message upstream.

Fixes anomalyco#17648
@github-actions github-actions bot added needs:compliance This means the issue will auto-close after 2 hours. and removed needs:compliance This means the issue will auto-close after 2 hours. labels Mar 16, 2026
@github-actions
Copy link
Contributor

Thanks for updating your PR! It now meets our contributing guidelines. 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Session processor retries indefinitely with unbounded exponential backoff — no max retries or circuit breaker

1 participant