Skip to content

fix: add retry circuit breaker and backoff cap to prevent infinite retry loops#17668

Closed
dawidbednarczyk wants to merge 1 commit intoanomalyco:devfrom
dawidbednarczyk:fix/retry-circuit-breaker
Closed

fix: add retry circuit breaker and backoff cap to prevent infinite retry loops#17668
dawidbednarczyk wants to merge 1 commit intoanomalyco:devfrom
dawidbednarczyk:fix/retry-circuit-breaker

Conversation

@dawidbednarczyk
Copy link

Issue for this PR

Relates to #17648

Type of change

  • Bug fix
  • New feature
  • Refactor / code improvement
  • Documentation

What does this PR do?

When API errors trigger retries and response headers are present but don't contain retry-after, delay() in retry.ts computes exponential backoff with no upper bound (line 54). The no-headers path is capped at 30s, but the with-headers fallback path isn't — it just returns RETRY_INITIAL_DELAY * 2^(attempt-1) raw.

Meanwhile processor.ts has a while(true) loop with no exit condition on retries — it increments attempt, sleeps the unbounded delay, and continues forever.

I hit this in production: 11 consecutive AI_APICallError: Could not relay message upstream errors over 9 minutes, with delays escalating to 202 seconds between attempts. Process never recovered, had to be killed manually.

Three changes:

  1. RETRY_MAX_DELAY_WITH_HEADERS = 60_000 — caps the with-headers fallback path at 60s (matching the spirit of the existing 30s no-headers cap)
  2. RETRY_MAX_ATTEMPTS = 10 — exported constant for max retry count
  3. Circuit breaker in processor.ts — after attempt > RETRY_MAX_ATTEMPTS, logs the failure, publishes error event, sets session idle, and breaks the loop

The fix is minimal and doesn't change behavior for successful retries or retries that respect retry-after headers.

How did you verify your code works?

  • Analyzed production logs from two stuck processes to confirm the exact failure pattern
  • tsc --noEmit passes clean
  • Verified the 202s delay at attempt 7 would have been capped to 60s, and the loop would have exited at attempt 10 instead of running indefinitely

Checklist

  • I have tested my changes locally
  • I have not included unrelated changes in this PR

…try loops

When API errors trigger retries with response headers present but no
retry-after header, the exponential backoff grows without bound (observed
202s+ delays in production). Combined with the while(true) loop in
processor.ts having no exit condition, this causes sessions to hang
indefinitely burning CPU and tokens.

Changes:
- Add RETRY_MAX_ATTEMPTS (10) to cap total retry count
- Add RETRY_MAX_DELAY_WITH_HEADERS (60s) to cap backoff when headers
  are present but missing retry-after
- Add circuit breaker in processor.ts that breaks the retry loop after
  max attempts, publishes error event, and sets session to idle

Validated against production logs showing 11 retries over 542 seconds
with AI_APICallError: Could not relay message upstream.

Relates to anomalyco#17648
@github-actions
Copy link
Contributor

Thanks for your contribution!

This PR doesn't have a linked issue. All PRs must reference an existing issue.

Please:

  1. Open an issue describing the bug/feature (if one doesn't exist)
  2. Add Fixes #<number> or Closes #<number> to this PR description

See CONTRIBUTING.md for details.

@dawidbednarczyk
Copy link
Author

Closing to resubmit with proper issue linkage (Fixes #17648) and rebased onto latest dev. The e2e (windows) timeout was a pre-existing CI issue (fixed in #17751).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant