Skip to content

Add 429 throttle retry policy#26820

Open
andrewmathew1 wants to merge 4 commits into
Azure:mainfrom
andrewmathew1:andrewmathew1/cosmos-429-retry-policy
Open

Add 429 throttle retry policy#26820
andrewmathew1 wants to merge 4 commits into
Azure:mainfrom
andrewmathew1:andrewmathew1/cosmos-429-retry-policy

Conversation

@andrewmathew1
Copy link
Copy Markdown
Contributor

Adds a dedicated throttle retry policy that retries HTTP 429 responses, honors the x-ms-retry-after-ms response header, and is configurable via ClientOptions.ThrottlingRetryOptions (MaxRetryAttempts, MaxRetryWaitTime). This brings parity with the throttling retry behavior of the .NET, Java, and Python Cosmos SDKs.

Fixes #25638

Adds a dedicated throttle retry policy that retries HTTP 429 responses,
honors the x-ms-retry-after-ms response header, and is configurable via
ClientOptions.ThrottlingRetryOptions (MaxRetryAttempts, MaxRetryWaitTime).
This brings parity with the throttling retry behavior of the .NET, Java,
and Python Cosmos SDKs.

Fixes Azure#25638

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings May 20, 2026 02:58
@andrewmathew1 andrewmathew1 requested a review from a team as a code owner May 20, 2026 02:58
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a Cosmos-specific throttling retry policy to retry HTTP 429 responses, honoring the x-ms-retry-after-ms header with configurable limits exposed via ClientOptions.ThrottlingRetryOptions.

Changes:

  • Introduces throttleRetryPolicy to retry HTTP 429 responses with a capped attempt count and cumulative wait budget.
  • Wires the new policy into the azcosmos client pipeline and adds ThrottlingRetryOptions to ClientOptions.
  • Adds unit tests validating retry behavior and updates the changelog.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
sdk/data/azcosmos/cosmos_throttle_retry_policy.go Implements the dedicated 429 throttling retry policy and retry-after parsing.
sdk/data/azcosmos/cosmos_throttle_retry_policy_test.go Adds unit tests covering success, exhaustion, missing header, and cancellation scenarios.
sdk/data/azcosmos/cosmos_http_constants.go Adds the x-ms-retry-after-ms header constant.
sdk/data/azcosmos/cosmos_client.go Inserts the throttling retry policy into the per-retry pipeline.
sdk/data/azcosmos/cosmos_client_options.go Exposes ThrottlingRetryOptions on ClientOptions.
sdk/data/azcosmos/CHANGELOG.md Records the feature addition.

Comment thread sdk/data/azcosmos/cosmos_throttle_retry_policy.go Outdated
Comment thread sdk/data/azcosmos/cosmos_throttle_retry_policy.go Outdated
Comment thread sdk/data/azcosmos/cosmos_throttle_retry_policy.go Outdated
Comment thread sdk/data/azcosmos/cosmos_throttle_retry_policy.go Outdated
andrewmathew1 and others added 3 commits May 19, 2026 23:15
The new throttle retry policy retries 429 with Cosmos-specific
x-ms-retry-after-ms semantics, but azcore's default retry policy also
retries 429. Layered together they compound: with maxRetryAttempts=2 a
single 429 produced 12 transport calls (3 throttle attempts x 4 azcore
attempts) instead of 3.

newClient now copies the embedded ClientOptions and, when the caller
hasn't set Retry.StatusCodes or Retry.ShouldRetry, swaps in azcore's
default status code set without 429 (408, 500, 502, 503, 504). The
throttle policy owns 429 end-to-end. Callers who customize Retry
behavior are left untouched.

Also replace `time.After` in the retry-wait select with
`time.NewTimer` + `Stop()` on cancellation, so cancelled long
delays don't leak timers until they fire.

New tests:
* TestThrottleRetry_RewindsRequestBodyAcrossRetries (POST body is
  re-readable across retries via the documented RewindBody path).
* TestThrottleRetry_NoDoubleRetryWith429 (full newClient pipeline -
  single 429 with delay produces 1 + N throttle attempts, not the
  compounded count).
* TestThrottleRetry_AzcoreStillRetriesOther5xx (503 is retried by
  azcore as before; the 429 carve-out doesn't disable the rest).
* TestThrottleRetry_CallerStatusCodesPreserved (explicit
  Retry.StatusCodes from the caller is not overridden).
* TestDefaultAzcoreRetryStatusCodesWithout429 (helper sanity).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ry-policy-fixes

azcosmos: fix double-retry on 429 and timer leak in throttle policy
Four issues raised by the Copilot reviewer:

1. The doc comment on newThrottleRetryPolicy said "non-positive values
   for either option fall back to defaults" but a negative
   MaxRetryAttempts actually disables retries entirely. Updated the
   comment to describe the real behavior of zero vs. negative.

2. readRetryAfterMs treated `delay <= 0` as "missing header" and fell
   back to defaultDelay, so an explicit `x-ms-retry-after-ms: 0`
   (a valid "retry immediately" hint) was silently replaced with a 5s
   sleep. Changed readRetryAfterMs to return `(duration, ok bool)`
   and only fall back when `!ok`. Explicit 0 is now honored.

3. The retry path called `azruntime.Drain(response)` before
   `req.RewindBody()`, so a rewind failure returned an error along
   with a response whose body was already closed. Reordered so the
   rewind happens first; if it fails, the caller still receives an
   intact 429 response for diagnostics.

4. `strconv.ParseFloat` accepts `NaN`/`Inf` and would produce
   surprising durations when scaled. Added explicit `math.IsNaN` /
   `math.IsInf` guards alongside the existing negative-value check.

Tests:
* TestReadRetryAfterMs extended with explicit-zero, nan, positive-inf,
  and negative-inf cases. Updated to assert the new `(duration, ok)`
  return.
* New TestThrottleRetry_ExplicitZeroRetryAfterIsHonored proves the
  policy retries immediately on `x-ms-retry-after-ms: 0` with a
  configured defaultDelay (5s) and a tight cumulative wait budget
  (100ms): if the fallback were applied, the first retry would either
  exceed the budget or take longer than the asserted threshold.

CHANGELOG: added an Other Changes entry under 1.5.0-beta.7 documenting
the new explicit-zero semantics, the NaN/Inf rejection, and the
rewind-before-drain ordering.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

### Features Added

* Added a dedicated 429 (Too Many Requests) throttling retry policy that honors the `x-ms-retry-after-ms` response header and is configurable via `ClientOptions.ThrottlingRetryOptions` (`MaxRetryAttempts`, `MaxRetryWaitTime`). This brings parity with the throttling retry behavior in the .NET, Java, and Python Cosmos SDKs. When `ClientOptions.Retry.StatusCodes` and `ClientOptions.Retry.ShouldRetry` are both unset, 429 is no longer in the azcore retry policy's default status codes (it is now handled exclusively by the throttling retry policy); the other transient status codes (408, 500, 502, 503, 504) remain.
Copy link
Copy Markdown
Member

@simorenoh simorenoh Jun 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Added a dedicated 429 (Too Many Requests) throttling retry policy that honors the `x-ms-retry-after-ms` response header and is configurable via `ClientOptions.ThrottlingRetryOptions` (`MaxRetryAttempts`, `MaxRetryWaitTime`). This brings parity with the throttling retry behavior in the .NET, Java, and Python Cosmos SDKs. When `ClientOptions.Retry.StatusCodes` and `ClientOptions.Retry.ShouldRetry` are both unset, 429 is no longer in the azcore retry policy's default status codes (it is now handled exclusively by the throttling retry policy); the other transient status codes (408, 500, 502, 503, 504) remain.
* Added a dedicated 429 (Too Many Requests) throttling retry policy that honors the `x-ms-retry-after-ms` response header and is configurable via `ClientOptions.ThrottlingRetryOptions` (`MaxRetryAttempts`, `MaxRetryWaitTime`). When `ClientOptions.Retry.StatusCodes` and `ClientOptions.Retry.ShouldRetry` are both unset, 429 is no longer in the azcore retry policy's default status codes (it is now handled exclusively by the throttling retry policy); the other transient status codes (408, 500, 502, 503, 504) remain. See [PR 26820](https://github.com/Azure/azure-sdk-for-go/pull/26820).

Copy link
Copy Markdown
Member

@simorenoh simorenoh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a changelog comment, LGTM otherwise!


### Other Changes

* Throttling retry policy: an explicit `x-ms-retry-after-ms: 0` header is now honored as "retry immediately" instead of being treated as a missing header (which would have applied the default delay). NaN/Inf values for the header are now rejected as invalid. The request body is rewound before the 429 response body is drained so a rewind failure surfaces a usable 429 response to the caller.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we probably don't need this since these changes are also in this PR for the first time right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] 429 Retry Policy

3 participants