Skip to content
Merged
Changes from 14 commits
Commits
Show all changes
20 commits
Select commit Hold shift + click to select a range
30234a2
[Docs] Add the draft description about feature intro, configurations,…
justinyeh1995 Oct 26, 2025
a746d50
Merge branch 'ray-project:master' into docs/3883-add-apiserver-rety-t…
justinyeh1995 Oct 27, 2025
14638bd
[Fix] Update the retry walk-through
justinyeh1995 Oct 30, 2025
fcfcdf4
Merge branch 'master' of https://github.com/ray-project/kuberay into …
justinyeh1995 Oct 30, 2025
8287448
[Doc] rewrite the first 2 sections
justinyeh1995 Nov 1, 2025
8533fb3
Merge branch 'master' of https://github.com/ray-project/kuberay into …
justinyeh1995 Nov 1, 2025
6a85ad3
Merge branch 'master' of https://github.com/ray-project/kuberay into …
justinyeh1995 Nov 7, 2025
5911cc4
Merge branch 'master' of https://github.com/ray-project/kuberay into …
justinyeh1995 Nov 8, 2025
f656a35
[Doc] Revise documentation wording and add Observing Retry Behavior s…
justinyeh1995 Nov 12, 2025
67c1476
[Fix] fix linting issue by running pre-commit run berfore commiting
justinyeh1995 Nov 12, 2025
da763de
[Fix] fix linting errors in the Markdown linting
justinyeh1995 Nov 12, 2025
9f9e3f4
Merge branch 'master' of https://github.com/ray-project/kuberay into …
justinyeh1995 Nov 13, 2025
fb4874a
[Fix] Clean up the math equation
justinyeh1995 Nov 13, 2025
9ed4b17
Update the math formula of Backoff calculation.
justinyeh1995 Nov 14, 2025
7640567
[Fix] Explicitly mentioned exponential backoff and removed the custom…
justinyeh1995 Nov 15, 2025
9a1e786
[Docs] Clarify naming by replacing “APIServer” with “KubeRay APIServer”
justinyeh1995 Nov 16, 2025
784228e
[Docs] Rename retry-configuration.md to retry-behavior.md for accuracy
justinyeh1995 Nov 16, 2025
5d58086
Update Title to KubeRay APIServer Retry Behavior
justinyeh1995 Nov 17, 2025
3e9b06b
[Docs] Add a note about the limitation of retry configuration
justinyeh1995 Nov 17, 2025
6a5e883
Merge branch 'master' of https://github.com/ray-project/kuberay into …
justinyeh1995 Nov 20, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
94 changes: 94 additions & 0 deletions apiserversdk/docs/retry-configuration.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
# APIServer Retry Behavior

By default, the KubeRay APIServer automatically retries failed requests to the Kubernetes API when transient errors occur.
This built-in resilience improves reliability without requiring manual intervention.
This guide explains the retry behavior and how to customize it.

## Prerequisite

Follow [installation](installation.md) to install the cluster and apiserver.

## Default Retry Behavior

The APIServer automatically retries for these HTTP status codes:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can explicitly mention we use exponential backoff when retrying for this transient errors

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. I will add this part into the paragraph.


- 408 (Request Timeout)
- 429 (Too Many Requests)
- 500 (Internal Server Error)
- 502 (Bad Gateway)
- 503 (Service Unavailable)
- 504 (Gateway Timeout)

Note that non-retryable errors (4xx except 408/429) fail immediately without retries.

The following default configuration explains how retry works:

- **MaxRetry**: 3 retries (4 total attempts including the initial one)
- **InitBackoff**: 500ms (initial wait time)
- **BackoffFactor**: 2.0 (exponential multiplier)
- **MaxBackoff**: 10s (maximum wait time between retries)
- **OverallTimeout**: 30s (total timeout for all attempts)

which means $$\text{Backoff}_i = \min(\text{InitBackoff} \times \text{BackoffFactor}^i, \text{MaxBackoff})$$

where $i$ is the attempt number (starting from 0).
The retries will stop if the total time exceeds the `OverallTimeout`.

## Customize the Retry Configuration

Currently, retry configuration is hardcoded. If you need custom retry behavior,
you'll need to modify the source code and rebuild the image.

### Step 1: Modify the config in `apiserversdk/util/config.go`

For example,

```go
const (
HTTPClientDefaultMaxRetry = 5 // Increase retries from 3 to 5
HTTPClientDefaultBackoffFactor = float64(2)
HTTPClientDefaultInitBackoff = 2 * time.Second // Longer backoff makes timing visible
HTTPClientDefaultMaxBackoff = 20 * time.Second
HTTPClientDefaultOverallTimeout = 120 * time.Second // Longer timeout to allow more retries
)
```
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like currently we do not have a way to configure it without modifying the code. I am thinking in this case we can omit the configuration part and just write about the default behavior?

cc @Future-Outlier @rueian for some advice on this

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After discussing offline, we can just document the default behavior here.

Copy link
Contributor Author

@justinyeh1995 justinyeh1995 Nov 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. I will remove the customization part.


### Step 2: Rebuild and load the new APIServer image into your Kind cluster

```bash
cd apiserver
export IMG_REPO=kuberay-apiserver
export IMG_TAG=dev
export KIND_CLUSTER_NAME=$(kubectl config current-context | sed 's/^kind-//')

make docker-image IMG_REPO=kuberay-apiserver IMG_TAG=dev
make load-image IMG_REPO=$IMG_REPO IMG_TAG=$IMG_TAG KIND_CLUSTER_NAME=$KIND_CLUSTER_NAME
```

### Step 3: Redeploy the APIServer using Helm, overriding the image to use the new one you just built

```bash
helm upgrade --install kuberay-apiserver ../helm-chart/kuberay-apiserver --wait \
--set image.repository=$IMG_REPO,image.tag=$IMG_TAG,image.pullPolicy=IfNotPresent \
--set security=null
```

## Observing Retry Behavior

### In Production

When retry occurs in production, you won't see explicit logs by default because
the retry mechanism operates silently. However, you can observe its effects:

1. **Monitor request latency**: Retried requests will take longer due to backoff delays
2. **Check Kubernetes API Server logs**: Look for repeated requests from the same client

### In Development

To verify retry behavior during development, you can:

1. Run the unit tests to ensure retry logic works correctly:

```bash
make test
```
Loading