-
Notifications
You must be signed in to change notification settings - Fork 678
[APIServer][Docs] Add user guide for retry behavior & configuration #4144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 14 commits
30234a2
a746d50
14638bd
fcfcdf4
8287448
8533fb3
6a85ad3
5911cc4
f656a35
67c1476
da763de
9f9e3f4
fb4874a
9ed4b17
7640567
9a1e786
784228e
5d58086
3e9b06b
6a5e883
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,94 @@ | ||
| # APIServer Retry Behavior | ||
|
|
||
| By default, the KubeRay APIServer automatically retries failed requests to the Kubernetes API when transient errors occur. | ||
| This built-in resilience improves reliability without requiring manual intervention. | ||
| This guide explains the retry behavior and how to customize it. | ||
|
|
||
| ## Prerequisite | ||
|
|
||
| Follow [installation](installation.md) to install the cluster and apiserver. | ||
|
|
||
| ## Default Retry Behavior | ||
|
|
||
| The APIServer automatically retries for these HTTP status codes: | ||
|
|
||
| - 408 (Request Timeout) | ||
| - 429 (Too Many Requests) | ||
| - 500 (Internal Server Error) | ||
| - 502 (Bad Gateway) | ||
| - 503 (Service Unavailable) | ||
| - 504 (Gateway Timeout) | ||
|
|
||
| Note that non-retryable errors (4xx except 408/429) fail immediately without retries. | ||
|
|
||
| The following default configuration explains how retry works: | ||
|
|
||
| - **MaxRetry**: 3 retries (4 total attempts including the initial one) | ||
| - **InitBackoff**: 500ms (initial wait time) | ||
| - **BackoffFactor**: 2.0 (exponential multiplier) | ||
| - **MaxBackoff**: 10s (maximum wait time between retries) | ||
| - **OverallTimeout**: 30s (total timeout for all attempts) | ||
|
|
||
| which means $$\text{Backoff}_i = \min(\text{InitBackoff} \times \text{BackoffFactor}^i, \text{MaxBackoff})$$ | ||
|
|
||
| where $i$ is the attempt number (starting from 0). | ||
| The retries will stop if the total time exceeds the `OverallTimeout`. | ||
|
|
||
| ## Customize the Retry Configuration | ||
|
|
||
| Currently, retry configuration is hardcoded. If you need custom retry behavior, | ||
| you'll need to modify the source code and rebuild the image. | ||
|
|
||
| ### Step 1: Modify the config in `apiserversdk/util/config.go` | ||
|
|
||
| For example, | ||
|
|
||
| ```go | ||
| const ( | ||
| HTTPClientDefaultMaxRetry = 5 // Increase retries from 3 to 5 | ||
| HTTPClientDefaultBackoffFactor = float64(2) | ||
| HTTPClientDefaultInitBackoff = 2 * time.Second // Longer backoff makes timing visible | ||
| HTTPClientDefaultMaxBackoff = 20 * time.Second | ||
| HTTPClientDefaultOverallTimeout = 120 * time.Second // Longer timeout to allow more retries | ||
| ) | ||
| ``` | ||
|
||
|
|
||
| ### Step 2: Rebuild and load the new APIServer image into your Kind cluster | ||
|
|
||
| ```bash | ||
| cd apiserver | ||
| export IMG_REPO=kuberay-apiserver | ||
| export IMG_TAG=dev | ||
| export KIND_CLUSTER_NAME=$(kubectl config current-context | sed 's/^kind-//') | ||
|
|
||
| make docker-image IMG_REPO=kuberay-apiserver IMG_TAG=dev | ||
| make load-image IMG_REPO=$IMG_REPO IMG_TAG=$IMG_TAG KIND_CLUSTER_NAME=$KIND_CLUSTER_NAME | ||
| ``` | ||
|
|
||
| ### Step 3: Redeploy the APIServer using Helm, overriding the image to use the new one you just built | ||
|
|
||
| ```bash | ||
| helm upgrade --install kuberay-apiserver ../helm-chart/kuberay-apiserver --wait \ | ||
| --set image.repository=$IMG_REPO,image.tag=$IMG_TAG,image.pullPolicy=IfNotPresent \ | ||
| --set security=null | ||
| ``` | ||
|
|
||
| ## Observing Retry Behavior | ||
|
|
||
| ### In Production | ||
|
|
||
| When retry occurs in production, you won't see explicit logs by default because | ||
| the retry mechanism operates silently. However, you can observe its effects: | ||
|
|
||
| 1. **Monitor request latency**: Retried requests will take longer due to backoff delays | ||
| 2. **Check Kubernetes API Server logs**: Look for repeated requests from the same client | ||
|
|
||
| ### In Development | ||
|
|
||
| To verify retry behavior during development, you can: | ||
|
|
||
| 1. Run the unit tests to ensure retry logic works correctly: | ||
|
|
||
| ```bash | ||
| make test | ||
| ``` | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can explicitly mention we use exponential backoff when retrying for this transient errors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. I will add this part into the paragraph.