Skip to content

Commit 966cc94

Browse files
[Docs] Add guidance for RayService initialization timeout to prevent indefinite waiting (ray-project#58238)
## Description Add guidance for RayService initialization timeout to prevent indefinite waiting with `ray.io/initializing-timeout` annotation on RayService. ## Related issues Closes ray-project/kuberay#4138 ## Additional information None --------- Signed-off-by: wei-chenglai <[email protected]> Signed-off-by: Wei-Cheng Lai <[email protected]> Co-authored-by: Han-Ju Chen (Future-Outlier) <[email protected]>
1 parent c0f3ee6 commit 966cc94

File tree

1 file changed

+28
-0
lines changed

1 file changed

+28
-0
lines changed

doc/source/cluster/kubernetes/troubleshooting/rayservice-troubleshooting.md

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -85,6 +85,7 @@ kubectl exec -it $HEAD_POD -- ray summary actors
8585
* {ref}`kuberay-raysvc-issue8`
8686
* {ref}`kuberay-raysvc-issue9`
8787
* {ref}`kuberay-raysvc-issue10`
88+
* {ref}`kuberay-raysvc-issue11`
8889

8990
(kuberay-raysvc-issue1)=
9091
### Issue 1: Ray Serve script is incorrect
@@ -309,3 +310,30 @@ If the annotation isn't set, KubeRay automatically uses each RayCluster custom r
309310
Hence, both the old and new RayClusters have different `RAY_external_storage_namespace` values, and the new RayCluster is unable to access the old cluster metadata.
310311
Another solution is to set the `RAY_external_storage_namespace` value manually to a unique value for each RayCluster custom resource.
311312
See [kuberay#1296](https://github.com/ray-project/kuberay/issues/1296) for more details.
313+
314+
(kuberay-raysvc-issue11)=
315+
### Issue 11: RayService stuck in Initializing — use the initializing timeout to fail fast
316+
317+
If one or more underlying Pods are scheduled but fail to start (for example, ImagePullBackOff, CrashLoopBackOff, or other container startup errors), a `RayService` can remain in the Initializing state indefinitely. This state consumes cluster resources and makes the root cause harder to diagnose.
318+
319+
#### What to do
320+
KubeRay exposes a configurable initializing timeout via the annotation `ray.io/initializing-timeout`. When the timeout expires, the operator marks the `RayService` as failed and starts cleanup of associated `RayCluster` resources. Enabling the timeout requires only adding the annotation to the `RayService` metadata — no other CRD changes are necessary.
321+
322+
#### Operator behavior after timeout
323+
- The `RayServiceReady` condition is set to `False` with reason `InitializingTimeout`.
324+
- The `RayService` is placed into a **terminal (failed)** state; updating the spec will not trigger a retry. Recovery requires deleting and recreating the `RayService`.
325+
- Cluster names on the `RayService` CR are cleared, which triggers cleanup of the underlying `RayCluster` resources. Deletions still respect `RayClusterDeletionDelaySeconds`.
326+
- A `Warning` event is emitted that documents the timeout and the failure reason.
327+
328+
#### Enable the timeout
329+
Add the annotation to your `RayService` metadata. The annotation accepts either Go duration strings (for example, `"30m"` or `"1h"`) or integer seconds (for example, `"1800"`):
330+
331+
```yaml
332+
metadata:
333+
annotations:
334+
ray.io/initializing-timeout: "30m"
335+
```
336+
337+
#### Guidance
338+
- Pick a timeout that balances expected startup work with failing fast to conserve cluster resources.
339+
- See the upstream discussion [kuberay#4138](https://github.com/ray-project/kuberay/issues/4138) for more implementation details.

0 commit comments

Comments
 (0)