You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -309,3 +310,30 @@ If the annotation isn't set, KubeRay automatically uses each RayCluster custom r
309
310
Hence, both the old and new RayClusters have different `RAY_external_storage_namespace` values, and the new RayCluster is unable to access the old cluster metadata.
310
311
Another solution is to set the `RAY_external_storage_namespace` value manually to a unique value for each RayCluster custom resource.
311
312
See [kuberay#1296](https://github.com/ray-project/kuberay/issues/1296) for more details.
313
+
314
+
(kuberay-raysvc-issue11)=
315
+
### Issue 11: RayService stuck in Initializing — use the initializing timeout to fail fast
316
+
317
+
If one or more underlying Pods are scheduled but fail to start (for example, ImagePullBackOff, CrashLoopBackOff, or other container startup errors), a `RayService` can remain in the Initializing state indefinitely. This state consumes cluster resources and makes the root cause harder to diagnose.
318
+
319
+
#### What to do
320
+
KubeRay exposes a configurable initializing timeout via the annotation `ray.io/initializing-timeout`. When the timeout expires, the operator marks the `RayService` as failed and starts cleanup of associated `RayCluster` resources. Enabling the timeout requires only adding the annotation to the `RayService` metadata — no other CRD changes are necessary.
321
+
322
+
#### Operator behavior after timeout
323
+
- The `RayServiceReady` condition is set to `False` with reason `InitializingTimeout`.
324
+
- The `RayService` is placed into a **terminal (failed)** state; updating the spec will not trigger a retry. Recovery requires deleting and recreating the `RayService`.
325
+
- Cluster names on the `RayService` CR are cleared, which triggers cleanup of the underlying `RayCluster` resources. Deletions still respect `RayClusterDeletionDelaySeconds`.
326
+
- A `Warning` event is emitted that documents the timeout and the failure reason.
327
+
328
+
#### Enable the timeout
329
+
Add the annotation to your `RayService` metadata. The annotation accepts either Go duration strings (for example, `"30m"` or `"1h"`) or integer seconds (for example, `"1800"`):
330
+
331
+
```yaml
332
+
metadata:
333
+
annotations:
334
+
ray.io/initializing-timeout: "30m"
335
+
```
336
+
337
+
#### Guidance
338
+
- Pick a timeout that balances expected startup work with failing fast to conserve cluster resources.
339
+
- See the upstream discussion [kuberay#4138](https://github.com/ray-project/kuberay/issues/4138) for more implementation details.
0 commit comments