You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is my first issue so I hope I can provide all the information required for a better understand and troubleshooting. I might even be wrong with this, so please, bear with me.
Context: The infrastructure is just deployed from scratch using TF. All apps/services are up and running except a Kuberay Worker (more details below). Using the help provider I deployed kuberay-operator with a few custom values (show below) and I create a sample Ray Service using TF's kubectl provide to deploy the manifest (shown below too).
In this cluster I have deployed another apps/services using Bitnami's charts.
Deploy your Kubernetes cluster as usual. Use helm to install bitnami's kuberay. Deploy a RayService and check kuberay-operator logs, as well as the RayService.
Are you using any custom parameters or values?
The reason I am adding the rbac rules and the service account account token is related to the apparent issue I am seeing. The reason I am adding the RAYCLUSTER_DEFAULT_REQUEUE_SECONDS_ENV is because in kuberay-operator's logs I see a message that states as the variable was not set it was using some other value, no biggie with this one, just explaining why I added that.
The Ray Service I am using as an example is this one:
Note: I tried using version 2.39.0 as well just in case but the results are the same and as the Ray image being used by Bitnami's kuberay operator is 2.38 and it is advised to use the same one in the custom images, I created my app image using 2.38.
The Ray Service in running state and no messages in kuberay-operator logs.
$ kg rayservice -n kuberay
NAME SERVICE STATUS NUM SERVE ENDPOINTS
rayservice-fake-emails Running 1
What do you see instead?
When I see the Ray Service it is stuck in WaitForServeDeploymentReady
$ kg rayservice -n kuberay
NAME SERVICE STATUS NUM SERVE ENDPOINTS
rayservice-fake-emails WaitForServeDeploymentReady
$ kd rayservice -n kuberay rayservice-fake-emails
Name: rayservice-fake-emails
Namespace: kuberay
Labels: <none>
Annotations: <none>
API Version: ray.io/v1
Kind: RayService
Metadata:
Creation Timestamp: 2024-11-27T12:40:06Z
Generation: 1
Resource Version: 5499
UID: a9221370-e409-4943-b0e0-77e3ff693c49
Spec:
Deployment Unhealthy Second Threshold: 300
Ray Cluster Config:
Head Group Spec:
Ray Start Params:
Dashboard - Host: 0.0.0.0
Template:
Spec:
Containers:
Image: fjrivas/custom_ray:latest
Name: ray-head
Ports:
Container Port: 6379
Name: gcs-server
Protocol: TCP
Container Port: 8265
Name: dashboard
Protocol: TCP
Container Port: 10001
Name: client
Protocol: TCP
Container Port: 8000
Name: serve
Protocol: TCP
Resources:
Limits:
Cpu: 2
Memory: 2Gi
Requests:
Cpu: 2
Memory: 2Gi
Ray Version: 2.38.0
Worker Group Specs:
Group Name: small-group
Max Replicas: 2
Min Replicas: 1
Num Of Hosts: 1
Ray Start Params:
Replicas: 1
Template:
Spec:
Containers:
Image: fjrivas/custom_ray:latest
Lifecycle:
Pre Stop:
Exec:
Command:
/bin/sh
-c
ray stop
Name: ray-worker
Resources:
Limits:
Cpu: 1
Memory: 2Gi
Requests:
Cpu: 500m
Memory: 2Gi
serveConfigV2: applications:
- name: fake
import_path: fake:app
route_prefix: /
Service Unhealthy Second Threshold: 900
Status:
Active Service Status:
Ray Cluster Status:
Desired CPU: 2500m
Desired GPU: 0
Desired Memory: 4Gi
Desired TPU: 0
Desired Worker Replicas: 1
Endpoints:
Client: 10001
Dashboard: 8265
Gcs - Server: 6379
Metrics: 8080
Serve: 8000
Head:
Pod IP: 10.1.36.167
Pod Name: rayservice-fake-emails-raycluster-dh9h2-head-6jq9d
Service IP: 10.1.36.167
Service Name: rayservice-fake-emails-raycluster-dh9h2-head-svc
Last Update Time: 2024-11-27T12:40:59Z
Max Worker Replicas: 2
Min Worker Replicas: 1
Observed Generation: 1
Observed Generation: 1
Pending Service Status:
Application Statuses:
Fake:
Health Last Update Time: 2024-11-27T12:41:27Z
Serve Deployment Statuses:
create_fake_email:
Health Last Update Time: 2024-11-27T12:41:27Z
Status: UPDATING
Status: DEPLOYING
Ray Cluster Name: rayservice-fake-emails-raycluster-dh9h2
Ray Cluster Status:
Desired CPU: 0
Desired GPU: 0
Desired Memory: 0
Desired TPU: 0
Head:
Service Status: WaitForServeDeploymentReady
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal ServiceNotReady 7m6s (x25 over 7m54s) rayservice-controller The service is not ready yet. Controller will perform a round of actions in 2s.
I have read that the worker group is normal that is 0/1, in fact even under this conditions the app works.
$ kgpo -n kuberay
NAME READY STATUS RESTARTS AGE
kuberay-operator-b5c75fd87-blwj6 1/1 Running 0 9m19s
rayservice-fake-emails-raycluster-dh9h2-head-6jq9d 1/1 Running 0 8m54s
rayservice-fake-emails-raycluster-dh9h2-small-grou-worker-l9s5w 0/1 Running 0 8m54s
I also see these in the kuberay-operator logs:
W1127 12:42:08.725162 1 reflector.go:539] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: failed to list *v1.Endpoints: endpoints is forbidden: User "system:serviceaccount:kuberay:kuberay-operator" cannot list resource "endpoints"in API group "" at the cluster scope
E1127 12:42:08.725465 1 reflector.go:147] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: Failed to watch *v1.Endpoints: failed to list *v1.Endpoints: endpoints is forbidden: User "system:serviceaccount:kuberay:kuberay-operator" cannot list resource "endpoints"in API group "" at the cluster scope
W1127 12:42:57.122692 1 reflector.go:539] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: failed to list *v1.Endpoints: endpoints is forbidden: User "system:serviceaccount:kuberay:kuberay-operator" cannot list resource "endpoints"in API group "" at the cluster scope
E1127 12:42:57.122732 1 reflector.go:147] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: Failed to watch *v1.Endpoints: failed to list *v1.Endpoints: endpoints is forbidden: User "system:serviceaccount:kuberay:kuberay-operator" cannot list resource "endpoints"in API group "" at the cluster scope
W1127 12:43:42.058024 1 reflector.go:539] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: failed to list *v1.Endpoints: endpoints is forbidden: User "system:serviceaccount:kuberay:kuberay-operator" cannot list resource "endpoints"in API group "" at the cluster scope
E1127 12:43:42.058075 1 reflector.go:147] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: Failed to watch *v1.Endpoints: failed to list *v1.Endpoints: endpoints is forbidden: User "system:serviceaccount:kuberay:kuberay-operator" cannot list resource "endpoints"in API group "" at the cluster scope
W1127 12:44:29.551260 1 reflector.go:539] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: failed to list *v1.Endpoints: endpoints is forbidden: User "system:serviceaccount:kuberay:kuberay-operator" cannot list resource "endpoints"in API group "" at the cluster scope
E1127 12:44:29.551308 1 reflector.go:147] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:229: Failed to watch *v1.Endpoints: failed to list *v1.Endpoints: endpoints is forbidden: User "system:serviceaccount:kuberay:kuberay-operator" cannot list resource "endpoints"in API group "" at the cluster scope
Additional information
Stuff that I have tried:
Adding the rbac rules to the custom values.yaml
When I edit the cluster roles, adding the endpoints resource the Ray Service status changed to Running and the messages in the kuberay-operator log are no longer there.
If there is in fact an issue and not my mistake adding these in the wrong place. The change will be in clusterrole.yaml adding the resources and verbs.
I have forked the project and, if this is in fact something to be fixed, I am ready to create a PR wit the solution described above.
Thank you so much for reporting. If I understood correctly, it seems that the RBAC rules may not be in sync with some changes in upstream. Would you like to submit a PR adding the missing rules?
Name and Version
bitnami/kuberay 1.2.19
What architecture are you using?
amd64
What steps will reproduce the bug?
This is my first issue so I hope I can provide all the information required for a better understand and troubleshooting. I might even be wrong with this, so please, bear with me.
Context: The infrastructure is just deployed from scratch using TF. All apps/services are up and running except a Kuberay Worker (more details below). Using the help provider I deployed
kuberay-operator
with a few custom values (show below) and I create a sample Ray Service using TF's kubectl provide to deploy the manifest (shown below too).Kubernetes Cluster: AWS EKS
Helm:
Images:
In this cluster I have deployed another apps/services using Bitnami's charts.
Deploy your Kubernetes cluster as usual. Use helm to install bitnami's kuberay. Deploy a RayService and check kuberay-operator logs, as well as the RayService.
Are you using any custom parameters or values?
The reason I am adding the rbac rules and the service account account token is related to the apparent issue I am seeing. The reason I am adding the
RAYCLUSTER_DEFAULT_REQUEUE_SECONDS_ENV
is because in kuberay-operator's logs I see a message that states as the variable was not set it was using some other value, no biggie with this one, just explaining why I added that.The Ray Service I am using as an example is this one:
Note: I tried using version 2.39.0 as well just in case but the results are the same and as the Ray image being used by Bitnami's kuberay operator is 2.38 and it is advised to use the same one in the custom images, I created my app image using 2.38.
What is the expected behavior?
The Ray Service in running state and no messages in kuberay-operator logs.
What do you see instead?
When I see the Ray Service it is stuck in WaitForServeDeploymentReady
I have read that the worker group is normal that is 0/1, in fact even under this conditions the app works.
I also see these in the kuberay-operator logs:
Additional information
Stuff that I have tried:
When I edit the cluster roles, adding the
endpoints
resource the Ray Service status changed to Running and the messages in the kuberay-operator log are no longer there.What I did was:
$ kubectl edit clusterrole kuberay-kuberay-operator -n kuberay ... - apiGroups: - "" resources: - endpoints verbs: - list - watch ...
If there is in fact an issue and not my mistake adding these in the wrong place. The change will be in clusterrole.yaml adding the resources and verbs.
I have forked the project and, if this is in fact something to be fixed, I am ready to create a PR wit the solution described above.
I hope I am not missing anything.
Update 11/27/2024: I can see the required rules are in the Ray project chart helper
The text was updated successfully, but these errors were encountered: