You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
frenoid opened this issue
Feb 24, 2025
· 0 comments
Labels
bugSomething that is supposed to be working; but isn'tdashboardIssues specific to the Ray DashboardtriageNeeds triage (eg: priority, bug/not-bug, and owning component)
Context
We are running multiple Ray clusters on version 2.41.0 and sending metrics to a single common Thanos instance.
Each Ray cluster is launched by the Kuberay operator
We are launching multiple Ray clusters by creating RayCluster and RayService custom resources
Observed
The Ray Dashboard Overview tab, "Cluster Utilization card" shows CPU, memory and Disk metrics for the ray cluster on which the Dashboard process is running.
We noticed that the Dashboard frontend code queries Thanos and includes a filter on SessionName
However, in the Serve tab, the "QPS per application", "Error QPS per application" panels include applications from multiple clusters
We noticed that the frontend Thanos queries do not filter on SessionName or ray_io_cluster
Expected
The Grafana panels in the Serve tab of the Ray Dashboard should only show applications for the Ray Cluster on which the Dashboard hosted instead of showing all applications running on every Ray cluster
Relevant code
I dug into the code base and believe that these code lines are relevant.
The Cluster Utilization card in the Overview tab has a SessionName filter
Medium: It is a significant difficulty but I can work around it.
The text was updated successfully, but these errors were encountered:
frenoid
added
bug
Something that is supposed to be working; but isn't
triage
Needs triage (eg: priority, bug/not-bug, and owning component)
labels
Feb 24, 2025
bugSomething that is supposed to be working; but isn'tdashboardIssues specific to the Ray DashboardtriageNeeds triage (eg: priority, bug/not-bug, and owning component)
What happened + What you expected to happen
Context
We are running multiple Ray clusters on version 2.41.0 and sending metrics to a single common Thanos instance.
Each Ray cluster is launched by the Kuberay operator
We are launching multiple Ray clusters by creating RayCluster and RayService custom resources
Observed
The Ray Dashboard Overview tab, "Cluster Utilization card" shows CPU, memory and Disk metrics for the ray cluster on which the Dashboard process is running.
We noticed that the Dashboard frontend code queries Thanos and includes a filter on SessionName
However, in the Serve tab, the "QPS per application", "Error QPS per application" panels include applications from multiple clusters
We noticed that the frontend Thanos queries do not filter on SessionName or ray_io_cluster
Expected
The Grafana panels in the Serve tab of the Ray Dashboard should only show applications for the Ray Cluster on which the Dashboard hosted instead of showing all applications running on every Ray cluster
Relevant code
I dug into the code base and believe that these code lines are relevant.
The Cluster Utilization card in the Overview tab has a SessionName filter
ray/python/ray/dashboard/client/src/pages/overview/cards/ClusterUtilizationCard.tsx
Line 55 in f100fe8
The queries in the Serve tab do not
ray/python/ray/dashboard/client/src/pages/serve/ServeDeploymentMetricsSection.tsx
Line 186 in f100fe8
Versions / Dependencies
Kuberay operator 1.2.2
Ray 2.41.0
Python 3.11.11
Thanos 0.37
Grafana 10.4.4
OpenShift 4.13
Reproduction script
These env are set in the RayCluster and RayService Kubernetes custom resources
Issue Severity
Medium: It is a significant difficulty but I can work around it.
The text was updated successfully, but these errors were encountered: