[Dashboard] Serve Grafana panels shows metrics from multiple clusters instead of filtering on SessionName or ray_io_cluster #50850

frenoid · 2025-02-24T07:26:58Z

What happened + What you expected to happen

Context
We are running multiple Ray clusters on version 2.41.0 and sending metrics to a single common Thanos instance.

Each Ray cluster is launched by the Kuberay operator

We are launching multiple Ray clusters by creating RayCluster and RayService custom resources

Observed
The Ray Dashboard Overview tab, "Cluster Utilization card" shows CPU, memory and Disk metrics for the ray cluster on which the Dashboard process is running.

We noticed that the Dashboard frontend code queries Thanos and includes a filter on SessionName

However, in the Serve tab, the "QPS per application", "Error QPS per application" panels include applications from multiple clusters

We noticed that the frontend Thanos queries do not filter on SessionName or ray_io_cluster

Expected

The Grafana panels in the Serve tab of the Ray Dashboard should only show applications for the Ray Cluster on which the Dashboard hosted instead of showing all applications running on every Ray cluster

Relevant code
I dug into the code base and believe that these code lines are relevant.

The Cluster Utilization card in the Overview tab has a SessionName filter

ray/python/ray/dashboard/client/src/pages/overview/cards/ClusterUtilizationCard.tsx

Line 55 in f100fe8

    
           src={`${grafanaHost}${path}&refresh&timezone=${currentTimeZone}${timeRangeParams}&var-SessionName=${sessionName}`}

The queries in the Serve tab do not

ray/python/ray/dashboard/client/src/pages/serve/ServeDeploymentMetricsSection.tsx

Line 186 in f100fe8

`/d-solo/${grafanaServeDashboardUid}?${pathParams}` +

Versions / Dependencies

Kuberay operator 1.2.2
Ray 2.41.0
Python 3.11.11
Thanos 0.37
Grafana 10.4.4
OpenShift 4.13

Reproduction script

These env are set in the RayCluster and RayService Kubernetes custom resources

- name: RAY_GRAFANA_IFRAME_HOST
  value: "https://grafana.apps.uat/.<redacted_root_domain>"
- name: RAY_GRAFANA_HOST
  value: "http://grafana.monitoring.svc"
- name: RAY_PROMETHEUS_NAME
  value: "Thanos"
- name: RAY_PROMETHEUS_HOST
  value: "http://thanos-query.monitoring.svc.cluster.local:9090"

Issue Severity

Medium: It is a significant difficulty but I can work around it.

The text was updated successfully, but these errors were encountered:

frenoid added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Feb 24, 2025

jcotant1 added the dashboard Issues specific to the Ray Dashboard label Feb 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Dashboard] Serve Grafana panels shows metrics from multiple clusters instead of filtering on SessionName or ray_io_cluster #50850

[Dashboard] Serve Grafana panels shows metrics from multiple clusters instead of filtering on SessionName or ray_io_cluster #50850

frenoid commented Feb 24, 2025 •

edited

Loading

[Dashboard] Serve Grafana panels shows metrics from multiple clusters instead of filtering on SessionName or ray_io_cluster #50850

[Dashboard] Serve Grafana panels shows metrics from multiple clusters instead of filtering on SessionName or ray_io_cluster #50850

Comments

frenoid commented Feb 24, 2025 • edited Loading

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

frenoid commented Feb 24, 2025 •

edited

Loading