Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Dashboard] Serve Grafana panels shows metrics from multiple clusters instead of filtering on SessionName or ray_io_cluster #50850

Open
frenoid opened this issue Feb 24, 2025 · 0 comments
Labels
bug Something that is supposed to be working; but isn't dashboard Issues specific to the Ray Dashboard triage Needs triage (eg: priority, bug/not-bug, and owning component)

Comments

@frenoid
Copy link

frenoid commented Feb 24, 2025

What happened + What you expected to happen

Context
We are running multiple Ray clusters on version 2.41.0 and sending metrics to a single common Thanos instance.

Each Ray cluster is launched by the Kuberay operator

We are launching multiple Ray clusters by creating RayCluster and RayService custom resources

Observed
The Ray Dashboard Overview tab, "Cluster Utilization card" shows CPU, memory and Disk metrics for the ray cluster on which the Dashboard process is running.

We noticed that the Dashboard frontend code queries Thanos and includes a filter on SessionName

However, in the Serve tab, the "QPS per application", "Error QPS per application" panels include applications from multiple clusters

We noticed that the frontend Thanos queries do not filter on SessionName or ray_io_cluster

Expected

The Grafana panels in the Serve tab of the Ray Dashboard should only show applications for the Ray Cluster on which the Dashboard hosted instead of showing all applications running on every Ray cluster

Relevant code
I dug into the code base and believe that these code lines are relevant.

The Cluster Utilization card in the Overview tab has a SessionName filter

src={`${grafanaHost}${path}&refresh&timezone=${currentTimeZone}${timeRangeParams}&var-SessionName=${sessionName}`}

The queries in the Serve tab do not

`/d-solo/${grafanaServeDashboardUid}?${pathParams}` +

Versions / Dependencies

Kuberay operator 1.2.2
Ray 2.41.0
Python 3.11.11
Thanos 0.37
Grafana 10.4.4
OpenShift 4.13

Reproduction script

These env are set in the RayCluster and RayService Kubernetes custom resources

- name: RAY_GRAFANA_IFRAME_HOST
  value: "https://grafana.apps.uat/.<redacted_root_domain>"
- name: RAY_GRAFANA_HOST
  value: "http://grafana.monitoring.svc"
- name: RAY_PROMETHEUS_NAME
  value: "Thanos"
- name: RAY_PROMETHEUS_HOST
  value: "http://thanos-query.monitoring.svc.cluster.local:9090"

Issue Severity

Medium: It is a significant difficulty but I can work around it.

@frenoid frenoid added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Feb 24, 2025
@jcotant1 jcotant1 added the dashboard Issues specific to the Ray Dashboard label Feb 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't dashboard Issues specific to the Ray Dashboard triage Needs triage (eg: priority, bug/not-bug, and owning component)
Projects
None yet
Development

No branches or pull requests

2 participants