[core] Fix "RayEventRecorder::StartExportingEvents() should be called only once." #57917

can-anyscale · 2025-10-20T17:20:00Z

This PR fixes the Ray check failure RayEventRecorder::StartExportingEvents() should be called only once..
The failure can occur in the following scenario:

The metric_agent_client successfully establishes a connection with the dashboard agent. In this case, RayEventRecorder::StartExportingEvents is correctly invoked to start sending events.
At the same time, the metric_agent_client exceeds its maximum number of connection retries. In this case, RayEventRecorder::StartExportingEvents is invoked again incorrectly, causing duplicate attempts to start exporting events.

This PR introduces two fixes:

In metric_agent_client, the connection success and retry logic are now synchronized (previously they ran asynchronously, allowing both paths to trigger).
Do not call StartExportingEvents if the connection cannot be established.

Test:

CI

Signed-off-by: Cuong Nguyen <[email protected]>

can-anyscale · 2025-10-20T17:21:03Z

src/ray/rpc/metrics_agent_client.h

                   << BuildAddress(address, port);
    grpc_client_ =
        std::make_unique<GrpcClient<ReporterService>>(address, port, client_call_manager);
-    retry_timer_ = std::make_unique<boost::asio::steady_timer>(io_service);


node: we don't perform retry async anymore; instead, we'll retry during the callback of the connection healthcheck

can-anyscale · 2025-10-20T17:21:28Z

src/ray/rpc/metrics_agent_client.h

                         HealthCheck,
                         grpc_client_,
-                         /*method_timeout_ms*/ -1,
+                         /*method_timeout_ms*/ kMetricAgentInitRetryDelayMs * 1000,


set a finite amount of timed out so we don't hang indefinitely in retrying

Pass in directly when you call the PRC, I think it's more direct

gemini-code-assist

Code Review

This pull request addresses a race condition that could lead to a RAY_CHECK failure in RayEventRecorder::StartExportingEvents(). The fix involves two main changes: making StartExportingEvents aware of the connection status, and synchronizing the connection logic in MetricsAgentClientImpl. The changes are logical and move towards resolving the issue. However, I've identified a remaining race condition in MetricsAgentClientImpl on the exporter_initialized_ flag, which could still lead to the same RAY_CHECK failure under concurrent execution. My review includes a detailed comment and suggestions to make the implementation fully thread-safe.

gemini-code-assist · 2025-10-20T17:25:00Z

src/ray/rpc/metrics_agent_client.cc

+        if (status.ok()) {
+          if (exporter_initialized_) {
+            return;
+          }
+          init_exporter_fn(status);
+          exporter_initialized_ = true;
+          RAY_LOG(INFO) << "Exporter initialized.";
+          return;


This block has a race condition on exporter_initialized_. Since the io_service can use multiple threads, two concurrent HealthCheck callbacks could both see exporter_initialized_ as false, leading to init_exporter_fn being called twice. This would re-introduce the RAY_CHECK failure this PR aims to fix.

To solve this, you should protect this critical section. One way is to use a mutex. You would add absl::Mutex exporter_mutex_; to MetricsAgentClientImpl (in the .h file, and include absl/synchronization/mutex.h) and then use it here:

if (status.ok()) { absl::MutexLock lock(&exporter_mutex_); if (exporter_initialized_) { return; } init_exporter_fn(status); exporter_initialized_ = true; RAY_LOG(INFO) << "Exporter initialized."; return; }

An alternative is to make exporter_initialized_ an std::atomic<bool> (which would require including <atomic>).

thanks, io_context is single threaded

edoakes · 2025-10-20T17:45:05Z

src/ray/observability/ray_event_recorder.cc

+void RayEventRecorder::StartExportingEvents(
+    const Status &aggregator_agent_connection_status) {
  absl::MutexLock lock(&mutex_);
+  if (!aggregator_agent_connection_status.ok()) {
+    RAY_LOG(ERROR) << "Failed to establish connection to the event aggregator agent. "
+                   << "Events will not be exported. Error: "
+                   << aggregator_agent_connection_status.ToString();
+    return;
+  }


this is a surprising pattern -- why not just avoid calling StartExportingEvents if the health check fails?

I think at some point in the past you suggested (or I might have misunderstood) that we should propagate the error status to these sub-components (e.g., otelrecorder, eventrecorder, etc.) so they can handle errors themselve, which is the current pattern here.

But yes, it’s probably better not to call these sub-components at all.

ok, let's just NOT call export then

can-anyscale · 2025-10-20T18:14:06Z

src/ray/gcs/gcs_server.cc


  // Init metrics and event exporter.
  metrics_agent_client_->WaitForServerReady([this](const Status &server_status) {
    stats::InitOpenTelemetryExporter(config_.metrics_agent_port, server_status);


i'll refactor this call to use the same pattern in another follow up (to make this PR minimal)

can-anyscale · 2025-10-20T18:16:34Z

@edoakes , @jjyao for re-review, thankks

edoakes · 2025-10-20T19:32:21Z

build failure @can-anyscale

… called only once." Signed-off-by: Cuong Nguyen <[email protected]>

can-anyscale · 2025-10-20T19:47:25Z

@edoakes - my bad, fixed now, tested that it builds locally

cursor · 2025-10-20T19:49:42Z

src/ray/rpc/metrics_agent_client.cc

+                  init_exporter_fn, retry_count + 1, max_retry, retry_interval_ms);
+            },
+            "MetricsAgentClient.WaitForServerReadyWithRetry",
+            retry_interval_ms * 1000);


Bug: Incorrect Retry Timing in Metrics Exporter

The io_service_.post() call uses retry_interval_ms * 1000 for its delay parameter. Since retry_interval_ms is in milliseconds, this multiplication likely causes incorrect retry timing, making retries either too fast or too slow depending on the expected unit. This impacts the metrics exporter's initialization.

… only once." (#57917) This PR fixes the Ray check failure RayEventRecorder::StartExportingEvents() should be called only once.. The failure can occur in the following scenario: - The metric_agent_client successfully establishes a connection with the dashboard agent. In this case, RayEventRecorder::StartExportingEvents is correctly invoked to start sending events. - At the same time, the metric_agent_client exceeds its maximum number of connection retries. In this case, RayEventRecorder::StartExportingEvents is invoked again incorrectly, causing duplicate attempts to start exporting events. This PR introduces two fixes: - In metric_agent_client, the connection success and retry logic are now synchronized (previously they ran asynchronously, allowing both paths to trigger). - Do not call StartExportingEvents if the connection cannot be established. Test: - CI --------- Signed-off-by: Cuong Nguyen <[email protected]> Signed-off-by: Kamil Kaczmarek <[email protected]>

… only once." (ray-project#57917) This PR fixes the Ray check failure RayEventRecorder::StartExportingEvents() should be called only once.. The failure can occur in the following scenario: - The metric_agent_client successfully establishes a connection with the dashboard agent. In this case, RayEventRecorder::StartExportingEvents is correctly invoked to start sending events. - At the same time, the metric_agent_client exceeds its maximum number of connection retries. In this case, RayEventRecorder::StartExportingEvents is invoked again incorrectly, causing duplicate attempts to start exporting events. This PR introduces two fixes: - In metric_agent_client, the connection success and retry logic are now synchronized (previously they ran asynchronously, allowing both paths to trigger). - Do not call StartExportingEvents if the connection cannot be established. Test: - CI --------- Signed-off-by: Cuong Nguyen <[email protected]> Signed-off-by: xgui <[email protected]>

… only once." (#57917) This PR fixes the Ray check failure RayEventRecorder::StartExportingEvents() should be called only once.. The failure can occur in the following scenario: - The metric_agent_client successfully establishes a connection with the dashboard agent. In this case, RayEventRecorder::StartExportingEvents is correctly invoked to start sending events. - At the same time, the metric_agent_client exceeds its maximum number of connection retries. In this case, RayEventRecorder::StartExportingEvents is invoked again incorrectly, causing duplicate attempts to start exporting events. This PR introduces two fixes: - In metric_agent_client, the connection success and retry logic are now synchronized (previously they ran asynchronously, allowing both paths to trigger). - Do not call StartExportingEvents if the connection cannot be established. Test: - CI --------- Signed-off-by: Cuong Nguyen <[email protected]> Signed-off-by: elliot-barn <[email protected]>

… only once." (ray-project#57917) This PR fixes the Ray check failure RayEventRecorder::StartExportingEvents() should be called only once.. The failure can occur in the following scenario: - The metric_agent_client successfully establishes a connection with the dashboard agent. In this case, RayEventRecorder::StartExportingEvents is correctly invoked to start sending events. - At the same time, the metric_agent_client exceeds its maximum number of connection retries. In this case, RayEventRecorder::StartExportingEvents is invoked again incorrectly, causing duplicate attempts to start exporting events. This PR introduces two fixes: - In metric_agent_client, the connection success and retry logic are now synchronized (previously they ran asynchronously, allowing both paths to trigger). - Do not call StartExportingEvents if the connection cannot be established. Test: - CI --------- Signed-off-by: Cuong Nguyen <[email protected]>

… only once." (ray-project#57917) This PR fixes the Ray check failure RayEventRecorder::StartExportingEvents() should be called only once.. The failure can occur in the following scenario: - The metric_agent_client successfully establishes a connection with the dashboard agent. In this case, RayEventRecorder::StartExportingEvents is correctly invoked to start sending events. - At the same time, the metric_agent_client exceeds its maximum number of connection retries. In this case, RayEventRecorder::StartExportingEvents is invoked again incorrectly, causing duplicate attempts to start exporting events. This PR introduces two fixes: - In metric_agent_client, the connection success and retry logic are now synchronized (previously they ran asynchronously, allowing both paths to trigger). - Do not call StartExportingEvents if the connection cannot be established. Test: - CI --------- Signed-off-by: Cuong Nguyen <[email protected]> Signed-off-by: Aydin Abiar <[email protected]>

[core] fix ray_event_recorder initialization

fca491d

Signed-off-by: Cuong Nguyen <[email protected]>

can-anyscale requested a review from a team as a code owner October 20, 2025 17:20

can-anyscale commented Oct 20, 2025

View reviewed changes

can-anyscale added the go add ONLY when ready to merge, run all tests label Oct 20, 2025

This comment was marked as outdated.

Sign in to view

gemini-code-assist bot reviewed Oct 20, 2025

View reviewed changes

edoakes reviewed Oct 20, 2025

View reviewed changes

can-anyscale force-pushed the can-1ev07 branch 3 times, most recently from c47bf08 to 8b65f96 Compare October 20, 2025 18:15

can-anyscale commented Oct 20, 2025

View reviewed changes

can-anyscale requested review from edoakes and jjyao October 20, 2025 18:16

This comment was marked as outdated.

Sign in to view

jjyao approved these changes Oct 20, 2025

View reviewed changes

can-anyscale force-pushed the can-1ev07 branch from 8b65f96 to 88c170d Compare October 20, 2025 18:31

ray-gardener bot added the core Issues that should be addressed in Ray Core label Oct 20, 2025

can-anyscale force-pushed the can-1ev07 branch from 88c170d to a7c6fc2 Compare October 20, 2025 19:38

[core][event] fix "RayEventRecorder::StartExportingEvents() should be…

2be8a2b

… called only once." Signed-off-by: Cuong Nguyen <[email protected]>

can-anyscale force-pushed the can-1ev07 branch from a7c6fc2 to 2be8a2b Compare October 20, 2025 19:46

cursor bot reviewed Oct 20, 2025

View reviewed changes

edoakes approved these changes Oct 20, 2025

View reviewed changes

can-anyscale enabled auto-merge (squash) October 20, 2025 20:13

can-anyscale merged commit 299eb1b into master Oct 20, 2025
7 checks passed

can-anyscale deleted the can-1ev07 branch October 20, 2025 21:27

[core] Fix "RayEventRecorder::StartExportingEvents() should be called only once." #57917

[core] Fix "RayEventRecorder::StartExportingEvents() should be called only once." #57917

Uh oh!

Conversation

can-anyscale commented Oct 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

can-anyscale Oct 20, 2025

Choose a reason for hiding this comment

Uh oh!

can-anyscale Oct 20, 2025

Choose a reason for hiding this comment

Uh oh!

jjyao Oct 20, 2025

Choose a reason for hiding this comment

Uh oh!

This comment was marked as outdated.

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Oct 20, 2025

Choose a reason for hiding this comment

Uh oh!

can-anyscale Oct 20, 2025

Choose a reason for hiding this comment

Uh oh!

edoakes Oct 20, 2025

Choose a reason for hiding this comment

Uh oh!

can-anyscale Oct 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

can-anyscale Oct 20, 2025

Choose a reason for hiding this comment

Uh oh!

can-anyscale Oct 20, 2025

Choose a reason for hiding this comment

Uh oh!

can-anyscale commented Oct 20, 2025

Uh oh!

This comment was marked as outdated.

Uh oh!

edoakes commented Oct 20, 2025

Uh oh!

can-anyscale commented Oct 20, 2025

Uh oh!

cursor bot Oct 20, 2025

Choose a reason for hiding this comment

Bug: Incorrect Retry Timing in Metrics Exporter

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

can-anyscale commented Oct 20, 2025 •

edited

Loading

can-anyscale Oct 20, 2025 •

edited

Loading