Skip to content

Conversation

@can-anyscale
Copy link
Contributor

@can-anyscale can-anyscale commented Oct 20, 2025

This PR fixes the Ray check failure RayEventRecorder::StartExportingEvents() should be called only once..
The failure can occur in the following scenario:

  • The metric_agent_client successfully establishes a connection with the dashboard agent. In this case, RayEventRecorder::StartExportingEvents is correctly invoked to start sending events.
  • At the same time, the metric_agent_client exceeds its maximum number of connection retries. In this case, RayEventRecorder::StartExportingEvents is invoked again incorrectly, causing duplicate attempts to start exporting events.

This PR introduces two fixes:

  • In metric_agent_client, the connection success and retry logic are now synchronized (previously they ran asynchronously, allowing both paths to trigger).
  • Do not call StartExportingEvents if the connection cannot be established.

Test:

  • CI

@can-anyscale can-anyscale requested a review from a team as a code owner October 20, 2025 17:20
<< BuildAddress(address, port);
grpc_client_ =
std::make_unique<GrpcClient<ReporterService>>(address, port, client_call_manager);
retry_timer_ = std::make_unique<boost::asio::steady_timer>(io_service);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

node: we don't perform retry async anymore; instead, we'll retry during the callback of the connection healthcheck

HealthCheck,
grpc_client_,
/*method_timeout_ms*/ -1,
/*method_timeout_ms*/ kMetricAgentInitRetryDelayMs * 1000,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

set a finite amount of timed out so we don't hang indefinitely in retrying

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pass in directly when you call the PRC, I think it's more direct

@can-anyscale can-anyscale added the go add ONLY when ready to merge, run all tests label Oct 20, 2025
cursor[bot]

This comment was marked as outdated.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a race condition that could lead to a RAY_CHECK failure in RayEventRecorder::StartExportingEvents(). The fix involves two main changes: making StartExportingEvents aware of the connection status, and synchronizing the connection logic in MetricsAgentClientImpl. The changes are logical and move towards resolving the issue. However, I've identified a remaining race condition in MetricsAgentClientImpl on the exporter_initialized_ flag, which could still lead to the same RAY_CHECK failure under concurrent execution. My review includes a detailed comment and suggestions to make the implementation fully thread-safe.

Comment on lines +48 to +55
if (status.ok()) {
if (exporter_initialized_) {
return;
}
init_exporter_fn(status);
exporter_initialized_ = true;
RAY_LOG(INFO) << "Exporter initialized.";
return;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This block has a race condition on exporter_initialized_. Since the io_service can use multiple threads, two concurrent HealthCheck callbacks could both see exporter_initialized_ as false, leading to init_exporter_fn being called twice. This would re-introduce the RAY_CHECK failure this PR aims to fix.

To solve this, you should protect this critical section. One way is to use a mutex. You would add absl::Mutex exporter_mutex_; to MetricsAgentClientImpl (in the .h file, and include absl/synchronization/mutex.h) and then use it here:

if (status.ok()) {
  absl::MutexLock lock(&exporter_mutex_);
  if (exporter_initialized_) {
    return;
  }
  init_exporter_fn(status);
  exporter_initialized_ = true;
  RAY_LOG(INFO) << "Exporter initialized.";
  return;
}

An alternative is to make exporter_initialized_ an std::atomic<bool> (which would require including <atomic>).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, io_context is single threaded

Comment on lines 37 to 45
void RayEventRecorder::StartExportingEvents(
const Status &aggregator_agent_connection_status) {
absl::MutexLock lock(&mutex_);
if (!aggregator_agent_connection_status.ok()) {
RAY_LOG(ERROR) << "Failed to establish connection to the event aggregator agent. "
<< "Events will not be exported. Error: "
<< aggregator_agent_connection_status.ToString();
return;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a surprising pattern -- why not just avoid calling StartExportingEvents if the health check fails?

Copy link
Contributor Author

@can-anyscale can-anyscale Oct 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think at some point in the past you suggested (or I might have misunderstood) that we should propagate the error status to these sub-components (e.g., otelrecorder, eventrecorder, etc.) so they can handle errors themselve, which is the current pattern here.

But yes, it’s probably better not to call these sub-components at all.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, let's just NOT call export then

@can-anyscale can-anyscale force-pushed the can-1ev07 branch 3 times, most recently from c47bf08 to 8b65f96 Compare October 20, 2025 18:15

// Init metrics and event exporter.
metrics_agent_client_->WaitForServerReady([this](const Status &server_status) {
stats::InitOpenTelemetryExporter(config_.metrics_agent_port, server_status);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'll refactor this call to use the same pattern in another follow up (to make this PR minimal)

@can-anyscale can-anyscale requested review from edoakes and jjyao October 20, 2025 18:16
@can-anyscale
Copy link
Contributor Author

@edoakes , @jjyao for re-review, thankks

cursor[bot]

This comment was marked as outdated.

@ray-gardener ray-gardener bot added the core Issues that should be addressed in Ray Core label Oct 20, 2025
@edoakes
Copy link
Collaborator

edoakes commented Oct 20, 2025

build failure @can-anyscale

@can-anyscale
Copy link
Contributor Author

@edoakes - my bad, fixed now, tested that it builds locally

init_exporter_fn, retry_count + 1, max_retry, retry_interval_ms);
},
"MetricsAgentClient.WaitForServerReadyWithRetry",
retry_interval_ms * 1000);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Incorrect Retry Timing in Metrics Exporter

The io_service_.post() call uses retry_interval_ms * 1000 for its delay parameter. Since retry_interval_ms is in milliseconds, this multiplication likely causes incorrect retry timing, making retries either too fast or too slow depending on the expected unit. This impacts the metrics exporter's initialization.

Fix in Cursor Fix in Web

@can-anyscale can-anyscale enabled auto-merge (squash) October 20, 2025 20:13
@can-anyscale can-anyscale merged commit 299eb1b into master Oct 20, 2025
7 checks passed
@can-anyscale can-anyscale deleted the can-1ev07 branch October 20, 2025 21:27
kamil-kaczmarek pushed a commit that referenced this pull request Oct 20, 2025
… only once." (#57917)

This PR fixes the Ray check failure
RayEventRecorder::StartExportingEvents() should be called only once..
The failure can occur in the following scenario:
- The metric_agent_client successfully establishes a connection with the
dashboard agent. In this case, RayEventRecorder::StartExportingEvents is
correctly invoked to start sending events.
- At the same time, the metric_agent_client exceeds its maximum number
of connection retries. In this case,
RayEventRecorder::StartExportingEvents is invoked again incorrectly,
causing duplicate attempts to start exporting events.

This PR introduces two fixes:
- In metric_agent_client, the connection success and retry logic are now
synchronized (previously they ran asynchronously, allowing both paths to
trigger).
- Do not call StartExportingEvents if the connection cannot be
established.

Test:
- CI

---------

Signed-off-by: Cuong Nguyen <[email protected]>
Signed-off-by: Kamil Kaczmarek <[email protected]>
xinyuangui2 pushed a commit to xinyuangui2/ray that referenced this pull request Oct 22, 2025
… only once." (ray-project#57917)

This PR fixes the Ray check failure
RayEventRecorder::StartExportingEvents() should be called only once..
The failure can occur in the following scenario:
- The metric_agent_client successfully establishes a connection with the
dashboard agent. In this case, RayEventRecorder::StartExportingEvents is
correctly invoked to start sending events.
- At the same time, the metric_agent_client exceeds its maximum number
of connection retries. In this case,
RayEventRecorder::StartExportingEvents is invoked again incorrectly,
causing duplicate attempts to start exporting events.

This PR introduces two fixes:
- In metric_agent_client, the connection success and retry logic are now
synchronized (previously they ran asynchronously, allowing both paths to
trigger).
- Do not call StartExportingEvents if the connection cannot be
established.

Test:
- CI

---------

Signed-off-by: Cuong Nguyen <[email protected]>
Signed-off-by: xgui <[email protected]>
elliot-barn pushed a commit that referenced this pull request Oct 23, 2025
… only once." (#57917)

This PR fixes the Ray check failure
RayEventRecorder::StartExportingEvents() should be called only once..
The failure can occur in the following scenario:
- The metric_agent_client successfully establishes a connection with the
dashboard agent. In this case, RayEventRecorder::StartExportingEvents is
correctly invoked to start sending events.
- At the same time, the metric_agent_client exceeds its maximum number
of connection retries. In this case,
RayEventRecorder::StartExportingEvents is invoked again incorrectly,
causing duplicate attempts to start exporting events.

This PR introduces two fixes:
- In metric_agent_client, the connection success and retry logic are now
synchronized (previously they ran asynchronously, allowing both paths to
trigger).
- Do not call StartExportingEvents if the connection cannot be
established.

Test:
- CI

---------

Signed-off-by: Cuong Nguyen <[email protected]>
Signed-off-by: elliot-barn <[email protected]>
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
… only once." (ray-project#57917)

This PR fixes the Ray check failure
RayEventRecorder::StartExportingEvents() should be called only once..
The failure can occur in the following scenario:
- The metric_agent_client successfully establishes a connection with the
dashboard agent. In this case, RayEventRecorder::StartExportingEvents is
correctly invoked to start sending events.
- At the same time, the metric_agent_client exceeds its maximum number
of connection retries. In this case,
RayEventRecorder::StartExportingEvents is invoked again incorrectly,
causing duplicate attempts to start exporting events.

This PR introduces two fixes:
- In metric_agent_client, the connection success and retry logic are now
synchronized (previously they ran asynchronously, allowing both paths to
trigger).
- Do not call StartExportingEvents if the connection cannot be
established.

Test:
- CI

---------

Signed-off-by: Cuong Nguyen <[email protected]>
Aydin-ab pushed a commit to Aydin-ab/ray-aydin that referenced this pull request Nov 19, 2025
… only once." (ray-project#57917)

This PR fixes the Ray check failure
RayEventRecorder::StartExportingEvents() should be called only once..
The failure can occur in the following scenario:
- The metric_agent_client successfully establishes a connection with the
dashboard agent. In this case, RayEventRecorder::StartExportingEvents is
correctly invoked to start sending events.
- At the same time, the metric_agent_client exceeds its maximum number
of connection retries. In this case,
RayEventRecorder::StartExportingEvents is invoked again incorrectly,
causing duplicate attempts to start exporting events.

This PR introduces two fixes:
- In metric_agent_client, the connection success and retry logic are now
synchronized (previously they ran asynchronously, allowing both paths to
trigger).
- Do not call StartExportingEvents if the connection cannot be
established.

Test:
- CI

---------

Signed-off-by: Cuong Nguyen <[email protected]>
Signed-off-by: Aydin Abiar <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants