-
Notifications
You must be signed in to change notification settings - Fork 7k
[core] Fix "RayEventRecorder::StartExportingEvents() should be called only once." #57917
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Cuong Nguyen <[email protected]>
| << BuildAddress(address, port); | ||
| grpc_client_ = | ||
| std::make_unique<GrpcClient<ReporterService>>(address, port, client_call_manager); | ||
| retry_timer_ = std::make_unique<boost::asio::steady_timer>(io_service); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
node: we don't perform retry async anymore; instead, we'll retry during the callback of the connection healthcheck
src/ray/rpc/metrics_agent_client.h
Outdated
| HealthCheck, | ||
| grpc_client_, | ||
| /*method_timeout_ms*/ -1, | ||
| /*method_timeout_ms*/ kMetricAgentInitRetryDelayMs * 1000, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
set a finite amount of timed out so we don't hang indefinitely in retrying
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pass in directly when you call the PRC, I think it's more direct
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request addresses a race condition that could lead to a RAY_CHECK failure in RayEventRecorder::StartExportingEvents(). The fix involves two main changes: making StartExportingEvents aware of the connection status, and synchronizing the connection logic in MetricsAgentClientImpl. The changes are logical and move towards resolving the issue. However, I've identified a remaining race condition in MetricsAgentClientImpl on the exporter_initialized_ flag, which could still lead to the same RAY_CHECK failure under concurrent execution. My review includes a detailed comment and suggestions to make the implementation fully thread-safe.
| if (status.ok()) { | ||
| if (exporter_initialized_) { | ||
| return; | ||
| } | ||
| init_exporter_fn(status); | ||
| exporter_initialized_ = true; | ||
| RAY_LOG(INFO) << "Exporter initialized."; | ||
| return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This block has a race condition on exporter_initialized_. Since the io_service can use multiple threads, two concurrent HealthCheck callbacks could both see exporter_initialized_ as false, leading to init_exporter_fn being called twice. This would re-introduce the RAY_CHECK failure this PR aims to fix.
To solve this, you should protect this critical section. One way is to use a mutex. You would add absl::Mutex exporter_mutex_; to MetricsAgentClientImpl (in the .h file, and include absl/synchronization/mutex.h) and then use it here:
if (status.ok()) {
absl::MutexLock lock(&exporter_mutex_);
if (exporter_initialized_) {
return;
}
init_exporter_fn(status);
exporter_initialized_ = true;
RAY_LOG(INFO) << "Exporter initialized.";
return;
}An alternative is to make exporter_initialized_ an std::atomic<bool> (which would require including <atomic>).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, io_context is single threaded
| void RayEventRecorder::StartExportingEvents( | ||
| const Status &aggregator_agent_connection_status) { | ||
| absl::MutexLock lock(&mutex_); | ||
| if (!aggregator_agent_connection_status.ok()) { | ||
| RAY_LOG(ERROR) << "Failed to establish connection to the event aggregator agent. " | ||
| << "Events will not be exported. Error: " | ||
| << aggregator_agent_connection_status.ToString(); | ||
| return; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a surprising pattern -- why not just avoid calling StartExportingEvents if the health check fails?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think at some point in the past you suggested (or I might have misunderstood) that we should propagate the error status to these sub-components (e.g., otelrecorder, eventrecorder, etc.) so they can handle errors themselve, which is the current pattern here.
But yes, it’s probably better not to call these sub-components at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, let's just NOT call export then
c47bf08 to
8b65f96
Compare
|
|
||
| // Init metrics and event exporter. | ||
| metrics_agent_client_->WaitForServerReady([this](const Status &server_status) { | ||
| stats::InitOpenTelemetryExporter(config_.metrics_agent_port, server_status); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'll refactor this call to use the same pattern in another follow up (to make this PR minimal)
8b65f96 to
88c170d
Compare
|
build failure @can-anyscale |
88c170d to
a7c6fc2
Compare
… called only once." Signed-off-by: Cuong Nguyen <[email protected]>
a7c6fc2 to
2be8a2b
Compare
|
@edoakes - my bad, fixed now, tested that it builds locally |
| init_exporter_fn, retry_count + 1, max_retry, retry_interval_ms); | ||
| }, | ||
| "MetricsAgentClient.WaitForServerReadyWithRetry", | ||
| retry_interval_ms * 1000); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Incorrect Retry Timing in Metrics Exporter
The io_service_.post() call uses retry_interval_ms * 1000 for its delay parameter. Since retry_interval_ms is in milliseconds, this multiplication likely causes incorrect retry timing, making retries either too fast or too slow depending on the expected unit. This impacts the metrics exporter's initialization.
… only once." (#57917) This PR fixes the Ray check failure RayEventRecorder::StartExportingEvents() should be called only once.. The failure can occur in the following scenario: - The metric_agent_client successfully establishes a connection with the dashboard agent. In this case, RayEventRecorder::StartExportingEvents is correctly invoked to start sending events. - At the same time, the metric_agent_client exceeds its maximum number of connection retries. In this case, RayEventRecorder::StartExportingEvents is invoked again incorrectly, causing duplicate attempts to start exporting events. This PR introduces two fixes: - In metric_agent_client, the connection success and retry logic are now synchronized (previously they ran asynchronously, allowing both paths to trigger). - Do not call StartExportingEvents if the connection cannot be established. Test: - CI --------- Signed-off-by: Cuong Nguyen <[email protected]> Signed-off-by: Kamil Kaczmarek <[email protected]>
… only once." (ray-project#57917) This PR fixes the Ray check failure RayEventRecorder::StartExportingEvents() should be called only once.. The failure can occur in the following scenario: - The metric_agent_client successfully establishes a connection with the dashboard agent. In this case, RayEventRecorder::StartExportingEvents is correctly invoked to start sending events. - At the same time, the metric_agent_client exceeds its maximum number of connection retries. In this case, RayEventRecorder::StartExportingEvents is invoked again incorrectly, causing duplicate attempts to start exporting events. This PR introduces two fixes: - In metric_agent_client, the connection success and retry logic are now synchronized (previously they ran asynchronously, allowing both paths to trigger). - Do not call StartExportingEvents if the connection cannot be established. Test: - CI --------- Signed-off-by: Cuong Nguyen <[email protected]> Signed-off-by: xgui <[email protected]>
… only once." (#57917) This PR fixes the Ray check failure RayEventRecorder::StartExportingEvents() should be called only once.. The failure can occur in the following scenario: - The metric_agent_client successfully establishes a connection with the dashboard agent. In this case, RayEventRecorder::StartExportingEvents is correctly invoked to start sending events. - At the same time, the metric_agent_client exceeds its maximum number of connection retries. In this case, RayEventRecorder::StartExportingEvents is invoked again incorrectly, causing duplicate attempts to start exporting events. This PR introduces two fixes: - In metric_agent_client, the connection success and retry logic are now synchronized (previously they ran asynchronously, allowing both paths to trigger). - Do not call StartExportingEvents if the connection cannot be established. Test: - CI --------- Signed-off-by: Cuong Nguyen <[email protected]> Signed-off-by: elliot-barn <[email protected]>
… only once." (ray-project#57917) This PR fixes the Ray check failure RayEventRecorder::StartExportingEvents() should be called only once.. The failure can occur in the following scenario: - The metric_agent_client successfully establishes a connection with the dashboard agent. In this case, RayEventRecorder::StartExportingEvents is correctly invoked to start sending events. - At the same time, the metric_agent_client exceeds its maximum number of connection retries. In this case, RayEventRecorder::StartExportingEvents is invoked again incorrectly, causing duplicate attempts to start exporting events. This PR introduces two fixes: - In metric_agent_client, the connection success and retry logic are now synchronized (previously they ran asynchronously, allowing both paths to trigger). - Do not call StartExportingEvents if the connection cannot be established. Test: - CI --------- Signed-off-by: Cuong Nguyen <[email protected]>
… only once." (ray-project#57917) This PR fixes the Ray check failure RayEventRecorder::StartExportingEvents() should be called only once.. The failure can occur in the following scenario: - The metric_agent_client successfully establishes a connection with the dashboard agent. In this case, RayEventRecorder::StartExportingEvents is correctly invoked to start sending events. - At the same time, the metric_agent_client exceeds its maximum number of connection retries. In this case, RayEventRecorder::StartExportingEvents is invoked again incorrectly, causing duplicate attempts to start exporting events. This PR introduces two fixes: - In metric_agent_client, the connection success and retry logic are now synchronized (previously they ran asynchronously, allowing both paths to trigger). - Do not call StartExportingEvents if the connection cannot be established. Test: - CI --------- Signed-off-by: Cuong Nguyen <[email protected]> Signed-off-by: Aydin Abiar <[email protected]>
This PR fixes the Ray check failure RayEventRecorder::StartExportingEvents() should be called only once..
The failure can occur in the following scenario:
This PR introduces two fixes:
Test: