-
Notifications
You must be signed in to change notification settings - Fork 7k
[core][stats-die/04] kill STATS in the common component #58299
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
7b7adef to
dce9219
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request continues the effort to unify metric definitions by replacing the legacy STATS macro system with the new Metric class-based system in Ray's common components. The changes are logical and follow the intended direction of refactoring.
I have two main points of feedback:
- A critical issue regarding duplicate metric definitions for
operation_*metrics, which could cause runtime problems. The old definitions need to be removed. - A medium-severity performance concern about creating metric objects on a hot path within
StatsHandle. It would be more efficient to create them once inEventTracker.
Overall, this is a good step forward in the metrics refactoring. Addressing these points will improve the correctness and performance of the implementation.
src/ray/common/event_stats.h
Outdated
| ray::stats::Count operation_count_metric_{ray::GetOperationCountCounterMetric()}; | ||
| ray::stats::Gauge operation_active_gauge_metric_{ | ||
| ray::GetOperationActiveCountGaugeMetric()}; | ||
| ray::stats::Histogram operation_run_time_ms_histogram_metric_{ | ||
| ray::GetOperationRunTimeMsHistogramMetric()}; | ||
| ray::stats::Histogram operation_queue_time_ms_histogram_metric_{ | ||
| ray::GetOperationQueueTimeMsHistogramMetric()}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These metric objects are created as members of StatsHandle. A new StatsHandle is created for every event via EventTracker::RecordStart, which can be very frequent. This means these four metric objects are constructed and likely go through registration logic on every event, which could be a performance concern.
The PR description mentions that metrics are defined inline within client and server classes. Following that pattern, it would be more efficient to define these metrics once as members of EventTracker. Then, StatsHandle could hold pointers or references to them, passed from EventTracker during its construction. This would avoid repeated object creation and registration overhead on a hot path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these metric objects are just wrapper of static object, no re-registration happens
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what are the static objects? I agree with Gemini here that these metric objects shouldn't be members of StatsHandle. They should be part of EventTracker.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh i didn't read gemini's comments fully; part of EventTracker makes sense, let see how to do that
dce9219 to
b5c88e0
Compare
b5c88e0 to
fb3013d
Compare
|
Move the operation metrics to be members of EventTracker instead of StatsHandle. Note that this requires RecordExecution and RecordEnd to be non-static, since they need to access instance members. Making the metric members static is unsafe, as they wrap OpenCensus/OpenTelemetry static objects. CC: @jjyao |
fb3013d to
393118e
Compare
393118e to
eb809b1
Compare
|
ignore buildkite/microcheck failure, buildkite/premerge is a superset and it passed |
| stats_handle = std::move(stats_handle)](const boost::system::error_code &error, | ||
| size_t bytes_transferred) { | ||
| EventTracker::RecordExecution( | ||
| event_stats->RecordExecution( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to capture event_stats. We have this so we can get static_cast<instrumented_io_context &>(socket_.get_executor().context()); again to get the event_stats. Trying to minimizing the things we capture if possible.
| [this, this_ptr, event_stats, stats_handle = std::move(stats_handle)]( | ||
| const boost::system::error_code &ec, size_t bytes_transferred) mutable { | ||
| EventTracker::RecordExecution( | ||
| event_stats->RecordExecution( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
eb809b1 to
47dbeda
Compare
|
@jjyao's comments |
Signed-off-by: Cuong Nguyen <[email protected]>
47dbeda to
43c2382
Compare
…58299) This PR replace STATS with Metric as a way to define metric inside ray (as a unification effort) in all common components. Normally, metrics are defined at the top-level component and passed down to sub-components. However, in this case, because the common component is used as an API across, doing so would feel unnecessarily cumbersome. I decided to define the metrics inline within each client and server class instead. Note that the metric classes (Metric, Gauge, Sum, etc.) are simply wrappers around static OpenCensus/OpenTelemetry entities. **Details** Full context of this refactoring work. - Each component (e.g., gcs, raylet, core_worker, etc.) now has a metrics.h file located in its top-level directory. This file defines all metrics for that component. - In most cases, metrics are defined once in the main entry point of each component (gcs/gcs_server_main.cc for GCS, raylet/main.cc for Raylet, core_worker/core_worker_process.cc for the Core Worker, etc.). These metrics are then passed down to subcomponents via the ray::observability::MetricInterface. - This approach significantly reduces rebuild time when metric infrastructure changes. Previously, a change would trigger a full Ray rebuild; now, only the top-level entry points of each component need rebuilding. - There are a few exceptions where metrics are tracked inside object libraries (e.g., task_specification). In these cases, metrics are defined within the library itself, since there is no corresponding top-level entry point. Test: - CI Signed-off-by: Cuong Nguyen <[email protected]> Signed-off-by: YK <[email protected]>
…58299) This PR replace STATS with Metric as a way to define metric inside ray (as a unification effort) in all common components. Normally, metrics are defined at the top-level component and passed down to sub-components. However, in this case, because the common component is used as an API across, doing so would feel unnecessarily cumbersome. I decided to define the metrics inline within each client and server class instead. Note that the metric classes (Metric, Gauge, Sum, etc.) are simply wrappers around static OpenCensus/OpenTelemetry entities. **Details** Full context of this refactoring work. - Each component (e.g., gcs, raylet, core_worker, etc.) now has a metrics.h file located in its top-level directory. This file defines all metrics for that component. - In most cases, metrics are defined once in the main entry point of each component (gcs/gcs_server_main.cc for GCS, raylet/main.cc for Raylet, core_worker/core_worker_process.cc for the Core Worker, etc.). These metrics are then passed down to subcomponents via the ray::observability::MetricInterface. - This approach significantly reduces rebuild time when metric infrastructure changes. Previously, a change would trigger a full Ray rebuild; now, only the top-level entry points of each component need rebuilding. - There are a few exceptions where metrics are tracked inside object libraries (e.g., task_specification). In these cases, metrics are defined within the library itself, since there is no corresponding top-level entry point. Test: - CI Signed-off-by: Cuong Nguyen <[email protected]>
This PR replace STATS with Metric as a way to define metric inside ray (as a unification effort) in all common components. Normally, metrics are defined at the top-level component and passed down to sub-components. However, in this case, because the common component is used as an API across, doing so would feel unnecessarily cumbersome. I decided to define the metrics inline within each client and server class instead.
Note that the metric classes (Metric, Gauge, Sum, etc.) are simply wrappers around static OpenCensus/OpenTelemetry entities.
Details
Full context of this refactoring work.
Test: