Skip to content

Conversation

@can-anyscale
Copy link
Contributor

This PR replace STATS with Metric as a way to define metric inside ray (as a unification effort) in all common components. Normally, metrics are defined at the top-level component and passed down to sub-components. However, in this case, because the common component is used as an API across, doing so would feel unnecessarily cumbersome. I decided to define the metrics inline within each client and server class instead.

Note that the metric classes (Metric, Gauge, Sum, etc.) are simply wrappers around static OpenCensus/OpenTelemetry entities.

Details
Full context of this refactoring work.

  • Each component (e.g., gcs, raylet, core_worker, etc.) now has a metrics.h file located in its top-level directory. This file defines all metrics for that component.
  • In most cases, metrics are defined once in the main entry point of each component (gcs/gcs_server_main.cc for GCS, raylet/main.cc for Raylet, core_worker/core_worker_process.cc for the Core Worker, etc.). These metrics are then passed down to subcomponents via the ray::observability::MetricInterface.
  • This approach significantly reduces rebuild time when metric infrastructure changes. Previously, a change would trigger a full Ray rebuild; now, only the top-level entry points of each component need rebuilding.
  • There are a few exceptions where metrics are tracked inside object libraries (e.g., task_specification). In these cases, metrics are defined within the library itself, since there is no corresponding top-level entry point.

Test:

  • CI

@can-anyscale can-anyscale requested a review from a team as a code owner October 30, 2025 01:01
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request continues the effort to unify metric definitions by replacing the legacy STATS macro system with the new Metric class-based system in Ray's common components. The changes are logical and follow the intended direction of refactoring.

I have two main points of feedback:

  1. A critical issue regarding duplicate metric definitions for operation_* metrics, which could cause runtime problems. The old definitions need to be removed.
  2. A medium-severity performance concern about creating metric objects on a hot path within StatsHandle. It would be more efficient to create them once in EventTracker.

Overall, this is a good step forward in the metrics refactoring. Addressing these points will improve the correctness and performance of the implementation.

Comment on lines 81 to 87
ray::stats::Count operation_count_metric_{ray::GetOperationCountCounterMetric()};
ray::stats::Gauge operation_active_gauge_metric_{
ray::GetOperationActiveCountGaugeMetric()};
ray::stats::Histogram operation_run_time_ms_histogram_metric_{
ray::GetOperationRunTimeMsHistogramMetric()};
ray::stats::Histogram operation_queue_time_ms_histogram_metric_{
ray::GetOperationQueueTimeMsHistogramMetric()};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

These metric objects are created as members of StatsHandle. A new StatsHandle is created for every event via EventTracker::RecordStart, which can be very frequent. This means these four metric objects are constructed and likely go through registration logic on every event, which could be a performance concern.

The PR description mentions that metrics are defined inline within client and server classes. Following that pattern, it would be more efficient to define these metrics once as members of EventTracker. Then, StatsHandle could hold pointers or references to them, passed from EventTracker during its construction. This would avoid repeated object creation and registration overhead on a hot path.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these metric objects are just wrapper of static object, no re-registration happens

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are the static objects? I agree with Gemini here that these metric objects shouldn't be members of StatsHandle. They should be part of EventTracker.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh i didn't read gemini's comments fully; part of EventTracker makes sense, let see how to do that

cursor[bot]

This comment was marked as outdated.

@ray-gardener ray-gardener bot added core Issues that should be addressed in Ray Core observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling labels Oct 30, 2025
@can-anyscale
Copy link
Contributor Author

Move the operation metrics to be members of EventTracker instead of StatsHandle. Note that this requires RecordExecution and RecordEnd to be non-static, since they need to access instance members. Making the metric members static is unsafe, as they wrap OpenCensus/OpenTelemetry static objects.

CC: @jjyao

@can-anyscale can-anyscale requested a review from jjyao October 30, 2025 18:09
cursor[bot]

This comment was marked as outdated.

@can-anyscale can-anyscale added the go add ONLY when ready to merge, run all tests label Nov 14, 2025
@can-anyscale
Copy link
Contributor Author

ignore buildkite/microcheck failure, buildkite/premerge is a superset and it passed

stats_handle = std::move(stats_handle)](const boost::system::error_code &error,
size_t bytes_transferred) {
EventTracker::RecordExecution(
event_stats->RecordExecution(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to capture event_stats. We have this so we can get static_cast<instrumented_io_context &>(socket_.get_executor().context()); again to get the event_stats. Trying to minimizing the things we capture if possible.

[this, this_ptr, event_stats, stats_handle = std::move(stats_handle)](
const boost::system::error_code &ec, size_t bytes_transferred) mutable {
EventTracker::RecordExecution(
event_stats->RecordExecution(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

@can-anyscale
Copy link
Contributor Author

@jjyao's comments

@can-anyscale can-anyscale enabled auto-merge (squash) November 25, 2025 00:56
@can-anyscale can-anyscale merged commit eb28037 into master Nov 25, 2025
7 checks passed
@can-anyscale can-anyscale deleted the can-statdie04 branch November 25, 2025 01:12
ykdojo pushed a commit to ykdojo/ray that referenced this pull request Nov 27, 2025
…58299)

This PR replace STATS with Metric as a way to define metric inside ray
(as a unification effort) in all common components. Normally, metrics
are defined at the top-level component and passed down to
sub-components. However, in this case, because the common component is
used as an API across, doing so would feel unnecessarily cumbersome. I
decided to define the metrics inline within each client and server class
instead.

Note that the metric classes (Metric, Gauge, Sum, etc.) are simply
wrappers around static OpenCensus/OpenTelemetry entities.

**Details**
Full context of this refactoring work.
- Each component (e.g., gcs, raylet, core_worker, etc.) now has a
metrics.h file located in its top-level directory. This file defines all
metrics for that component.
- In most cases, metrics are defined once in the main entry point of
each component (gcs/gcs_server_main.cc for GCS, raylet/main.cc for
Raylet, core_worker/core_worker_process.cc for the Core Worker, etc.).
These metrics are then passed down to subcomponents via the
ray::observability::MetricInterface.
- This approach significantly reduces rebuild time when metric
infrastructure changes. Previously, a change would trigger a full Ray
rebuild; now, only the top-level entry points of each component need
rebuilding.
- There are a few exceptions where metrics are tracked inside object
libraries (e.g., task_specification). In these cases, metrics are
defined within the library itself, since there is no corresponding
top-level entry point.

Test:
- CI

Signed-off-by: Cuong Nguyen <[email protected]>
Signed-off-by: YK <[email protected]>
SheldonTsen pushed a commit to SheldonTsen/ray that referenced this pull request Dec 1, 2025
…58299)

This PR replace STATS with Metric as a way to define metric inside ray
(as a unification effort) in all common components. Normally, metrics
are defined at the top-level component and passed down to
sub-components. However, in this case, because the common component is
used as an API across, doing so would feel unnecessarily cumbersome. I
decided to define the metrics inline within each client and server class
instead.

Note that the metric classes (Metric, Gauge, Sum, etc.) are simply
wrappers around static OpenCensus/OpenTelemetry entities.

**Details**
Full context of this refactoring work.
- Each component (e.g., gcs, raylet, core_worker, etc.) now has a
metrics.h file located in its top-level directory. This file defines all
metrics for that component.
- In most cases, metrics are defined once in the main entry point of
each component (gcs/gcs_server_main.cc for GCS, raylet/main.cc for
Raylet, core_worker/core_worker_process.cc for the Core Worker, etc.).
These metrics are then passed down to subcomponents via the
ray::observability::MetricInterface.
- This approach significantly reduces rebuild time when metric
infrastructure changes. Previously, a change would trigger a full Ray
rebuild; now, only the top-level entry points of each component need
rebuilding.
- There are a few exceptions where metrics are tracked inside object
libraries (e.g., task_specification). In these cases, metrics are
defined within the library itself, since there is no corresponding
top-level entry point.

Test:
- CI

Signed-off-by: Cuong Nguyen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants