Skip to content

Conversation

@can-anyscale
Copy link
Contributor

@can-anyscale can-anyscale commented Oct 20, 2025

This PR replace STATS with Metric as a way to define metric inside ray (as a unification effort) in all rpc components. Normally, metrics are defined at the top-level component and passed down to sub-components. However, in this case, because the codebase contains many gRPC clients and servers, doing so would feel unnecessarily cumbersome. I decided to define the metrics inline within each client and server class instead.

Note that the metric classes (Metric, Gauge, Sum, etc.) are simply wrappers around static OpenCensus/OpenTelemetry entities.

Details
Full context of this refactoring work.

  • Each component (e.g., gcs, raylet, core_worker, etc.) now has a metrics.h file located in its top-level directory. This file defines all metrics for that component.
  • In most cases, metrics are defined once in the main entry point of each component (gcs/gcs_server_main.cc for GCS, raylet/main.cc for Raylet, core_worker/core_worker_process.cc for the Core Worker, etc.). These metrics are then passed down to subcomponents via the ray::observability::MetricInterface.
  • This approach significantly reduces rebuild time when metric infrastructure changes. Previously, a change would trigger a full Ray rebuild; now, only the top-level entry points of each component need rebuilding.
  • There are a few exceptions where metrics are tracked inside object libraries (e.g., task_specification). In these cases, metrics are defined within the library itself, since there is no corresponding top-level entry point.
  • Finally, the obsolete metric_defs.h and metric_defs.cc files can now be completely removed. This paves the way for further dead code cleanup in a future PR.

Test:

  • CI

@can-anyscale can-anyscale requested a review from a team as a code owner October 20, 2025 19:58
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This PR refactors metric definitions in the RPC component to use the unified Metric class, replacing the old STATS macros. The changes are generally correct and follow the intended direction of unifying metric handling. However, there is a significant issue with how the new metric objects are instantiated. They are created for every RPC request, which is inefficient and can lead to race conditions during metric registration. My review includes suggestions to fix this by ensuring metric objects are created as singletons.

/// the server and/or tweak certain RPC behaviors.
grpc::ClientContext context_;

ray::stats::Count grpc_client_req_failed_metric_{GetGrpcClientReqFailedMetric()};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

To accompany the change in GetGrpcClientReqFailedMetric to return a reference, this member should be a reference as well. This avoids creating a new metric object for every ClientCallImpl instance, improving performance and ensuring metrics are handled as singletons.

Suggested change
ray::stats::Count grpc_client_req_failed_metric_{GetGrpcClientReqFailedMetric()};
ray::stats::Count &grpc_client_req_failed_metric_{GetGrpcClientReqFailedMetric()};

Comment on lines 22 to 77
inline ray::stats::Histogram GetGrpcServerReqProcessTimeMsMetric() {
return ray::stats::Histogram(
/*name=*/"grpc_server_req_process_time_ms",
/*description=*/"Request latency in grpc server",
/*unit=*/"",
/*boundaries=*/{0.1, 1, 10, 100, 1000, 10000},
/*tag_keys=*/{"Method"});
}

inline ray::stats::Count GetGrpcServerReqNewMetric() {
return ray::stats::Count(
/*name=*/"grpc_server_req_new",
/*description=*/"New request number in grpc server",
/*unit=*/"",
/*tag_keys=*/{"Method"});
}

inline ray::stats::Count GetGrpcServerReqHandlingMetric() {
return ray::stats::Count(
/*name=*/"grpc_server_req_handling",
/*description=*/"Request number are handling in grpc server",
/*unit=*/"",
/*tag_keys=*/{"Method"});
}

inline ray::stats::Count GetGrpcServerReqFinishedMetric() {
return ray::stats::Count(
/*name=*/"grpc_server_req_finished",
/*description=*/"Finished request number in grpc server",
/*unit=*/"",
/*tag_keys=*/{"Method"});
}

inline ray::stats::Count GetGrpcServerReqSucceededMetric() {
return ray::stats::Count(
/*name=*/"grpc_server_req_succeeded",
/*description=*/"Succeeded request count in grpc server",
/*unit=*/"",
/*tag_keys=*/{"Method"});
}

inline ray::stats::Count GetGrpcServerReqFailedMetric() {
return ray::stats::Count(
/*name=*/"grpc_server_req_failed",
/*description=*/"Failed request count in grpc server",
/*unit=*/"",
/*tag_keys=*/{"Method"});
}

inline ray::stats::Count GetGrpcClientReqFailedMetric() {
return ray::stats::Count(
/*name=*/"grpc_client_req_failed",
/*description=*/"Number of gRPC client failures (non-OK response statuses).",
/*unit=*/"",
/*tag_keys=*/{"Method"});
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

These functions create and return new metric objects by value. Since they are called to initialize member variables in ServerCallImpl and ClientCallImpl, which are instantiated for each RPC, this leads to creating new metric objects for every request. This is inefficient and likely incorrect, as metric registration is a one-time setup process.

To fix this, these functions should return a reference to a static local metric object. This ensures that each metric is a singleton, initialized only once.

For example, GetGrpcServerReqProcessTimeMsMetric should be changed to:

inline ray::stats::Histogram &GetGrpcServerReqProcessTimeMsMetric() {
  static ray::stats::Histogram metric(
      /*name=*/"grpc_server_req_process_time_ms",
      /*description=*/"Request latency in grpc server",
      /*unit=*/"",
      /*boundaries=*/{0.1, 1, 10, 100, 1000, 10000},
      /*tag_keys=*/{"Method"});
  return metric;
}

This pattern should be applied to all metric getter functions in this file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each of these objects already wraps an OpenTelemetry static metric. The wrapper itself is not static to avoid static destruction problems.

Comment on lines 388 to 394
ray::stats::Histogram grpc_server_req_process_time_ms_metric_{
GetGrpcServerReqProcessTimeMsMetric()};
ray::stats::Count grpc_server_req_new_metric_{GetGrpcServerReqNewMetric()};
ray::stats::Count grpc_server_req_handling_metric_{GetGrpcServerReqHandlingMetric()};
ray::stats::Count grpc_server_req_finished_metric_{GetGrpcServerReqFinishedMetric()};
ray::stats::Count grpc_server_req_succeeded_metric_{GetGrpcServerReqSucceededMetric()};
ray::stats::Count grpc_server_req_failed_metric_{GetGrpcServerReqFailedMetric()};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

To accompany the change in the metric getter functions to return references, these members should also be references. This avoids creating new metric objects for every ServerCallImpl instance, improving performance and ensuring metrics are handled as singletons.

Suggested change
ray::stats::Histogram grpc_server_req_process_time_ms_metric_{
GetGrpcServerReqProcessTimeMsMetric()};
ray::stats::Count grpc_server_req_new_metric_{GetGrpcServerReqNewMetric()};
ray::stats::Count grpc_server_req_handling_metric_{GetGrpcServerReqHandlingMetric()};
ray::stats::Count grpc_server_req_finished_metric_{GetGrpcServerReqFinishedMetric()};
ray::stats::Count grpc_server_req_succeeded_metric_{GetGrpcServerReqSucceededMetric()};
ray::stats::Count grpc_server_req_failed_metric_{GetGrpcServerReqFailedMetric()};
ray::stats::Histogram &grpc_server_req_process_time_ms_metric_{
GetGrpcServerReqProcessTimeMsMetric()};
ray::stats::Count &grpc_server_req_new_metric_{GetGrpcServerReqNewMetric()};
ray::stats::Count &grpc_server_req_handling_metric_{GetGrpcServerReqHandlingMetric()};
ray::stats::Count &grpc_server_req_finished_metric_{GetGrpcServerReqFinishedMetric()};
ray::stats::Count &grpc_server_req_succeeded_metric_{GetGrpcServerReqSucceededMetric()};
ray::stats::Count &grpc_server_req_failed_metric_{GetGrpcServerReqFailedMetric()};

@can-anyscale can-anyscale changed the title [core][metric] kill STATS in rpc component [core][stats-die/01] kill STATS in rpc component Oct 20, 2025
cursor[bot]

This comment was marked as outdated.

@can-anyscale can-anyscale added the go add ONLY when ready to merge, run all tests label Oct 20, 2025
/// the server and/or tweak certain RPC behaviors.
grpc::ClientContext context_;

ray::stats::Count grpc_client_req_failed_metric_{GetGrpcClientReqFailedMetric()};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's follow the naming convention.

@can-anyscale
Copy link
Contributor Author

@jjyao's comments

@edoakes
Copy link
Collaborator

edoakes commented Oct 21, 2025

@Sparks0219 PTAL

@ray-gardener ray-gardener bot added core Issues that should be addressed in Ray Core observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling labels Oct 21, 2025
Copy link
Contributor

@Sparks0219 Sparks0219 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🚢

@jjyao jjyao merged commit 332b671 into master Oct 21, 2025
6 checks passed
@jjyao jjyao deleted the can-statdie01 branch October 21, 2025 20:41
elliot-barn pushed a commit that referenced this pull request Oct 23, 2025
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
Aydin-ab pushed a commit to Aydin-ab/ray-aydin that referenced this pull request Nov 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants