-
Notifications
You must be signed in to change notification settings - Fork 7k
[core][stats-die/01] kill STATS in rpc component #57926
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Cuong Nguyen <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This PR refactors metric definitions in the RPC component to use the unified Metric class, replacing the old STATS macros. The changes are generally correct and follow the intended direction of unifying metric handling. However, there is a significant issue with how the new metric objects are instantiated. They are created for every RPC request, which is inefficient and can lead to race conditions during metric registration. My review includes suggestions to fix this by ensuring metric objects are created as singletons.
src/ray/rpc/client_call.h
Outdated
| /// the server and/or tweak certain RPC behaviors. | ||
| grpc::ClientContext context_; | ||
|
|
||
| ray::stats::Count grpc_client_req_failed_metric_{GetGrpcClientReqFailedMetric()}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To accompany the change in GetGrpcClientReqFailedMetric to return a reference, this member should be a reference as well. This avoids creating a new metric object for every ClientCallImpl instance, improving performance and ensuring metrics are handled as singletons.
| ray::stats::Count grpc_client_req_failed_metric_{GetGrpcClientReqFailedMetric()}; | |
| ray::stats::Count &grpc_client_req_failed_metric_{GetGrpcClientReqFailedMetric()}; |
src/ray/rpc/metrics.h
Outdated
| inline ray::stats::Histogram GetGrpcServerReqProcessTimeMsMetric() { | ||
| return ray::stats::Histogram( | ||
| /*name=*/"grpc_server_req_process_time_ms", | ||
| /*description=*/"Request latency in grpc server", | ||
| /*unit=*/"", | ||
| /*boundaries=*/{0.1, 1, 10, 100, 1000, 10000}, | ||
| /*tag_keys=*/{"Method"}); | ||
| } | ||
|
|
||
| inline ray::stats::Count GetGrpcServerReqNewMetric() { | ||
| return ray::stats::Count( | ||
| /*name=*/"grpc_server_req_new", | ||
| /*description=*/"New request number in grpc server", | ||
| /*unit=*/"", | ||
| /*tag_keys=*/{"Method"}); | ||
| } | ||
|
|
||
| inline ray::stats::Count GetGrpcServerReqHandlingMetric() { | ||
| return ray::stats::Count( | ||
| /*name=*/"grpc_server_req_handling", | ||
| /*description=*/"Request number are handling in grpc server", | ||
| /*unit=*/"", | ||
| /*tag_keys=*/{"Method"}); | ||
| } | ||
|
|
||
| inline ray::stats::Count GetGrpcServerReqFinishedMetric() { | ||
| return ray::stats::Count( | ||
| /*name=*/"grpc_server_req_finished", | ||
| /*description=*/"Finished request number in grpc server", | ||
| /*unit=*/"", | ||
| /*tag_keys=*/{"Method"}); | ||
| } | ||
|
|
||
| inline ray::stats::Count GetGrpcServerReqSucceededMetric() { | ||
| return ray::stats::Count( | ||
| /*name=*/"grpc_server_req_succeeded", | ||
| /*description=*/"Succeeded request count in grpc server", | ||
| /*unit=*/"", | ||
| /*tag_keys=*/{"Method"}); | ||
| } | ||
|
|
||
| inline ray::stats::Count GetGrpcServerReqFailedMetric() { | ||
| return ray::stats::Count( | ||
| /*name=*/"grpc_server_req_failed", | ||
| /*description=*/"Failed request count in grpc server", | ||
| /*unit=*/"", | ||
| /*tag_keys=*/{"Method"}); | ||
| } | ||
|
|
||
| inline ray::stats::Count GetGrpcClientReqFailedMetric() { | ||
| return ray::stats::Count( | ||
| /*name=*/"grpc_client_req_failed", | ||
| /*description=*/"Number of gRPC client failures (non-OK response statuses).", | ||
| /*unit=*/"", | ||
| /*tag_keys=*/{"Method"}); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These functions create and return new metric objects by value. Since they are called to initialize member variables in ServerCallImpl and ClientCallImpl, which are instantiated for each RPC, this leads to creating new metric objects for every request. This is inefficient and likely incorrect, as metric registration is a one-time setup process.
To fix this, these functions should return a reference to a static local metric object. This ensures that each metric is a singleton, initialized only once.
For example, GetGrpcServerReqProcessTimeMsMetric should be changed to:
inline ray::stats::Histogram &GetGrpcServerReqProcessTimeMsMetric() {
static ray::stats::Histogram metric(
/*name=*/"grpc_server_req_process_time_ms",
/*description=*/"Request latency in grpc server",
/*unit=*/"",
/*boundaries=*/{0.1, 1, 10, 100, 1000, 10000},
/*tag_keys=*/{"Method"});
return metric;
}This pattern should be applied to all metric getter functions in this file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Each of these objects already wraps an OpenTelemetry static metric. The wrapper itself is not static to avoid static destruction problems.
src/ray/rpc/server_call.h
Outdated
| ray::stats::Histogram grpc_server_req_process_time_ms_metric_{ | ||
| GetGrpcServerReqProcessTimeMsMetric()}; | ||
| ray::stats::Count grpc_server_req_new_metric_{GetGrpcServerReqNewMetric()}; | ||
| ray::stats::Count grpc_server_req_handling_metric_{GetGrpcServerReqHandlingMetric()}; | ||
| ray::stats::Count grpc_server_req_finished_metric_{GetGrpcServerReqFinishedMetric()}; | ||
| ray::stats::Count grpc_server_req_succeeded_metric_{GetGrpcServerReqSucceededMetric()}; | ||
| ray::stats::Count grpc_server_req_failed_metric_{GetGrpcServerReqFailedMetric()}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To accompany the change in the metric getter functions to return references, these members should also be references. This avoids creating new metric objects for every ServerCallImpl instance, improving performance and ensuring metrics are handled as singletons.
| ray::stats::Histogram grpc_server_req_process_time_ms_metric_{ | |
| GetGrpcServerReqProcessTimeMsMetric()}; | |
| ray::stats::Count grpc_server_req_new_metric_{GetGrpcServerReqNewMetric()}; | |
| ray::stats::Count grpc_server_req_handling_metric_{GetGrpcServerReqHandlingMetric()}; | |
| ray::stats::Count grpc_server_req_finished_metric_{GetGrpcServerReqFinishedMetric()}; | |
| ray::stats::Count grpc_server_req_succeeded_metric_{GetGrpcServerReqSucceededMetric()}; | |
| ray::stats::Count grpc_server_req_failed_metric_{GetGrpcServerReqFailedMetric()}; | |
| ray::stats::Histogram &grpc_server_req_process_time_ms_metric_{ | |
| GetGrpcServerReqProcessTimeMsMetric()}; | |
| ray::stats::Count &grpc_server_req_new_metric_{GetGrpcServerReqNewMetric()}; | |
| ray::stats::Count &grpc_server_req_handling_metric_{GetGrpcServerReqHandlingMetric()}; | |
| ray::stats::Count &grpc_server_req_finished_metric_{GetGrpcServerReqFinishedMetric()}; | |
| ray::stats::Count &grpc_server_req_succeeded_metric_{GetGrpcServerReqSucceededMetric()}; | |
| ray::stats::Count &grpc_server_req_failed_metric_{GetGrpcServerReqFailedMetric()}; |
src/ray/rpc/client_call.h
Outdated
| /// the server and/or tweak certain RPC behaviors. | ||
| grpc::ClientContext context_; | ||
|
|
||
| ray::stats::Count grpc_client_req_failed_metric_{GetGrpcClientReqFailedMetric()}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's follow the naming convention.
eb2d019 to
300d30d
Compare
|
@jjyao's comments |
Signed-off-by: Cuong Nguyen <[email protected]>
300d30d to
2ef3347
Compare
|
@Sparks0219 PTAL |
Sparks0219
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 🚢
Signed-off-by: Cuong Nguyen <[email protected]> Signed-off-by: elliot-barn <[email protected]>
Signed-off-by: Cuong Nguyen <[email protected]>
Signed-off-by: Cuong Nguyen <[email protected]> Signed-off-by: Aydin Abiar <[email protected]>
This PR replace STATS with
Metricas a way to define metric inside ray (as a unification effort) in all rpc components. Normally, metrics are defined at the top-level component and passed down to sub-components. However, in this case, because the codebase contains many gRPC clients and servers, doing so would feel unnecessarily cumbersome. I decided to define the metrics inline within each client and server class instead.Note that the metric classes (Metric, Gauge, Sum, etc.) are simply wrappers around static OpenCensus/OpenTelemetry entities.
Details
Full context of this refactoring work.
Test: