-
Notifications
You must be signed in to change notification settings - Fork 7k
[core][stats-die/03] kill STATS in core worker component #58060
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
||
| ray::stats::Gauge task_by_state_gauge_{GetTaskByStateGaugeMetric()}; | ||
| ray::stats::Gauge actor_by_state_gauge_{GetActorByStateGaugeMetric()}; | ||
| std::unique_ptr<ray::stats::Gauge> task_by_state_gauge_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note: I changed this to pointer to the metric initialization can happen after stats::Init inside core_worker_process.cc
ba9e8e8 to
a06bd02
Compare
a06bd02 to
b075c33
Compare
|
@ZacAttack PTAL |
| "be placed. This is the time from when the tasks dependencies are " | ||
| "resolved to when it actually reserves resources on a node to run.", | ||
| /*unit=*/"s", | ||
| /*boundaries=*/{0.1, 1, 10, 100, 1000, 10000}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I've lately gotten a bit of a crash course in this thing. These.... buckets seem a little off right? scheduler_placement_time_s is in seconds (I'm assuming because of the _s). So we're saying the scheduler placement time falls within 0.1s->1s->10s->100s->16 minutes -> 2.5+ hours
The latter buckets seem absurd right? What is the actual realistic spread of latency for the scheduler to place something? If the _s is misleading and actually it's not in seconds then we should fix that. But we should choose our buckets along the lines of
-->healthy range
-->elevated range
-->error range
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make sense, the series of number must’ve been pulled out of a hat at some point ;) - i'll fix them on a PR on top of this, to keep this PR purely a refactoring
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it even makes sense to have this metric in seconds at all. It seems like we'd want this to usually be sub second
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make total sense
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there you go #58217
Signed-off-by: Cuong Nguyen <[email protected]>
b075c33 to
b950cff
Compare
| const TaskSpecification &same2, | ||
| const TaskSpecification &different) { | ||
| rpc::Address address; | ||
| ray::observability::FakeHistogram fake_scheduler_placement_time_s_histogram_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Local Variable Passed to Constructor Causes Use-After-Free
Variable fake_scheduler_placement_time_s_histogram_ is declared as a local variable inside the TestSchedulingKey helper function at line 1436, but is then passed by reference to NormalTaskSubmitter constructor at line 1465. This creates a use-after-free bug because the local variable will be destroyed before NormalTaskSubmitter uses it. This should be a local variable declaration that persists for the lifetime of the submitter, not a declaration at the function scope start.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's fine, the NormalTaskSubmitter is also a local object within the TestSchedulingKey function
This PR replace STATS with Metric as a way to define metric inside ray (as a unification effort) in all core worker components. For the most parts, metrics are defined as the top level component (core_worker_process.cc) and pass down as an interface to the sub-components. **Details** Full context of this refactoring work. - Each component (e.g., gcs, raylet, core_worker, etc.) now has a metrics.h file located in its top-level directory. This file defines all metrics for that component. - In most cases, metrics are defined once in the main entry point of each component (gcs/gcs_server_main.cc for GCS, raylet/main.cc for Raylet, core_worker/core_worker_process.cc for the Core Worker, etc.). These metrics are then passed down to subcomponents via the ray::observability::MetricInterface. - This approach significantly reduces rebuild time when metric infrastructure changes. Previously, a change would trigger a full Ray rebuild; now, only the top-level entry points of each component need rebuilding. - There are a few exceptions where metrics are tracked inside object libraries (e.g., task_specification). In these cases, metrics are defined within the library itself, since there is no corresponding top-level entry point. - Finally, the obsolete metric_defs.h and metric_defs.cc files can now be completely removed. This paves the way for further dead code cleanup in a future PR. Test: - CI Signed-off-by: Cuong Nguyen <[email protected]>
…#58060) This PR replace STATS with Metric as a way to define metric inside ray (as a unification effort) in all core worker components. For the most parts, metrics are defined as the top level component (core_worker_process.cc) and pass down as an interface to the sub-components. **Details** Full context of this refactoring work. - Each component (e.g., gcs, raylet, core_worker, etc.) now has a metrics.h file located in its top-level directory. This file defines all metrics for that component. - In most cases, metrics are defined once in the main entry point of each component (gcs/gcs_server_main.cc for GCS, raylet/main.cc for Raylet, core_worker/core_worker_process.cc for the Core Worker, etc.). These metrics are then passed down to subcomponents via the ray::observability::MetricInterface. - This approach significantly reduces rebuild time when metric infrastructure changes. Previously, a change would trigger a full Ray rebuild; now, only the top-level entry points of each component need rebuilding. - There are a few exceptions where metrics are tracked inside object libraries (e.g., task_specification). In these cases, metrics are defined within the library itself, since there is no corresponding top-level entry point. - Finally, the obsolete metric_defs.h and metric_defs.cc files can now be completely removed. This paves the way for further dead code cleanup in a future PR. Test: - CI Signed-off-by: Cuong Nguyen <[email protected]>
This PR replace STATS with Metric as a way to define metric inside ray (as a unification effort) in all core worker components. For the most parts, metrics are defined as the top level component (core_worker_process.cc) and pass down as an interface to the sub-components. **Details** Full context of this refactoring work. - Each component (e.g., gcs, raylet, core_worker, etc.) now has a metrics.h file located in its top-level directory. This file defines all metrics for that component. - In most cases, metrics are defined once in the main entry point of each component (gcs/gcs_server_main.cc for GCS, raylet/main.cc for Raylet, core_worker/core_worker_process.cc for the Core Worker, etc.). These metrics are then passed down to subcomponents via the ray::observability::MetricInterface. - This approach significantly reduces rebuild time when metric infrastructure changes. Previously, a change would trigger a full Ray rebuild; now, only the top-level entry points of each component need rebuilding. - There are a few exceptions where metrics are tracked inside object libraries (e.g., task_specification). In these cases, metrics are defined within the library itself, since there is no corresponding top-level entry point. - Finally, the obsolete metric_defs.h and metric_defs.cc files can now be completely removed. This paves the way for further dead code cleanup in a future PR. Test: - CI Signed-off-by: Cuong Nguyen <[email protected]> Signed-off-by: elliot-barn <[email protected]>
This PR replace STATS with Metric as a way to define metric inside ray (as a unification effort) in all core worker components. For the most parts, metrics are defined as the top level component (core_worker_process.cc) and pass down as an interface to the sub-components. **Details** Full context of this refactoring work. - Each component (e.g., gcs, raylet, core_worker, etc.) now has a metrics.h file located in its top-level directory. This file defines all metrics for that component. - In most cases, metrics are defined once in the main entry point of each component (gcs/gcs_server_main.cc for GCS, raylet/main.cc for Raylet, core_worker/core_worker_process.cc for the Core Worker, etc.). These metrics are then passed down to subcomponents via the ray::observability::MetricInterface. - This approach significantly reduces rebuild time when metric infrastructure changes. Previously, a change would trigger a full Ray rebuild; now, only the top-level entry points of each component need rebuilding. - There are a few exceptions where metrics are tracked inside object libraries (e.g., task_specification). In these cases, metrics are defined within the library itself, since there is no corresponding top-level entry point. - Finally, the obsolete metric_defs.h and metric_defs.cc files can now be completely removed. This paves the way for further dead code cleanup in a future PR. Test: - CI Signed-off-by: Cuong Nguyen <[email protected]> Signed-off-by: elliot-barn <[email protected]>
…#58060) This PR replace STATS with Metric as a way to define metric inside ray (as a unification effort) in all core worker components. For the most parts, metrics are defined as the top level component (core_worker_process.cc) and pass down as an interface to the sub-components. **Details** Full context of this refactoring work. - Each component (e.g., gcs, raylet, core_worker, etc.) now has a metrics.h file located in its top-level directory. This file defines all metrics for that component. - In most cases, metrics are defined once in the main entry point of each component (gcs/gcs_server_main.cc for GCS, raylet/main.cc for Raylet, core_worker/core_worker_process.cc for the Core Worker, etc.). These metrics are then passed down to subcomponents via the ray::observability::MetricInterface. - This approach significantly reduces rebuild time when metric infrastructure changes. Previously, a change would trigger a full Ray rebuild; now, only the top-level entry points of each component need rebuilding. - There are a few exceptions where metrics are tracked inside object libraries (e.g., task_specification). In these cases, metrics are defined within the library itself, since there is no corresponding top-level entry point. - Finally, the obsolete metric_defs.h and metric_defs.cc files can now be completely removed. This paves the way for further dead code cleanup in a future PR. Test: - CI Signed-off-by: Cuong Nguyen <[email protected]>
…#58060) This PR replace STATS with Metric as a way to define metric inside ray (as a unification effort) in all core worker components. For the most parts, metrics are defined as the top level component (core_worker_process.cc) and pass down as an interface to the sub-components. **Details** Full context of this refactoring work. - Each component (e.g., gcs, raylet, core_worker, etc.) now has a metrics.h file located in its top-level directory. This file defines all metrics for that component. - In most cases, metrics are defined once in the main entry point of each component (gcs/gcs_server_main.cc for GCS, raylet/main.cc for Raylet, core_worker/core_worker_process.cc for the Core Worker, etc.). These metrics are then passed down to subcomponents via the ray::observability::MetricInterface. - This approach significantly reduces rebuild time when metric infrastructure changes. Previously, a change would trigger a full Ray rebuild; now, only the top-level entry points of each component need rebuilding. - There are a few exceptions where metrics are tracked inside object libraries (e.g., task_specification). In these cases, metrics are defined within the library itself, since there is no corresponding top-level entry point. - Finally, the obsolete metric_defs.h and metric_defs.cc files can now be completely removed. This paves the way for further dead code cleanup in a future PR. Test: - CI Signed-off-by: Cuong Nguyen <[email protected]> Signed-off-by: Aydin Abiar <[email protected]>

This PR replace STATS with Metric as a way to define metric inside ray (as a unification effort) in all core worker components. For the most parts, metrics are defined as the top level component (core_worker_process.cc) and pass down as an interface to the sub-components.
Details
Full context of this refactoring work.
Test: