Skip to content

Conversation

@can-anyscale
Copy link
Contributor

This PR replace STATS with Metric as a way to define metric inside ray (as a unification effort) in all core worker components. For the most parts, metrics are defined as the top level component (core_worker_process.cc) and pass down as an interface to the sub-components.

Details
Full context of this refactoring work.

  • Each component (e.g., gcs, raylet, core_worker, etc.) now has a metrics.h file located in its top-level directory. This file defines all metrics for that component.
  • In most cases, metrics are defined once in the main entry point of each component (gcs/gcs_server_main.cc for GCS, raylet/main.cc for Raylet, core_worker/core_worker_process.cc for the Core Worker, etc.). These metrics are then passed down to subcomponents via the ray::observability::MetricInterface.
  • This approach significantly reduces rebuild time when metric infrastructure changes. Previously, a change would trigger a full Ray rebuild; now, only the top-level entry points of each component need rebuilding.
  • There are a few exceptions where metrics are tracked inside object libraries (e.g., task_specification). In these cases, metrics are defined within the library itself, since there is no corresponding top-level entry point.
  • Finally, the obsolete metric_defs.h and metric_defs.cc files can now be completely removed. This paves the way for further dead code cleanup in a future PR.

Test:

  • CI


ray::stats::Gauge task_by_state_gauge_{GetTaskByStateGaugeMetric()};
ray::stats::Gauge actor_by_state_gauge_{GetActorByStateGaugeMetric()};
std::unique_ptr<ray::stats::Gauge> task_by_state_gauge_;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: I changed this to pointer to the metric initialization can happen after stats::Init inside core_worker_process.cc

@can-anyscale can-anyscale added the go add ONLY when ready to merge, run all tests label Oct 23, 2025
@can-anyscale can-anyscale marked this pull request as ready for review October 23, 2025 23:49
@can-anyscale can-anyscale requested a review from a team as a code owner October 23, 2025 23:49
cursor[bot]

This comment was marked as outdated.

@ray-gardener ray-gardener bot added core Issues that should be addressed in Ray Core observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling labels Oct 24, 2025
@edoakes
Copy link
Collaborator

edoakes commented Oct 24, 2025

@ZacAttack PTAL

"be placed. This is the time from when the tasks dependencies are "
"resolved to when it actually reserves resources on a node to run.",
/*unit=*/"s",
/*boundaries=*/{0.1, 1, 10, 100, 1000, 10000},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I've lately gotten a bit of a crash course in this thing. These.... buckets seem a little off right? scheduler_placement_time_s is in seconds (I'm assuming because of the _s). So we're saying the scheduler placement time falls within 0.1s->1s->10s->100s->16 minutes -> 2.5+ hours

The latter buckets seem absurd right? What is the actual realistic spread of latency for the scheduler to place something? If the _s is misleading and actually it's not in seconds then we should fix that. But we should choose our buckets along the lines of

-->healthy range
-->elevated range
-->error range

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sense, the series of number must’ve been pulled out of a hat at some point ;) - i'll fix them on a PR on top of this, to keep this PR purely a refactoring

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Image

yeah look like the value is only meaningful in the range of less than 10s; most are actually below 0.1s so further breakdown of bucket less than 0.1s is probably more useful

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it even makes sense to have this metric in seconds at all. It seems like we'd want this to usually be sub second

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make total sense

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there you go #58217

@can-anyscale can-anyscale requested a review from a team October 28, 2025 14:23
const TaskSpecification &same2,
const TaskSpecification &different) {
rpc::Address address;
ray::observability::FakeHistogram fake_scheduler_placement_time_s_histogram_;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Local Variable Passed to Constructor Causes Use-After-Free

Variable fake_scheduler_placement_time_s_histogram_ is declared as a local variable inside the TestSchedulingKey helper function at line 1436, but is then passed by reference to NormalTaskSubmitter constructor at line 1465. This creates a use-after-free bug because the local variable will be destroyed before NormalTaskSubmitter uses it. This should be a local variable declaration that persists for the lifetime of the submitter, not a declaration at the function scope start.

Fix in Cursor Fix in Web

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's fine, the NormalTaskSubmitter is also a local object within the TestSchedulingKey function

@can-anyscale can-anyscale merged commit dc966a1 into master Oct 29, 2025
6 checks passed
@can-anyscale can-anyscale deleted the can-statdie03 branch October 29, 2025 21:26
can-anyscale added a commit that referenced this pull request Oct 29, 2025
This PR replace STATS with Metric as a way to define metric inside ray
(as a unification effort) in all core worker components. For the most
parts, metrics are defined as the top level component
(core_worker_process.cc) and pass down as an interface to the
sub-components.

**Details**
Full context of this refactoring work.
- Each component (e.g., gcs, raylet, core_worker, etc.) now has a
metrics.h file located in its top-level directory. This file defines all
metrics for that component.
- In most cases, metrics are defined once in the main entry point of
each component (gcs/gcs_server_main.cc for GCS, raylet/main.cc for
Raylet, core_worker/core_worker_process.cc for the Core Worker, etc.).
These metrics are then passed down to subcomponents via the
ray::observability::MetricInterface.
- This approach significantly reduces rebuild time when metric
infrastructure changes. Previously, a change would trigger a full Ray
rebuild; now, only the top-level entry points of each component need
rebuilding.
- There are a few exceptions where metrics are tracked inside object
libraries (e.g., task_specification). In these cases, metrics are
defined within the library itself, since there is no corresponding
top-level entry point.
- Finally, the obsolete metric_defs.h and metric_defs.cc files can now
be completely removed. This paves the way for further dead code cleanup
in a future PR.

Test:
- CI

Signed-off-by: Cuong Nguyen <[email protected]>
YoussefEssDS pushed a commit to YoussefEssDS/ray that referenced this pull request Nov 8, 2025
…#58060)

This PR replace STATS with Metric as a way to define metric inside ray
(as a unification effort) in all core worker components. For the most
parts, metrics are defined as the top level component
(core_worker_process.cc) and pass down as an interface to the
sub-components.

**Details**
Full context of this refactoring work.
- Each component (e.g., gcs, raylet, core_worker, etc.) now has a
metrics.h file located in its top-level directory. This file defines all
metrics for that component.
- In most cases, metrics are defined once in the main entry point of
each component (gcs/gcs_server_main.cc for GCS, raylet/main.cc for
Raylet, core_worker/core_worker_process.cc for the Core Worker, etc.).
These metrics are then passed down to subcomponents via the
ray::observability::MetricInterface.
- This approach significantly reduces rebuild time when metric
infrastructure changes. Previously, a change would trigger a full Ray
rebuild; now, only the top-level entry points of each component need
rebuilding.
- There are a few exceptions where metrics are tracked inside object
libraries (e.g., task_specification). In these cases, metrics are
defined within the library itself, since there is no corresponding
top-level entry point.
- Finally, the obsolete metric_defs.h and metric_defs.cc files can now
be completely removed. This paves the way for further dead code cleanup
in a future PR.

Test:
- CI

Signed-off-by: Cuong Nguyen <[email protected]>
elliot-barn pushed a commit that referenced this pull request Nov 14, 2025
This PR replace STATS with Metric as a way to define metric inside ray
(as a unification effort) in all core worker components. For the most
parts, metrics are defined as the top level component
(core_worker_process.cc) and pass down as an interface to the
sub-components.

**Details**
Full context of this refactoring work.
- Each component (e.g., gcs, raylet, core_worker, etc.) now has a
metrics.h file located in its top-level directory. This file defines all
metrics for that component.
- In most cases, metrics are defined once in the main entry point of
each component (gcs/gcs_server_main.cc for GCS, raylet/main.cc for
Raylet, core_worker/core_worker_process.cc for the Core Worker, etc.).
These metrics are then passed down to subcomponents via the
ray::observability::MetricInterface.
- This approach significantly reduces rebuild time when metric
infrastructure changes. Previously, a change would trigger a full Ray
rebuild; now, only the top-level entry points of each component need
rebuilding.
- There are a few exceptions where metrics are tracked inside object
libraries (e.g., task_specification). In these cases, metrics are
defined within the library itself, since there is no corresponding
top-level entry point.
- Finally, the obsolete metric_defs.h and metric_defs.cc files can now
be completely removed. This paves the way for further dead code cleanup
in a future PR.

Test:
- CI

Signed-off-by: Cuong Nguyen <[email protected]>
Signed-off-by: elliot-barn <[email protected]>
elliot-barn pushed a commit that referenced this pull request Nov 14, 2025
This PR replace STATS with Metric as a way to define metric inside ray
(as a unification effort) in all core worker components. For the most
parts, metrics are defined as the top level component
(core_worker_process.cc) and pass down as an interface to the
sub-components.

**Details**
Full context of this refactoring work.
- Each component (e.g., gcs, raylet, core_worker, etc.) now has a
metrics.h file located in its top-level directory. This file defines all
metrics for that component.
- In most cases, metrics are defined once in the main entry point of
each component (gcs/gcs_server_main.cc for GCS, raylet/main.cc for
Raylet, core_worker/core_worker_process.cc for the Core Worker, etc.).
These metrics are then passed down to subcomponents via the
ray::observability::MetricInterface.
- This approach significantly reduces rebuild time when metric
infrastructure changes. Previously, a change would trigger a full Ray
rebuild; now, only the top-level entry points of each component need
rebuilding.
- There are a few exceptions where metrics are tracked inside object
libraries (e.g., task_specification). In these cases, metrics are
defined within the library itself, since there is no corresponding
top-level entry point.
- Finally, the obsolete metric_defs.h and metric_defs.cc files can now
be completely removed. This paves the way for further dead code cleanup
in a future PR.

Test:
- CI

Signed-off-by: Cuong Nguyen <[email protected]>
Signed-off-by: elliot-barn <[email protected]>
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
…#58060)

This PR replace STATS with Metric as a way to define metric inside ray
(as a unification effort) in all core worker components. For the most
parts, metrics are defined as the top level component
(core_worker_process.cc) and pass down as an interface to the
sub-components.

**Details**
Full context of this refactoring work.
- Each component (e.g., gcs, raylet, core_worker, etc.) now has a
metrics.h file located in its top-level directory. This file defines all
metrics for that component.
- In most cases, metrics are defined once in the main entry point of
each component (gcs/gcs_server_main.cc for GCS, raylet/main.cc for
Raylet, core_worker/core_worker_process.cc for the Core Worker, etc.).
These metrics are then passed down to subcomponents via the
ray::observability::MetricInterface.
- This approach significantly reduces rebuild time when metric
infrastructure changes. Previously, a change would trigger a full Ray
rebuild; now, only the top-level entry points of each component need
rebuilding.
- There are a few exceptions where metrics are tracked inside object
libraries (e.g., task_specification). In these cases, metrics are
defined within the library itself, since there is no corresponding
top-level entry point.
- Finally, the obsolete metric_defs.h and metric_defs.cc files can now
be completely removed. This paves the way for further dead code cleanup
in a future PR.

Test:
- CI

Signed-off-by: Cuong Nguyen <[email protected]>
Aydin-ab pushed a commit to Aydin-ab/ray-aydin that referenced this pull request Nov 19, 2025
…#58060)

This PR replace STATS with Metric as a way to define metric inside ray
(as a unification effort) in all core worker components. For the most
parts, metrics are defined as the top level component
(core_worker_process.cc) and pass down as an interface to the
sub-components.

**Details**
Full context of this refactoring work.
- Each component (e.g., gcs, raylet, core_worker, etc.) now has a
metrics.h file located in its top-level directory. This file defines all
metrics for that component.
- In most cases, metrics are defined once in the main entry point of
each component (gcs/gcs_server_main.cc for GCS, raylet/main.cc for
Raylet, core_worker/core_worker_process.cc for the Core Worker, etc.).
These metrics are then passed down to subcomponents via the
ray::observability::MetricInterface.
- This approach significantly reduces rebuild time when metric
infrastructure changes. Previously, a change would trigger a full Ray
rebuild; now, only the top-level entry points of each component need
rebuilding.
- There are a few exceptions where metrics are tracked inside object
libraries (e.g., task_specification). In these cases, metrics are
defined within the library itself, since there is no corresponding
top-level entry point.
- Finally, the obsolete metric_defs.h and metric_defs.cc files can now
be completely removed. This paves the way for further dead code cleanup
in a future PR.

Test:
- CI

Signed-off-by: Cuong Nguyen <[email protected]>
Signed-off-by: Aydin Abiar <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants