Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[data] Add dataset/operator state, progress, total metrics #50770

Merged

Conversation

omatthew98
Copy link
Contributor

Add various metrics that are captured in the progress bar but are not captured in the prometheus metrics emitted.

Why are these changes needed?

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@omatthew98 omatthew98 requested a review from a team as a code owner February 20, 2025 20:06
Copy link
Collaborator

@can-anyscale can-anyscale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

w00h00

@@ -266,6 +266,44 @@ def __init__(self, max_stats=1000):
tag_keys=iter_tag_keys,
)

# === Dataset and Operator Metadata Metrics ===
dataset_tags = ("dataset",)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit,

dataset_tags = ("dataset", "jobid", "total", "starttime") (these are all metadata that never changes)

full schema in https://github.com/anyscale/rayturbo/pull/1446/files#diff-dc8f3eb3e3b8d8b0b62c1160389cad945623fd82e78ec1d2d54a45fe26902320

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if i understand the meaning of dataset_tags correctly, but i just mean to also store jobid, total and startime as column/label in prometheus

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Total might change afaik so leaving that off. I'm not sure if it makes sense to have jobid or starttime as tags here as that might increase the cardinality of our metrics tags. Are those necessary for the data dashboard?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yess, startime is a column in data dashboard and jobid is an important link to render job status, etc. I believe the cardinality size is affected by the number of combination in the label values. In this case, because starttime and jobid values are singleton to dataset, it doesn't affect cardinality size, if that makes sense.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you believe chatgpt ;)

Screenshot 2025-03-03 at 1 12 12 PM

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@omatthew98 can we discuss/do this before merging, otherwise I won't be able to generate enough data for the dashboard with just metrics

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me discuss with the team, adding jobid seems reasonable here, dataset start time as a tag feels like a bit of an antipattern.

FWIW these are the two things suggested by ChatGPT (neither which feel particularly elegant).

For static metadata like a job ID or start time, it's common practice not to create separate gauges for each piece of information. Instead, you can adopt one of these approaches:

Info Metric Pattern:
Create a single gauge (often set to a constant value like 1) with labels representing your static metadata. For instance, you might have a metric called job_info with labels such as jobid and start_time. This pattern is widely used (e.g., many exporters expose a build_info metric) and keeps your metric space clean.

Const Labels:
When initializing a metric that is associated with a particular job, you can add the metadata as constant labels. This is useful if the metadata is truly static for the lifetime of that metric instance and you don't expect many different values that could lead to high cardinality.

Both methods allow you to expose useful metadata in Prometheus without having to create multiple separate gauges. Just be cautious with label cardinality—if your metadata values vary too widely, they might negatively impact your Prometheus queries and storage.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tags don't make sense in Ray Turbo

  • Jobid: we don't know where RT will be running so jobid is not well-defined
  • Starttime: is unbounded hence can't be a tag
  • Total: i don't even understand what this could be

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Const Labels:
When initializing a metric that is associated with a particular job, you can add the metadata as constant labels. This is useful if the metadata is truly static for the lifetime of that metric instance and you don't expect many different values that could lead to high cardinality.

This is what I'm suggesting basically ;). If they are constants then there should be no downside in storing this in prometheus. We actually already have 20+ constant labels (cluster related information) for every metric in prometheus.

@alexeykudinkin: if I understand your comments correctly - don't think about them as tags; labels are just constant columns in prometheus (unless these tags have different semantic in dataset point of views)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Info Metric Pattern:
Create a single gauge (often set to a constant value like 1) with labels representing your static metadata. For instance, you might have a metric called job_info with labels such as jobid and start_time. This pattern is widely used (e.g., many exporters expose a build_info metric) and keeps your metric space clean.

This is also fine except that it's more expensive (additional metrics vs. a constant column like the other solutions)

@can-anyscale can-anyscale self-requested a review March 3, 2025 21:39
tag_keys=dataset_tags,
)
self.dataset_state = Gauge(
"ray_data_dataset_state",
"data_dataset_state",
description=f"State of dataset ({', '.join([f'{s.value}={s.name}' for s in DatasetState])})",
tag_keys=("dataset",),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: dataset_tags

tag_keys=operator_tags,
)
self.operator_total = Gauge(
"data_operator_total",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"data_operator_total",
"data_operator_estimated_total_blocks",

This is the estimated total number of blocks right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for other "total" metrics. let's add "estimated" to avoid confusion.

operator_tags = ("dataset", "operator")
self.operator_progress = Gauge(
"data_operator_progress",
description="Progress of operator execution",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this num rows or blocks? I think we need both

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually the number of completed tasks

"progress": last_state.num_completed_tasks,
. I think we already have the output of the dataset in rows and blocks here and here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make the name more explicit?

@omatthew98 omatthew98 added the go add ONLY when ready to merge, run all tests label Mar 5, 2025
@omatthew98 omatthew98 force-pushed the mowen/add-more-data-prom-metrics branch from ebb795e to c70e5c9 Compare March 6, 2025 06:34
Signed-off-by: Matthew Owen <[email protected]>
Copy link
Collaborator

@can-anyscale can-anyscale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.datasets[dataset_tag].update(state)
job_id = self.datasets[dataset_tag].get("job_id", None)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: job_id = self.datasets[dataset_tag].get("job_id") (get returns None by default)

also not sure if prometheus handle null/none gracefully (prometheus/client_java#315) so maybe gives it some dummy default value instead of None.

Comment on lines 281 to 285
self.data_dataset_progress = Gauge(
"data_dataset_progress",
description="Progress of dataset execution",
tag_keys=dataset_tags,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are these?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explanation here: #50770 (comment)

I will make more explicit though like @raulchen suggested.

@raulchen raulchen merged commit 0cf2a4b into ray-project:master Mar 7, 2025
5 checks passed
elimelt pushed a commit to elimelt/ray that referenced this pull request Mar 9, 2025
…ct#50770)

Add various metrics that are captured in the progress bar but are not
captured in the prometheus metrics emitted.
---------

Signed-off-by: Matthew Owen <[email protected]>
park12sj pushed a commit to park12sj/ray that referenced this pull request Mar 18, 2025
…ct#50770)

Add various metrics that are captured in the progress bar but are not
captured in the prometheus metrics emitted.
---------

Signed-off-by: Matthew Owen <[email protected]>
jaychia pushed a commit to jaychia/ray that referenced this pull request Mar 19, 2025
…ct#50770)

Add various metrics that are captured in the progress bar but are not
captured in the prometheus metrics emitted.
---------

Signed-off-by: Matthew Owen <[email protected]>
Signed-off-by: Jay Chia <[email protected]>
jaychia pushed a commit to jaychia/ray that referenced this pull request Mar 19, 2025
…ct#50770)

Add various metrics that are captured in the progress bar but are not
captured in the prometheus metrics emitted.
---------

Signed-off-by: Matthew Owen <[email protected]>
Signed-off-by: Jay Chia <[email protected]>
Drice1999 pushed a commit to Drice1999/ray that referenced this pull request Mar 23, 2025
…ct#50770)

Add various metrics that are captured in the progress bar but are not
captured in the prometheus metrics emitted.
---------

Signed-off-by: Matthew Owen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants