-
Notifications
You must be signed in to change notification settings - Fork 6.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[data] Add dataset/operator state, progress, total metrics #50770
[data] Add dataset/operator state, progress, total metrics #50770
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
w00h00
python/ray/data/_internal/stats.py
Outdated
@@ -266,6 +266,44 @@ def __init__(self, max_stats=1000): | |||
tag_keys=iter_tag_keys, | |||
) | |||
|
|||
# === Dataset and Operator Metadata Metrics === | |||
dataset_tags = ("dataset",) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit,
dataset_tags = ("dataset", "jobid", "total", "starttime") (these are all metadata that never changes)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure if i understand the meaning of dataset_tags
correctly, but i just mean to also store jobid
, total
and startime
as column/label in prometheus
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Total might change afaik so leaving that off. I'm not sure if it makes sense to have jobid
or starttime
as tags here as that might increase the cardinality of our metrics tags. Are those necessary for the data dashboard?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yess, startime
is a column in data dashboard and jobid
is an important link to render job status, etc. I believe the cardinality
size is affected by the number of combination in the label values. In this case, because starttime
and jobid
values are singleton to dataset
, it doesn't affect cardinality
size, if that makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@omatthew98 can we discuss/do this before merging, otherwise I won't be able to generate enough data for the dashboard with just metrics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me discuss with the team, adding jobid seems reasonable here, dataset start time as a tag feels like a bit of an antipattern.
FWIW these are the two things suggested by ChatGPT (neither which feel particularly elegant).
For static metadata like a job ID or start time, it's common practice not to create separate gauges for each piece of information. Instead, you can adopt one of these approaches:
Info Metric Pattern:
Create a single gauge (often set to a constant value like 1) with labels representing your static metadata. For instance, you might have a metric called job_info with labels such as jobid and start_time. This pattern is widely used (e.g., many exporters expose a build_info metric) and keeps your metric space clean.Const Labels:
When initializing a metric that is associated with a particular job, you can add the metadata as constant labels. This is useful if the metadata is truly static for the lifetime of that metric instance and you don't expect many different values that could lead to high cardinality.Both methods allow you to expose useful metadata in Prometheus without having to create multiple separate gauges. Just be cautious with label cardinality—if your metadata values vary too widely, they might negatively impact your Prometheus queries and storage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These tags don't make sense in Ray Turbo
- Jobid: we don't know where RT will be running so jobid is not well-defined
- Starttime: is unbounded hence can't be a tag
- Total: i don't even understand what this could be
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Const Labels:
When initializing a metric that is associated with a particular job, you can add the metadata as constant labels. This is useful if the metadata is truly static for the lifetime of that metric instance and you don't expect many different values that could lead to high cardinality.
This is what I'm suggesting basically ;). If they are constants then there should be no downside in storing this in prometheus. We actually already have 20+ constant labels (cluster related information) for every metric in prometheus.
@alexeykudinkin: if I understand your comments correctly - don't think about them as tags; labels are just constant columns in prometheus (unless these tags have different semantic in dataset point of views)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Info Metric Pattern:
Create a single gauge (often set to a constant value like 1) with labels representing your static metadata. For instance, you might have a metric called job_info with labels such as jobid and start_time. This pattern is widely used (e.g., many exporters expose a build_info metric) and keeps your metric space clean.
This is also fine except that it's more expensive (additional metrics vs. a constant column like the other solutions)
python/ray/data/_internal/stats.py
Outdated
tag_keys=dataset_tags, | ||
) | ||
self.dataset_state = Gauge( | ||
"ray_data_dataset_state", | ||
"data_dataset_state", | ||
description=f"State of dataset ({', '.join([f'{s.value}={s.name}' for s in DatasetState])})", | ||
tag_keys=("dataset",), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: dataset_tags
python/ray/data/_internal/stats.py
Outdated
tag_keys=operator_tags, | ||
) | ||
self.operator_total = Gauge( | ||
"data_operator_total", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"data_operator_total", | |
"data_operator_estimated_total_blocks", |
This is the estimated total number of blocks right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same for other "total" metrics. let's add "estimated" to avoid confusion.
python/ray/data/_internal/stats.py
Outdated
operator_tags = ("dataset", "operator") | ||
self.operator_progress = Gauge( | ||
"data_operator_progress", | ||
description="Progress of operator execution", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this num rows or blocks? I think we need both
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make the name more explicit?
Signed-off-by: Matthew Owen <[email protected]>
Signed-off-by: Matthew Owen <[email protected]>
Signed-off-by: Matthew Owen <[email protected]>
Signed-off-by: Matthew Owen <[email protected]>
Signed-off-by: Matthew Owen <[email protected]>
Signed-off-by: Matthew Owen <[email protected]>
Signed-off-by: Matthew Owen <[email protected]>
Signed-off-by: Matthew Owen <[email protected]>
ebb795e
to
c70e5c9
Compare
Signed-off-by: Matthew Owen <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.datasets[dataset_tag].update(state) | ||
job_id = self.datasets[dataset_tag].get("job_id", None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: job_id = self.datasets[dataset_tag].get("job_id")
(get returns None by default)
also not sure if prometheus handle null/none gracefully (prometheus/client_java#315) so maybe gives it some dummy default value instead of None.
python/ray/data/_internal/stats.py
Outdated
self.data_dataset_progress = Gauge( | ||
"data_dataset_progress", | ||
description="Progress of dataset execution", | ||
tag_keys=dataset_tags, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are these?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Explanation here: #50770 (comment)
I will make more explicit though like @raulchen suggested.
Signed-off-by: Matthew Owen <[email protected]>
…ct#50770) Add various metrics that are captured in the progress bar but are not captured in the prometheus metrics emitted. --------- Signed-off-by: Matthew Owen <[email protected]>
…ct#50770) Add various metrics that are captured in the progress bar but are not captured in the prometheus metrics emitted. --------- Signed-off-by: Matthew Owen <[email protected]>
…ct#50770) Add various metrics that are captured in the progress bar but are not captured in the prometheus metrics emitted. --------- Signed-off-by: Matthew Owen <[email protected]> Signed-off-by: Jay Chia <[email protected]>
…ct#50770) Add various metrics that are captured in the progress bar but are not captured in the prometheus metrics emitted. --------- Signed-off-by: Matthew Owen <[email protected]> Signed-off-by: Jay Chia <[email protected]>
…ct#50770) Add various metrics that are captured in the progress bar but are not captured in the prometheus metrics emitted. --------- Signed-off-by: Matthew Owen <[email protected]>
Add various metrics that are captured in the progress bar but are not captured in the prometheus metrics emitted.
Why are these changes needed?
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.