-
Notifications
You must be signed in to change notification settings - Fork 0
[doc] Add monitoring and observability guidelines #13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,196 @@ | ||
| # GitHub Runners Observability Guidelines | ||
|
|
||
| > [!NOTE] | ||
| > This guidelines are an unreleased DRAFT | ||
|
|
||
| ## Overview | ||
|
|
||
| PyTorch's CI infrastructure serves as the backbone for continuous integration across multiple cloud providers, supporting thousands of developers and contributors worldwide. The reliability of this infrastructure directly impacts PyTorch's development velocity, code quality, and release cadence. Without comprehensive monitoring, issues such as runner failures, performance degradation, or capacity bottlenecks can go undetected, leading to delayed builds, frustrated contributors, and potential release delays. Monitoring enables proactive identification of problems, ensures optimal resource utilization, and provides the data necessary for capacity planning and infrastructure optimization. This is especially critical given PyTorch's position as a leading machine learning framework where build reliability directly affects the broader AI/ML ecosystem. | ||
|
|
||
| This document defines the mandatory monitoring and observability requirements for GitHub runners added to the Pytorch multi-cloud CI infrastructure. All runners must comply with these guidelines to ensure proper tracking of health, performance, and availability. | ||
| This document is split into two parts -- | ||
|
|
||
| 1. [Requirements](#requirements) : which provides guidelines on `what` is reqired to onboard a new runner system and | ||
| 2. [Implementation](#implementation) : which provides guidelines on `how` to fulfill the above requriements in a manner consisent with the rest of pytorch CI infra | ||
|
|
||
| ## Requirements | ||
|
|
||
| ### Runner Pool Stability | ||
|
|
||
| A candidate runner pool must: | ||
|
|
||
| - Undergo stability assessment before deployment in critical CI/CD workflows | ||
| - Maintain performance metrics during test jobs | ||
| - Track resource utilization and stability patterns | ||
| - Document baseline performance metrics for each runner type | ||
|
|
||
| ### Incident Management | ||
|
|
||
| Runner pools must: | ||
|
|
||
| - Implement real-time status monitoring | ||
| - Configure automated alerts for: | ||
| - Runner pool offline events | ||
| - Capacity reduction incidents | ||
| - Performance degradation | ||
| - Resource exhaustion | ||
| - Establish alert routing to: | ||
| - CI infrastructure team | ||
| - Community maintainers | ||
| - System administrators | ||
|
|
||
| ### Metrics Requirements | ||
|
|
||
| All runners must collect and expose the following metrics on [hud.pytorch.org/metrics](https://hud.pytorch.org/metrics) | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The HUD metrics are available to all CI clouds for free because the data is sourced directly from Github without considering the cloud the jobs are executed on. |
||
|
|
||
| #### Lifecycle Metrics | ||
|
|
||
| Runners must track: | ||
|
|
||
| - Registration/unregistration events | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I couldn't find a webhook type for runner registration/unregistration on GitHub side, so this may have to be produced by the various runners individually. When ARC is used, ARC will produce this metric out of the box
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes. I was thinking of ARC when I wrote this. |
||
| - Job start/completion times | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. GitHub can trigger webhooks for
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes. I was thinking of ARC when I wrote this. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Github already shares this data by default. For context, we take the webhooks github emits on job/workflow updates and store them directly into ClickHouse The thing it sort-of lacks though is tying that job start/end time to a particular cloud or instance. Github does share the runner label, which today is enough to uniquely identify a specific cloud, but I'm not sure how that'll work with runner groups |
||
| - Queue wait times | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this the time the job queues after a runner has been assigned, or the time a job queues waiting for a runner to be assigned to it?
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The former -- job queues after a runner has been assigned There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We have rough measurements available for this by default today: We take the time the job was first pushed to ClickHouse (which happens after we get the new job creation webhook from github) and then take the delta against the time the job actually started running. It has a small room for error but is generally accurate enough |
||
| - Job execution duration | ||
| - Resource utilization during jobs | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this CPU/GPU/Memory utilization? Is the idea to track this over time within a specific job? The exact shape of what's recorded will be important here. Fyi, we have per-job utilization reports available today To find it, you go to a commit page (like this one) and click on the "utilization report" button next to a job |
||
| - Error rates and types | ||
|
|
||
| #### Health Metrics | ||
|
|
||
| Runners must monitor: | ||
|
|
||
| - Heartbeat status | ||
| - System resource usage (CPU, Memory, Disk) | ||
| - Network connectivity | ||
| - GitHub API response times | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. And github api success/failure rates. E.g. GH Api token exhaustion used to be a frequent source of failures. |
||
| - Runner process health | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you be more specific about what "process health" means? Is it a binary "alive/dead" question or something more granular? |
||
|
|
||
| ## Technical Requirements | ||
|
|
||
| ### OpenTelemetry Integration | ||
|
|
||
| All monitoring implementations must: | ||
|
|
||
| - Expose metrics in OpenTelemetry format | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Would be good to define the exact format each metric is expected to be emitted in, so that they're all in a consistent shape and can be queried easily |
||
| - Follow standardized metric naming conventions | ||
| - Use consistent labeling across all runners | ||
| - Implement proper metric aggregation and sampling | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is the idea that each cloud will host it's own metrics (thus the aggregation/sampling requirements)? Or are we expecting to have all metrics related data emitted to a central location, and let that central service take care of aggregation/sampling? |
||
|
|
||
| ### Service Level Requirements | ||
|
|
||
| Production runners must maintain: | ||
|
|
||
| - Minimum uptime of 99.9% | ||
| - Maximum job queue time of 5 minutes | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I want to square this with the actual behavior we see in the Meta team, to make sure we don't hold other clouds accountable to a higher standard than what the Meta cloud is held to Perhaps something like: p99 queue time of 5 minutes per hour, expecting most jobs to have very little queuing |
||
| - Job execution time variance within ±10% of baseline | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. thanks to caching, some jobs tend to have high variance between runs. Maybe we can tighten this up to "P50 job execution time variance..." |
||
| - Response time to critical alerts within 15 minutes | ||
| - Maximum capacity reduction of 10% | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. With capacity defined as: "Theoretical maximum number of jobs of a given type that can be run in this cloud in parallel"? |
||
|
|
||
| ### Dashboard Requirements | ||
|
|
||
| ### HUD Integration | ||
|
|
||
| The PyTorch CI HUD is a dashboard that consolidates metrics and dashboards for tracking the Continuous Integration (CI) system of PyTorch, including metrics related to runners. | ||
| The HUD provides a centralized view of these metrics, with dashboards like the Metrics Dashboard, Flambeau and Benchmark Metrics offering insights into runner performance and CI health | ||
|
|
||
| Teams providing runners to the pool must | ||
|
|
||
| - Implementing OpenTelemetry data source integration to HUD | ||
| - Support real-time status overview | ||
| - Support resource utilization graphs | ||
| - Alert history and status | ||
| - Runner pool capacity visualization | ||
|
Comment on lines
+95
to
+101
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does HUD integration add any requirements on top of the metrics and technical requirements above?
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @ZainRizvi might know better. I was thinking we need to centralize the monitoring in one place, and HUD seemed like a good candidate for it, since it already exists. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. HUD itself doesn't look at any infrastructure related data today. Right now all of it's data is taken directly from github (so it'll automatically keep working for any new cloud that spins up), though we may add more infra data there in the future. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For providing a data source, the general criteria might be to provide that data in a queryable format. We're considering migrating our dashboards over to Grafana (which is easier to build new dashboards on than HUD). The common bit both need is to have an interface that can be queried. Today both HUD and Grafana are powered by our ClickHouse database. |
||
|
|
||
| ### Alternative Dashboards | ||
|
|
||
| Teams may implement: | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What does teams refer to? I think it's worth clarifying. My take:
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ack. |
||
|
|
||
| - Grafana dashboard implementation | ||
| - Custom metrics visualization | ||
| - Alert management interface | ||
| - Performance reporting | ||
|
|
||
| ### Documentation Requirements | ||
|
|
||
| Teams must: | ||
|
|
||
| - Maintain up-to-date monitoring documentation | ||
| - Doccuemnt the architecture diagram detailing their runner CI infrastructure setup | ||
| - Document all custom metrics, monitoring endpoints, and esclalation routes | ||
| - Document thresholds for raising and resolving alerts | ||
| - Document alert response procedures and playbooks for internal SRE/manitainers to follow to resolve the alerts | ||
| - Conduct regular review and updates of the documentation to prevent documentation from getting outdated | ||
|
|
||
| ### Maintenance Requirements | ||
|
|
||
| Teams must: | ||
|
|
||
| - Conduct regular metric review | ||
| - Perform alert threshold tuning | ||
| - Optimize performance | ||
| - Plan for capacity | ||
|
|
||
| ### Compliance Requirements | ||
|
|
||
| Teams must: | ||
|
|
||
| - Conduct regular review of monitoring effectiveness | ||
| - Perform quarterly metric analysis | ||
| - Update monitoring strategy annually | ||
| - Implement continuous improvement process | ||
|
|
||
| ## Implementation | ||
|
|
||
| ### System Architecture | ||
|
|
||
| In order to provide a clear separation between the pytorch foundation runners and community/partner runners, the following guidelines must be followed. | ||
| For details on getting started with onboarding a new runner, please refer to the [Partners Pytorch CI Runners](https://github.com/pytorch/test-infra/blob/main/docs/partners_pytorch_ci_runners.md) guide. | ||
|
|
||
| #### PyTorch Runners | ||
|
|
||
| Must implement: | ||
|
|
||
| - Dedicated monitoring namespace | ||
| - Resource quotas and limits | ||
| - Custom metrics for PyTorch-specific workloads | ||
| - Integration with existing PyTorch monitoring infrastructure | ||
|
|
||
| #### Community Runners | ||
|
|
||
| Must implement: | ||
|
|
||
| - Separate monitoring namespace | ||
| - Basic resource monitoring | ||
| - Job execution metrics | ||
| - Error tracking and reporting | ||
|
Comment on lines
+148
to
+164
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This section is not clear to me, perhaps we can expand on it a bit during the WG?
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ack. |
||
|
|
||
| ### Alerting | ||
|
|
||
| All CI Runners should post alerts to [#pytorch-infra-alerts](https://pytorch.slack.com/archives/C082SHB006Q) channel in case of service degradation. | ||
|
|
||
| Teams must define clear alert thresholds as part of the runner [documentation requirements](#documentation-requirements) | ||
|
|
||
| Alerts may be of three types -- | ||
|
|
||
| 1. Raise `warning` alers when the expected values degrade below the P50 threshold nominal values | ||
| 2. Raise `error` alerts when the expected values degrade below the P90 threshold nominal values | ||
| 3. Raise `critical` alerts when the expected values degrade below the P99 threshold nominal values | ||
|
|
||
| Alerts need to have a raise threshold and a clear threshold defined. | ||
|
|
||
| [<img src="assets/threshold-setting.png" width="400"/>](threshold-setting.png) | ||
|
|
||
| In general the high raise threshold must be greater than the high clear threshold, and low raise threshold values would be less than the low clear threshold. | ||
| A common use case for high raise thresholds would be runner HTTP 5XX error rate, and low raise threshold would be metrics such as runner node disk space. | ||
|
|
||
| In addition to the above, for successfully managing alerts, teams must: | ||
|
|
||
| - Implement best-effort alert deduplication so as to reduce redundant posts in the channel | ||
| - Establish proper escalation paths for tagging maintainers. | ||
|
|
||
| ### Metric Collection | ||
|
|
||
| Pytorch HUD uses a ClickHouse Cloud database as the data source for the dashboards, whose schema is defined [here](https://github.com/pytorch/test-infra/tree/main/clickhouse_db_schema). | ||
| All runners must publish metrics marked `mandatory` in the tables below. | ||
|
|
||
| > [!NOTE] | ||
| > TODO :: Fill in this section based on current state of metrics from WG meeting. | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you have in mind when you say "performance metrics"?