Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
196 changes: 196 additions & 0 deletions Observability-Guidelines.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,196 @@
# GitHub Runners Observability Guidelines

> [!NOTE]
> This guidelines are an unreleased DRAFT

## Overview

PyTorch's CI infrastructure serves as the backbone for continuous integration across multiple cloud providers, supporting thousands of developers and contributors worldwide. The reliability of this infrastructure directly impacts PyTorch's development velocity, code quality, and release cadence. Without comprehensive monitoring, issues such as runner failures, performance degradation, or capacity bottlenecks can go undetected, leading to delayed builds, frustrated contributors, and potential release delays. Monitoring enables proactive identification of problems, ensures optimal resource utilization, and provides the data necessary for capacity planning and infrastructure optimization. This is especially critical given PyTorch's position as a leading machine learning framework where build reliability directly affects the broader AI/ML ecosystem.

This document defines the mandatory monitoring and observability requirements for GitHub runners added to the Pytorch multi-cloud CI infrastructure. All runners must comply with these guidelines to ensure proper tracking of health, performance, and availability.
This document is split into two parts --

1. [Requirements](#requirements) : which provides guidelines on `what` is reqired to onboard a new runner system and
2. [Implementation](#implementation) : which provides guidelines on `how` to fulfill the above requriements in a manner consisent with the rest of pytorch CI infra

## Requirements

### Runner Pool Stability

A candidate runner pool must:

- Undergo stability assessment before deployment in critical CI/CD workflows
- Maintain performance metrics during test jobs

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you have in mind when you say "performance metrics"?

- Track resource utilization and stability patterns
- Document baseline performance metrics for each runner type

### Incident Management

Runner pools must:

- Implement real-time status monitoring
- Configure automated alerts for:
- Runner pool offline events
- Capacity reduction incidents
- Performance degradation
- Resource exhaustion
- Establish alert routing to:
- CI infrastructure team
- Community maintainers
- System administrators

### Metrics Requirements

All runners must collect and expose the following metrics on [hud.pytorch.org/metrics](https://hud.pytorch.org/metrics)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The HUD metrics are available to all CI clouds for free because the data is sourced directly from Github without considering the cloud the jobs are executed on.


#### Lifecycle Metrics

Runners must track:

- Registration/unregistration events
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't find a webhook type for runner registration/unregistration on GitHub side, so this may have to be produced by the various runners individually. When ARC is used, ARC will produce this metric out of the box

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I was thinking of ARC when I wrote this.

- Job start/completion times
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GitHub can trigger webhooks for workflow_run and workflow_job, we could use those to collect job metrics centrally, but I don't know if that will show a breakdown per-runner. Again, when ARC is used, ARC will produce this metric out of the box

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I was thinking of ARC when I wrote this.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Github already shares this data by default. For context, we take the webhooks github emits on job/workflow updates and store them directly into ClickHouse

The thing it sort-of lacks though is tying that job start/end time to a particular cloud or instance.

Github does share the runner label, which today is enough to uniquely identify a specific cloud, but I'm not sure how that'll work with runner groups

- Queue wait times
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the time the job queues after a runner has been assigned, or the time a job queues waiting for a runner to be assigned to it?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The former -- job queues after a runner has been assigned

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have rough measurements available for this by default today: We take the time the job was first pushed to ClickHouse (which happens after we get the new job creation webhook from github) and then take the delta against the time the job actually started running. It has a small room for error but is generally accurate enough

- Job execution duration
- Resource utilization during jobs

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this CPU/GPU/Memory utilization? Is the idea to track this over time within a specific job?

The exact shape of what's recorded will be important here.

Fyi, we have per-job utilization reports available today
Example report: https://hud.pytorch.org/utilization/15989391092/45100138715/1

To find it, you go to a commit page (like this one) and click on the "utilization report" button next to a job

- Error rates and types

#### Health Metrics

Runners must monitor:

- Heartbeat status
- System resource usage (CPU, Memory, Disk)
- Network connectivity
- GitHub API response times

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And github api success/failure rates. E.g. GH Api token exhaustion used to be a frequent source of failures.

- Runner process health

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you be more specific about what "process health" means? Is it a binary "alive/dead" question or something more granular?


## Technical Requirements

### OpenTelemetry Integration

All monitoring implementations must:

- Expose metrics in OpenTelemetry format

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to define the exact format each metric is expected to be emitted in, so that they're all in a consistent shape and can be queried easily

- Follow standardized metric naming conventions
- Use consistent labeling across all runners
- Implement proper metric aggregation and sampling

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the idea that each cloud will host it's own metrics (thus the aggregation/sampling requirements)?

Or are we expecting to have all metrics related data emitted to a central location, and let that central service take care of aggregation/sampling?


### Service Level Requirements

Production runners must maintain:

- Minimum uptime of 99.9%
- Maximum job queue time of 5 minutes

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to square this with the actual behavior we see in the Meta team, to make sure we don't hold other clouds accountable to a higher standard than what the Meta cloud is held to

Perhaps something like:

p99 queue time of 5 minutes per hour, expecting most jobs to have very little queuing
pMax queue time of 30 minutes per day. Exceeding this means there's an outage.

- Job execution time variance within ±10% of baseline

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks to caching, some jobs tend to have high variance between runs. Maybe we can tighten this up to "P50 job execution time variance..."

- Response time to critical alerts within 15 minutes
- Maximum capacity reduction of 10%

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With capacity defined as: "Theoretical maximum number of jobs of a given type that can be run in this cloud in parallel"?


### Dashboard Requirements

### HUD Integration

The PyTorch CI HUD is a dashboard that consolidates metrics and dashboards for tracking the Continuous Integration (CI) system of PyTorch, including metrics related to runners.
The HUD provides a centralized view of these metrics, with dashboards like the Metrics Dashboard, Flambeau and Benchmark Metrics offering insights into runner performance and CI health

Teams providing runners to the pool must

- Implementing OpenTelemetry data source integration to HUD
- Support real-time status overview
- Support resource utilization graphs
- Alert history and status
- Runner pool capacity visualization
Comment on lines +95 to +101
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does HUD integration add any requirements on top of the metrics and technical requirements above?
I think it would be good to call out any metric required for HUD specifically, and do not include HUD itself as a requirement for runner providers.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ZainRizvi might know better. I was thinking we need to centralize the monitoring in one place, and HUD seemed like a good candidate for it, since it already exists.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HUD itself doesn't look at any infrastructure related data today. Right now all of it's data is taken directly from github (so it'll automatically keep working for any new cloud that spins up), though we may add more infra data there in the future.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For providing a data source, the general criteria might be to provide that data in a queryable format. We're considering migrating our dashboards over to Grafana (which is easier to build new dashboards on than HUD). The common bit both need is to have an interface that can be queried. Today both HUD and Grafana are powered by our ClickHouse database.


### Alternative Dashboards

Teams may implement:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does teams refer to? I think it's worth clarifying. My take:

  • In case the pool is on a public cloud (credits, dedicated billing account, pre-provisioned resources), the team would be the PyTorch infra team.
  • In case of resources on a private cloud or private resource pool, the team would be the individuals from the companing sharing resource who are responsible for their operations.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack.


- Grafana dashboard implementation
- Custom metrics visualization
- Alert management interface
- Performance reporting

### Documentation Requirements

Teams must:

- Maintain up-to-date monitoring documentation
- Doccuemnt the architecture diagram detailing their runner CI infrastructure setup
- Document all custom metrics, monitoring endpoints, and esclalation routes
- Document thresholds for raising and resolving alerts
- Document alert response procedures and playbooks for internal SRE/manitainers to follow to resolve the alerts
- Conduct regular review and updates of the documentation to prevent documentation from getting outdated

### Maintenance Requirements

Teams must:

- Conduct regular metric review
- Perform alert threshold tuning
- Optimize performance
- Plan for capacity

### Compliance Requirements

Teams must:

- Conduct regular review of monitoring effectiveness
- Perform quarterly metric analysis
- Update monitoring strategy annually
- Implement continuous improvement process

## Implementation

### System Architecture

In order to provide a clear separation between the pytorch foundation runners and community/partner runners, the following guidelines must be followed.
For details on getting started with onboarding a new runner, please refer to the [Partners Pytorch CI Runners](https://github.com/pytorch/test-infra/blob/main/docs/partners_pytorch_ci_runners.md) guide.

#### PyTorch Runners

Must implement:

- Dedicated monitoring namespace
- Resource quotas and limits
- Custom metrics for PyTorch-specific workloads
- Integration with existing PyTorch monitoring infrastructure

#### Community Runners

Must implement:

- Separate monitoring namespace
- Basic resource monitoring
- Job execution metrics
- Error tracking and reporting
Comment on lines +148 to +164
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section is not clear to me, perhaps we can expand on it a bit during the WG?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack.


### Alerting

All CI Runners should post alerts to [#pytorch-infra-alerts](https://pytorch.slack.com/archives/C082SHB006Q) channel in case of service degradation.

Teams must define clear alert thresholds as part of the runner [documentation requirements](#documentation-requirements)

Alerts may be of three types --

1. Raise `warning` alers when the expected values degrade below the P50 threshold nominal values
2. Raise `error` alerts when the expected values degrade below the P90 threshold nominal values
3. Raise `critical` alerts when the expected values degrade below the P99 threshold nominal values

Alerts need to have a raise threshold and a clear threshold defined.

[<img src="assets/threshold-setting.png" width="400"/>](threshold-setting.png)

In general the high raise threshold must be greater than the high clear threshold, and low raise threshold values would be less than the low clear threshold.
A common use case for high raise thresholds would be runner HTTP 5XX error rate, and low raise threshold would be metrics such as runner node disk space.

In addition to the above, for successfully managing alerts, teams must:

- Implement best-effort alert deduplication so as to reduce redundant posts in the channel
- Establish proper escalation paths for tagging maintainers.

### Metric Collection

Pytorch HUD uses a ClickHouse Cloud database as the data source for the dashboards, whose schema is defined [here](https://github.com/pytorch/test-infra/tree/main/clickhouse_db_schema).
All runners must publish metrics marked `mandatory` in the tables below.

> [!NOTE]
> TODO :: Fill in this section based on current state of metrics from WG meeting.
Binary file added assets/threshold-setting.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.