Skip to content

quay/dora

Repository files navigation

DORA Metrics for Quay.io

These scripts implement some of the DORA metrics for quay.io. They rely on a recent clone of the quay and app-interface repositories.

https://dora.dev

$ git clone https://github.com/quay/quay
$ git clone https://gitlab.cee.redhat.com/service/app-interface
$ git clone https://github.com/quay/dora
$ cd dora

DORA Metrics

We can get all of the DORA Metrics done in one shot by using dora.py as follows:

$ pip install -r requirements.txt
$ python dora.py --since 2025-01-01

This will generate a Google Spreadsheet with all of the data. (NOTE: currently we are only calculating Deployent Frequency and Change Fail Rate).

There are additional python scripts which can be used to pull specific metrics as well.

Deployment Frequency

A higher deployment frequency directly correlates to a more stable service that is able to react faster to issues and changing business conditions. It also keeps the team's skills sharp and operational practices current.

Every time we deploy quay.io, we update the ref in Quay's App Interface SaaS file (data/services/quayio/saas/quayio-py3.yaml). This ref points to the commit point in quay/quay master branch. By cross-referencing this ref in the SaaS file we can get a fairly accurate measurement of how many times we have deployed to quay.io as well as the lead time from when the commit to HEAD occurred and when it went into production.

$  python time_to_deploy.py > ttd.csv

The output is structured as follows:

git_commit,quay_commit_date,saas_commit_date,elapsed_days
57915a5ef3b0065d878f51fae3bae892f9c019e9,2025-09-17,2025-09-17,0
03abd7c5bdcbe26436152b9ed814fe70ac8f134d,2025-08-11,2025-09-17,36
dc8ad71acdef45e28fe8c6186d1892b91c839f64,2025-08-07,2025-08-28,20
6273fb6046db4e83f38337aed21761998b267dc6,2025-08-01,2025-09-17,47
849da7625659ec1055bfca33a971a53c507b5abb,2025-07-31,2025-08-07,7

We can also find out how often the service itself has been touched, beyond code deployments, by looking at merges to App Interface with the label tenant-quayio. Easiest way to do that is through the glab command (since regular git doesn't speak GitLab labels).

$ glab mr list --merged  --label "tenant-quayio"

Unfortunately this CLI is fairly limited and the only way to get the merged_at time is to dump the entire JSON output (--output "json"), which limits the number of MRs you can get. So we still need to spend some more time investigating here- but this looks promising to get non-code related deployment changes.

Change Lead Time

How long it takes for a change to land in production is a direct measure of how fast we can deliver customer value.

(NOTE: we changed how we merged in 2023, so the python code simply using git log doesn't go past this year)

$ python merge_times.py > merges.csv

Use the gh command instead.

$ cd quay
$ gh pr list --state "merged" --limit=50000  --json number,createdAt,mergedAt | jq -r '(first | keys_unsorted) as $keys | keys, map([.[ $keys[] ]])[] | @csv'

Change Fail Rate

What percentage of deployments caused failures in production? We can pull the incident data from WebRCA and compare to our Deployment Frequency data.

First install the ocm tool (https://console.redhat.com/openshift/token). Then authenticate the tool using Red Hat SSO. Then we can start to get info via the reporting endpoint (https://gitlab.cee.redhat.com/service/web-rca/-/blob/main/docs/reporting.md)

$ ocm get "/api/web-rca/v1/report/product?from=2025-01-01T00:00:00Z&to=2025-10-01T00:00:00Z&products=Quay"

year,month,status,count,mttr
2025,1,closed,1,64.2269660000000000
2025,2,closed,1,0.99674600000000000000
2025,3,closed,1,1177.5007650000000000
2025,4,closed,1,0.01395300000000000000
2025,5,closed,3,267.1782690000000000
2025,7,closed,2,167.1985730000000000
2025,8,resolved,1,20.9090240000000000
2025,9,resolved,1,24.6100230000000000

Next we can correlate incidents that follow from deployments, using the information from Deployment Frequency above. Any incidents that have been created on the same date as a deployment would count as a Change Fail. Notice that when filtering the /incidents endpoint by product_id we need to use the raw identifier (not product name- you can get that from the status-board endpoint, e.g. ocm get "/api/status-board/v1/products?search=name = 'Quay'")

$ ocm get "/api/web-rca/v1/incidents?size=100&product_id=a54ca112-520c-4a96-b1a7-c8597d16879d"  |  jq -r '(["incident_id", "created_at", "resolved_at"] | @tsv), (.items[] |  [.incident_id, .created_at, .resolved_at] | @tsv)' > incidents.csv

The output is structured as follows:

incident_id	created_at	resolved_at
ITN-2025-00240	2025-09-25T13:09:06.479267Z	2025-09-26T13:45:42.562678Z
ITN-2025-00204	2025-08-25T17:13:10.324841Z	2025-08-26T14:07:42.810192Z
ITN-2025-00174	2025-07-23T17:11:41.013355Z	2025-08-06T15:05:28.832943Z
ITN-2025-00153	2025-07-01T19:07:42.712377Z	2025-07-01T19:37:44.616658Z
...

From this info, let's identify any fails that happened on the same date as and within 12 hours of a deployment.

$ python change_fail_rate.py --deploys=ttd.csv --incidents=incidents.csv

The output is structured as follows:

incident_id,created_at_datetime,saas_commit_datetime,resolution_hours
ITN-2023-00132,2023-10-13 15:32:58.898556+00:00,2023-10-13 10:07:19+00:00,1.7132465124999998
ITN-2023-00151,2023-11-12 17:51:02.017200+00:00,2023-11-12 16:40:00+00:00,20.755103945277778
ITN-2023-00165,2023-12-04 11:24:19.321817+00:00,2023-12-04 10:46:36+00:00,1.1635148458333333
ITN-2025-00116,2025-05-02 15:21:59.184076+00:00,2025-05-02 13:19:44+00:00,3.5789386155555554

This does not always mean that the incidents were caused by the deployment, as sometimes an incident can be raised for other reasons.

Failed Deployment Recovery Time

How fast were we able to recover from a failed deployment?

We can use the recovery times from the Change Fail data above to see how consistent/improving we are.

About

DORA metrics for quay.io

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published