Skip to content

ci: add per-host utilization view to runner-utilization report#24102

Merged
Kangyan-Zhou merged 8 commits into
mainfrom
fix-runner-utilization-per-host
Apr 30, 2026
Merged

ci: add per-host utilization view to runner-utilization report#24102
Kangyan-Zhou merged 8 commits into
mainfrom
fix-runner-utilization-per-host

Conversation

@alisonshao
Copy link
Copy Markdown
Collaborator

@alisonshao alisonshao commented Apr 29, 2026

The runner-utilization report's Active hours and Utilization columns under-counted by ~2-4× on busy days. Three causes, all fixed:

  1. Sibling labels split capacity. A single B200 host advertises 4-gpu-b200, 4-gpu-b200-kernel, 4-gpu-b200-low-disk, 4-gpu-b200-kernel-low-disk — the old denominator put it in every label's capacity bucket while busy time only credited the one label that dispatched the job. Fix: per-label numerator now sums all jobs on the pool's hosts (any sibling label), so siblings sharing a host set show the same util.
  2. Re-runs were dropped. Default filter=latest hides re-run attempts that consumed real host time. Switched to filter=all.
  3. GH API rate limits silently dropped runs. Empirical scan showed up to ~37% of runs failing on a 24h window, dropped by except Exception: return None. Added retry-with-backoff (10 attempts, ≤60s, jitter), lowered threadpool concurrency 20→4, and pre-filter out non-GPU workflows (docs/lint/release/etc.) so the API budget covers GPU work. Surface a "data completeness" warning at the top if any fetches still fail.

Report schema is unchanged: Summary by Runner Label → Concurrency Analysis → Recommendations.

Verified output: https://github.com/sgl-project/sglang/actions/runs/25152808318

Hosts advertise multiple overlapping labels — a single B200 machine
carries `4-gpu-b200`, `4-gpu-b200-kernel`, `4-gpu-b200-low-disk`,
`4-gpu-b200-kernel-low-disk`; a single H100 GPU pair carries
`2-gpu-runner`, `2-gpu-large`, `2-gpu-h100`, `1-gpu-h100-h200`. The
existing per-label utilization computes capacity as
`num_unique_runners_with_label * window`, so the same machine ends up
in the denominator of every label it advertises while its busy time
only contributes to the one label its job ran under. Result: per-label
utilization looks artificially low (e.g. 4-gpu-b200 at 8.6% while the
hosts are actually saturated and queue depth is high).

Add a "Per Host Utilization" section that aggregates by `runner_name`
so each physical machine is counted once, with its true busy time
clamped to the window. This is the source-of-truth view; the existing
per-label view is kept (with a note explaining the denominator
inflation) for per-class diagnostics.

Failure example: https://github.com/sgl-project/sglang/actions/runs/25099608265
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

The previous per-label utilization computed capacity as
`num_unique_hosts_with_label * window`, but the same machine carries
many overlapping labels (e.g. one B200 host advertises `4-gpu-b200`,
`4-gpu-b200-kernel`, `4-gpu-b200-low-disk`,
`4-gpu-b200-kernel-low-disk`; one H100 GPU pair carries
`2-gpu-runner`, `2-gpu-large`, `2-gpu-h100`, `1-gpu-h100-h200`). The
old denominator put each shared host in every label's capacity
denominator while the busy-time numerator only got credited once,
making the utilization column read e.g. 4-gpu-b200 = 8.0% even when
the underlying hosts were queued up.

Switch to one row per host pool — labels that map to an identical
host set are collapsed (so the four `4-gpu-b200*` labels show up as
one row), and the numerator counts every job that ran on the pool's
hosts (any of the labels) so the percentage reflects real hardware
occupancy. Also drop the Per Host / Concurrency Analysis /
Recommendations sections — the single table is the report.

Failure example: https://github.com/sgl-project/sglang/actions/runs/25099608265
Earlier fix attempt that still showed the wrong column: https://github.com/sgl-project/sglang/actions/runs/25138480934
Active (hrs) and Utilization now reflect the actual host pool's busy
time (sum across all jobs on the hosts that advertise the label,
regardless of which sibling label dispatched them). Sibling labels
like `4-gpu-b200` and `4-gpu-b200-low-disk` share physical hosts,
so their utilization now reflects real hardware saturation instead
of being divided across labels and reported as artificially low.

Reverts the verbose Per Host / Concurrency / Recommendations sections
added in the prior commit — the table is the report.
Two fixes for cases where the previous numbers under-counted real
host activity:

1. Use the union of (currently-online API runners advertising the
   label) and (hosts observed in job data with the label) as the
   host pool. The previous fallback was elif-only — if any host was
   listed in the API, job-observed hosts that had since gone offline
   were dropped. Their busy time on those hosts is real capacity
   consumed; including them captures it.

2. Bump pagination safety limits — workflow runs 20 -> 50 pages
   (2000 -> 5000 runs), jobs per run 5 -> 20 pages (500 -> 2000
   jobs). On busy 24h windows the old caps could silently drop runs
   and jobs from busy workflows like pr-test, undercounting busy
   time in the numerator.
The /jobs endpoint defaults to filter=latest, returning only the most
recent attempt of each job. Re-runs (manual or via tag-and-rerun-ci)
each consumed real host time on the runner pool — adding filter=all
sums them all in so the utilization numerator reflects actual GPU
time spent, not just the last attempt of each retried job.
The threadpool caller previously had `except Exception: return None`,
which silently dropped any run whose jobs couldn't be fetched —
causing the utilization numerator to fluctuate wildly between back-to-back
runs of the same window (one run saw 11k jobs, the next saw 39k). On
busy windows the secondary rate-limit kicks in often, so this dropped
2-3x the actual workload.

Fixes:
- Add retry-with-backoff in run_gh_command for 5xx, secondary rate
  limit, abuse detection, and network-reset signatures (5 attempts,
  1-16s exponential backoff with jitter).
- Lower threadpool concurrency from 20 to 10 to stay below the
  secondary rate-limit threshold.
- Surface the count of runs that still failed after retries instead
  of silently dropping them, so the next iteration can see whether
  the report is complete.
@alisonshao
Copy link
Copy Markdown
Collaborator Author

@Kangyan-Zhou Kangyan-Zhou merged commit 7bb7f60 into main Apr 30, 2026
61 of 65 checks passed
@Kangyan-Zhou Kangyan-Zhou deleted the fix-runner-utilization-per-host branch April 30, 2026 17:05
vguduruTT pushed a commit to vguduruTT/sglang that referenced this pull request May 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants