ci: add per-host utilization view to runner-utilization report by alisonshao · Pull Request #24102 · sgl-project/sglang

alisonshao · 2026-04-29T23:03:00Z

The runner-utilization report's Active hours and Utilization columns under-counted by ~2-4× on busy days. Three causes, all fixed:

Sibling labels split capacity. A single B200 host advertises 4-gpu-b200, 4-gpu-b200-kernel, 4-gpu-b200-low-disk, 4-gpu-b200-kernel-low-disk — the old denominator put it in every label's capacity bucket while busy time only credited the one label that dispatched the job. Fix: per-label numerator now sums all jobs on the pool's hosts (any sibling label), so siblings sharing a host set show the same util.
Re-runs were dropped. Default filter=latest hides re-run attempts that consumed real host time. Switched to filter=all.
GH API rate limits silently dropped runs. Empirical scan showed up to ~37% of runs failing on a 24h window, dropped by except Exception: return None. Added retry-with-backoff (10 attempts, ≤60s, jitter), lowered threadpool concurrency 20→4, and pre-filter out non-GPU workflows (docs/lint/release/etc.) so the API budget covers GPU work. Surface a "data completeness" warning at the top if any fetches still fail.

Report schema is unchanged: Summary by Runner Label → Concurrency Analysis → Recommendations.

Verified output: https://github.com/sgl-project/sglang/actions/runs/25152808318

Hosts advertise multiple overlapping labels — a single B200 machine carries `4-gpu-b200`, `4-gpu-b200-kernel`, `4-gpu-b200-low-disk`, `4-gpu-b200-kernel-low-disk`; a single H100 GPU pair carries `2-gpu-runner`, `2-gpu-large`, `2-gpu-h100`, `1-gpu-h100-h200`. The existing per-label utilization computes capacity as `num_unique_runners_with_label * window`, so the same machine ends up in the denominator of every label it advertises while its busy time only contributes to the one label its job ran under. Result: per-label utilization looks artificially low (e.g. 4-gpu-b200 at 8.6% while the hosts are actually saturated and queue depth is high). Add a "Per Host Utilization" section that aggregates by `runner_name` so each physical machine is counted once, with its true busy time clamped to the window. This is the source-of-truth view; the existing per-label view is kept (with a note explaining the denominator inflation) for per-class diagnostics. Failure example: https://github.com/sgl-project/sglang/actions/runs/25099608265

gemini-code-assist · 2026-04-29T23:03:03Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

The previous per-label utilization computed capacity as `num_unique_hosts_with_label * window`, but the same machine carries many overlapping labels (e.g. one B200 host advertises `4-gpu-b200`, `4-gpu-b200-kernel`, `4-gpu-b200-low-disk`, `4-gpu-b200-kernel-low-disk`; one H100 GPU pair carries `2-gpu-runner`, `2-gpu-large`, `2-gpu-h100`, `1-gpu-h100-h200`). The old denominator put each shared host in every label's capacity denominator while the busy-time numerator only got credited once, making the utilization column read e.g. 4-gpu-b200 = 8.0% even when the underlying hosts were queued up. Switch to one row per host pool — labels that map to an identical host set are collapsed (so the four `4-gpu-b200*` labels show up as one row), and the numerator counts every job that ran on the pool's hosts (any of the labels) so the percentage reflects real hardware occupancy. Also drop the Per Host / Concurrency Analysis / Recommendations sections — the single table is the report. Failure example: https://github.com/sgl-project/sglang/actions/runs/25099608265 Earlier fix attempt that still showed the wrong column: https://github.com/sgl-project/sglang/actions/runs/25138480934

Active (hrs) and Utilization now reflect the actual host pool's busy time (sum across all jobs on the hosts that advertise the label, regardless of which sibling label dispatched them). Sibling labels like `4-gpu-b200` and `4-gpu-b200-low-disk` share physical hosts, so their utilization now reflects real hardware saturation instead of being divided across labels and reported as artificially low. Reverts the verbose Per Host / Concurrency / Recommendations sections added in the prior commit — the table is the report.

Two fixes for cases where the previous numbers under-counted real host activity: 1. Use the union of (currently-online API runners advertising the label) and (hosts observed in job data with the label) as the host pool. The previous fallback was elif-only — if any host was listed in the API, job-observed hosts that had since gone offline were dropped. Their busy time on those hosts is real capacity consumed; including them captures it. 2. Bump pagination safety limits — workflow runs 20 -> 50 pages (2000 -> 5000 runs), jobs per run 5 -> 20 pages (500 -> 2000 jobs). On busy 24h windows the old caps could silently drop runs and jobs from busy workflows like pr-test, undercounting busy time in the numerator.

The /jobs endpoint defaults to filter=latest, returning only the most recent attempt of each job. Re-runs (manual or via tag-and-rerun-ci) each consumed real host time on the runner pool — adding filter=all sums them all in so the utilization numerator reflects actual GPU time spent, not just the last attempt of each retried job.

The threadpool caller previously had `except Exception: return None`, which silently dropped any run whose jobs couldn't be fetched — causing the utilization numerator to fluctuate wildly between back-to-back runs of the same window (one run saw 11k jobs, the next saw 39k). On busy windows the secondary rate-limit kicks in often, so this dropped 2-3x the actual workload. Fixes: - Add retry-with-backoff in run_gh_command for 5xx, secondary rate limit, abuse detection, and network-reset signatures (5 attempts, 1-16s exponential backoff with jitter). - Lower threadpool concurrency from 20 to 10 to stay below the secondary rate-limit threshold. - Surface the count of runs that still failed after retries instead of silently dropping them, so the next iteration can see whether the report is complete.

alisonshao · 2026-04-30T06:21:25Z

https://github.com/sgl-project/sglang/actions/runs/25150621390

…roject#24102)

alisonshao added 6 commits April 29, 2026 16:35

ci: skip non-GPU workflows + lower concurrency to dodge GH rate limits

0cd6cd9

ci: add back Concurrency Analysis + Recommendations sections

9cd17c8

Kangyan-Zhou merged commit 7bb7f60 into main Apr 30, 2026
61 of 65 checks passed

Kangyan-Zhou deleted the fix-runner-utilization-per-host branch April 30, 2026 17:05

vguduruTT pushed a commit to vguduruTT/sglang that referenced this pull request May 2, 2026

ci: add per-host utilization view to runner-utilization report (sgl-p…

dd09e75

…roject#24102)

LucQueen pushed a commit to LucQueen/sglang that referenced this pull request May 12, 2026

ci: add per-host utilization view to runner-utilization report (sgl-p…

ff203e2

…roject#24102)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci: add per-host utilization view to runner-utilization report#24102

ci: add per-host utilization view to runner-utilization report#24102
Kangyan-Zhou merged 8 commits into
mainfrom
fix-runner-utilization-per-host

alisonshao commented Apr 29, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented Apr 29, 2026

Uh oh!

alisonshao commented Apr 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

alisonshao commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist Bot commented Apr 29, 2026

Uh oh!

alisonshao commented Apr 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

alisonshao commented Apr 29, 2026 •

edited

Loading