ci: add per-host utilization view to runner-utilization report#24102
Merged
Conversation
Hosts advertise multiple overlapping labels — a single B200 machine carries `4-gpu-b200`, `4-gpu-b200-kernel`, `4-gpu-b200-low-disk`, `4-gpu-b200-kernel-low-disk`; a single H100 GPU pair carries `2-gpu-runner`, `2-gpu-large`, `2-gpu-h100`, `1-gpu-h100-h200`. The existing per-label utilization computes capacity as `num_unique_runners_with_label * window`, so the same machine ends up in the denominator of every label it advertises while its busy time only contributes to the one label its job ran under. Result: per-label utilization looks artificially low (e.g. 4-gpu-b200 at 8.6% while the hosts are actually saturated and queue depth is high). Add a "Per Host Utilization" section that aggregates by `runner_name` so each physical machine is counted once, with its true busy time clamped to the window. This is the source-of-truth view; the existing per-label view is kept (with a note explaining the denominator inflation) for per-class diagnostics. Failure example: https://github.com/sgl-project/sglang/actions/runs/25099608265
Contributor
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
The previous per-label utilization computed capacity as `num_unique_hosts_with_label * window`, but the same machine carries many overlapping labels (e.g. one B200 host advertises `4-gpu-b200`, `4-gpu-b200-kernel`, `4-gpu-b200-low-disk`, `4-gpu-b200-kernel-low-disk`; one H100 GPU pair carries `2-gpu-runner`, `2-gpu-large`, `2-gpu-h100`, `1-gpu-h100-h200`). The old denominator put each shared host in every label's capacity denominator while the busy-time numerator only got credited once, making the utilization column read e.g. 4-gpu-b200 = 8.0% even when the underlying hosts were queued up. Switch to one row per host pool — labels that map to an identical host set are collapsed (so the four `4-gpu-b200*` labels show up as one row), and the numerator counts every job that ran on the pool's hosts (any of the labels) so the percentage reflects real hardware occupancy. Also drop the Per Host / Concurrency Analysis / Recommendations sections — the single table is the report. Failure example: https://github.com/sgl-project/sglang/actions/runs/25099608265 Earlier fix attempt that still showed the wrong column: https://github.com/sgl-project/sglang/actions/runs/25138480934
Active (hrs) and Utilization now reflect the actual host pool's busy time (sum across all jobs on the hosts that advertise the label, regardless of which sibling label dispatched them). Sibling labels like `4-gpu-b200` and `4-gpu-b200-low-disk` share physical hosts, so their utilization now reflects real hardware saturation instead of being divided across labels and reported as artificially low. Reverts the verbose Per Host / Concurrency / Recommendations sections added in the prior commit — the table is the report.
Two fixes for cases where the previous numbers under-counted real host activity: 1. Use the union of (currently-online API runners advertising the label) and (hosts observed in job data with the label) as the host pool. The previous fallback was elif-only — if any host was listed in the API, job-observed hosts that had since gone offline were dropped. Their busy time on those hosts is real capacity consumed; including them captures it. 2. Bump pagination safety limits — workflow runs 20 -> 50 pages (2000 -> 5000 runs), jobs per run 5 -> 20 pages (500 -> 2000 jobs). On busy 24h windows the old caps could silently drop runs and jobs from busy workflows like pr-test, undercounting busy time in the numerator.
The /jobs endpoint defaults to filter=latest, returning only the most recent attempt of each job. Re-runs (manual or via tag-and-rerun-ci) each consumed real host time on the runner pool — adding filter=all sums them all in so the utilization numerator reflects actual GPU time spent, not just the last attempt of each retried job.
The threadpool caller previously had `except Exception: return None`, which silently dropped any run whose jobs couldn't be fetched — causing the utilization numerator to fluctuate wildly between back-to-back runs of the same window (one run saw 11k jobs, the next saw 39k). On busy windows the secondary rate-limit kicks in often, so this dropped 2-3x the actual workload. Fixes: - Add retry-with-backoff in run_gh_command for 5xx, secondary rate limit, abuse detection, and network-reset signatures (5 attempts, 1-16s exponential backoff with jitter). - Lower threadpool concurrency from 20 to 10 to stay below the secondary rate-limit threshold. - Surface the count of runs that still failed after retries instead of silently dropping them, so the next iteration can see whether the report is complete.
Collaborator
Author
vguduruTT
pushed a commit
to vguduruTT/sglang
that referenced
this pull request
May 2, 2026
LucQueen
pushed a commit
to LucQueen/sglang
that referenced
this pull request
May 12, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The runner-utilization report's Active hours and Utilization columns under-counted by ~2-4× on busy days. Three causes, all fixed:
4-gpu-b200,4-gpu-b200-kernel,4-gpu-b200-low-disk,4-gpu-b200-kernel-low-disk— the old denominator put it in every label's capacity bucket while busy time only credited the one label that dispatched the job. Fix: per-label numerator now sums all jobs on the pool's hosts (any sibling label), so siblings sharing a host set show the same util.filter=latesthides re-run attempts that consumed real host time. Switched tofilter=all.except Exception: return None. Added retry-with-backoff (10 attempts, ≤60s, jitter), lowered threadpool concurrency 20→4, and pre-filter out non-GPU workflows (docs/lint/release/etc.) so the API budget covers GPU work. Surface a "data completeness" warning at the top if any fetches still fail.Report schema is unchanged: Summary by Runner Label → Concurrency Analysis → Recommendations.
Verified output: https://github.com/sgl-project/sglang/actions/runs/25152808318