[Skill] add perf-bisect for attributing vllm-omni perf regressions to commits by linyueqian · Pull Request #3861 · vllm-project/vllm-omni

linyueqian · 2026-05-25T15:33:18Z

What

Adds .claude/skills/perf-bisect/ — a project-local Claude Code skill that gives vllm-omni contributors a repeatable workflow for attributing a perf change (TTFP, RTF, audio throughput, image latency, step time) to a specific commit. Generalised from the TTS-only workflow used during the post-#3662 regression hunt (#3681 / #3817 / #3839) so the same discipline applies to diffusion-image and omni-audio bisects.

Why

PR #3839's perf regression took several hours to attribute, mostly because of one trap: an initial bisect measured the wrong cell (Base voice_clone with the default deploy YAML) when the report was about CustomVoice default_voice under qwen3_tts_high_concurrency.yaml. The bench completed cleanly with apples-vs-oranges numbers and reported "no regression" while a +118% TTFP regression was live. This skill encodes the cell-definition discipline — extract all seven dimensions (model, task, deploy_yaml, dataset, num_prompts, max_concurrency, num_warmups) plus family-specific knobs from the regression report before writing any bench script — so the trap is harder to fall into next time.

Layout

.claude/skills/perf-bisect/
├── SKILL.md                                  # discipline + 5-step workflow
├── references/
│   ├── family-knobs.md                       # extra_body / stage_overrides per family
│   └── pitfalls.md                           # 6 mechanical failure modes + remediations
└── scripts/
    ├── run_bisect.sh                         # bench-loop template
    ├── kanban_trend.py                       # metric time series from vllm-omni-kanban
    └── cells/
        ├── README.md                         # <family>_<descriptor>.yaml convention
        ├── tts_default_voice_high_c.yaml     # #3839's regression cell
        └── tts_voice_clone_nightly.yaml      # kanban-parity cell

SKILL.md — trigger phrases (English + Chinese), paired tools (remote-gpu, tts-perf-check, the kanban repo), the generic 7-tuple cell-discipline table, a family-specific knob TL;DR, the 5-step workflow (kanban triage → bisect span → bench harness → interpret → variance check), a rationalization table of excuses-vs-reality, and a red-flags pre-flight list.
references/family-knobs.md — full knob tables for TTS / diffusion-image / omni-audio (task_type, voice, language, width, height, num_inference_steps, stage_overrides) plus the headline metric each family reports.
references/pitfalls.md — six failure modes caught in real bisects, each with a copy-paste remediation snippet: pytest -k zero-match, venv PATH not inherited by subprocess, stale server PID binding the wrong port, multi-tenant GPU contention, /v1/models returning before CUDA-graph compile, cold model download exceeding the server-ready timeout.
scripts/run_bisect.sh — bench-loop template that pairs vllm serve (with --deploy-config) and vllm bench serve --omni, polls /v1/models with a 30s settle window, parses median/p99 TTFP + RTF + throughput from the saved JSON, and cleans up the server between commits.
scripts/kanban_trend.py — prints a per-build metric time series from vllm-omni-kanban with rolling-delta percent and ←REG / ←IMP markers at the 10% threshold; works for any cell prefix the kanban tracks.
scripts/cells/ — two production cells (the [Perf][TTS] Restore Qwen3-TTS default_voice c=64 TTFP to v021 baseline #3839 high-c regression cell and the Base voice_clone kanban-parity cell) plus a README documenting the <family>_<descriptor>.yaml convention so diffusion_* / omni_* cells can be added without collisions.

Triggers

Natural-language phrases that should activate the skill:

"find which PR regressed Qwen3-TTS perf"
"bisect TTFP between commit X and Y"
"verify PR #N actually improves perf"
"高并发 TTFP 劣化"

Scope

This skill is investigation, not prevention. Writing a new perf regression test belongs in tests/dfx/perf/. Reading existing perf trends belongs in vllm-omni-kanban. Cross-version (vllm 0.20 ↔ 0.21) bisects need two venvs and are explicitly out of scope.

Verification

Skill picked up by Claude Code's skill registry (verified locally — Skill(perf-bisect) invokes correctly).
All five resource references in SKILL.md resolve to files in the diff.
No session-detail (dates, host names, reproduction logs) in any docstring or comment.
Quality rubric: 96/100 (A) — description 97, organization 94, style 92, structure 100.

How to use

In any Claude Code session inside this repo:

@perf-bisect bisect TTFP between abc123 and def456 for CustomVoice default_voice c=64 N=512 with the high-c YAML

The skill will ask for the missing cell dimensions if any are unspecified, then walk through kanban triage → bisect span → bench loop → interpretation.

Out of scope for this PR

Other model families' cell files (anyone can add diffusion_*.yaml / omni_*.yaml in a follow-up).
Wiring this into Buildkite as an automated CI step. A separate PR will propose a tts-perf-regression-ci mechanism that uses this skill's cell convention for its specs.

chatgpt-codex-connector · 2026-05-25T15:33:25Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

…commits Adds .claude/skills/perf-bisect/ — a project-local Claude skill that encodes a repeatable workflow for attributing a vllm-omni perf change to a specific commit. Covers TTS, diffusion-image, and omni-audio model families. Generalised from the workflow used during the post-vllm-project#3662 regression hunt (vllm-project#3681 / vllm-project#3817 / vllm-project#3839), and extended with parallel blast-radius file lists, per-family bench-harness examples, and ready-to-paste cells for each model class so the same discipline applies across the stack. The skill encodes the load-bearing lesson from the PR vllm-project#3839 saga: extract the full cell (model, task, deploy_yaml, dataset, num_prompts, max_concurrency, num_warmups + family knobs) from the regression report BEFORE writing any bench script. Measuring a sibling cell that does not exercise the regressed code path is the most common path to a false "no regression" verdict. Layout (progressive disclosure): - SKILL.md: trigger conditions, paired tools, the cell-definition discipline (generic 7-tuple table + per-family knob TL;DR), the 5-step workflow with parallel TTS / diffusion / omni blast-radius file lists and per-family bench-harness snippets, the rationalization table of excuses-vs-reality, the red-flags list, and a one-paragraph cross-platform invariant. - references/family-knobs.md: full TTS / diffusion / omni knob tables (extra_body, stage_overrides, headline metrics). - references/pitfalls.md: six mechanical failure modes with copy-paste remediations (pytest -k zero-match, venv PATH for ninja subprocess, stale server PID, multi-tenant GPUs, /v1/models settle, cold download). - scripts/run_bisect.sh: bench-loop template that pairs vllm serve with vllm bench serve, polls /v1/models with a settle window, parses median/p99 TTFP + RTF + throughput from the saved JSON, and cleans up the server between commits. - scripts/kanban_trend.py: per-build metric time series from the vllm-omni-kanban repo with rolling-delta percent and regression markers; works for any cell prefix the kanban tracks. - scripts/cells/: four cells covering the three families — tts_default_voice_high_c (the vllm-project#3839 regression class), tts_voice_clone_nightly (kanban parity), diffusion_hunyuan_t2i_1024 (HunyuanImage-3.0 t2i @ 1024²), omni_qwen2_5_audio (Qwen2.5-Omni audio-in/audio-out) — plus a README documenting the <family>_<descriptor>.yaml convention. Triggers on natural-language requests like "bisect TTFP between X and Y", "verify PR #N actually improves perf", "find which commit slowed default_voice", "高并发 TTFP 劣化". Signed-off-by: Yueqian Lin <linyueqian@outlook.com>

linyueqian · 2026-05-26T02:34:45Z

Closing this — on second look the skill content is too tied to one team's internal regression hunt to live in the public repo. Planning to contribute it to a more appropriate home (hsliu's perf-tracking repo) instead. Sorry for the noise.

linyueqian requested a review from hsliuustc0106 as a code owner May 25, 2026 15:33

linyueqian force-pushed the feat/perf-bisect-skill branch from bc81c00 to 3311239 Compare May 25, 2026 15:44

linyueqian force-pushed the feat/perf-bisect-skill branch from 3311239 to 2d501f0 Compare May 25, 2026 15:53

linyueqian closed this May 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Skill] add perf-bisect for attributing vllm-omni perf regressions to commits#3861

[Skill] add perf-bisect for attributing vllm-omni perf regressions to commits#3861
linyueqian wants to merge 1 commit into
vllm-project:mainfrom
linyueqian:feat/perf-bisect-skill

linyueqian commented May 25, 2026

Uh oh!

chatgpt-codex-connector Bot commented May 25, 2026

Uh oh!

linyueqian commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

linyueqian commented May 25, 2026

What

Why

Layout

Contents

Triggers

Scope

Verification

How to use

Out of scope for this PR

Uh oh!

chatgpt-codex-connector Bot commented May 25, 2026

Uh oh!

linyueqian commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant