[Skill] add perf-bisect for attributing vllm-omni perf regressions to commits#3861
Closed
linyueqian wants to merge 1 commit into
Closed
[Skill] add perf-bisect for attributing vllm-omni perf regressions to commits#3861linyueqian wants to merge 1 commit into
linyueqian wants to merge 1 commit into
Conversation
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
bc81c00 to
3311239
Compare
…commits Adds .claude/skills/perf-bisect/ — a project-local Claude skill that encodes a repeatable workflow for attributing a vllm-omni perf change to a specific commit. Covers TTS, diffusion-image, and omni-audio model families. Generalised from the workflow used during the post-vllm-project#3662 regression hunt (vllm-project#3681 / vllm-project#3817 / vllm-project#3839), and extended with parallel blast-radius file lists, per-family bench-harness examples, and ready-to-paste cells for each model class so the same discipline applies across the stack. The skill encodes the load-bearing lesson from the PR vllm-project#3839 saga: extract the full cell (model, task, deploy_yaml, dataset, num_prompts, max_concurrency, num_warmups + family knobs) from the regression report BEFORE writing any bench script. Measuring a sibling cell that does not exercise the regressed code path is the most common path to a false "no regression" verdict. Layout (progressive disclosure): - SKILL.md: trigger conditions, paired tools, the cell-definition discipline (generic 7-tuple table + per-family knob TL;DR), the 5-step workflow with parallel TTS / diffusion / omni blast-radius file lists and per-family bench-harness snippets, the rationalization table of excuses-vs-reality, the red-flags list, and a one-paragraph cross-platform invariant. - references/family-knobs.md: full TTS / diffusion / omni knob tables (extra_body, stage_overrides, headline metrics). - references/pitfalls.md: six mechanical failure modes with copy-paste remediations (pytest -k zero-match, venv PATH for ninja subprocess, stale server PID, multi-tenant GPUs, /v1/models settle, cold download). - scripts/run_bisect.sh: bench-loop template that pairs vllm serve with vllm bench serve, polls /v1/models with a settle window, parses median/p99 TTFP + RTF + throughput from the saved JSON, and cleans up the server between commits. - scripts/kanban_trend.py: per-build metric time series from the vllm-omni-kanban repo with rolling-delta percent and regression markers; works for any cell prefix the kanban tracks. - scripts/cells/: four cells covering the three families — tts_default_voice_high_c (the vllm-project#3839 regression class), tts_voice_clone_nightly (kanban parity), diffusion_hunyuan_t2i_1024 (HunyuanImage-3.0 t2i @ 1024²), omni_qwen2_5_audio (Qwen2.5-Omni audio-in/audio-out) — plus a README documenting the <family>_<descriptor>.yaml convention. Triggers on natural-language requests like "bisect TTFP between X and Y", "verify PR #N actually improves perf", "find which commit slowed default_voice", "高并发 TTFP 劣化". Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
3311239 to
2d501f0
Compare
Collaborator
Author
|
Closing this — on second look the skill content is too tied to one team's internal regression hunt to live in the public repo. Planning to contribute it to a more appropriate home (hsliu's perf-tracking repo) instead. Sorry for the noise. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Adds
.claude/skills/perf-bisect/— a project-local Claude Code skill that gives vllm-omni contributors a repeatable workflow for attributing a perf change (TTFP, RTF, audio throughput, image latency, step time) to a specific commit. Generalised from the TTS-only workflow used during the post-#3662 regression hunt (#3681 / #3817 / #3839) so the same discipline applies to diffusion-image and omni-audio bisects.Why
PR #3839's perf regression took several hours to attribute, mostly because of one trap: an initial bisect measured the wrong cell (Base
voice_clonewith the default deploy YAML) when the report was about CustomVoicedefault_voiceunderqwen3_tts_high_concurrency.yaml. The bench completed cleanly with apples-vs-oranges numbers and reported "no regression" while a +118% TTFP regression was live. This skill encodes the cell-definition discipline — extract all seven dimensions (model, task, deploy_yaml, dataset, num_prompts, max_concurrency, num_warmups) plus family-specific knobs from the regression report before writing any bench script — so the trap is harder to fall into next time.Layout
Contents
SKILL.md— trigger phrases (English + Chinese), paired tools (remote-gpu,tts-perf-check, the kanban repo), the generic 7-tuple cell-discipline table, a family-specific knob TL;DR, the 5-step workflow (kanban triage → bisect span → bench harness → interpret → variance check), a rationalization table of excuses-vs-reality, and a red-flags pre-flight list.references/family-knobs.md— full knob tables for TTS / diffusion-image / omni-audio (task_type,voice,language,width,height,num_inference_steps,stage_overrides) plus the headline metric each family reports.references/pitfalls.md— six failure modes caught in real bisects, each with a copy-paste remediation snippet:pytest -kzero-match, venv PATH not inherited by subprocess, stale server PID binding the wrong port, multi-tenant GPU contention,/v1/modelsreturning before CUDA-graph compile, cold model download exceeding the server-ready timeout.scripts/run_bisect.sh— bench-loop template that pairsvllm serve(with--deploy-config) andvllm bench serve --omni, polls/v1/modelswith a 30s settle window, parses median/p99 TTFP + RTF + throughput from the saved JSON, and cleans up the server between commits.scripts/kanban_trend.py— prints a per-build metric time series fromvllm-omni-kanbanwith rolling-delta percent and←REG/←IMPmarkers at the 10% threshold; works for any cell prefix the kanban tracks.scripts/cells/— two production cells (the [Perf][TTS] Restore Qwen3-TTS default_voice c=64 TTFP to v021 baseline #3839 high-c regression cell and the Base voice_clone kanban-parity cell) plus a README documenting the<family>_<descriptor>.yamlconvention so diffusion_* / omni_* cells can be added without collisions.Triggers
Natural-language phrases that should activate the skill:
Scope
This skill is investigation, not prevention. Writing a new perf regression test belongs in
tests/dfx/perf/. Reading existing perf trends belongs invllm-omni-kanban. Cross-version (vllm 0.20 ↔ 0.21) bisects need two venvs and are explicitly out of scope.Verification
Skill(perf-bisect)invokes correctly).SKILL.mdresolve to files in the diff.How to use
In any Claude Code session inside this repo:
The skill will ask for the missing cell dimensions if any are unspecified, then walk through kanban triage → bisect span → bench loop → interpretation.
Out of scope for this PR
diffusion_*.yaml/omni_*.yamlin a follow-up).tts-perf-regression-cimechanism that uses this skill's cell convention for its specs.