Skip to content

[Skill] add perf-bisect for attributing vllm-omni perf regressions to commits#3861

Closed
linyueqian wants to merge 1 commit into
vllm-project:mainfrom
linyueqian:feat/perf-bisect-skill
Closed

[Skill] add perf-bisect for attributing vllm-omni perf regressions to commits#3861
linyueqian wants to merge 1 commit into
vllm-project:mainfrom
linyueqian:feat/perf-bisect-skill

Conversation

@linyueqian
Copy link
Copy Markdown
Collaborator

What

Adds .claude/skills/perf-bisect/ — a project-local Claude Code skill that gives vllm-omni contributors a repeatable workflow for attributing a perf change (TTFP, RTF, audio throughput, image latency, step time) to a specific commit. Generalised from the TTS-only workflow used during the post-#3662 regression hunt (#3681 / #3817 / #3839) so the same discipline applies to diffusion-image and omni-audio bisects.

Why

PR #3839's perf regression took several hours to attribute, mostly because of one trap: an initial bisect measured the wrong cell (Base voice_clone with the default deploy YAML) when the report was about CustomVoice default_voice under qwen3_tts_high_concurrency.yaml. The bench completed cleanly with apples-vs-oranges numbers and reported "no regression" while a +118% TTFP regression was live. This skill encodes the cell-definition discipline — extract all seven dimensions (model, task, deploy_yaml, dataset, num_prompts, max_concurrency, num_warmups) plus family-specific knobs from the regression report before writing any bench script — so the trap is harder to fall into next time.

Layout

.claude/skills/perf-bisect/
├── SKILL.md                                  # discipline + 5-step workflow
├── references/
│   ├── family-knobs.md                       # extra_body / stage_overrides per family
│   └── pitfalls.md                           # 6 mechanical failure modes + remediations
└── scripts/
    ├── run_bisect.sh                         # bench-loop template
    ├── kanban_trend.py                       # metric time series from vllm-omni-kanban
    └── cells/
        ├── README.md                         # <family>_<descriptor>.yaml convention
        ├── tts_default_voice_high_c.yaml     # #3839's regression cell
        └── tts_voice_clone_nightly.yaml      # kanban-parity cell

Contents

  • SKILL.md — trigger phrases (English + Chinese), paired tools (remote-gpu, tts-perf-check, the kanban repo), the generic 7-tuple cell-discipline table, a family-specific knob TL;DR, the 5-step workflow (kanban triage → bisect span → bench harness → interpret → variance check), a rationalization table of excuses-vs-reality, and a red-flags pre-flight list.
  • references/family-knobs.md — full knob tables for TTS / diffusion-image / omni-audio (task_type, voice, language, width, height, num_inference_steps, stage_overrides) plus the headline metric each family reports.
  • references/pitfalls.md — six failure modes caught in real bisects, each with a copy-paste remediation snippet: pytest -k zero-match, venv PATH not inherited by subprocess, stale server PID binding the wrong port, multi-tenant GPU contention, /v1/models returning before CUDA-graph compile, cold model download exceeding the server-ready timeout.
  • scripts/run_bisect.sh — bench-loop template that pairs vllm serve (with --deploy-config) and vllm bench serve --omni, polls /v1/models with a 30s settle window, parses median/p99 TTFP + RTF + throughput from the saved JSON, and cleans up the server between commits.
  • scripts/kanban_trend.py — prints a per-build metric time series from vllm-omni-kanban with rolling-delta percent and ←REG / ←IMP markers at the 10% threshold; works for any cell prefix the kanban tracks.
  • scripts/cells/ — two production cells (the [Perf][TTS] Restore Qwen3-TTS default_voice c=64 TTFP to v021 baseline #3839 high-c regression cell and the Base voice_clone kanban-parity cell) plus a README documenting the <family>_<descriptor>.yaml convention so diffusion_* / omni_* cells can be added without collisions.

Triggers

Natural-language phrases that should activate the skill:

  • "find which PR regressed Qwen3-TTS perf"
  • "bisect TTFP between commit X and Y"
  • "verify PR #N actually improves perf"
  • "高并发 TTFP 劣化"

Scope

This skill is investigation, not prevention. Writing a new perf regression test belongs in tests/dfx/perf/. Reading existing perf trends belongs in vllm-omni-kanban. Cross-version (vllm 0.20 ↔ 0.21) bisects need two venvs and are explicitly out of scope.

Verification

  • Skill picked up by Claude Code's skill registry (verified locally — Skill(perf-bisect) invokes correctly).
  • All five resource references in SKILL.md resolve to files in the diff.
  • No session-detail (dates, host names, reproduction logs) in any docstring or comment.
  • Quality rubric: 96/100 (A) — description 97, organization 94, style 92, structure 100.

How to use

In any Claude Code session inside this repo:

@perf-bisect bisect TTFP between abc123 and def456 for CustomVoice default_voice c=64 N=512 with the high-c YAML

The skill will ask for the missing cell dimensions if any are unspecified, then walk through kanban triage → bisect span → bench loop → interpretation.

Out of scope for this PR

  • Other model families' cell files (anyone can add diffusion_*.yaml / omni_*.yaml in a follow-up).
  • Wiring this into Buildkite as an automated CI step. A separate PR will propose a tts-perf-regression-ci mechanism that uses this skill's cell convention for its specs.

@linyueqian linyueqian requested a review from hsliuustc0106 as a code owner May 25, 2026 15:33
@chatgpt-codex-connector
Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

@linyueqian linyueqian force-pushed the feat/perf-bisect-skill branch from bc81c00 to 3311239 Compare May 25, 2026 15:44
…commits

Adds .claude/skills/perf-bisect/ — a project-local Claude skill that
encodes a repeatable workflow for attributing a vllm-omni perf change to
a specific commit. Covers TTS, diffusion-image, and omni-audio model
families. Generalised from the workflow used during the post-vllm-project#3662
regression hunt (vllm-project#3681 / vllm-project#3817 / vllm-project#3839), and extended with parallel
blast-radius file lists, per-family bench-harness examples, and
ready-to-paste cells for each model class so the same discipline applies
across the stack.

The skill encodes the load-bearing lesson from the PR vllm-project#3839 saga:
extract the full cell (model, task, deploy_yaml, dataset, num_prompts,
max_concurrency, num_warmups + family knobs) from the regression report
BEFORE writing any bench script. Measuring a sibling cell that does not
exercise the regressed code path is the most common path to a false
"no regression" verdict.

Layout (progressive disclosure):

- SKILL.md: trigger conditions, paired tools, the cell-definition
  discipline (generic 7-tuple table + per-family knob TL;DR), the 5-step
  workflow with parallel TTS / diffusion / omni blast-radius file lists
  and per-family bench-harness snippets, the rationalization table of
  excuses-vs-reality, the red-flags list, and a one-paragraph
  cross-platform invariant.

- references/family-knobs.md: full TTS / diffusion / omni knob tables
  (extra_body, stage_overrides, headline metrics).

- references/pitfalls.md: six mechanical failure modes with copy-paste
  remediations (pytest -k zero-match, venv PATH for ninja subprocess,
  stale server PID, multi-tenant GPUs, /v1/models settle, cold download).

- scripts/run_bisect.sh: bench-loop template that pairs vllm serve with
  vllm bench serve, polls /v1/models with a settle window, parses
  median/p99 TTFP + RTF + throughput from the saved JSON, and cleans up
  the server between commits.

- scripts/kanban_trend.py: per-build metric time series from the
  vllm-omni-kanban repo with rolling-delta percent and regression
  markers; works for any cell prefix the kanban tracks.

- scripts/cells/: four cells covering the three families —
  tts_default_voice_high_c (the vllm-project#3839 regression class),
  tts_voice_clone_nightly (kanban parity), diffusion_hunyuan_t2i_1024
  (HunyuanImage-3.0 t2i @ 1024²), omni_qwen2_5_audio (Qwen2.5-Omni
  audio-in/audio-out) — plus a README documenting the
  <family>_<descriptor>.yaml convention.

Triggers on natural-language requests like "bisect TTFP between X and Y",
"verify PR #N actually improves perf", "find which commit slowed default_voice",
"高并发 TTFP 劣化".

Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
@linyueqian linyueqian force-pushed the feat/perf-bisect-skill branch from 3311239 to 2d501f0 Compare May 25, 2026 15:53
@linyueqian
Copy link
Copy Markdown
Collaborator Author

Closing this — on second look the skill content is too tied to one team's internal regression hunt to live in the public repo. Planning to contribute it to a more appropriate home (hsliu's perf-tracking repo) instead. Sorry for the noise.

@linyueqian linyueqian closed this May 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant