[BugFix] Fix prefer_model_sampler token history in async scheduling#3681
Conversation
35431f9 to
5217fa8
Compare
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
|
@Bounty-hunter PTAL |
|
https://github.com/vllm-project/vllm/blob/a950e9447e38727fc956afdc242bc6e3796ccb77/vllm/v1/worker/gpu_input_batch.py#L1018 will update If we just move the calling |
I try it, and it work. |
Signed-off-by: zengchuang <zengchuang3@huawei.com>
5217fa8 to
654cea6
Compare
nice catch, i tried this and it does work |
|
@zengchuang-hw Please fix CI, thanks! |
Signed-off-by: zengchuang <zengchuang3@huawei.com>
Resolve conflict in tests/e2e/accuracy/test_hunyuan_image3.py by adopting upstream's new stage config API format (stage_args/engine_args structure) while keeping the core fix in gpu_ar_model_runner.py. The PR's core fix moves update_async_output_token_ids() before the prefer_model_sampler branch, ensuring async output token IDs are updated regardless of which sampler is used. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Switch from stage_args format to deploy_config (stages/pipeline) format for consistency with the deploy_config API path. - Rename _BASE_CONFIG to _DEPLOY_CONFIG with stages/pipeline structure - Update _make_config to use stages key instead of stage_args - Update _QUANT_DIT_CONFIG to deploy_config format - Change _run_offline and _run_dit_model to use deploy_config_path - Pass deploy_config via kwargs to OmniRunner Signed-off-by: zengchuang <zengchuang3@huawei.com>
After PR 3681 fix, update_async_output_token_ids() is called BEFORE the model sampler path, so updated should be True, not False. Signed-off-by: zengchuang <zengchuang3@huawei.com>
|
I have solved the ut errors and conflict, PTAL @gcanlin |
…3681) Signed-off-by: zengchuang <zengchuang3@huawei.com> Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
…commits Adds .claude/skills/perf-bisect/ — a project-local Claude skill that encodes a repeatable workflow for attributing a vllm-omni perf change to a specific commit. Covers TTS, diffusion-image, and omni-audio model families. Generalised from the workflow used during the post-vllm-project#3662 regression hunt (vllm-project#3681 / vllm-project#3817 / vllm-project#3839), and extended with parallel blast-radius file lists, per-family bench-harness examples, and ready-to-paste cells for each model class so the same discipline applies across the stack. The skill encodes the load-bearing lesson from the PR vllm-project#3839 saga: extract the full cell (model, task, deploy_yaml, dataset, num_prompts, max_concurrency, num_warmups + family knobs) from the regression report BEFORE writing any bench script. Measuring a sibling cell that does not exercise the regressed code path is the most common path to a false "no regression" verdict. Layout (progressive disclosure): - SKILL.md: trigger conditions, paired tools, the cell-definition discipline (generic 7-tuple table + per-family knob TL;DR), the 5-step workflow with parallel TTS / diffusion / omni blast-radius file lists and per-family bench-harness snippets, the rationalization table of excuses-vs-reality, the red-flags list, and a one-paragraph cross-platform invariant. - references/family-knobs.md: full TTS / diffusion / omni knob tables (extra_body, stage_overrides, headline metrics). - references/pitfalls.md: six mechanical failure modes with copy-paste remediations (pytest -k zero-match, venv PATH for ninja subprocess, stale server PID, multi-tenant GPUs, /v1/models settle, cold download). - scripts/run_bisect.sh: bench-loop template that pairs vllm serve with vllm bench serve, polls /v1/models with a settle window, parses median/p99 TTFP + RTF + throughput from the saved JSON, and cleans up the server between commits. - scripts/kanban_trend.py: per-build metric time series from the vllm-omni-kanban repo with rolling-delta percent and regression markers; works for any cell prefix the kanban tracks. - scripts/cells/: four cells covering the three families — tts_default_voice_high_c (the vllm-project#3839 regression class), tts_voice_clone_nightly (kanban parity), diffusion_hunyuan_t2i_1024 (HunyuanImage-3.0 t2i @ 1024²), omni_qwen2_5_audio (Qwen2.5-Omni audio-in/audio-out) — plus a README documenting the <family>_<descriptor>.yaml convention. Triggers on natural-language requests like "bisect TTFP between X and Y", "verify PR #N actually improves perf", "find which commit slowed default_voice", "高并发 TTFP 劣化". Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
…commits Adds .claude/skills/perf-bisect/ — a project-local Claude skill that encodes a repeatable workflow for attributing a vllm-omni perf change to a specific commit. Covers TTS, diffusion-image, and omni-audio model families. Generalised from the workflow used during the post-vllm-project#3662 regression hunt (vllm-project#3681 / vllm-project#3817 / vllm-project#3839), and extended with parallel blast-radius file lists, per-family bench-harness examples, and ready-to-paste cells for each model class so the same discipline applies across the stack. The skill encodes the load-bearing lesson from the PR vllm-project#3839 saga: extract the full cell (model, task, deploy_yaml, dataset, num_prompts, max_concurrency, num_warmups + family knobs) from the regression report BEFORE writing any bench script. Measuring a sibling cell that does not exercise the regressed code path is the most common path to a false "no regression" verdict. Layout (progressive disclosure): - SKILL.md: trigger conditions, paired tools, the cell-definition discipline (generic 7-tuple table + per-family knob TL;DR), the 5-step workflow with parallel TTS / diffusion / omni blast-radius file lists and per-family bench-harness snippets, the rationalization table of excuses-vs-reality, the red-flags list, and a one-paragraph cross-platform invariant. - references/family-knobs.md: full TTS / diffusion / omni knob tables (extra_body, stage_overrides, headline metrics). - references/pitfalls.md: six mechanical failure modes with copy-paste remediations (pytest -k zero-match, venv PATH for ninja subprocess, stale server PID, multi-tenant GPUs, /v1/models settle, cold download). - scripts/run_bisect.sh: bench-loop template that pairs vllm serve with vllm bench serve, polls /v1/models with a settle window, parses median/p99 TTFP + RTF + throughput from the saved JSON, and cleans up the server between commits. - scripts/kanban_trend.py: per-build metric time series from the vllm-omni-kanban repo with rolling-delta percent and regression markers; works for any cell prefix the kanban tracks. - scripts/cells/: four cells covering the three families — tts_default_voice_high_c (the vllm-project#3839 regression class), tts_voice_clone_nightly (kanban parity), diffusion_hunyuan_t2i_1024 (HunyuanImage-3.0 t2i @ 1024²), omni_qwen2_5_audio (Qwen2.5-Omni audio-in/audio-out) — plus a README documenting the <family>_<descriptor>.yaml convention. Triggers on natural-language requests like "bisect TTFP between X and Y", "verify PR #N actually improves perf", "find which commit slowed default_voice", "高并发 TTFP 劣化". Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
…llm-project#3681) Signed-off-by: zengchuang <zengchuang3@huawei.com> Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Purpose
as mentioned in #3503:
Fix the stage-transition trigger (
<recaption>after</think>end-of-think token) failure in HunyuanImage-3 when using async scheduling mode (--deploy-config hunyuan_image3.yaml).#3517 tried to fix this bug but seems test result not goes well
HunyuanImage-3's
_stage_transitionsmechanism depends onsampling_metadata.output_token_idsto detect the</think>token and force emit the<recaption>start token. This worked correctly with sync scheduler (--stage-configs-path) but failed with async scheduler (--deploy-config).Root Cause
Separate processes: In async scheduling, scheduler and GPU runner are separate processes. Scheduler's
req_state.output_token_idsmay preallocate placeholders (-1) for tokens not yet sampled, and doesn't reflect actual tokens sampled by GPU runner.Stale placeholder problem: The original code read from
req_output_token_idswhich contained-1placeholders, then tried to replace them withsampled_token_ids_cpu. Butsampled_token_ids_cpuis only set whensampling_metadata.output_token_idsis already populated (vLLM's internal logic).Replacement condition bug: The condition
if output_token_ids and not sampling_metadata.output_token_ids:only replaced when original was empty. In subsequent iterations, original was NOT empty (had stale/wrong content), so correct history wasn't used.Fix
_model_sampler_token_historylocal cache in GPU runner to accumulate token history across iterationsoutput_token_idsfrom cached history + newly sampled tokens fromsampled_token_ids_cpusampling_metadata.output_token_idswith our built history, regardless of original stateTest Plan
expected the output text contains
<recaption>tag after</think>.Test Result
✅ recaption tag missed problem fixed
Before Fix:
After Fix:
<recaption>tag correctly appears after</think>:output_image:

✅ HuanyuanImage Local Test Passed
stage_config test result (though discarded)
deploy_config test result
output_image:
