[BugFix] Fix prefer_model_sampler token history in async scheduling by zengchuang-hw · Pull Request #3681 · vllm-project/vllm-omni

zengchuang-hw · 2026-05-18T03:50:36Z

Purpose

as mentioned in #3503:
Fix the stage-transition trigger (<recaption> after </think> end-of-think token) failure in HunyuanImage-3 when using async scheduling mode (--deploy-config hunyuan_image3.yaml).

#3517 tried to fix this bug but seems test result not goes well

HunyuanImage-3's _stage_transitions mechanism depends on sampling_metadata.output_token_ids to detect the </think> token and force emit the <recaption> start token. This worked correctly with sync scheduler (--stage-configs-path) but failed with async scheduler (--deploy-config).

Root Cause

Separate processes: In async scheduling, scheduler and GPU runner are separate processes. Scheduler's req_state.output_token_ids may preallocate placeholders (-1) for tokens not yet sampled, and doesn't reflect actual tokens sampled by GPU runner.
Stale placeholder problem: The original code read from req_output_token_ids which contained -1 placeholders, then tried to replace them with sampled_token_ids_cpu. But sampled_token_ids_cpu is only set when sampling_metadata.output_token_ids is already populated (vLLM's internal logic).
Replacement condition bug: The condition if output_token_ids and not sampling_metadata.output_token_ids: only replaced when original was empty. In subsequent iterations, original was NOT empty (had stale/wrong content), so correct history wasn't used.

Fix

Added _model_sampler_token_history local cache in GPU runner to accumulate token history across iterations
Build output_token_ids from cached history + newly sampled tokens from sampled_token_ids_cpu
ALWAYS replace sampling_metadata.output_token_ids with our built history, regardless of original state

Test Plan

Run HunyuanImage-3 I2T with async scheduling (deploy-config mode):

python vllm-omni/examples/offline_inference/hunyuan_image3/end2end.py   \
--model tencent/HunyuanImage-3.0-Instruct  \
--modality img2img   \
--image-path ./input_0_0.png   \
--deploy-config hunyuan_image3.yaml   \
--bot-task think_recaption   \
--prompts "新年宠物海报，Q版圆润的可爱标题\"新年快乐汪\"，副标题\"HAPPY NEW YEAR\"。 鱼眼镜头，背景是房间门口，近景，上传的主体歪头笑，围着红色围巾，戴着红色毛线帽，高清，绒毛细节，面部特写。 宝丽莱相纸，超现实主义，写实主义，胶片摄影，打印颗粒感肌理。肌理，超写实，复古感。" \
2>&1 | tee output_fix.log

expected the output text contains <recaption> tag after </think>.

Local Test for hunyuanimage3

pytest -s -v tests/e2e/accuracy/test_hunyuan_image3.py

Test Result

✅ recaption tag missed problem fixed

Before Fix:

[Output] Text:
用户希望将参考图中可爱的金毛幼犬转化为一张具有新年氛围的宠物海报。参考图展示了一只坐在木质地板上、背景有白色蒲公英的幼犬，它正对着镜头微笑。原始指令非常具体，要求添加特定的标题文字、改变背景、调整构图并增加特定的艺术风格。首先，我需要处理文字部分，将“新年快乐汪”和“HAPPY NEW YEAR”以圆润可爱的字体放置在图像上方。接着，背景需要从户外的木地板和蒲公英切换到室内的房间门口，这涉及到场景的完全重构。构图上，指令要求使用鱼眼镜头效果，这意味着画面边缘会有明显的畸变，主体小狗会显得更加突出和圆润。小狗本身的配饰也需要改变，从原来的粉色项圈换成红色的围巾和红色的毛线帽，以契合新年主题。最后，整体风格要模拟宝丽莱相纸的质感，带有胶片颗粒感和复古色调。在改写指令时，我会把这些元素整合在一起，详细描述最终画面的视觉呈现，确保每一个细节都能被准确执行，同时保持小狗原本那种憨态可掬的神情。</think>请基于参考图中的金毛幼犬，创作一张充满节日氛围的新年宠物海报。将背景替换为温馨的室内房间门口，并采用鱼眼镜头拍摄，使画面中心的小狗呈现出可爱的圆润感。在图像顶部添加圆润可爱的艺术字体标题“新年快乐汪”，下方配以较小的“HAPPY NEW YEAR”字样。为小狗戴上一顶红色的针织毛线帽，并围上一条厚实的红色围巾，保留它原本歪头微笑的憨厚表情。整张图片应呈现出宝丽莱相纸的复古胶片风格，带有细腻的打印颗粒肌理和温暖的色调，画面边缘有自然的暗角效果，营造出一种怀旧且温馨的节日纪念感。</recaption>

After Fix:
<recaption> tag correctly appears after </think>:

[Output] Text:
用户希望将参考图中可爱的金毛幼犬转化为一张具有复古胶片风格的新年宠物海报。参考图展示了一只坐在木质地板上的金毛幼犬，背景是户外的白色花丛。原始指令非常具体，要求添加特定的标题文字、改变背景为室内门口、为小狗添加红色围巾和毛线帽，并应用鱼眼镜头效果和宝丽莱相纸质感。这是一个中等复杂度的任务，因为它涉及了主体装饰、背景替换、构图调整以及整体艺术风格的转变。我需要将这些抽象的风格描述转化为具体的视觉元素。首先，背景需要从户外的木地板和花丛切换到室内的门口，这会改变光影和空间感。其次，小狗的配饰——红色围巾和红色毛线帽——需要自然地融入其身体结构，围巾应环绕颈部，帽子应戴在头顶。标题文字“新年快乐汪”和“HAPPY NEW YEAR”需要以圆润可爱的字体呈现在画面上方。最后，为了实现“宝丽莱相纸”和“胶片摄影”的效果，我需要描述出那种带有白色边框、轻微颗粒感、色彩略微褪色且具有复古质感的视觉特征，同时加入鱼眼镜头带来的边缘畸变效果，使画面中心的小狗显得更加突出和可爱。</think><recaption>将参考图中的金毛幼犬制作成一张复古胶片风格的新年海报。首先，将背景从户外的木地板和花丛替换为室内的门口场景，光线应显得柔和且具有室内感。在小狗的颈部添加一条厚实的红色针织围巾，并在其头顶戴上一顶配套的红色毛线帽，确保配饰与小狗的绒毛自然衔接。在画面的上方，使用圆润可爱的艺术字体添加主标题“新年快乐汪”，并在其下方用稍小的字体添加副标题“HAPPY NEW YEAR”。对整张图片应用鱼眼镜头效果，使画面中心的小狗头部略微放大并产生自然的边缘畸变。最后，为图像添加宝丽莱相纸的白色边框，并赋予其胶片摄影的质感，包括细腻的打印颗粒感、轻微的色彩偏移和复古的色调，使整体呈现出一种怀旧且温馨的节日氛围。</recaption>-

output_image:

✅ HuanyuanImage Local Test Passed

stage_config test result (though discarded)

+-----------------------------+---------+------------------+
| Metric                      |   Value |   L20x Reference |
+=============================+=========+==================+
| COT similarity to reference |  0.9795 |           0.9644 |
+-----------------------------+---------+------------------+
| COT prefix match            | 29      |          29      |
+-----------------------------+---------+------------------+
| Image-Image similarity      | 91.8283 |          94.5538 |
+-----------------------------+---------+------------------+
| SSIM                        |  0.2501 |           0.242  |
+-----------------------------+---------+------------------+
| PSNR (dB)                   | 13.92   |          14.1    |
+-----------------------------+---------+------------------+
PASSED

deploy_config test result

+-----------------------------+---------+------------------+
| Metric                      |   Value |   L20x Reference |
+=============================+=========+==================+
| COT similarity to reference |  0.9795 |           0.9644 |
+-----------------------------+---------+------------------+
| COT prefix match            | 29      |          29      |
+-----------------------------+---------+------------------+
| Image-Image similarity      | 91.8283 |          94.5538 |
+-----------------------------+---------+------------------+
| SSIM                        |  0.2501 |           0.242  |
+-----------------------------+---------+------------------+
| PSNR (dB)                   | 13.92   |          14.1    |
+-----------------------------+---------+------------------+
PASSED

output_image:

chatgpt-codex-connector · 2026-05-21T03:04:24Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

zengchuang-hw · 2026-05-21T03:33:23Z

@Bounty-hunter PTAL

Bounty-hunter · 2026-05-21T07:16:59Z

https://github.com/vllm-project/vllm/blob/a950e9447e38727fc956afdc242bc6e3796ccb77/vllm/v1/worker/gpu_input_batch.py#L1018 will update -1 to real token id.

If we just move the calling self.input_batch.update_async_output_token_ids() in GPUARModelRunner::_sample to the beginning of if spec_decode_metadata is None:, would that solve the problem?

Bounty-hunter · 2026-05-21T07:47:05Z

https://github.com/vllm-project/vllm/blob/a950e9447e38727fc956afdc242bc6e3796ccb77/vllm/v1/worker/gpu_input_batch.py#L1018 will update -1 to real token id.

If we just move the calling self.input_batch.update_async_output_token_ids() in GPUARModelRunner::_sample to the beginning of if spec_decode_metadata is None:, would that solve the problem?

I try it, and it work.

Signed-off-by: zengchuang <zengchuang3@huawei.com>

zengchuang-hw · 2026-05-21T08:44:22Z

https://github.com/vllm-project/vllm/blob/a950e9447e38727fc956afdc242bc6e3796ccb77/vllm/v1/worker/gpu_input_batch.py#L1018 will update -1 to real token id.

If we just move the calling self.input_batch.update_async_output_token_ids() in GPUARModelRunner::_sample to the beginning of if spec_decode_metadata is None:, would that solve the problem?

nice catch, i tried this and it does work

Bounty-hunter

LGTM

gcanlin · 2026-05-21T14:21:59Z

@zengchuang-hw Please fix CI, thanks!

Signed-off-by: zengchuang <zengchuang3@huawei.com>

Resolve conflict in tests/e2e/accuracy/test_hunyuan_image3.py by adopting upstream's new stage config API format (stage_args/engine_args structure) while keeping the core fix in gpu_ar_model_runner.py. The PR's core fix moves update_async_output_token_ids() before the prefer_model_sampler branch, ensuring async output token IDs are updated regardless of which sampler is used. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Switch from stage_args format to deploy_config (stages/pipeline) format for consistency with the deploy_config API path. - Rename _BASE_CONFIG to _DEPLOY_CONFIG with stages/pipeline structure - Update _make_config to use stages key instead of stage_args - Update _QUANT_DIT_CONFIG to deploy_config format - Change _run_offline and _run_dit_model to use deploy_config_path - Pass deploy_config via kwargs to OmniRunner Signed-off-by: zengchuang <zengchuang3@huawei.com>

After PR 3681 fix, update_async_output_token_ids() is called BEFORE the model sampler path, so updated should be True, not False. Signed-off-by: zengchuang <zengchuang3@huawei.com>

zengchuang-hw · 2026-05-22T06:04:42Z

I have solved the ut errors and conflict, PTAL @gcanlin

…3681) Signed-off-by: zengchuang <zengchuang3@huawei.com> Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

…commits Adds .claude/skills/perf-bisect/ — a project-local Claude skill that encodes a repeatable workflow for attributing a vllm-omni perf change to a specific commit. Covers TTS, diffusion-image, and omni-audio model families. Generalised from the workflow used during the post-vllm-project#3662 regression hunt (vllm-project#3681 / vllm-project#3817 / vllm-project#3839), and extended with parallel blast-radius file lists, per-family bench-harness examples, and ready-to-paste cells for each model class so the same discipline applies across the stack. The skill encodes the load-bearing lesson from the PR vllm-project#3839 saga: extract the full cell (model, task, deploy_yaml, dataset, num_prompts, max_concurrency, num_warmups + family knobs) from the regression report BEFORE writing any bench script. Measuring a sibling cell that does not exercise the regressed code path is the most common path to a false "no regression" verdict. Layout (progressive disclosure): - SKILL.md: trigger conditions, paired tools, the cell-definition discipline (generic 7-tuple table + per-family knob TL;DR), the 5-step workflow with parallel TTS / diffusion / omni blast-radius file lists and per-family bench-harness snippets, the rationalization table of excuses-vs-reality, the red-flags list, and a one-paragraph cross-platform invariant. - references/family-knobs.md: full TTS / diffusion / omni knob tables (extra_body, stage_overrides, headline metrics). - references/pitfalls.md: six mechanical failure modes with copy-paste remediations (pytest -k zero-match, venv PATH for ninja subprocess, stale server PID, multi-tenant GPUs, /v1/models settle, cold download). - scripts/run_bisect.sh: bench-loop template that pairs vllm serve with vllm bench serve, polls /v1/models with a settle window, parses median/p99 TTFP + RTF + throughput from the saved JSON, and cleans up the server between commits. - scripts/kanban_trend.py: per-build metric time series from the vllm-omni-kanban repo with rolling-delta percent and regression markers; works for any cell prefix the kanban tracks. - scripts/cells/: four cells covering the three families — tts_default_voice_high_c (the vllm-project#3839 regression class), tts_voice_clone_nightly (kanban parity), diffusion_hunyuan_t2i_1024 (HunyuanImage-3.0 t2i @ 1024²), omni_qwen2_5_audio (Qwen2.5-Omni audio-in/audio-out) — plus a README documenting the <family>_<descriptor>.yaml convention. Triggers on natural-language requests like "bisect TTFP between X and Y", "verify PR #N actually improves perf", "find which commit slowed default_voice", "高并发 TTFP 劣化". Signed-off-by: Yueqian Lin <linyueqian@outlook.com>

…llm-project#3681) Signed-off-by: zengchuang <zengchuang3@huawei.com> Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

zengchuang-hw force-pushed the fix-async-sampler-token-history branch 10 times, most recently from 35431f9 to 5217fa8 Compare May 21, 2026 02:56

zengchuang-hw marked this pull request as ready for review May 21, 2026 03:04

zengchuang-hw requested review from gcanlin, tzhouam and yenuo26 as code owners May 21, 2026 03:04

[BugFix] Fix prefer_model_sampler token history in async scheduling

654cea6

Signed-off-by: zengchuang <zengchuang3@huawei.com>

zengchuang-hw force-pushed the fix-async-sampler-token-history branch from 5217fa8 to 654cea6 Compare May 21, 2026 08:31

Bounty-hunter approved these changes May 21, 2026

View reviewed changes

gcanlin approved these changes May 21, 2026

View reviewed changes

gcanlin added the ready label to trigger buildkite CI label May 21, 2026

zengchuang-hw added 2 commits May 22, 2026 08:59

Merge branch 'main' into fix-async-sampler-token-history

dac3196

[BugFix] Fix prefer_model_sampler token history in async scheduling

d654b6f

Signed-off-by: zengchuang <zengchuang3@huawei.com>

Bounty-hunter mentioned this pull request May 22, 2026

[Bug]: HunyuanImage-3.0 requires manual configuration of stop_token_ids on NPU. #3722

Closed

1 task

zengchuang-hw and others added 2 commits May 22, 2026 11:47

[Test] Fix test assertion for prefer_model_sampler async update

a2aa921

After PR 3681 fix, update_async_output_token_ids() is called BEFORE the model sampler path, so updated should be True, not False. Signed-off-by: zengchuang <zengchuang3@huawei.com>

gcanlin merged commit f0af3ab into vllm-project:main May 22, 2026
8 checks passed

tzhouam pushed a commit that referenced this pull request May 22, 2026

[BugFix] Fix prefer_model_sampler token history in async scheduling (#…

88d6c8e

…3681) Signed-off-by: zengchuang <zengchuang3@huawei.com> Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

zengchuang-hw mentioned this pull request May 25, 2026

[RFC] HunyuanImage Model Bug Tracking #3731

Open

linyueqian mentioned this pull request May 25, 2026

[Skill] add perf-bisect for attributing vllm-omni perf regressions to commits #3861

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BugFix] Fix prefer_model_sampler token history in async scheduling#3681

[BugFix] Fix prefer_model_sampler token history in async scheduling#3681
gcanlin merged 6 commits into
vllm-project:mainfrom
zengchuang-hw:fix-async-sampler-token-history

zengchuang-hw commented May 18, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot commented May 21, 2026

Uh oh!

zengchuang-hw commented May 21, 2026

Uh oh!

Bounty-hunter commented May 21, 2026

Uh oh!

Bounty-hunter commented May 21, 2026

Uh oh!

zengchuang-hw commented May 21, 2026 •

edited

Loading

Uh oh!

Bounty-hunter left a comment

Uh oh!

gcanlin commented May 21, 2026

Uh oh!

zengchuang-hw commented May 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

zengchuang-hw commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Root Cause

Fix

Test Plan

Test Result

✅ recaption tag missed problem fixed

✅ HuanyuanImage Local Test Passed

Uh oh!

chatgpt-codex-connector Bot commented May 21, 2026

Uh oh!

zengchuang-hw commented May 21, 2026

Uh oh!

Bounty-hunter commented May 21, 2026

Uh oh!

Bounty-hunter commented May 21, 2026

Uh oh!

zengchuang-hw commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Bounty-hunter left a comment

Choose a reason for hiding this comment

Uh oh!

gcanlin commented May 21, 2026

Uh oh!

zengchuang-hw commented May 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

zengchuang-hw commented May 18, 2026 •

edited

Loading

zengchuang-hw commented May 21, 2026 •

edited

Loading