Skip to content

[BugFix] Fix prefer_model_sampler token history in async scheduling#3681

Merged
gcanlin merged 6 commits into
vllm-project:mainfrom
zengchuang-hw:fix-async-sampler-token-history
May 22, 2026
Merged

[BugFix] Fix prefer_model_sampler token history in async scheduling#3681
gcanlin merged 6 commits into
vllm-project:mainfrom
zengchuang-hw:fix-async-sampler-token-history

Conversation

@zengchuang-hw
Copy link
Copy Markdown
Contributor

@zengchuang-hw zengchuang-hw commented May 18, 2026

Purpose

as mentioned in #3503:
Fix the stage-transition trigger (<recaption> after </think> end-of-think token) failure in HunyuanImage-3 when using async scheduling mode (--deploy-config hunyuan_image3.yaml).

#3517 tried to fix this bug but seems test result not goes well

HunyuanImage-3's _stage_transitions mechanism depends on sampling_metadata.output_token_ids to detect the </think> token and force emit the <recaption> start token. This worked correctly with sync scheduler (--stage-configs-path) but failed with async scheduler (--deploy-config).

Root Cause

  1. Separate processes: In async scheduling, scheduler and GPU runner are separate processes. Scheduler's req_state.output_token_ids may preallocate placeholders (-1) for tokens not yet sampled, and doesn't reflect actual tokens sampled by GPU runner.

  2. Stale placeholder problem: The original code read from req_output_token_ids which contained -1 placeholders, then tried to replace them with sampled_token_ids_cpu. But sampled_token_ids_cpu is only set when sampling_metadata.output_token_ids is already populated (vLLM's internal logic).

  3. Replacement condition bug: The condition if output_token_ids and not sampling_metadata.output_token_ids: only replaced when original was empty. In subsequent iterations, original was NOT empty (had stale/wrong content), so correct history wasn't used.

Fix

  • Added _model_sampler_token_history local cache in GPU runner to accumulate token history across iterations
  • Build output_token_ids from cached history + newly sampled tokens from sampled_token_ids_cpu
  • ALWAYS replace sampling_metadata.output_token_ids with our built history, regardless of original state

Test Plan

  • Run HunyuanImage-3 I2T with async scheduling (deploy-config mode):
python vllm-omni/examples/offline_inference/hunyuan_image3/end2end.py   \
--model tencent/HunyuanImage-3.0-Instruct  \
--modality img2img   \
--image-path ./input_0_0.png   \
--deploy-config hunyuan_image3.yaml   \
--bot-task think_recaption   \
--prompts "新年宠物海报,Q版圆润的可爱标题\"新年快乐汪\",副标题\"HAPPY NEW YEAR\"。 鱼眼镜头,背景是房间门口,近景,上传的主体歪头笑,围着红色围巾,戴着红色毛线帽,高清,绒毛细节,面部特写。 宝丽莱相纸,超现实主义,写实主义,胶片摄影,打印颗粒感肌理。肌理,超写实,复古感。" \
2>&1 | tee output_fix.log

expected the output text contains <recaption> tag after </think>.

  • Local Test for hunyuanimage3
pytest -s -v tests/e2e/accuracy/test_hunyuan_image3.py

Test Result

✅ recaption tag missed problem fixed

Before Fix:

[Output] Text:
用户希望将参考图中可爱的金毛幼犬转化为一张具有新年氛围的宠物海报。参考图展示了一只坐在木质地板上、背景有白色蒲公英的幼犬,它正对着镜头微笑。原始指令非常具体,要求添加特定的标题文字、改变背景、调整构图并增加特定的艺术风格。首先,我需要处理文字部分,将“新年快乐汪”和“HAPPY NEW YEAR”以圆润可爱的字体放置在图像上方。接着,背景需要从户外的木地板和蒲公英切换到室内的房间门口,这涉及到场景的完全重构。构图上,指令要求使用鱼眼镜头效果,这意味着画面边缘会有明显的畸变,主体小狗会显得更加突出和圆润。小狗本身的配饰也需要改变,从原来的粉色项圈换成红色的围巾和红色的毛线帽,以契合新年主题。最后,整体风格要模拟宝丽莱相纸的质感,带有胶片颗粒感和复古色调。在改写指令时,我会把这些元素整合在一起,详细描述最终画面的视觉呈现,确保每一个细节都能被准确执行,同时保持小狗原本那种憨态可掬的神情。</think>请基于参考图中的金毛幼犬,创作一张充满节日氛围的新年宠物海报。将背景替换为温馨的室内房间门口,并采用鱼眼镜头拍摄,使画面中心的小狗呈现出可爱的圆润感。在图像顶部添加圆润可爱的艺术字体标题“新年快乐汪”,下方配以较小的“HAPPY NEW YEAR”字样。为小狗戴上一顶红色的针织毛线帽,并围上一条厚实的红色围巾,保留它原本歪头微笑的憨厚表情。整张图片应呈现出宝丽莱相纸的复古胶片风格,带有细腻的打印颗粒肌理和温暖的色调,画面边缘有自然的暗角效果,营造出一种怀旧且温馨的节日纪念感。</recaption>

After Fix:
<recaption> tag correctly appears after </think>:

[Output] Text:
用户希望将参考图中可爱的金毛幼犬转化为一张具有复古胶片风格的新年宠物海报。参考图展示了一只坐在木质地板上的金毛幼犬,背景是户外的白色花丛。原始指令非常具体,要求添加特定的标题文字、改变背景为室内门口、为小狗添加红色围巾和毛线帽,并应用鱼眼镜头效果和宝丽莱相纸质感。这是一个中等复杂度的任务,因为它涉及了主体装饰、背景替换、构图调整以及整体艺术风格的转变。我需要将这些抽象的风格描述转化为具体的视觉元素。首先,背景需要从户外的木地板和花丛切换到室内的门口,这会改变光影和空间感。其次,小狗的配饰——红色围巾和红色毛线帽——需要自然地融入其身体结构,围巾应环绕颈部,帽子应戴在头顶。标题文字“新年快乐汪”和“HAPPY NEW YEAR”需要以圆润可爱的字体呈现在画面上方。最后,为了实现“宝丽莱相纸”和“胶片摄影”的效果,我需要描述出那种带有白色边框、轻微颗粒感、色彩略微褪色且具有复古质感的视觉特征,同时加入鱼眼镜头带来的边缘畸变效果,使画面中心的小狗显得更加突出和可爱。</think><recaption>将参考图中的金毛幼犬制作成一张复古胶片风格的新年海报。首先,将背景从户外的木地板和花丛替换为室内的门口场景,光线应显得柔和且具有室内感。在小狗的颈部添加一条厚实的红色针织围巾,并在其头顶戴上一顶配套的红色毛线帽,确保配饰与小狗的绒毛自然衔接。在画面的上方,使用圆润可爱的艺术字体添加主标题“新年快乐汪”,并在其下方用稍小的字体添加副标题“HAPPY NEW YEAR”。对整张图片应用鱼眼镜头效果,使画面中心的小狗头部略微放大并产生自然的边缘畸变。最后,为图像添加宝丽莱相纸的白色边框,并赋予其胶片摄影的质感,包括细腻的打印颗粒感、轻微的色彩偏移和复古的色调,使整体呈现出一种怀旧且温馨的节日氛围。</recaption>-

output_image:
image

✅ HuanyuanImage Local Test Passed

stage_config test result (though discarded)

+-----------------------------+---------+------------------+
| Metric                      |   Value |   L20x Reference |
+=============================+=========+==================+
| COT similarity to reference |  0.9795 |           0.9644 |
+-----------------------------+---------+------------------+
| COT prefix match            | 29      |          29      |
+-----------------------------+---------+------------------+
| Image-Image similarity      | 91.8283 |          94.5538 |
+-----------------------------+---------+------------------+
| SSIM                        |  0.2501 |           0.242  |
+-----------------------------+---------+------------------+
| PSNR (dB)                   | 13.92   |          14.1    |
+-----------------------------+---------+------------------+
PASSED

deploy_config test result

+-----------------------------+---------+------------------+
| Metric                      |   Value |   L20x Reference |
+=============================+=========+==================+
| COT similarity to reference |  0.9795 |           0.9644 |
+-----------------------------+---------+------------------+
| COT prefix match            | 29      |          29      |
+-----------------------------+---------+------------------+
| Image-Image similarity      | 91.8283 |          94.5538 |
+-----------------------------+---------+------------------+
| SSIM                        |  0.2501 |           0.242  |
+-----------------------------+---------+------------------+
| PSNR (dB)                   | 13.92   |          14.1    |
+-----------------------------+---------+------------------+
PASSED

output_image:
image

@zengchuang-hw zengchuang-hw force-pushed the fix-async-sampler-token-history branch 10 times, most recently from 35431f9 to 5217fa8 Compare May 21, 2026 02:56
@zengchuang-hw zengchuang-hw marked this pull request as ready for review May 21, 2026 03:04
@chatgpt-codex-connector
Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

@zengchuang-hw
Copy link
Copy Markdown
Contributor Author

@Bounty-hunter PTAL

@Bounty-hunter
Copy link
Copy Markdown
Contributor

https://github.com/vllm-project/vllm/blob/a950e9447e38727fc956afdc242bc6e3796ccb77/vllm/v1/worker/gpu_input_batch.py#L1018 will update -1 to real token id.

If we just move the calling self.input_batch.update_async_output_token_ids() in GPUARModelRunner::_sample to the beginning of if spec_decode_metadata is None:, would that solve the problem?

@Bounty-hunter
Copy link
Copy Markdown
Contributor

https://github.com/vllm-project/vllm/blob/a950e9447e38727fc956afdc242bc6e3796ccb77/vllm/v1/worker/gpu_input_batch.py#L1018 will update -1 to real token id.

If we just move the calling self.input_batch.update_async_output_token_ids() in GPUARModelRunner::_sample to the beginning of if spec_decode_metadata is None:, would that solve the problem?

I try it, and it work.

Signed-off-by: zengchuang <zengchuang3@huawei.com>
@zengchuang-hw zengchuang-hw force-pushed the fix-async-sampler-token-history branch from 5217fa8 to 654cea6 Compare May 21, 2026 08:31
@zengchuang-hw
Copy link
Copy Markdown
Contributor Author

zengchuang-hw commented May 21, 2026

https://github.com/vllm-project/vllm/blob/a950e9447e38727fc956afdc242bc6e3796ccb77/vllm/v1/worker/gpu_input_batch.py#L1018 will update -1 to real token id.

If we just move the calling self.input_batch.update_async_output_token_ids() in GPUARModelRunner::_sample to the beginning of if spec_decode_metadata is None:, would that solve the problem?

nice catch, i tried this and it does work

Copy link
Copy Markdown
Contributor

@Bounty-hunter Bounty-hunter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@gcanlin gcanlin added the ready label to trigger buildkite CI label May 21, 2026
@gcanlin
Copy link
Copy Markdown
Collaborator

gcanlin commented May 21, 2026

@zengchuang-hw Please fix CI, thanks!

zengchuang-hw and others added 2 commits May 22, 2026 11:47
Resolve conflict in tests/e2e/accuracy/test_hunyuan_image3.py by
adopting upstream's new stage config API format (stage_args/engine_args
structure) while keeping the core fix in gpu_ar_model_runner.py.

The PR's core fix moves update_async_output_token_ids() before the
prefer_model_sampler branch, ensuring async output token IDs are
updated regardless of which sampler is used.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Switch from stage_args format to deploy_config (stages/pipeline) format
for consistency with the deploy_config API path.

- Rename _BASE_CONFIG to _DEPLOY_CONFIG with stages/pipeline structure
- Update _make_config to use stages key instead of stage_args
- Update _QUANT_DIT_CONFIG to deploy_config format
- Change _run_offline and _run_dit_model to use deploy_config_path
- Pass deploy_config via kwargs to OmniRunner

Signed-off-by: zengchuang <zengchuang3@huawei.com>
After PR 3681 fix, update_async_output_token_ids() is called BEFORE
the model sampler path, so updated should be True, not False.

Signed-off-by: zengchuang <zengchuang3@huawei.com>
@zengchuang-hw
Copy link
Copy Markdown
Contributor Author

I have solved the ut errors and conflict, PTAL @gcanlin

@gcanlin gcanlin merged commit f0af3ab into vllm-project:main May 22, 2026
8 checks passed
tzhouam pushed a commit that referenced this pull request May 22, 2026
…3681)

Signed-off-by: zengchuang <zengchuang3@huawei.com>
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
linyueqian added a commit to linyueqian/vllm-omni that referenced this pull request May 25, 2026
…commits

Adds .claude/skills/perf-bisect/ — a project-local Claude skill that
encodes a repeatable workflow for attributing a vllm-omni perf change to
a specific commit. Covers TTS, diffusion-image, and omni-audio model
families. Generalised from the workflow used during the post-vllm-project#3662
regression hunt (vllm-project#3681 / vllm-project#3817 / vllm-project#3839), and extended with parallel
blast-radius file lists, per-family bench-harness examples, and
ready-to-paste cells for each model class so the same discipline applies
across the stack.

The skill encodes the load-bearing lesson from the PR vllm-project#3839 saga:
extract the full cell (model, task, deploy_yaml, dataset, num_prompts,
max_concurrency, num_warmups + family knobs) from the regression report
BEFORE writing any bench script. Measuring a sibling cell that does not
exercise the regressed code path is the most common path to a false
"no regression" verdict.

Layout (progressive disclosure):

- SKILL.md: trigger conditions, paired tools, the cell-definition
  discipline (generic 7-tuple table + per-family knob TL;DR), the 5-step
  workflow with parallel TTS / diffusion / omni blast-radius file lists
  and per-family bench-harness snippets, the rationalization table of
  excuses-vs-reality, the red-flags list, and a one-paragraph
  cross-platform invariant.

- references/family-knobs.md: full TTS / diffusion / omni knob tables
  (extra_body, stage_overrides, headline metrics).

- references/pitfalls.md: six mechanical failure modes with copy-paste
  remediations (pytest -k zero-match, venv PATH for ninja subprocess,
  stale server PID, multi-tenant GPUs, /v1/models settle, cold download).

- scripts/run_bisect.sh: bench-loop template that pairs vllm serve with
  vllm bench serve, polls /v1/models with a settle window, parses
  median/p99 TTFP + RTF + throughput from the saved JSON, and cleans up
  the server between commits.

- scripts/kanban_trend.py: per-build metric time series from the
  vllm-omni-kanban repo with rolling-delta percent and regression
  markers; works for any cell prefix the kanban tracks.

- scripts/cells/: four cells covering the three families —
  tts_default_voice_high_c (the vllm-project#3839 regression class),
  tts_voice_clone_nightly (kanban parity), diffusion_hunyuan_t2i_1024
  (HunyuanImage-3.0 t2i @ 1024²), omni_qwen2_5_audio (Qwen2.5-Omni
  audio-in/audio-out) — plus a README documenting the
  <family>_<descriptor>.yaml convention.

Triggers on natural-language requests like "bisect TTFP between X and Y",
"verify PR #N actually improves perf", "find which commit slowed default_voice",
"高并发 TTFP 劣化".

Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
linyueqian added a commit to linyueqian/vllm-omni that referenced this pull request May 25, 2026
…commits

Adds .claude/skills/perf-bisect/ — a project-local Claude skill that
encodes a repeatable workflow for attributing a vllm-omni perf change to
a specific commit. Covers TTS, diffusion-image, and omni-audio model
families. Generalised from the workflow used during the post-vllm-project#3662
regression hunt (vllm-project#3681 / vllm-project#3817 / vllm-project#3839), and extended with parallel
blast-radius file lists, per-family bench-harness examples, and
ready-to-paste cells for each model class so the same discipline applies
across the stack.

The skill encodes the load-bearing lesson from the PR vllm-project#3839 saga:
extract the full cell (model, task, deploy_yaml, dataset, num_prompts,
max_concurrency, num_warmups + family knobs) from the regression report
BEFORE writing any bench script. Measuring a sibling cell that does not
exercise the regressed code path is the most common path to a false
"no regression" verdict.

Layout (progressive disclosure):

- SKILL.md: trigger conditions, paired tools, the cell-definition
  discipline (generic 7-tuple table + per-family knob TL;DR), the 5-step
  workflow with parallel TTS / diffusion / omni blast-radius file lists
  and per-family bench-harness snippets, the rationalization table of
  excuses-vs-reality, the red-flags list, and a one-paragraph
  cross-platform invariant.

- references/family-knobs.md: full TTS / diffusion / omni knob tables
  (extra_body, stage_overrides, headline metrics).

- references/pitfalls.md: six mechanical failure modes with copy-paste
  remediations (pytest -k zero-match, venv PATH for ninja subprocess,
  stale server PID, multi-tenant GPUs, /v1/models settle, cold download).

- scripts/run_bisect.sh: bench-loop template that pairs vllm serve with
  vllm bench serve, polls /v1/models with a settle window, parses
  median/p99 TTFP + RTF + throughput from the saved JSON, and cleans up
  the server between commits.

- scripts/kanban_trend.py: per-build metric time series from the
  vllm-omni-kanban repo with rolling-delta percent and regression
  markers; works for any cell prefix the kanban tracks.

- scripts/cells/: four cells covering the three families —
  tts_default_voice_high_c (the vllm-project#3839 regression class),
  tts_voice_clone_nightly (kanban parity), diffusion_hunyuan_t2i_1024
  (HunyuanImage-3.0 t2i @ 1024²), omni_qwen2_5_audio (Qwen2.5-Omni
  audio-in/audio-out) — plus a README documenting the
  <family>_<descriptor>.yaml convention.

Triggers on natural-language requests like "bisect TTFP between X and Y",
"verify PR #N actually improves perf", "find which commit slowed default_voice",
"高并发 TTFP 劣化".

Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
zengchuang-hw added a commit to zengchuang-hw/vllm-omni that referenced this pull request Jun 1, 2026
…llm-project#3681)

Signed-off-by: zengchuang <zengchuang3@huawei.com>
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants