[BugFix][NPU] Honor prefer_model_sampler in NPU AR runner by gcanlin · Pull Request #3517 · vllm-project/vllm-omni

gcanlin · 2026-05-11T16:12:50Z

Summary

Fix #3503: HunyuanImage3 AR output on NPU is missing the <recaption> opening tag after </think>, which breaks the downstream DiT stage. The same code path (prefer_model_sampler) is also used by CosyVoice3, so this fix benefits any model that opts into custom sampling.

Root Cause

HunyuanImage3 declares prefer_model_sampler = True and implements a custom sample() method that ports the official _StageTransitionLogitsProcessor. After </think>, it overrides logits to force <recaption> (and analogous transitions for </recaption>).

The GPU AR runner honors this contract at gpu_ar_model_runner.py::_sample:

if logits is not None and callable(model_sample) and \
        getattr(self.model, "prefer_model_sampler", False):
    sampler_output = model_sample(
        logits,
        self._sampling_metadata_for_model_sampler(sampling_metadata),
    )
    if sampler_output is not None:
        return sampler_output

NPUARModelRunner had no such override. It inherited vllm-ascend.NPUModelRunner._sample, which is unconditionally:

return self.sampler(logits=logits, sampling_metadata=sampling_metadata)

So model.sample() was never called on NPU — the stage transition forcing logic was completely bypassed. With sampling temperature=0.6 / top_p=0.95 / top_k=1024, the model would freely sample whatever after </think>, which on Chinese prompts almost always landed on actual text (请...) instead of the <recaption> token.

This is purely an integration gap, not a sampler-algorithm difference between CUDA and NPU. The dispatch hook was added in #1703 for CosyVoice3 and gates on the generic prefer_model_sampler attribute, so HunyuanImage3 (#2713) opted into it for free on GPU. NPU never picked up the same generalization.

Fix

Move _build_model_sampler_output_token_ids and _sampling_metadata_for_model_sampler from GPUARModelRunner to OmniGPUModelRunner. They are pure logic over self.input_batch with no device-specific code.

The MRO for OmniNPUModelRunner is OmniNPUModelRunner → OmniGPUModelRunner → NPUModelRunner → ..., so NPU inherits the helpers automatically.
Add a thin _sample override in NPUARModelRunner mirroring the GPU one. On the fall-through (no model sampler, or spec-decode), call super()._sample(...) so NPU keeps its lmhead_tp_enable logits slicing and rejection_sampler path unchanged.

Touched files (+83 / -48):

vllm_omni/worker/gpu_model_runner.py — host the two shared helpers
vllm_omni/worker/gpu_ar_model_runner.py — drop the duplicates
vllm_omni/platforms/npu/worker/npu_ar_model_runner.py — drop unused imports, add _sample override

Future Cleanup

This PR is the minimal correctness fix. Two follow-ups worth doing once we have more prefer_model_sampler users or another platform:

Push _sample itself down to OmniGPUModelRunner. The dispatch logic is identical between GPU and NPU; only the fall-through target differs, and super()._sample(...) resolves correctly via MRO on both sides (GPU → vllm.GPUModelRunner._sample, NPU → vllm-ascend.NPUModelRunner._sample with lmhead_tp_enable slicing and rejection sampler). Holding off here only because the logit-bias call uses self.sampler.logit_bias_state, and we want one more pair of eyes on whether that shape is identical on both platforms before merging the override.
Collapse GPUARModelRunner / NPUARModelRunner duplication. The two files are ~1040 lines each with ~95% structural overlap (_request_final_stage_id, _request_needs_downstream_stage_payload, _resolve_pooler_payload_req_ids, _resolve_req_hidden_states, _maybe_update_prefix_cache, _resolve_global_request_id, the propose_draft_token_ids shape, etc.). Two viable shapes:
- Mixin (ModelSamplerDispatchMixin, etc.): smallest blast radius, both runners mix in. Each new shared concern becomes its own mixin.
- Sibling-merge via NPUARModelRunner(GPUARModelRunner, NPUModelRunner): matches the existing OmniNPUModelRunner shape, but pulls in execute_model / sample_tokens / _capture_talker_mtp_graphs / capture_model (all device-specific) which NPU still has to override — the surface saved isn't worth the silent-regression risk (any new helper added to GPU runner that touches torch.cuda.* would auto-leak to NPU).
Recommend mixin first; revisit sibling-merge if 3+ shared helpers accumulate without device-specific contamination.

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

chatgpt-codex-connector · 2026-05-11T16:12:56Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

gcanlin · 2026-05-11T16:14:03Z

@Fishermanykx Could you help test it?

Fishermanykx · 2026-05-13T03:00:58Z

@Fishermanykx Could you help test it?

Test on 8xAscend 64G NPUs

offline

STAGE_CONFIGS_PATH="${STAGE_CONFIGS_PATH:-${REPO_ROOT}/vllm_omni/deploy/hunyuan_image3.yaml}"
OUTPUT_DIR="${OUTPUT_DIR:-${SCRIPT_DIR}/outputs}"
PROMPT="${PROMPT:-新年宠物海报，Q版圆润的可爱标题“新年快乐汪”，副标题“HAPPY NEW YEAR”。 鱼眼镜头，背景是房间门口，近景，上传的主体歪头笑，围着红色围巾，戴着红色毛线帽，高清，绒毛细节，面部特写。 宝丽莱相纸，超现实主义，写实主义，胶片摄影，打印颗粒感肌理。肌理，超写实，复古感。}"
STEPS="${STEPS:-8}"
GUIDANCE_SCALE="${GUIDANCE_SCALE:-1.0}"
SEED="${SEED:-42}"

python "${SCRIPT_DIR}/end2end.py" \
  --model "${MODEL_PATH}" \
  --image-path "${IMAGE_PATH}" \
  --prompts "${PROMPT}" \
  --deploy-config "${STAGE_CONFIGS_PATH}" \
  --output "${OUTPUT_DIR}" \
  --log-stats \
  --modality "img2img" \
  --steps "${STEPS}" \
  --guidance-scale "${GUIDANCE_SCALE}" \
  --seed "${SEED}" \
  --enforce-eager \
  --bot-task "think"

output

[Output] Text:
用户希望将这张可爱的金毛幼犬照片改造成一张充满节日氛围的新年宠物海报。参考图中是一只吐着舌头、歪着头微笑的金毛幼犬，背景是木质地板和模糊的白色花朵。原始指令非常具体，要求添加特定的标题文字、改变背景、调整构图并应用特定的胶片质感。这是一个中等复杂度的任务，因为它涉及到了图像合成、文字排版、风格迁移和构图调整。首先，我需要处理文字部分，将“新年快乐汪”和“HAPPY NEW YEAR”以圆润可爱的字体放置在图像上方。接着，背景需要从户外的木地板切换到室内的门口场景，这需要保持狗狗作为主体的近景特写。为了增强节日感，狗狗需要佩戴红色的毛线帽和红色的围巾。构图上，原始指令提到了鱼眼镜头效果，这意味着画面边缘会有轻微的弧形畸变，增加视觉冲击力。最后，整体风格要模拟宝丽莱相纸的质感，带有胶片颗粒和复古的色调。在改写指令时，我会把这些元素整合在一起，详细描述狗狗的新装扮、背景的变化、文字的样式和位置，以及整体的艺术风格，确保生成的图像既保留了原图狗狗的神态，又具备浓厚的新年海报氛围。</think>将参考图中的金毛幼犬制作成一张新年主题的海报。请保留狗狗歪头吐舌的可爱表情，但为它戴上一顶红色的针织毛线帽和一条配套的红色围巾。将背景从户外的木地板更换为室内的门口场景，狗狗依然保持近景特写。在图像上方添加圆润可爱的艺术字体标题“新年快乐汪”，下方配以较小的“HAPPY NEW YEAR”字样。整体画面采用鱼眼镜头效果，使边缘产生轻微的弧形畸变。最后，为整张图片添加宝丽莱相纸的白色边框，并赋予其胶片摄影的质感，包括细腻的打印颗粒感和复古的暖色调，营造出一种温馨的节日氛围。</recaption>

Still no <recaption> tag

hsliuustc0106 · 2026-05-16T23:11:52Z

fix it asap

[BugFix][NPU] Honor prefer_model_sampler in NPU AR runner]

a20ee78

Signed-off-by: gcanlin <canlinguosdu@gmail.com>

gcanlin requested a review from tzhouam as a code owner May 11, 2026 16:12

gcanlin added the ready label to trigger buildkite CI label May 11, 2026

gcanlin changed the title ~~[BugFix][NPU] Honor prefer_model_sampler in NPU AR runner]~~ [BugFix][NPU] Honor prefer_model_sampler in NPU AR runner May 11, 2026

Merge branch 'main' into pre-sampler

08bf293

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BugFix][NPU] Honor prefer_model_sampler in NPU AR runner#3517

[BugFix][NPU] Honor prefer_model_sampler in NPU AR runner#3517
gcanlin wants to merge 2 commits into
vllm-project:mainfrom
gcanlin:pre-sampler

gcanlin commented May 11, 2026

Uh oh!

chatgpt-codex-connector Bot commented May 11, 2026

Uh oh!

gcanlin commented May 11, 2026

Uh oh!

Fishermanykx commented May 13, 2026 •

edited

Loading

Uh oh!

hsliuustc0106 commented May 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

gcanlin commented May 11, 2026

Summary

Root Cause

Fix

Future Cleanup

Uh oh!

chatgpt-codex-connector Bot commented May 11, 2026

Uh oh!

gcanlin commented May 11, 2026

Uh oh!

Fishermanykx commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hsliuustc0106 commented May 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fishermanykx commented May 13, 2026 •

edited

Loading