[BugFix][NPU] Honor prefer_model_sampler in NPU AR runner#3517
Open
gcanlin wants to merge 2 commits into
Open
Conversation
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
Collaborator
Author
|
@Fishermanykx Could you help test it? |
Contributor
Test on 8xAscend 64G NPUs offline output Still no |
Collaborator
|
fix it asap |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fix #3503: HunyuanImage3 AR output on NPU is missing the
<recaption>opening tag after</think>, which breaks the downstream DiT stage. The same code path (prefer_model_sampler) is also used by CosyVoice3, so this fix benefits any model that opts into custom sampling.Root Cause
HunyuanImage3 declares
prefer_model_sampler = Trueand implements a customsample()method that ports the official_StageTransitionLogitsProcessor. After</think>, it overrides logits to force<recaption>(and analogous transitions for</recaption>).The GPU AR runner honors this contract at
gpu_ar_model_runner.py::_sample:NPUARModelRunnerhad no such override. It inheritedvllm-ascend.NPUModelRunner._sample, which is unconditionally:So
model.sample()was never called on NPU — the stage transition forcing logic was completely bypassed. With samplingtemperature=0.6 / top_p=0.95 / top_k=1024, the model would freely sample whatever after</think>, which on Chinese prompts almost always landed on actual text (请...) instead of the<recaption>token.This is purely an integration gap, not a sampler-algorithm difference between CUDA and NPU. The dispatch hook was added in #1703 for CosyVoice3 and gates on the generic
prefer_model_samplerattribute, so HunyuanImage3 (#2713) opted into it for free on GPU. NPU never picked up the same generalization.Fix
Move
_build_model_sampler_output_token_idsand_sampling_metadata_for_model_samplerfromGPUARModelRunnertoOmniGPUModelRunner. They are pure logic overself.input_batchwith no device-specific code.The MRO for
OmniNPUModelRunnerisOmniNPUModelRunner → OmniGPUModelRunner → NPUModelRunner → ..., so NPU inherits the helpers automatically.Add a thin
_sampleoverride inNPUARModelRunnermirroring the GPU one. On the fall-through (no model sampler, or spec-decode), callsuper()._sample(...)so NPU keeps itslmhead_tp_enablelogits slicing andrejection_samplerpath unchanged.Touched files (+83 / -48):
vllm_omni/worker/gpu_model_runner.py— host the two shared helpersvllm_omni/worker/gpu_ar_model_runner.py— drop the duplicatesvllm_omni/platforms/npu/worker/npu_ar_model_runner.py— drop unused imports, add_sampleoverrideFuture Cleanup
This PR is the minimal correctness fix. Two follow-ups worth doing once we have more
prefer_model_samplerusers or another platform:Push
_sampleitself down toOmniGPUModelRunner. The dispatch logic is identical between GPU and NPU; only the fall-through target differs, andsuper()._sample(...)resolves correctly via MRO on both sides (GPU →vllm.GPUModelRunner._sample, NPU →vllm-ascend.NPUModelRunner._samplewithlmhead_tp_enableslicing and rejection sampler). Holding off here only because the logit-bias call usesself.sampler.logit_bias_state, and we want one more pair of eyes on whether that shape is identical on both platforms before merging the override.Collapse
GPUARModelRunner/NPUARModelRunnerduplication. The two files are ~1040 lines each with ~95% structural overlap (_request_final_stage_id,_request_needs_downstream_stage_payload,_resolve_pooler_payload_req_ids,_resolve_req_hidden_states,_maybe_update_prefix_cache,_resolve_global_request_id, thepropose_draft_token_idsshape, etc.). Two viable shapes:ModelSamplerDispatchMixin, etc.): smallest blast radius, both runners mix in. Each new shared concern becomes its own mixin.NPUARModelRunner(GPUARModelRunner, NPUModelRunner): matches the existingOmniNPUModelRunnershape, but pulls inexecute_model/sample_tokens/_capture_talker_mtp_graphs/capture_model(all device-specific) which NPU still has to override — the surface saved isn't worth the silent-regression risk (any new helper added to GPU runner that touchestorch.cuda.*would auto-leak to NPU).Recommend mixin first; revisit sibling-merge if 3+ shared helpers accumulate without device-specific contamination.