[Bugfix] Revert MiMo-Audio local_sampler to greedy to fix text truncation under concurrent batching (followup to #3686)#3817
Merged
hsliuustc0106 merged 3 commits intoMay 22, 2026
Conversation
…boundary When length % chunk_size == 0, _flush_remaining_codes previously returned an empty finished sentinel, dropping the tail audio frames. The vocoder needs the final chunk plus left context to produce a stable tail; otherwise voice cuts off at chunk boundaries. Fall back to chunk_size as the context length in this case, matching the behavior pinned by the new unit tests in tests/model_executor/stage_input_processors/test_mimo_audio_flush_remaining_codes.py. Signed-off-by: Jialong Liu <88185941+Galleons2029@users.noreply.github.com>
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
…batching PR vllm-project#3686 set `local_sampler.do_sample=True` (temperature=0.9, top_p=0.95) intending to fix MiMo-Audio voice instability by avoiding the silent argmax that `MiMoLocalSamplerTensor` enforces on the CUDA-graph path. The change unintentionally destabilises text continuations under concurrent batched requests, surfaced by buildkite/vllm-omni #10167 on the merge-to-main run of `tests/e2e/online_serving/test_mimo_audio.py::test_text_to_text_001`: AssertionError: The output does not contain any of the keywords. response.text_content = 'The capital of China is.' # missing "Beijing" Root cause is an audio-embedding feedback loop into the text decode path: local_forward (stochastic local_sampler) -> next_speech_tokens # random under temp=0.9 top_p=0.95 -> new_audio_emb = sum_k speech_embeddings[k](next_speech_tokens[..,k,..]) -> _cached_new_audio_emb_by_req[req_id] = new_audio_emb -> next decode step: _prepare_multimodal_embeddings_with_cache adds prev_new_audio_emb back into inputs_embeds -> self.model(input_ids, positions, inputs_embeds=inputs_embeds) -> compute_logits -> global_sampler.sample (greedy) `global_sampler` is greedy but its *logits* depend on the random audio embedding from the previous step, so the greedy argmax flips for some batch members. At batch=5 with identical prompt "What is the capital of China? Answer in 20 words." we observed 5 different continuations, one of which dropped "Beijing" entirely and emitted "The capital of China is.<eot>" instead. Reproduction (with --run-level=advanced_model): before revert: 4/5 contain "beijing", 1/5 truncates -> FAIL after revert : 5/5 contain "beijing" -> PASS Setting do_sample=False also restores the CUDA-graph path (`use_cg = (do_sample is None or do_sample is False) and ...`), undoing the ~18% stage-0 per-prompt latency regression Codex flagged on vllm-project#3686. The voice-instability symptoms PR vllm-project#3686 set out to fix are actually resolved by its other change -- `codec_left_context_frames: 3 -> 40` in the stage-1 vocoder config, which covers `vocoder_attn_window_size=[40, 10]` and prevents acoustic-state resets at chunk boundaries. That change lives in stage_configs / stage_input_processors and is preserved here. Voice diversity, if needed, should be reintroduced in the codec/vocoder path (stage-1) with a per-request seed rather than by randomising the shared local_sampler whose outputs feed back into stage-0 text logits. Three sites touched, all in mimo_audio_llm.py: - __init__ : self.local_sampler do_sample True -> False - base_local_forward fallback: same - local_forward fallback: same The unrelated `pooling_output is None` guard in stage_input_processors/mimo_audio.py landed earlier on this branch is retained. That guard fixes a separate AttributeError in chunk_transfer_adapter when stage-0 emits None pooling_output on text-only paths. It is independent of the truncation bug. Fixes vllm-project#3815 Follow-up to vllm-project#3686 Signed-off-by: Galleons2029 <Galleons777@gmail.com> Signed-off-by: Jialong Liu <88185941+Galleons2029@users.noreply.github.com>
de21a31 to
ba9c06b
Compare
linyueqian
added a commit
to linyueqian/vllm-omni
that referenced
this pull request
May 25, 2026
…commits Adds .claude/skills/perf-bisect/ — a project-local Claude skill that encodes a repeatable workflow for attributing a vllm-omni perf change to a specific commit. Covers TTS, diffusion-image, and omni-audio model families. Generalised from the workflow used during the post-vllm-project#3662 regression hunt (vllm-project#3681 / vllm-project#3817 / vllm-project#3839), and extended with parallel blast-radius file lists, per-family bench-harness examples, and ready-to-paste cells for each model class so the same discipline applies across the stack. The skill encodes the load-bearing lesson from the PR vllm-project#3839 saga: extract the full cell (model, task, deploy_yaml, dataset, num_prompts, max_concurrency, num_warmups + family knobs) from the regression report BEFORE writing any bench script. Measuring a sibling cell that does not exercise the regressed code path is the most common path to a false "no regression" verdict. Layout (progressive disclosure): - SKILL.md: trigger conditions, paired tools, the cell-definition discipline (generic 7-tuple table + per-family knob TL;DR), the 5-step workflow with parallel TTS / diffusion / omni blast-radius file lists and per-family bench-harness snippets, the rationalization table of excuses-vs-reality, the red-flags list, and a one-paragraph cross-platform invariant. - references/family-knobs.md: full TTS / diffusion / omni knob tables (extra_body, stage_overrides, headline metrics). - references/pitfalls.md: six mechanical failure modes with copy-paste remediations (pytest -k zero-match, venv PATH for ninja subprocess, stale server PID, multi-tenant GPUs, /v1/models settle, cold download). - scripts/run_bisect.sh: bench-loop template that pairs vllm serve with vllm bench serve, polls /v1/models with a settle window, parses median/p99 TTFP + RTF + throughput from the saved JSON, and cleans up the server between commits. - scripts/kanban_trend.py: per-build metric time series from the vllm-omni-kanban repo with rolling-delta percent and regression markers; works for any cell prefix the kanban tracks. - scripts/cells/: four cells covering the three families — tts_default_voice_high_c (the vllm-project#3839 regression class), tts_voice_clone_nightly (kanban parity), diffusion_hunyuan_t2i_1024 (HunyuanImage-3.0 t2i @ 1024²), omni_qwen2_5_audio (Qwen2.5-Omni audio-in/audio-out) — plus a README documenting the <family>_<descriptor>.yaml convention. Triggers on natural-language requests like "bisect TTFP between X and Y", "verify PR #N actually improves perf", "find which commit slowed default_voice", "高并发 TTFP 劣化". Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
linyueqian
added a commit
to linyueqian/vllm-omni
that referenced
this pull request
May 25, 2026
…commits Adds .claude/skills/perf-bisect/ — a project-local Claude skill that encodes a repeatable workflow for attributing a vllm-omni perf change to a specific commit. Covers TTS, diffusion-image, and omni-audio model families. Generalised from the workflow used during the post-vllm-project#3662 regression hunt (vllm-project#3681 / vllm-project#3817 / vllm-project#3839), and extended with parallel blast-radius file lists, per-family bench-harness examples, and ready-to-paste cells for each model class so the same discipline applies across the stack. The skill encodes the load-bearing lesson from the PR vllm-project#3839 saga: extract the full cell (model, task, deploy_yaml, dataset, num_prompts, max_concurrency, num_warmups + family knobs) from the regression report BEFORE writing any bench script. Measuring a sibling cell that does not exercise the regressed code path is the most common path to a false "no regression" verdict. Layout (progressive disclosure): - SKILL.md: trigger conditions, paired tools, the cell-definition discipline (generic 7-tuple table + per-family knob TL;DR), the 5-step workflow with parallel TTS / diffusion / omni blast-radius file lists and per-family bench-harness snippets, the rationalization table of excuses-vs-reality, the red-flags list, and a one-paragraph cross-platform invariant. - references/family-knobs.md: full TTS / diffusion / omni knob tables (extra_body, stage_overrides, headline metrics). - references/pitfalls.md: six mechanical failure modes with copy-paste remediations (pytest -k zero-match, venv PATH for ninja subprocess, stale server PID, multi-tenant GPUs, /v1/models settle, cold download). - scripts/run_bisect.sh: bench-loop template that pairs vllm serve with vllm bench serve, polls /v1/models with a settle window, parses median/p99 TTFP + RTF + throughput from the saved JSON, and cleans up the server between commits. - scripts/kanban_trend.py: per-build metric time series from the vllm-omni-kanban repo with rolling-delta percent and regression markers; works for any cell prefix the kanban tracks. - scripts/cells/: four cells covering the three families — tts_default_voice_high_c (the vllm-project#3839 regression class), tts_voice_clone_nightly (kanban parity), diffusion_hunyuan_t2i_1024 (HunyuanImage-3.0 t2i @ 1024²), omni_qwen2_5_audio (Qwen2.5-Omni audio-in/audio-out) — plus a README documenting the <family>_<descriptor>.yaml convention. Triggers on natural-language requests like "bisect TTFP between X and Y", "verify PR #N actually improves perf", "find which commit slowed default_voice", "高并发 TTFP 劣化". Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
zengchuang-hw
pushed a commit
to zengchuang-hw/vllm-omni
that referenced
this pull request
Jun 1, 2026
…tion under concurrent batching (followup to vllm-project#3686) (vllm-project#3817) Signed-off-by: Jialong Liu <88185941+Galleons2029@users.noreply.github.com> Signed-off-by: Galleons2029 <Galleons777@gmail.com> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Follow-up to merged PR #3686 ("Fix MiMo-Audio voice instability"). That PR set
local_sampler.do_sample=True(temperature=0.9, top_p=0.95) on the assumption that the local sampler is "entirely internal" tolocal_forward. It is not: its samplednext_speech_tokensare embedded intonew_audio_emb, cached per request, and added back intoinputs_embedson the next decode step via_prepare_multimodal_embeddings_with_cache. This randomises the text logits even thoughglobal_samplerstays greedy, causing the greedy argmax to flip for some batch members and dropping tokens from text continuations under concurrent batching.Symptom in buildkite/vllm-omni #10167, on the merge-to-main run of
tests/e2e/online_serving/test_mimo_audio.py::test_text_to_text_001:This PR:
do_sample=False -> Truesites invllm_omni/model_executor/models/mimo_audio/mimo_audio_llm.py(only thelocal_samplerdefaults —__init__,base_local_forwardfallback,local_forwardfallback).global_samplerwas already greedy and is unchanged.use_cg = (do_sample is None or do_sample is False) and ...gate, undoing the ~18% stage-0 per-prompt latency regression Codex flagged on [Bugfix] Fix MiMo-Audio voice instability: stochastic local_sampler + codec streaming context #3686.codec_chunk_frames: 3 → 30,codec_left_context_frames: 3 → 40, the min/default constants, and the newstage_configs/mimo_audio.yaml. Those address the actual voice-instability root cause (vocoder attention window[40, 10]was not covered by the previous left context) and live in the codec/vocoder path — unaffected by this revert.pooling_output is Noneguard invllm_omni/model_executor/stage_input_processors/mimo_audio.pyfrom an earlier commit on this branch. That guard fixes a separateAttributeError: 'NoneType' object has no attribute 'get'inchunk_transfer_adapter.py:226on text-only paths (visible 11× in #10167 logs); it is not the cause of the truncation but is a real bug worth keeping fixed in the same code path.If voice diversity beyond what greedy
local_sampler+ correct vocoder context produces turns out to be needed, the right place to reintroduce it is stage-1 (codec/vocoder) with a per-request seed propagated fromSamplingParams.seed, not the shared stage-0local_samplerwhose outputs feed back into text logits.Fixes #3815. Follow-up to #3686. Fixes buildkite/vllm-omni #10167 regression.
Test Plan
Setup:
MiMo-Audio-7B-Instructon RTX 5090tests/e2e/online_serving/test_mimo_audio.py::test_text_to_text_001--run-level=advanced_model(soassert_omni_responseactually executes the keyword check; undercore_modelthe keyword branch is skipped)get_max_batch_size("few"))"What is the capital of China? Answer in 20 words.""beijing"(case-insensitive)Command:
pytest tests/e2e/online_serving/test_mimo_audio.py::test_text_to_text_001 \ -v -s --run-level=advanced_model 2>&1 | tee /tmp/mimo_text_advanced.log grep -E "text content is:|chunk_transfer_adapter.*NoneType|subprocess died|PASSED|FAILED" /tmp/mimo_text_advanced.logDifferential reproduction (to confirm root cause is in this PR's diff, not pre-existing):
e949ccf0^(PR [Bugfix] Fix MiMo-Audio voice instability: stochastic local_sampler + codec streaming context #3686's parent commit on upstream/main), run the same command → expected PASS, 5/5 contain "beijing".upstream/main(with PR [Bugfix] Fix MiMo-Audio voice instability: stochastic local_sampler + codec streaming context #3686 merged), run the same command → expected FAIL, at least one response truncates.Test Result
Before this PR (upstream/main with #3686 merged,
e949ccf0) — reproduces the regression:Log additionally shows 11×
ERROR [chunk_transfer_adapter.py:234] Failed to use custom_process_input_func for payload extraction: 'NoneType' object has no attribute 'get'(independent issue, addressed by the keptpooling_output is Noneguard).After this PR (this branch HEAD) — all 5 parallel responses contain "Beijing":
Differential — running
e949ccf0^(PR #3686's parent) with the same configuration also PASSES with all 5 responses containing "Beijing", confirming the truncation is introduced by PR #3686'slocal_samplerchange and is fully resolved by this revert (not merely masked).Side observations from the post-fix log:
chunk_transfer_adapter.py:234NoneTypeerrors during the test run (kept guard works).(StageEngineCoreProc ...) Shutdown initiatedandsubprocess died unexpectedlylines that followPASSEDare normal teardown-phase output fromOmniServertearing down stages — they appear after the test result, not during inference.Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model. Please runmkdocs serveto sync the documentation editions to./docs.BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)