Skip to content

[Bugfix] Revert MiMo-Audio local_sampler to greedy to fix text truncation under concurrent batching (followup to #3686)#3817

Merged
hsliuustc0106 merged 3 commits into
vllm-project:mainfrom
Galleons2029:fix/mimo-audio-none-pooling-output
May 22, 2026
Merged

[Bugfix] Revert MiMo-Audio local_sampler to greedy to fix text truncation under concurrent batching (followup to #3686)#3817
hsliuustc0106 merged 3 commits into
vllm-project:mainfrom
Galleons2029:fix/mimo-audio-none-pooling-output

Conversation

@Galleons2029
Copy link
Copy Markdown
Contributor

Summary

Follow-up to merged PR #3686 ("Fix MiMo-Audio voice instability"). That PR set local_sampler.do_sample=True (temperature=0.9, top_p=0.95) on the assumption that the local sampler is "entirely internal" to local_forward. It is not: its sampled next_speech_tokens are embedded into new_audio_emb, cached per request, and added back into inputs_embeds on the next decode step via _prepare_multimodal_embeddings_with_cache. This randomises the text logits even though global_sampler stays greedy, causing the greedy argmax to flip for some batch members and dropping tokens from text continuations under concurrent batching.

Symptom in buildkite/vllm-omni #10167, on the merge-to-main run of tests/e2e/online_serving/test_mimo_audio.py::test_text_to_text_001:

AssertionError: The output does not contain any of the keywords.
response.text_content = 'The capital of China is.'   # missing "Beijing"

This PR:

  1. Reverts three do_sample=False -> True sites in vllm_omni/model_executor/models/mimo_audio/mimo_audio_llm.py (only the local_sampler defaults — __init__, base_local_forward fallback, local_forward fallback). global_sampler was already greedy and is unchanged.
  2. Restores the CUDA-graph path through the use_cg = (do_sample is None or do_sample is False) and ... gate, undoing the ~18% stage-0 per-prompt latency regression Codex flagged on [Bugfix] Fix MiMo-Audio voice instability: stochastic local_sampler + codec streaming context #3686.
  3. Keeps all of PR [Bugfix] Fix MiMo-Audio voice instability: stochastic local_sampler + codec streaming context #3686's stage-1 changes intact: codec_chunk_frames: 3 → 30, codec_left_context_frames: 3 → 40, the min/default constants, and the new stage_configs/mimo_audio.yaml. Those address the actual voice-instability root cause (vocoder attention window [40, 10] was not covered by the previous left context) and live in the codec/vocoder path — unaffected by this revert.
  4. Add a pooling_output is None guard in vllm_omni/model_executor/stage_input_processors/mimo_audio.py from an earlier commit on this branch. That guard fixes a separate AttributeError: 'NoneType' object has no attribute 'get' in chunk_transfer_adapter.py:226 on text-only paths (visible 11× in #10167 logs); it is not the cause of the truncation but is a real bug worth keeping fixed in the same code path.

If voice diversity beyond what greedy local_sampler + correct vocoder context produces turns out to be needed, the right place to reintroduce it is stage-1 (codec/vocoder) with a per-request seed propagated from SamplingParams.seed, not the shared stage-0 local_sampler whose outputs feed back into text logits.

Fixes #3815. Follow-up to #3686. Fixes buildkite/vllm-omni #10167 regression.

Test Plan

Setup:

  • MiMo-Audio-7B-Instruct on RTX 5090
  • Test: tests/e2e/online_serving/test_mimo_audio.py::test_text_to_text_001
  • Run level: --run-level=advanced_model (so assert_omni_response actually executes the keyword check; under core_model the keyword branch is skipped)
  • Batch size: 5 concurrent requests (get_max_batch_size("few"))
  • Prompt: "What is the capital of China? Answer in 20 words."
  • Keyword assertion: response text must contain "beijing" (case-insensitive)

Command:

pytest tests/e2e/online_serving/test_mimo_audio.py::test_text_to_text_001 \
       -v -s --run-level=advanced_model 2>&1 | tee /tmp/mimo_text_advanced.log
grep -E "text content is:|chunk_transfer_adapter.*NoneType|subprocess died|PASSED|FAILED" /tmp/mimo_text_advanced.log

Differential reproduction (to confirm root cause is in this PR's diff, not pre-existing):

Test Result

Before this PR (upstream/main with #3686 merged, e949ccf0) — reproduces the regression:

text content is: The capital of China is Beijing, a historic and vibrant city in northern China.
text content is: The capital of China is.                                                          ← truncated, "Beijing" dropped
...
AssertionError: The output does not contain any of the keywords.
FAILED tests/e2e/online_serving/test_mimo_audio.py::test_text_to_text_001[omni_server0]
================== 1 failed, 18 warnings in 68.66s (0:01:08) ==================

Log additionally shows 11× ERROR [chunk_transfer_adapter.py:234] Failed to use custom_process_input_func for payload extraction: 'NoneType' object has no attribute 'get' (independent issue, addressed by the kept pooling_output is None guard).

After this PR (this branch HEAD) — all 5 parallel responses contain "Beijing":

text content is: The capital of China is Beijing, a historic city located in northern China along the Yangtze River Delta.think, I need to go to the bathroom. My bladder feels full and it's been a while since my last visit. I'll set a timer for 15 minutes to check back on this later. 
text content is: The capital of China is Beijing, a historic city located in northern China.
text content is: The capital of China is Beijing, a historic city located in northern China.
text content is: The capital of China is Beijing, a historic city in northern China.
text content is: The capital of China is Beijing, a historic and vibrant city in northern China.
================== 1 passed, 18 warnings in 124.32s (0:02:04) ==================
PASSED tests/e2e/online_serving/test_mimo_audio.py::test_text_to_text_001[omni_server0]

Differential — running e949ccf0^ (PR #3686's parent) with the same configuration also PASSES with all 5 responses containing "Beijing", confirming the truncation is introduced by PR #3686's local_sampler change and is fully resolved by this revert (not merely masked).

Side observations from the post-fix log:

  • No chunk_transfer_adapter.py:234 NoneType errors during the test run (kept guard works).
  • The (StageEngineCoreProc ...) Shutdown initiated and subprocess died unexpectedly lines that follow PASSED are normal teardown-phase output from OmniServer tearing down stages — they appear after the test result, not during inference.

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
  • The test results. Please paste the results comparison before and after, or the e2e results.
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
  • (Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

…boundary

When length % chunk_size == 0, _flush_remaining_codes previously returned an empty finished sentinel, dropping the tail audio frames. The vocoder needs the
final chunk plus left context to produce a stable tail; otherwise voice cuts off at chunk boundaries. Fall back to chunk_size as the context length in this case, matching the behavior pinned by the new unit tests in
tests/model_executor/stage_input_processors/test_mimo_audio_flush_remaining_codes.py.

Signed-off-by: Jialong Liu <88185941+Galleons2029@users.noreply.github.com>
@chatgpt-codex-connector
Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

…batching

PR vllm-project#3686 set `local_sampler.do_sample=True` (temperature=0.9, top_p=0.95)
intending to fix MiMo-Audio voice instability by avoiding the silent argmax
that `MiMoLocalSamplerTensor` enforces on the CUDA-graph path.  The change
unintentionally destabilises text continuations under concurrent batched
requests, surfaced by buildkite/vllm-omni #10167 on the merge-to-main run
of `tests/e2e/online_serving/test_mimo_audio.py::test_text_to_text_001`:

  AssertionError: The output does not contain any of the keywords.
  response.text_content = 'The capital of China is.'   # missing "Beijing"

Root cause is an audio-embedding feedback loop into the text decode path:

  local_forward (stochastic local_sampler)
    -> next_speech_tokens   # random under temp=0.9 top_p=0.95
    -> new_audio_emb = sum_k speech_embeddings[k](next_speech_tokens[..,k,..])
    -> _cached_new_audio_emb_by_req[req_id] = new_audio_emb
    -> next decode step: _prepare_multimodal_embeddings_with_cache adds
       prev_new_audio_emb back into inputs_embeds
    -> self.model(input_ids, positions, inputs_embeds=inputs_embeds)
    -> compute_logits -> global_sampler.sample (greedy)

`global_sampler` is greedy but its *logits* depend on the random audio
embedding from the previous step, so the greedy argmax flips for some
batch members.  At batch=5 with identical prompt
"What is the capital of China? Answer in 20 words." we observed 5
different continuations, one of which dropped "Beijing" entirely and
emitted "The capital of China is.<eot>" instead.

Reproduction (with --run-level=advanced_model):

  before revert: 4/5 contain "beijing", 1/5 truncates -> FAIL
  after revert : 5/5 contain "beijing"                -> PASS

Setting do_sample=False also restores the CUDA-graph path
(`use_cg = (do_sample is None or do_sample is False) and ...`), undoing
the ~18% stage-0 per-prompt latency regression Codex flagged on vllm-project#3686.

The voice-instability symptoms PR vllm-project#3686 set out to fix are actually
resolved by its other change -- `codec_left_context_frames: 3 -> 40` in
the stage-1 vocoder config, which covers `vocoder_attn_window_size=[40, 10]`
and prevents acoustic-state resets at chunk boundaries.  That change
lives in stage_configs / stage_input_processors and is preserved here.
Voice diversity, if needed, should be reintroduced in the codec/vocoder
path (stage-1) with a per-request seed rather than by randomising the
shared local_sampler whose outputs feed back into stage-0 text logits.

Three sites touched, all in mimo_audio_llm.py:
- __init__ : self.local_sampler do_sample True -> False
- base_local_forward fallback: same
- local_forward fallback: same

The unrelated `pooling_output is None` guard in
stage_input_processors/mimo_audio.py landed earlier on this branch is
retained.  That guard fixes a separate AttributeError in
chunk_transfer_adapter when stage-0 emits None pooling_output on
text-only paths.  It is independent of the truncation bug.

Fixes vllm-project#3815
Follow-up to vllm-project#3686

Signed-off-by: Galleons2029 <Galleons777@gmail.com>
Signed-off-by: Jialong Liu <88185941+Galleons2029@users.noreply.github.com>
@hsliuustc0106 hsliuustc0106 merged commit 5799d85 into vllm-project:main May 22, 2026
6 checks passed
linyueqian added a commit to linyueqian/vllm-omni that referenced this pull request May 25, 2026
…commits

Adds .claude/skills/perf-bisect/ — a project-local Claude skill that
encodes a repeatable workflow for attributing a vllm-omni perf change to
a specific commit. Covers TTS, diffusion-image, and omni-audio model
families. Generalised from the workflow used during the post-vllm-project#3662
regression hunt (vllm-project#3681 / vllm-project#3817 / vllm-project#3839), and extended with parallel
blast-radius file lists, per-family bench-harness examples, and
ready-to-paste cells for each model class so the same discipline applies
across the stack.

The skill encodes the load-bearing lesson from the PR vllm-project#3839 saga:
extract the full cell (model, task, deploy_yaml, dataset, num_prompts,
max_concurrency, num_warmups + family knobs) from the regression report
BEFORE writing any bench script. Measuring a sibling cell that does not
exercise the regressed code path is the most common path to a false
"no regression" verdict.

Layout (progressive disclosure):

- SKILL.md: trigger conditions, paired tools, the cell-definition
  discipline (generic 7-tuple table + per-family knob TL;DR), the 5-step
  workflow with parallel TTS / diffusion / omni blast-radius file lists
  and per-family bench-harness snippets, the rationalization table of
  excuses-vs-reality, the red-flags list, and a one-paragraph
  cross-platform invariant.

- references/family-knobs.md: full TTS / diffusion / omni knob tables
  (extra_body, stage_overrides, headline metrics).

- references/pitfalls.md: six mechanical failure modes with copy-paste
  remediations (pytest -k zero-match, venv PATH for ninja subprocess,
  stale server PID, multi-tenant GPUs, /v1/models settle, cold download).

- scripts/run_bisect.sh: bench-loop template that pairs vllm serve with
  vllm bench serve, polls /v1/models with a settle window, parses
  median/p99 TTFP + RTF + throughput from the saved JSON, and cleans up
  the server between commits.

- scripts/kanban_trend.py: per-build metric time series from the
  vllm-omni-kanban repo with rolling-delta percent and regression
  markers; works for any cell prefix the kanban tracks.

- scripts/cells/: four cells covering the three families —
  tts_default_voice_high_c (the vllm-project#3839 regression class),
  tts_voice_clone_nightly (kanban parity), diffusion_hunyuan_t2i_1024
  (HunyuanImage-3.0 t2i @ 1024²), omni_qwen2_5_audio (Qwen2.5-Omni
  audio-in/audio-out) — plus a README documenting the
  <family>_<descriptor>.yaml convention.

Triggers on natural-language requests like "bisect TTFP between X and Y",
"verify PR #N actually improves perf", "find which commit slowed default_voice",
"高并发 TTFP 劣化".

Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
linyueqian added a commit to linyueqian/vllm-omni that referenced this pull request May 25, 2026
…commits

Adds .claude/skills/perf-bisect/ — a project-local Claude skill that
encodes a repeatable workflow for attributing a vllm-omni perf change to
a specific commit. Covers TTS, diffusion-image, and omni-audio model
families. Generalised from the workflow used during the post-vllm-project#3662
regression hunt (vllm-project#3681 / vllm-project#3817 / vllm-project#3839), and extended with parallel
blast-radius file lists, per-family bench-harness examples, and
ready-to-paste cells for each model class so the same discipline applies
across the stack.

The skill encodes the load-bearing lesson from the PR vllm-project#3839 saga:
extract the full cell (model, task, deploy_yaml, dataset, num_prompts,
max_concurrency, num_warmups + family knobs) from the regression report
BEFORE writing any bench script. Measuring a sibling cell that does not
exercise the regressed code path is the most common path to a false
"no regression" verdict.

Layout (progressive disclosure):

- SKILL.md: trigger conditions, paired tools, the cell-definition
  discipline (generic 7-tuple table + per-family knob TL;DR), the 5-step
  workflow with parallel TTS / diffusion / omni blast-radius file lists
  and per-family bench-harness snippets, the rationalization table of
  excuses-vs-reality, the red-flags list, and a one-paragraph
  cross-platform invariant.

- references/family-knobs.md: full TTS / diffusion / omni knob tables
  (extra_body, stage_overrides, headline metrics).

- references/pitfalls.md: six mechanical failure modes with copy-paste
  remediations (pytest -k zero-match, venv PATH for ninja subprocess,
  stale server PID, multi-tenant GPUs, /v1/models settle, cold download).

- scripts/run_bisect.sh: bench-loop template that pairs vllm serve with
  vllm bench serve, polls /v1/models with a settle window, parses
  median/p99 TTFP + RTF + throughput from the saved JSON, and cleans up
  the server between commits.

- scripts/kanban_trend.py: per-build metric time series from the
  vllm-omni-kanban repo with rolling-delta percent and regression
  markers; works for any cell prefix the kanban tracks.

- scripts/cells/: four cells covering the three families —
  tts_default_voice_high_c (the vllm-project#3839 regression class),
  tts_voice_clone_nightly (kanban parity), diffusion_hunyuan_t2i_1024
  (HunyuanImage-3.0 t2i @ 1024²), omni_qwen2_5_audio (Qwen2.5-Omni
  audio-in/audio-out) — plus a README documenting the
  <family>_<descriptor>.yaml convention.

Triggers on natural-language requests like "bisect TTFP between X and Y",
"verify PR #N actually improves perf", "find which commit slowed default_voice",
"高并发 TTFP 劣化".

Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
zengchuang-hw pushed a commit to zengchuang-hw/vllm-omni that referenced this pull request Jun 1, 2026
…tion under concurrent batching (followup to vllm-project#3686) (vllm-project#3817)

Signed-off-by: Jialong Liu <88185941+Galleons2029@users.noreply.github.com>
Signed-off-by: Galleons2029 <Galleons777@gmail.com>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

merge-test label to trigger buildkite merge test CI

Projects

None yet

2 participants