Skip to content

[Feat] Support async_chunk additional_information delivery to V2 model runner#2607

Merged
tzhouam merged 4 commits intovllm-project:dev/migrate-MR-v2from
Sy0307:fix/v2-improvements-on-2522
Apr 19, 2026
Merged

[Feat] Support async_chunk additional_information delivery to V2 model runner#2607
tzhouam merged 4 commits intovllm-project:dev/migrate-MR-v2from
Sy0307:fix/v2-improvements-on-2522

Conversation

@Sy0307
Copy link
Copy Markdown
Contributor

@Sy0307 Sy0307 commented Apr 8, 2026

Purpose

Fix async_chunk mode producing garbage/short audio in V2 model runner.

Root cause: additional_information (containing thinker_decode_embeddings and thinker_output_token_ids) was never propagated from the scheduler's CachedRequestData to the runner's intermediate_buffer during decode steps. The chunk_transfer_adapter correctly polled data from SharedMemoryConnector and attached it to scheduled_cached_reqs.additional_information, but GPUModelRunner.update_requests() does not handle this field — so the data was silently dropped.

Additionally fixes three correctness issues found during review:

  1. _handle_async_chunk_updates passed raw AdditionalInformationPayload objects to intermediate_buffer.update(), which expects dict — causing AttributeError when payload is not pre-resolved
  2. Inline deserialization in scheduler only preserved list_data, silently dropping tensor_data and scalar_data entries
  3. cleanup() (sender+receiver) replaced cleanup_receiver() in the failed-KV-load path, risking race conditions with background save threads

Test Plan

  • Qwen3-Omni-30B async_chunk end-to-end (use_audio query, 2xH20)
  • ASR verification: Whisper transcription matches expected text output
  • Audio duration: 22.04s (previously 4.57s with the bug)
  • Sync mode regression check (non-async_chunk path unchanged)

Test Result

Before fix: Talker sees thinker_output_token_ids=[], thinker_decode_embeddings=None -> early EOS after ~33 decode steps -> 4.57s noise audio

After fix: Talker correctly receives incremental thinker data -> 329+ decode steps -> 22.04s audio, ASR output:

"The audio contains a man reciting the nursery rhyme Mary had a little lamb. He begins by saying the first words I spoke in the original phonograph before reciting the rhyme. Mary had a little lamb. Its fleece was white as snow, and everywhere that Mary went, the lamb was sure to go."

@Sy0307 Sy0307 changed the title [Bugfix] Fix async_chunk additional_information delivery to V2 model runner [Feat] Support async_chunk additional_information delivery to V2 model runner Apr 8, 2026
…nner

- Add update_requests() to OmniGPUModelRunner to propagate
  additional_information from scheduler to intermediate_buffer
- Use _resolve_additional_information for AdditionalInformationPayload
  deserialization in both AR and generation runners
- Revert cleanup() to cleanup_receiver() for concurrent safety
- Fix _safe_get_rope control flow (remove exception-as-goto pattern)
- Add Talker M-RoPE fallback returning 3D sequential positions

Signed-off-by: Sy03 <1370724210@qq.com>
@Sy0307 Sy0307 force-pushed the fix/v2-improvements-on-2522 branch from 4e80cc4 to a6ef196 Compare April 8, 2026 19:33
@Sy0307 Sy0307 marked this pull request as ready for review April 13, 2026 15:56
@Sy0307 Sy0307 requested a review from hsliuustc0106 as a code owner April 13, 2026 15:56
@chatgpt-codex-connector
Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

@Sy0307
Copy link
Copy Markdown
Contributor Author

Sy0307 commented Apr 13, 2026

cc @Fattysand @tzhouam

and isinstance(getattr(add_info, "entries"), dict)
):
request.additional_information = deserialize_additional_information(add_info)
from vllm_omni.worker_v2.model_states.intermediate_buffer import (
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this change keep MR V1 runnable?

Copy link
Copy Markdown
Collaborator

@lishunyang12 lishunyang12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary

The core goal — propagating additional_information from OmniCachedRequestData to the V2 model runner's intermediate_buffer — is correct and clearly needed. The in-place update approach in OmniGenerationModelRunner is a good simplification over the previous remove+re-add cycle. The M-RoPE dimensionality check in omni_model_state.py is a sensible hardening.

However, there are two issues that should be addressed before merging:


1. Double intermediate-buffer update in OmniGenerationModelRunner (correctness/perf)

OmniGenerationModelRunner.execute_model calls:

  1. _handle_async_chunk_updates(scheduler_output) — which resolves and merges additional_information into intermediate_buffer
  2. self.update_requests(scheduler_output) — which now (via the new OmniGPUModelRunner.update_requests) also resolves and merges the exact same additional_information into intermediate_buffer

Since OmniGenerationModelRunner inherits OmniGPUModelRunner.update_requests without overriding it, every async_chunk cached request gets _resolve_additional_information + intermediate_buffer.update called twice with the same data. While the merge is idempotent for dict values, the tensor .detach().cpu().contiguous() clone path in intermediate_buffer.update() runs twice per tensor per request per step, which is needless GPU-to-CPU traffic.

Suggested fix: Either (a) override update_requests in OmniGenerationModelRunner to skip the additional_information merge (since _handle_async_chunk_updates already handles it), or (b) remove the intermediate-buffer update from _handle_async_chunk_updates and let the inherited update_requests be the single source of truth.


2. _resolve_additional_information drops scalar_data entries — potential regression in scheduler

The PR replaces deserialize_additional_information (from serialization.py) with _resolve_additional_information (from intermediate_buffer.py) in omni_ar_scheduler.py:_free_request. However, these two functions are not equivalent:

  • deserialize_additional_information handles tensor_data, list_data, and scalar_data entries.
  • _resolve_additional_information only handles tensor_data and falls through to getattr(entry, "list_data", None) for everything else — scalar_data entries become None.

This is a regression for the scheduler path. If any additional_information entry uses scalar_data, it will be silently dropped after this change. The PR description itself lists "silently dropping tensor_data and scalar_data entries" as a bug being fixed (issue #2), but the function being switched to has the same gap.

Suggested fix: Add scalar_data handling to _resolve_additional_information in intermediate_buffer.py, matching what deserialize_additional_information does:

tensor_data = getattr(entry, "tensor_data", None)
if tensor_data is not None:
    ...
    info[k] = torch.from_numpy(arr.copy())
elif getattr(entry, "list_data", None) is not None:
    info[k] = entry.list_data
elif getattr(entry, "scalar_data", None) is not None:
    info[k] = entry.scalar_data
else:
    info[k] = None

Minor / style

  • The inline from ... import _resolve_additional_information inside _handle_async_chunk_updates and update_requests is fine for avoiding circular imports, but since both methods are in files that already import from the same package at module level, consider hoisting the import to the top of each file if there is no actual circular dependency. This would be a minor readability improvement.

  • The cleanup_receiver change (issue #3 in the PR description) is sound — only cleaning the receiver side in update_from_output avoids race conditions with background sender threads.

Overall this is a well-motivated fix with good test evidence. Requesting changes only for the two functional issues above.

Sy0307 added 2 commits April 19, 2026 03:33
- Add scalar_data branch to _resolve_additional_information to match
  deserialize_additional_information (scheduler path was silently
  dropping scalar entries)
- Remove duplicate intermediate_buffer.update in
  _handle_async_chunk_updates; inherited update_requests is the single
  source of truth (avoids double CPU clone per tensor per step)
- Hoist inline imports of _resolve_additional_information to module top

Signed-off-by: Sy03 <1370724210@qq.com>
@Sy0307
Copy link
Copy Markdown
Contributor Author

Sy0307 commented Apr 18, 2026

I have fixed all issues and Plz merge it and we can check model v2 migration progress later. @tzhouam

@tzhouam tzhouam merged commit b305789 into vllm-project:dev/migrate-MR-v2 Apr 19, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants