[Feat] Support async_chunk additional_information delivery to V2 model runner by Sy0307 · Pull Request #2607 · vllm-project/vllm-omni

Sy0307 · 2026-04-08T19:31:36Z

Purpose

Fix async_chunk mode producing garbage/short audio in V2 model runner.

Root cause: additional_information (containing thinker_decode_embeddings and thinker_output_token_ids) was never propagated from the scheduler's CachedRequestData to the runner's intermediate_buffer during decode steps. The chunk_transfer_adapter correctly polled data from SharedMemoryConnector and attached it to scheduled_cached_reqs.additional_information, but GPUModelRunner.update_requests() does not handle this field — so the data was silently dropped.

Additionally fixes three correctness issues found during review:

_handle_async_chunk_updates passed raw AdditionalInformationPayload objects to intermediate_buffer.update(), which expects dict — causing AttributeError when payload is not pre-resolved
Inline deserialization in scheduler only preserved list_data, silently dropping tensor_data and scalar_data entries
cleanup() (sender+receiver) replaced cleanup_receiver() in the failed-KV-load path, risking race conditions with background save threads

Test Plan

Qwen3-Omni-30B async_chunk end-to-end (use_audio query, 2xH20)
ASR verification: Whisper transcription matches expected text output
Audio duration: 22.04s (previously 4.57s with the bug)
Sync mode regression check (non-async_chunk path unchanged)

Test Result

Before fix: Talker sees thinker_output_token_ids=[], thinker_decode_embeddings=None -> early EOS after ~33 decode steps -> 4.57s noise audio

After fix: Talker correctly receives incremental thinker data -> 329+ decode steps -> 22.04s audio, ASR output:

"The audio contains a man reciting the nursery rhyme Mary had a little lamb. He begins by saying the first words I spoke in the original phonograph before reciting the rhyme. Mary had a little lamb. Its fleece was white as snow, and everywhere that Mary went, the lamb was sure to go."

…nner - Add update_requests() to OmniGPUModelRunner to propagate additional_information from scheduler to intermediate_buffer - Use _resolve_additional_information for AdditionalInformationPayload deserialization in both AR and generation runners - Revert cleanup() to cleanup_receiver() for concurrent safety - Fix _safe_get_rope control flow (remove exception-as-goto pattern) - Add Talker M-RoPE fallback returning 3D sequential positions Signed-off-by: Sy03 <1370724210@qq.com>

chatgpt-codex-connector · 2026-04-13T15:56:24Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

Sy0307 · 2026-04-13T15:56:26Z

cc @Fattysand @tzhouam

tzhouam · 2026-04-13T16:03:43Z

                        and isinstance(getattr(add_info, "entries"), dict)
                    ):
-                        request.additional_information = deserialize_additional_information(add_info)
+                        from vllm_omni.worker_v2.model_states.intermediate_buffer import (


Does this change keep MR V1 runnable?

lishunyang12

Review Summary

The core goal — propagating additional_information from OmniCachedRequestData to the V2 model runner's intermediate_buffer — is correct and clearly needed. The in-place update approach in OmniGenerationModelRunner is a good simplification over the previous remove+re-add cycle. The M-RoPE dimensionality check in omni_model_state.py is a sensible hardening.

However, there are two issues that should be addressed before merging:

1. Double intermediate-buffer update in `OmniGenerationModelRunner` (correctness/perf)

OmniGenerationModelRunner.execute_model calls:

_handle_async_chunk_updates(scheduler_output) — which resolves and merges additional_information into intermediate_buffer
self.update_requests(scheduler_output) — which now (via the new OmniGPUModelRunner.update_requests) also resolves and merges the exact same additional_information into intermediate_buffer

Since OmniGenerationModelRunner inherits OmniGPUModelRunner.update_requests without overriding it, every async_chunk cached request gets _resolve_additional_information + intermediate_buffer.update called twice with the same data. While the merge is idempotent for dict values, the tensor .detach().cpu().contiguous() clone path in intermediate_buffer.update() runs twice per tensor per request per step, which is needless GPU-to-CPU traffic.

Suggested fix: Either (a) override update_requests in OmniGenerationModelRunner to skip the additional_information merge (since _handle_async_chunk_updates already handles it), or (b) remove the intermediate-buffer update from _handle_async_chunk_updates and let the inherited update_requests be the single source of truth.

2. `_resolve_additional_information` drops `scalar_data` entries — potential regression in scheduler

The PR replaces deserialize_additional_information (from serialization.py) with _resolve_additional_information (from intermediate_buffer.py) in omni_ar_scheduler.py:_free_request. However, these two functions are not equivalent:

deserialize_additional_information handles tensor_data, list_data, and scalar_data entries.
_resolve_additional_information only handles tensor_data and falls through to getattr(entry, "list_data", None) for everything else — scalar_data entries become None.

This is a regression for the scheduler path. If any additional_information entry uses scalar_data, it will be silently dropped after this change. The PR description itself lists "silently dropping tensor_data and scalar_data entries" as a bug being fixed (issue #2), but the function being switched to has the same gap.

Suggested fix: Add scalar_data handling to _resolve_additional_information in intermediate_buffer.py, matching what deserialize_additional_information does:

tensor_data = getattr(entry, "tensor_data", None)
if tensor_data is not None:
    ...
    info[k] = torch.from_numpy(arr.copy())
elif getattr(entry, "list_data", None) is not None:
    info[k] = entry.list_data
elif getattr(entry, "scalar_data", None) is not None:
    info[k] = entry.scalar_data
else:
    info[k] = None

Minor / style

The inline from ... import _resolve_additional_information inside _handle_async_chunk_updates and update_requests is fine for avoiding circular imports, but since both methods are in files that already import from the same package at module level, consider hoisting the import to the top of each file if there is no actual circular dependency. This would be a minor readability improvement.
The cleanup_receiver change (issue #3 in the PR description) is sound — only cleaning the receiver side in update_from_output avoids race conditions with background sender threads.

Overall this is a well-motivated fix with good test evidence. Requesting changes only for the two functional issues above.

- Add scalar_data branch to _resolve_additional_information to match deserialize_additional_information (scheduler path was silently dropping scalar entries) - Remove duplicate intermediate_buffer.update in _handle_async_chunk_updates; inherited update_requests is the single source of truth (avoids double CPU clone per tensor per step) - Hoist inline imports of _resolve_additional_information to module top Signed-off-by: Sy03 <1370724210@qq.com>

Sy0307 · 2026-04-18T19:35:29Z

I have fixed all issues and Plz merge it and we can check model v2 migration progress later. @tzhouam

Sy0307 changed the title ~~[Bugfix] Fix async_chunk additional_information delivery to V2 model runner~~ [Feat] Support async_chunk additional_information delivery to V2 model runner Apr 8, 2026

Sy0307 force-pushed the fix/v2-improvements-on-2522 branch from 4e80cc4 to a6ef196 Compare April 8, 2026 19:33

Merge branch 'dev/migrate-MR-v2' into fix/v2-improvements-on-2522

28ba6c9

Sy0307 marked this pull request as ready for review April 13, 2026 15:56

Sy0307 requested a review from hsliuustc0106 as a code owner April 13, 2026 15:56

tzhouam reviewed Apr 13, 2026

View reviewed changes

lishunyang12 requested changes Apr 16, 2026

View reviewed changes

Sy0307 added 2 commits April 19, 2026 03:33

Merge branch 'dev/migrate-MR-v2' into fix/v2-improvements-on-2522

d43b5c8

tzhouam merged commit b305789 into vllm-project:dev/migrate-MR-v2 Apr 19, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feat] Support async_chunk additional_information delivery to V2 model runner#2607

[Feat] Support async_chunk additional_information delivery to V2 model runner#2607
tzhouam merged 4 commits intovllm-project:dev/migrate-MR-v2from
Sy0307:fix/v2-improvements-on-2522

Sy0307 commented Apr 8, 2026

Uh oh!

chatgpt-codex-connector Bot commented Apr 13, 2026

Uh oh!

Sy0307 commented Apr 13, 2026 •

edited

Loading

Uh oh!

tzhouam Apr 13, 2026

Uh oh!

lishunyang12 left a comment

Uh oh!

Sy0307 commented Apr 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Sy0307 commented Apr 8, 2026

Purpose

Test Plan

Test Result

Uh oh!

chatgpt-codex-connector Bot commented Apr 13, 2026

Uh oh!

Sy0307 commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tzhouam Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

lishunyang12 left a comment

Choose a reason for hiding this comment

Review Summary

1. Double intermediate-buffer update in OmniGenerationModelRunner (correctness/perf)

2. _resolve_additional_information drops scalar_data entries — potential regression in scheduler

Minor / style

Uh oh!

Sy0307 commented Apr 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Sy0307 commented Apr 13, 2026 •

edited

Loading

1. Double intermediate-buffer update in `OmniGenerationModelRunner` (correctness/perf)

2. `_resolve_additional_information` drops `scalar_data` entries — potential regression in scheduler