[Frontend][Model][Qwen3-Omni] Enable realtime async-chunk commit bridge#3654
[Frontend][Model][Qwen3-Omni] Enable realtime async-chunk commit bridge#3654indevn wants to merge 1 commit into
Conversation
Signed-off-by: indevn <indevn@outlook.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 82a8792097
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if new_tokens_len: | ||
| input_stream.put_nowait(new_token_ids) | ||
|
|
There was a problem hiding this comment.
Stop queueing tokens when async bridge has no consumer
In the async-chunk bridge path, _run_async_chunk_bridge_generation creates a fresh input_stream queue but never passes it to any consumer, while _consume_generation_outputs still enqueues new_token_ids on every chunk. That means each realtime request accumulates all generated token-id lists in memory until completion, so long responses (or many concurrent sessions) can cause avoidable memory growth and eventually OOM pressure. Guard this enqueue behind a consumer check, or skip it for the async-chunk bridge path.
Useful? React with 👍 / 👎.
|
@Shirley125 PTAL |
|
@indevn Hi, thanks for the PR. From what I understand, this PR adds temporary compatibility support for the Realtime API when async chunking is enabled. This PR #3614 is planned to be merged in the 0.22 release and is intended to provide full Realtime API support under async chunk mode. Because of that, the necessity of add an intermediate compatibility bridge here may be somewhat limited. Perhaps we could instead discuss and collaborate on other optimizations around audio streaming input:) |
Purpose
Enable the OpenAI-compatible
/v1/realtimeWebSocket path for Qwen3-Omni whenasync_chunkis enabled.Before this PR, the API server rejected realtime sessions whenever
engine_client.async_chunkwas true, even though Qwen3-Omni's default deployment path uses async chunking. This PR removes that hard guard and adds a realtime async-chunk bridge:async_chunk: falsekeeps the existing realtime streaming-input path.async_chunk: truebuffersinput_audio_buffer.appendaudio, ignores non-final commits, and starts one normal multimodal Qwen3-Omni request afterinput_audio_buffer.commitwithfinal: true.transcription.*andresponse.audio.*realtime events.This is intentionally a commit-then-generate compatibility bridge. It does not implement early-start streaming input or prompt extension for async chunking; that broader scope is being explored separately in #3614.
Test Plan
Test Result
The realtime e2e covers both the new async-chunk bridge and the existing
--no-async-chunkrealtime path.Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model. Please runmkdocs serveto sync the documentation editions to./docs.BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md