[Feature] add session based audio streaming input #2208
[Feature] add session based audio streaming input #2208hsliuustc0106 merged 8 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: cb986338f3
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
a25755d to
a0a0317
Compare
princepride
left a comment
There was a problem hiding this comment.
In my opinion, streaming input should emphasize the ability to interrupt the output, which is the fundamental reason why vLLM supports streaming input. However, as I understand, the current code does not seem to support this feature.
Thanks, I’ve updated the PR. |
Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>
ed6b0f2 to
df69957
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 99f58381fd
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| audio_placeholder = Qwen3OmniMoeThinkerForConditionalGeneration.get_placeholder_str("audio", 0) | ||
| prompt_template = f"<|im_start|>user\n{audio_placeholder}<|im_end|>\n<|im_start|>assistant\n" | ||
|
|
||
| prompt_token_ids = tokenizer.encode(prompt_template) |
There was a problem hiding this comment.
Incorporate input_stream context in realtime audio prompts
buffer_realtime_audio() receives input_stream for carrying prior generated token IDs, but prompt construction uses a fixed prompt_token_ids template and never consumes that queue. As a result, each audio segment is transcribed without prior output context, breaking session-style cumulative prompting across incremental realtime updates.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Referencing vLLM’s qwen3_asr_realtime model and following vLLM’s optimization strategies:vllm-project/vllm#35767
Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>
Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>
Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>
| request.external_req_id = request_id | ||
|
|
||
| # Register with stage 0's output processor. | ||
| output_prompt_text = prompt_text |
There was a problem hiding this comment.
Please confirm if this is necessary?
There was a problem hiding this comment.
Yes, this aligns with vLLM: the prompt in RequestState within output_processor.py should be of type str; otherwise, apply_streaming_update in vLLM will raise an error.
|
|
||
| async def _handle_streaming_update(self, msg: dict[str, Any]) -> None: | ||
| """Handle a streaming_update message for an existing request.""" | ||
| stage_id = 0 |
There was a problem hiding this comment.
Should we pass this parameter in instead of hardcoding it?
There was a problem hiding this comment.
Same as _handle_add_request logic: it first goes through Stage 0, and then is passed to the downstream stage via _forward_to_next_stage.
|
@Sy0307 @lishunyang12 PTAL |
| audio transcription by uploading an audio file. | ||
|
|
||
| Before running this script, you must start the vLLM-Omni server with a realtime-capable | ||
| model, for example: |
There was a problem hiding this comment.
Please update the README.md
Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>
|
LGTM. I have tested it on c6396fc before. |
|
@princepride PTAL |
|
Thank you very much for your contribution, LGTM |
| info=Qwen3OmniMoeThinkerProcessingInfo, | ||
| dummy_inputs=Qwen3OmniMoeThinkerDummyInputsBuilder, | ||
| ) | ||
| class Qwen3OmniMoeForConditionalGeneration( |
There was a problem hiding this comment.
Consider inheriting from SupportsRealtime
https://github.com/vllm-project/vllm/blob/551b3fb39f3a95ff3dc3feca9528ab4c90649316/vllm/model_executor/models/qwen3_asr_realtime.py#L179
UT have been added. CI will be considered in the next PR. |
| "arrival_time": arrival_time, | ||
| } | ||
| if resumable: | ||
| process_kwargs["resumable"] = True |
There was a problem hiding this comment.
why not just resumable=resumable
Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>
Signed-off-by: CHEN <116010019@link.cuhk.edu.cn> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com> Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>
Includes: - ca02351 [skip ci][Bugfix] clean useless log (vllm-project#2450) - 50bb47a [Test] Skip zimage expansion test (vllm-project#2454) - 728cf6d [Feature] add session based audio streaming input (vllm-project#2208) - 6211413 Update MRoPE config fallback logic (vllm-project#2278)
- 56 upstream commits pulled in - Adds /v1/realtime WebSocket endpoint (PR vllm-project#2208, Qwen3-Omni) - Registry, scheduler, diffusion model updates - Conflict resolved: README.md kept fork header Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… audio out)
New endpoint: /v1/omni WebSocket for full-duplex omni conversation.
- Client sends PCM16 16kHz audio as binary frames + {type: input.done}
- Server streams WAV audio chunks as binary frames + transcript.delta JSON
- Reuses OmniOpenAIServingChat.create_chat_completion() (proven REST path)
- Multi-turn conversation history maintained per session
- request_id pre-generated before generator iteration to enable abort on
early client disconnect without GPU resource leak
Protocol:
session.config (text) -> binary PCM16 frames -> input.done (text)
turn.start -> transcript.delta -> audio.start -> binary WAV chunks ->
audio.done -> turn.done
Each binary output frame is a self-contained WAV (format: wav_chunk).
Not a streamable concatenated WAV header.
Why not /v1/realtime or _add_streaming_input_request:
- /v1/realtime = ASR transcription only, no audio output
- _add_streaming_input_request (PR vllm-project#2208) = ASR input streaming for Qwen3
- MiniCPM-o thinker2talker_async_chunk needs full Thinker output for
_find_tts_bound(); streaming input gives no TTFP benefit
Relates to: nextain/naia-os#216
Pattern: OmniStreamingSpeechHandler in serving_speech_stream.py
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: CHEN <116010019@link.cuhk.edu.cn> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>
PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.
Purpose
Here is Phase 1 of session based audio streaming input RFC
Introduce a WebSocket interface to support streaming input, aligned with upstream vLLM implementation.
Goal: Enable streaming audio input and text output for the Qwen3-Omni model in vllm-omni.
How it works
Refer to vllm:
Audio Format
Audio must be sent as base64-encoded PCM16 audio at 16kHz sample rate, mono channel.
Protocol Overview
Prompt of each streaming request is the cumulative concatenation of all input prompts so far + their corresponding output tokens - excluding the final sampled token from each request. All generated tokens are returned to the caller in the session's output stream.
So for streaming inputs [A1, B1, C1], [A2, B2], [A3, B3],samping max tokens = 3:
Key Changes
API Layer
Add a /v1/realtime WebSocket endpoint, reusing the upstream vLLM streaming input protocol and request handling logic.
On the server side, multiple incremental streaming input requests are grouped into a single session, which continues until an input_audio_buffer.commit event with final=True is received or the connection is closed.
The client continuously receives incremental streaming outputs.
Engine Layer
Introduce an add_streaming_update_async method. For streaming input requests, add a resumable flag so that the scheduler can identify streaming inputs and append historical prompts accordingly.
Reuse the existing request state.
Test Plan
Test Result
pass
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model. Please runmkdocs serveto sync the documentation editions to./docs.Future work
Phase 1
Introduce a WebSocket interface to support streaming input, aligned with upstream vLLM implementation.
Goal: Enable streaming audio input and text output for the Qwen3-Omni model in vllm-omni. (this pr)
Phase 2
Align accuracy across different stages of Qwen3-Omni streaming input and support audio output.(doing)
Phase 3
Support streaming input with prefix cache reuse compatibility and performance optimization.
Phase 4
Support streaming input with async chunk processing enabled.
BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)