[Feature] add session based audio streaming input by Shirley125 · Pull Request #2208 · vllm-project/vllm-omni

Shirley125 · 2026-03-26T03:38:45Z

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

Here is Phase 1 of session based audio streaming input RFC
Introduce a WebSocket interface to support streaming input, aligned with upstream vLLM implementation.
Goal: Enable streaming audio input and text output for the Qwen3-Omni model in vllm-omni.

How it works

Refer to vllm:
Audio Format
Audio must be sent as base64-encoded PCM16 audio at 16kHz sample rate, mono channel.

Protocol Overview

1. Client connects to ws://host/v1/realtime
2. Server sends session.created event
3. Client optionally sends session.update with model/params
4. Client sends input_audio_buffer.commit when ready
5. Client sends input_audio_buffer.append events with base64 PCM16 chunks
6. Server sends transcription.delta events with incremental text
7. Server sends transcription.done with final text + usage
8. Repeat from step 5 for next utterance
9. Optionally, client sends input_audio_buffer.commit with final=True to signal audio input is finished. Useful when streaming audio files

Prompt of each streaming request is the cumulative concatenation of all input prompts so far + their corresponding output tokens - excluding the final sampled token from each request. All generated tokens are returned to the caller in the session's output stream.
So for streaming inputs [A1, B1, C1], [A2, B2], [A3, B3],samping max tokens = 3:

First prompt [A1, B1, C1], generates [D1, E1, F1]
Second prompt is [A1, B1, C1, D1, E1, A2, B2], generates [C2, D2, E2] (F1 discarded)
Third prompt is [A1, B1, C1, D1, E1, A2, B2, C2, D2, A3, B3], generates [C3, D3, E3] (E2 discarded)
Streamed output tokens would be D1, E1, F1, C2, D2, E2, C3, D3, E3

Key Changes

API Layer
Add a /v1/realtime WebSocket endpoint, reusing the upstream vLLM streaming input protocol and request handling logic.
On the server side, multiple incremental streaming input requests are grouped into a single session, which continues until an input_audio_buffer.commit event with final=True is received or the connection is closed.
The client continuously receives incremental streaming outputs.

Engine Layer
Introduce an add_streaming_update_async method. For streaming input requests, add a resumable flag so that the scheduler can identify streaming inputs and append historical prompts accordingly.
Reuse the existing request state.

Test Plan

pytest tests/engine/test_async_omni_engine_input.py

python examples/online_serving/qwen3_omni/openai_realtime_client.py --model /home/models/Qwen/Qwen3-Omni-30B-A3B-Instruct --port 8366 --audio_path audio.wav

Test Result

pass

python examples/online_serving/qwen3_omni/openai_realtime_client.py --model /home/models/Qwen/Qwen3-Omni-30B-A3B-Instruct --port 8366 --audio_path audio.wav
Session created: sess-9f0d7e91c0b50924
Loading audio from: output6.wav
Sending 228 audio chunks...
Audio sent. Waiting for transcription...

Transcription: The Qwen-Omni model is capable of accepting.text and other modalities as input, and generate text.or audio responses in various human-like voice styles.Supports multilingual and dialectal speech output, applicable tocontent moderation, text creation, visual recognition,and audio-video interactive assistants, among other scenarios.

Final transcription: The Qwen-Omni model is capable of accepting.text and other modalities as input, and generate text.or audio responses in various human-like voice styles.Supports multilingual and dialectal speech output, applicable tocontent moderation, text creation, visual recognition,and audio-video interactive assistants, among other scenarios.
Usage: {'prompt_tokens': 75, 'total_tokens': 146, 'completion_tokens': 71, 'prompt_tokens_details': None}

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
The test results. Please paste the results comparison before and after, or the e2e results.
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
(Optional) Release notes update. If your change is user-facing, please update the release notes draft.

Future work

Phase 1
Introduce a WebSocket interface to support streaming input, aligned with upstream vLLM implementation.
Goal: Enable streaming audio input and text output for the Qwen3-Omni model in vllm-omni. (this pr)
Phase 2
Align accuracy across different stages of Qwen3-Omni streaming input and support audio output.(doing)
Phase 3
Support streaming input with prefix cache reuse compatibility and performance optimization.
Phase 4
Support streaming input with async chunk processing enabled.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: cb986338f3

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

princepride

In my opinion, streaming input should emphasize the ability to interrupt the output, which is the fundamental reason why vLLM supports streaming input. However, as I understand, the current code does not seem to support this feature.

Shirley125 · 2026-03-31T06:07:39Z

In my opinion, streaming input should emphasize the ability to interrupt the output, which is the fundamental reason why vLLM supports streaming input. However, as I understand, the current code does not seem to support this feature.

Thanks, I’ve updated the PR.
In Phase 1, I will directly integrate vLLM’s capabilities to ensure alignment with vLLM for audio input and text output.

Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 99f58381fd

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-03-31T08:44:43Z

+        audio_placeholder = Qwen3OmniMoeThinkerForConditionalGeneration.get_placeholder_str("audio", 0)
+        prompt_template = f"<|im_start|>user\n{audio_placeholder}<|im_end|>\n<|im_start|>assistant\n"
+
+        prompt_token_ids = tokenizer.encode(prompt_template)


Incorporate input_stream context in realtime audio prompts

buffer_realtime_audio() receives input_stream for carrying prior generated token IDs, but prompt construction uses a fixed prompt_token_ids template and never consumes that queue. As a result, each audio segment is transcribed without prior output context, breaking session-style cumulative prompting across incremental realtime updates.

Useful? React with 👍 / 👎.

Referencing vLLM’s qwen3_asr_realtime model and following vLLM’s optimization strategies：vllm-project/vllm#35767

Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>

Shirley125 · 2026-04-02T06:10:07Z

PTAL @princepride @amy-why-3459 @hsliuustc0106

amy-why-3459 · 2026-04-02T06:25:15Z

            request.external_req_id = request_id

            # Register with stage 0's output processor.
+            output_prompt_text = prompt_text


Please confirm if this is necessary?

Yes, this aligns with vLLM: the prompt in RequestState within output_processor.py should be of type str; otherwise, apply_streaming_update in vLLM will raise an error.

amy-why-3459 · 2026-04-02T06:27:39Z


+    async def _handle_streaming_update(self, msg: dict[str, Any]) -> None:
+        """Handle a streaming_update message for an existing request."""
+        stage_id = 0


Should we pass this parameter in instead of hardcoding it?

Same as _handle_add_request logic: it first goes through Stage 0, and then is passed to the downstream stage via _forward_to_next_stage.

amy-why-3459 · 2026-04-02T06:29:20Z

@Sy0307 @lishunyang12 PTAL

amy-why-3459 · 2026-04-02T07:20:20Z

+audio transcription by uploading an audio file.
+
+Before running this script, you must start the vLLM-Omni server with a realtime-capable
+model, for example:


Please update the README.md

Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>

Sy0307 · 2026-04-02T08:50:20Z

LGTM. I have tested it on c6396fc before.

lishunyang12

LGTM overall

lishunyang12 · 2026-04-02T10:35:54Z

@princepride PTAL

princepride

LGTM

amy-why-3459 · 2026-04-02T10:56:53Z

Thank you very much for your contribution, LGTM

hsliuustc0106

any test added?

ZeldaHuang · 2026-04-02T11:24:58Z

    info=Qwen3OmniMoeThinkerProcessingInfo,
    dummy_inputs=Qwen3OmniMoeThinkerDummyInputsBuilder,
 )
 class Qwen3OmniMoeForConditionalGeneration(


Consider inheriting from SupportsRealtime
https://github.com/vllm-project/vllm/blob/551b3fb39f3a95ff3dc3feca9528ab4c90649316/vllm/model_executor/models/qwen3_asr_realtime.py#L179

Shirley125 · 2026-04-02T11:26:44Z

any test added?

UT have been added. CI will be considered in the next PR.

ZeldaHuang · 2026-04-02T11:46:00Z

+                "arrival_time": arrival_time,
+            }
+            if resumable:
+                process_kwargs["resumable"] = True


why not just resumable=resumable

Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>

Signed-off-by: CHEN <116010019@link.cuhk.edu.cn> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com> Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>

Includes: - ca02351 [skip ci][Bugfix] clean useless log (vllm-project#2450) - 50bb47a [Test] Skip zimage expansion test (vllm-project#2454) - 728cf6d [Feature] add session based audio streaming input (vllm-project#2208) - 6211413 Update MRoPE config fallback logic (vllm-project#2278)

- 56 upstream commits pulled in - Adds /v1/realtime WebSocket endpoint (PR vllm-project#2208, Qwen3-Omni) - Registry, scheduler, diffusion model updates - Conflict resolved: README.md kept fork header Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… audio out) New endpoint: /v1/omni WebSocket for full-duplex omni conversation. - Client sends PCM16 16kHz audio as binary frames + {type: input.done} - Server streams WAV audio chunks as binary frames + transcript.delta JSON - Reuses OmniOpenAIServingChat.create_chat_completion() (proven REST path) - Multi-turn conversation history maintained per session - request_id pre-generated before generator iteration to enable abort on early client disconnect without GPU resource leak Protocol: session.config (text) -> binary PCM16 frames -> input.done (text) turn.start -> transcript.delta -> audio.start -> binary WAV chunks -> audio.done -> turn.done Each binary output frame is a self-contained WAV (format: wav_chunk). Not a streamable concatenated WAV header. Why not /v1/realtime or _add_streaming_input_request: - /v1/realtime = ASR transcription only, no audio output - _add_streaming_input_request (PR vllm-project#2208) = ASR input streaming for Qwen3 - MiniCPM-o thinker2talker_async_chunk needs full Thinker output for _find_tts_bound(); streaming input gives no TTFP benefit Relates to: nextain/naia-os#216 Pattern: OmniStreamingSpeechHandler in serving_speech_stream.py Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Signed-off-by: CHEN <116010019@link.cuhk.edu.cn> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

Shirley125 requested a review from hsliuustc0106 as a code owner March 26, 2026 03:38

Shirley125 marked this pull request as draft March 26, 2026 03:39

chatgpt-codex-connector Bot reviewed Mar 26, 2026

View reviewed changes

Comment thread vllm_omni/distributed/omni_connectors/transfer_adapter/chunk_transfer_adapter.py Outdated

Comment thread vllm_omni/entrypoints/async_omni.py Outdated

Shirley125 force-pushed the realtime branch 8 times, most recently from a25755d to a0a0317 Compare March 30, 2026 06:48

hsliuustc0106 mentioned this pull request Mar 30, 2026

[RFC]: vLLM-Omni 2026 Q2 Roadmap (collecting ideas...) #2136

Open

1 task

princepride requested changes Mar 30, 2026

View reviewed changes

Comment thread vllm_omni/engine/output_processor.py Outdated

Comment thread vllm_omni/engine/orchestrator.py Outdated

Comment thread vllm_omni/engine/orchestrator.py Outdated

Comment thread vllm_omni/model_executor/stage_input_processors/qwen3_omni.py Outdated

Shirley125 force-pushed the realtime branch from e215342 to cfbb159 Compare March 31, 2026 01:56

amy-why-3459 mentioned this pull request Mar 31, 2026

[RFC]: Omni-Modality Q2 Roadmap #2207

Open

audio streaming input api

7ea9d41

Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>

Shirley125 force-pushed the realtime branch 2 times, most recently from ed6b0f2 to df69957 Compare March 31, 2026 07:32

Shirley125 marked this pull request as ready for review March 31, 2026 08:37

chatgpt-codex-connector Bot reviewed Mar 31, 2026

View reviewed changes

audio streaming input adapt api and engine

746871f

Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>

Shirley125 force-pushed the realtime branch from 99f5838 to 746871f Compare March 31, 2026 08:50

audio streaming input example

c6396fc

Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>

Shirley125 force-pushed the realtime branch from 267d7ac to c6396fc Compare March 31, 2026 11:53

Shirley125 changed the title ~~[WIP][Feature] add session based audio streaming input~~ [Feature] add session based audio streaming input Mar 31, 2026

fix comment

d0e808d

Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>

amy-why-3459 reviewed Apr 2, 2026

View reviewed changes

update readme

572bd3e

Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>

lishunyang12 added the ready label to trigger buildkite CI label Apr 2, 2026

lishunyang12 reviewed Apr 2, 2026

View reviewed changes

princepride approved these changes Apr 2, 2026

View reviewed changes

amy-why-3459 approved these changes Apr 2, 2026

View reviewed changes

Merge branch 'main' into realtime

5a89944

hsliuustc0106 reviewed Apr 2, 2026

View reviewed changes

ZeldaHuang reviewed Apr 2, 2026

View reviewed changes

fix comment

1777e57

Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>

Shirley125 force-pushed the realtime branch from a777d37 to 1777e57 Compare April 2, 2026 12:13

Merge branch 'main' into realtime

2871922

hsliuustc0106 merged commit 728cf6d into vllm-project:main Apr 2, 2026
8 checks passed

yuanheng-zhao mentioned this pull request Apr 5, 2026

[Bug]: RequestOutput.prompt is always None, input prompt is missing #2497

Closed

1 task

luke-n-alpha mentioned this pull request Apr 6, 2026

chore: sync upstream/main (56 commits behind, includes /v1/realtime PR #2208) #2514

Closed

luke-n-alpha mentioned this pull request Apr 7, 2026

cleanup: remove /v1/omni proprietary endpoint, move to ref/ nextain/vllm-omni#4

Closed

3 tasks

Shirley125 mentioned this pull request Apr 9, 2026

[RFC]: Qwen3-Omni supports streaming input JiusiServe/vllm-omni#171

Closed

1 task

vraiti pushed a commit to vraiti/vllm-omni that referenced this pull request Apr 9, 2026

[Feature] add session based audio streaming input (vllm-project#2208)

1170aaa

Signed-off-by: CHEN <116010019@link.cuhk.edu.cn> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

Conversation

Shirley125 commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

How it works

Key Changes

Test Plan

Test Result

Future work

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

princepride left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Shirley125 commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

chatgpt-codex-connector Bot Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Shirley125 commented Apr 2, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Shirley125 Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amy-why-3459 commented Apr 2, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Sy0307 commented Apr 2, 2026

Uh oh!

lishunyang12 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lishunyang12 commented Apr 2, 2026

Uh oh!

princepride left a comment

Choose a reason for hiding this comment

Uh oh!

amy-why-3459 commented Apr 2, 2026

Uh oh!

hsliuustc0106 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Shirley125 commented Apr 2, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Shirley125 commented Mar 26, 2026 •

edited

Loading

Shirley125 commented Mar 31, 2026 •

edited

Loading

Shirley125 Apr 2, 2026 •

edited

Loading

lishunyang12 left a comment •

edited

Loading