Skip to content

[Feature] add session based audio streaming input #2208

Merged
hsliuustc0106 merged 8 commits intovllm-project:mainfrom
Shirley125:realtime
Apr 2, 2026
Merged

[Feature] add session based audio streaming input #2208
hsliuustc0106 merged 8 commits intovllm-project:mainfrom
Shirley125:realtime

Conversation

@Shirley125
Copy link
Copy Markdown
Contributor

@Shirley125 Shirley125 commented Mar 26, 2026

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

Here is Phase 1 of session based audio streaming input RFC
Introduce a WebSocket interface to support streaming input, aligned with upstream vLLM implementation.
Goal: Enable streaming audio input and text output for the Qwen3-Omni model in vllm-omni.

How it works

Refer to vllm:
Audio Format
Audio must be sent as base64-encoded PCM16 audio at 16kHz sample rate, mono channel.

Protocol Overview

1. Client connects to ws://host/v1/realtime
2. Server sends session.created event
3. Client optionally sends session.update with model/params
4. Client sends input_audio_buffer.commit when ready
5. Client sends input_audio_buffer.append events with base64 PCM16 chunks
6. Server sends transcription.delta events with incremental text
7. Server sends transcription.done with final text + usage
8. Repeat from step 5 for next utterance
9. Optionally, client sends input_audio_buffer.commit with final=True to signal audio input is finished. Useful when streaming audio files

Prompt of each streaming request is the cumulative concatenation of all input prompts so far + their corresponding output tokens - excluding the final sampled token from each request. All generated tokens are returned to the caller in the session's output stream.
So for streaming inputs [A1, B1, C1], [A2, B2], [A3, B3],samping max tokens = 3:

First prompt [A1, B1, C1], generates [D1, E1, F1]
Second prompt is [A1, B1, C1, D1, E1, A2, B2], generates [C2, D2, E2] (F1 discarded)
Third prompt is [A1, B1, C1, D1, E1, A2, B2, C2, D2, A3, B3], generates [C3, D3, E3] (E2 discarded)
Streamed output tokens would be D1, E1, F1, C2, D2, E2, C3, D3, E3

Key Changes

API Layer
Add a /v1/realtime WebSocket endpoint, reusing the upstream vLLM streaming input protocol and request handling logic.
On the server side, multiple incremental streaming input requests are grouped into a single session, which continues until an input_audio_buffer.commit event with final=True is received or the connection is closed.
The client continuously receives incremental streaming outputs.

Engine Layer
Introduce an add_streaming_update_async method. For streaming input requests, add a resumable flag so that the scheduler can identify streaming inputs and append historical prompts accordingly.
Reuse the existing request state.

Test Plan

pytest tests/engine/test_async_omni_engine_input.py
python examples/online_serving/qwen3_omni/openai_realtime_client.py --model /home/models/Qwen/Qwen3-Omni-30B-A3B-Instruct --port 8366 --audio_path audio.wav

Test Result

pass

python examples/online_serving/qwen3_omni/openai_realtime_client.py --model /home/models/Qwen/Qwen3-Omni-30B-A3B-Instruct --port 8366 --audio_path audio.wav
Session created: sess-9f0d7e91c0b50924
Loading audio from: output6.wav
Sending 228 audio chunks...
Audio sent. Waiting for transcription...

Transcription: The Qwen-Omni model is capable of accepting.text and other modalities as input, and generate text.or audio responses in various human-like voice styles.Supports multilingual and dialectal speech output, applicable tocontent moderation, text creation, visual recognition,and audio-video interactive assistants, among other scenarios.

Final transcription: The Qwen-Omni model is capable of accepting.text and other modalities as input, and generate text.or audio responses in various human-like voice styles.Supports multilingual and dialectal speech output, applicable tocontent moderation, text creation, visual recognition,and audio-video interactive assistants, among other scenarios.
Usage: {'prompt_tokens': 75, 'total_tokens': 146, 'completion_tokens': 71, 'prompt_tokens_details': None}

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
  • The test results. Please paste the results comparison before and after, or the e2e results.
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
  • (Optional) Release notes update. If your change is user-facing, please update the release notes draft.

Future work

Phase 1
Introduce a WebSocket interface to support streaming input, aligned with upstream vLLM implementation.
Goal: Enable streaming audio input and text output for the Qwen3-Omni model in vllm-omni. (this pr)
Phase 2
Align accuracy across different stages of Qwen3-Omni streaming input and support audio output.(doing)
Phase 3
Support streaming input with prefix cache reuse compatibility and performance optimization.
Phase 4
Support streaming input with async chunk processing enabled.

mermaid-diagram

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

@Shirley125 Shirley125 marked this pull request as draft March 26, 2026 03:39
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: cb986338f3

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread vllm_omni/distributed/omni_connectors/transfer_adapter/chunk_transfer_adapter.py Outdated
Comment thread vllm_omni/entrypoints/async_omni.py Outdated
Copy link
Copy Markdown
Collaborator

@princepride princepride left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion, streaming input should emphasize the ability to interrupt the output, which is the fundamental reason why vLLM supports streaming input. However, as I understand, the current code does not seem to support this feature.

Comment thread vllm_omni/engine/output_processor.py Outdated
Comment thread vllm_omni/engine/orchestrator.py Outdated
Comment thread vllm_omni/engine/orchestrator.py Outdated
Comment thread vllm_omni/model_executor/stage_input_processors/qwen3_omni.py Outdated
@Shirley125
Copy link
Copy Markdown
Contributor Author

Shirley125 commented Mar 31, 2026

In my opinion, streaming input should emphasize the ability to interrupt the output, which is the fundamental reason why vLLM supports streaming input. However, as I understand, the current code does not seem to support this feature.

Thanks, I’ve updated the PR.
In Phase 1, I will directly integrate vLLM’s capabilities to ensure alignment with vLLM for audio input and text output.

Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>
@Shirley125 Shirley125 force-pushed the realtime branch 2 times, most recently from ed6b0f2 to df69957 Compare March 31, 2026 07:32
@Shirley125 Shirley125 marked this pull request as ready for review March 31, 2026 08:37
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 99f58381fd

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread vllm_omni/entrypoints/async_omni.py
audio_placeholder = Qwen3OmniMoeThinkerForConditionalGeneration.get_placeholder_str("audio", 0)
prompt_template = f"<|im_start|>user\n{audio_placeholder}<|im_end|>\n<|im_start|>assistant\n"

prompt_token_ids = tokenizer.encode(prompt_template)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Incorporate input_stream context in realtime audio prompts

buffer_realtime_audio() receives input_stream for carrying prior generated token IDs, but prompt construction uses a fixed prompt_token_ids template and never consumes that queue. As a result, each audio segment is transcribed without prior output context, breaking session-style cumulative prompting across incremental realtime updates.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Referencing vLLM’s qwen3_asr_realtime model and following vLLM’s optimization strategies:vllm-project/vllm#35767

Comment thread vllm_omni/entrypoints/async_omni.py Outdated
Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>
Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>
@Shirley125 Shirley125 changed the title [WIP][Feature] add session based audio streaming input [Feature] add session based audio streaming input Mar 31, 2026
Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>
@Shirley125
Copy link
Copy Markdown
Contributor Author

request.external_req_id = request_id

# Register with stage 0's output processor.
output_prompt_text = prompt_text
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please confirm if this is necessary?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this aligns with vLLM: the prompt in RequestState within output_processor.py should be of type str; otherwise, apply_streaming_update in vLLM will raise an error.


async def _handle_streaming_update(self, msg: dict[str, Any]) -> None:
"""Handle a streaming_update message for an existing request."""
stage_id = 0
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we pass this parameter in instead of hardcoding it?

Copy link
Copy Markdown
Contributor Author

@Shirley125 Shirley125 Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as _handle_add_request logic: it first goes through Stage 0, and then is passed to the downstream stage via _forward_to_next_stage.

@amy-why-3459
Copy link
Copy Markdown
Contributor

@Sy0307 @lishunyang12 PTAL

audio transcription by uploading an audio file.

Before running this script, you must start the vLLM-Omni server with a realtime-capable
model, for example:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update the README.md

Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>
@Sy0307
Copy link
Copy Markdown
Contributor

Sy0307 commented Apr 2, 2026

LGTM. I have tested it on c6396fc before.

@lishunyang12 lishunyang12 added the ready label to trigger buildkite CI label Apr 2, 2026
Copy link
Copy Markdown
Collaborator

@lishunyang12 lishunyang12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall

@lishunyang12
Copy link
Copy Markdown
Collaborator

@princepride PTAL

Copy link
Copy Markdown
Collaborator

@princepride princepride left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@amy-why-3459
Copy link
Copy Markdown
Contributor

Thank you very much for your contribution, LGTM

Copy link
Copy Markdown
Collaborator

@hsliuustc0106 hsliuustc0106 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any test added?

info=Qwen3OmniMoeThinkerProcessingInfo,
dummy_inputs=Qwen3OmniMoeThinkerDummyInputsBuilder,
)
class Qwen3OmniMoeForConditionalGeneration(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

@Shirley125
Copy link
Copy Markdown
Contributor Author

any test added?

UT have been added. CI will be considered in the next PR.

Comment thread vllm_omni/engine/async_omni_engine.py Outdated
"arrival_time": arrival_time,
}
if resumable:
process_kwargs["resumable"] = True
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not just resumable=resumable

Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>
@hsliuustc0106 hsliuustc0106 merged commit 728cf6d into vllm-project:main Apr 2, 2026
8 checks passed
linyueqian pushed a commit to JuanPZuluaga/vllm-omni that referenced this pull request Apr 3, 2026
Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Signed-off-by: JuanPZuluaga <juanz9312@gmal.com>
Sy0307 pushed a commit to Sy0307/vllm-omni that referenced this pull request Apr 6, 2026
Includes:
- ca02351 [skip ci][Bugfix] clean useless log (vllm-project#2450)
- 50bb47a [Test] Skip zimage expansion test (vllm-project#2454)
- 728cf6d [Feature] add session based audio streaming input (vllm-project#2208)
- 6211413 Update MRoPE config fallback logic (vllm-project#2278)
luke-n-alpha pushed a commit to nextain/vllm-omni that referenced this pull request Apr 6, 2026
- 56 upstream commits pulled in
- Adds /v1/realtime WebSocket endpoint (PR vllm-project#2208, Qwen3-Omni)
- Registry, scheduler, diffusion model updates
- Conflict resolved: README.md kept fork header

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
luke-n-alpha pushed a commit to nextain/vllm-omni that referenced this pull request Apr 6, 2026
… audio out)

New endpoint: /v1/omni WebSocket for full-duplex omni conversation.
- Client sends PCM16 16kHz audio as binary frames + {type: input.done}
- Server streams WAV audio chunks as binary frames + transcript.delta JSON
- Reuses OmniOpenAIServingChat.create_chat_completion() (proven REST path)
- Multi-turn conversation history maintained per session
- request_id pre-generated before generator iteration to enable abort on
  early client disconnect without GPU resource leak

Protocol:
  session.config (text) -> binary PCM16 frames -> input.done (text)
  turn.start -> transcript.delta -> audio.start -> binary WAV chunks ->
  audio.done -> turn.done

Each binary output frame is a self-contained WAV (format: wav_chunk).
Not a streamable concatenated WAV header.

Why not /v1/realtime or _add_streaming_input_request:
- /v1/realtime = ASR transcription only, no audio output
- _add_streaming_input_request (PR vllm-project#2208) = ASR input streaming for Qwen3
- MiniCPM-o thinker2talker_async_chunk needs full Thinker output for
  _find_tts_bound(); streaming input gives no TTFP benefit

Relates to: nextain/naia-os#216

Pattern: OmniStreamingSpeechHandler in serving_speech_stream.py
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
vraiti pushed a commit to vraiti/vllm-omni that referenced this pull request Apr 9, 2026
Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>
Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants