[Feat][Qwen3-TTS] Support streaming audio output for websocket by Sy0307 · Pull Request #1719 · vllm-project/vllm-omni

Sy0307 · 2026-03-06T21:40:15Z

Summary

This PR implements streaming audio output for the Qwen3-TTS WebSocket endpoint and addresses #1479.

Today, /v1/audio/speech/stream supports incremental text input over WebSocket and sentence-level audio generation, but each sentence is returned as a single binary frame only after full synthesis completes.

This PR adds sentence-level streaming audio output:

audio.start
one or more binary PCM chunks
audio.done

This allows clients to start playback before full sentence synthesis finishes.

Relationship to #1230

This PR is written as a standalone diff against main, but conceptually it builds on the WebSocket text-input foundation introduced in #1230.

In other words:

#1230 provides the WebSocket session / incremental text input / sentence splitting baseline
this PR adds chunked audio output for each sentence

Because of that, part of the diff includes the WebSocket text-input scaffolding as the base needed for #1479.

What changed

WebSocket protocol

Added stream_audio: bool = false to StreamingSpeechSessionConfig
Enforced stream_audio=true => response_format="pcm"
Enforced stream_audio=true => speed=1.0
Added initial_codec_chunk_frames to the WebSocket session config to match the REST API

Shared speech generation path

Refactored OmniOpenAIServingSpeech so REST and WebSocket reuse the same generation preparation path
Added a shared PCM chunk iterator for streaming output
Kept the non-streaming byte aggregation path for backward compatibility

WebSocket streaming audio output

Replaced the per-sentence blocking audio generation path in the WebSocket handler
When stream_audio=true, multiple PCM binary frames are sent per sentence as chunks arrive
Preserved audio.start / audio.done framing
Added total_bytes to audio.done
Added sample_rate=24000 to audio.start for PCM streaming
Added error to audio.done so clients can distinguish failed sentence completion

Client and docs

Updated the example WebSocket client to append multiple chunks for the same sentence instead of overwriting on every binary frame
Updated docs to describe the PCM chunked WebSocket protocol

Protocol example

Client:

{"type":"session.config","voice":"Vivian","response_format":"pcm","stream_audio":true}
{"type":"input.text","text":"Hello world. "}
{"type":"input.done"}

Server:

{"type":"audio.start","sentence_index":0,"sentence_text":"Hello world.","format":"pcm","sample_rate":24000}

<binary PCM chunk 1>
<binary PCM chunk 2>

{"type":"audio.done","sentence_index":0,"total_bytes":96000,"error":false}
{"type":"session.done","total_sentences":1}

Backward compatibility

Default behavior is unchanged: stream_audio=false
In non-streaming mode, each sentence still returns a single binary frame
Streaming over WebSocket is limited to PCM, matching the existing REST streaming constraints

Tests

Passed locally:

python -m pytest tests/entrypoints/openai_api/test_serving_speech_stream.py tests/entrypoints/openai_api/test_text_splitter.py -q

Also added:

WebSocket online e2e coverage for Qwen3-TTS
coverage for empty input, multi-sentence indexing, unknown message handling, config timeout, generation failure, and splitter carry behavior

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 14e90c3259

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-03-06T21:46:02Z

-                status_code=result.error.code if result.error else 400,
-            )
-        return result
+        return await handler.create_speech(request, raw_request)


Restore non-200 status mapping for speech errors

This handler now returns handler.create_speech(...) directly, but create_speech still returns ErrorResponse objects for validation/generation failures; without the removed isinstance(result, ErrorResponse) conversion, FastAPI will serialize those as HTTP 200 responses. That makes invalid requests (e.g., bad TTS params) look successful to clients and breaks error handling/retry logic that depends on 4xx/5xx status codes.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-03-06T21:46:02Z

        await omni_init_app_state(engine_client, app.state, args)

-        # Conditionally register profiler endpoints based on config or env var
-        if _should_enable_profiler_endpoints(args):
-            logger.warning("Profiler endpoints are enabled. This should ONLY be used for local development!")
-            app.include_router(profiler_router)
-
        vllm_config = await engine_client.get_vllm_config()


Re-enable profiler route registration

The server bootstrap no longer conditionally mounts the profiler router, so /start_profile and /stop_profile are never exposed even when profiling is configured. This is a functional regression for profiling workflows: internal benchmarking code still calls these endpoints (vllm_omni/benchmarks/patch/patch.py uses .../start_profile and .../stop_profile), so profile runs will now fail with missing-route errors.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-03-06T21:46:02Z

            # Validate ref_audio format
            if not (request.ref_audio.startswith(("http://", "https://")) or request.ref_audio.startswith("data:")):
                return "ref_audio must be a URL (http/https) or base64 data URL (data:...)"


Enforce ref_text for Base cloning requests

Base-task validation now only checks that ref_audio exists and is URL/data-URI, but it no longer rejects missing/empty ref_text when x_vector_only_mode is not enabled. That lets invalid voice-cloning requests pass validation and fail later in model execution instead of returning a clean client error, which is a regression in API robustness for Base mode.

Useful? React with 👍 / 👎.

linyueqian · 2026-03-06T22:04:14Z

Thanks for this PR! I ran some benchmarks comparing connection setup overhead across transports (HTTP SSE, WebSocket, WebRTC).

Connection setup latency (localhost, 10 runs):

Transport	Mean	P50
WebSocket (TCP + upgrade)	2.9ms	1.7ms
HTTP SSE (TCP + POST)	47ms	43ms
WebRTC (offer + ICE + DTLS)	55ms	54ms

Projected connection overhead by network RTT:

RTT	HTTP (1 RTT)	WebSocket (2 RTTs)	WebRTC (~6 RTTs)
0.2ms (localhost)	0.2ms	0.4ms	1.2ms
50ms (cross-region)	50ms	100ms	300ms
100ms (cross-continent)	100ms	200ms	600ms

WebSocket is the clear winner for TTS streaming. Fast connection setup, no STUN/TURN dependency, and the incremental text input support is a nice bonus. WebRTC's UDP advantage doesn't really help for one directional audio push.

Looking forward to a follow up PR to update the Gradio demo to use the WebSocket endpoint.

linyueqian

Took a look at the code, left some inline comments on a few things I noticed.

linyueqian · 2026-03-06T22:19:01Z

+        await websocket.send_json(start_payload)
+
+        total_bytes = 0
+        generation_failed = False


If websocket.send_bytes(chunk) throws WebSocketDisconnect, this except Exception block catches it and tries to send an error back to the already disconnected client. Then the finally block also tries to send_json the audio.done frame on the closed socket, which throws again.

Two things worth fixing:

Catch WebSocketDisconnect separately and re-raise it so the outer handler in handle_session deals with it cleanly

Wrap the finally block send in try/except so it doesnt mask the original exception

except WebSocketDisconnect: raise except Exception as e: generation_failed = True ... finally: try: await websocket.send_json({...}) except Exception: pass

linyueqian · 2026-03-06T22:19:01Z

+            "sentence_index": sentence_index,
+            "sentence_text": sentence_text,
+            "format": response_format,
+        }


When the client disconnects mid-generation, the async generator from _iter_pcm_audio_bytes is not explicitly closed. Python should call aclose() during GC but thats non-deterministic, so the underlying engine generator may keep running and burning GPU until it finishes naturally.

Worth wrapping this in contextlib.aclosing and calling abort(request_id) on disconnect:

from contextlib import aclosing async with aclosing(self._speech_service._iter_pcm_audio_bytes(request)) as stream: async for chunk in stream: ...

iancarrasco-b10 · 2026-03-07T04:01:31Z

I tried testing this PR using ICL mode + streaming but noticing a lot of noise at the beginning of the generated output. Is ICL supported here?

The reference audio is clean and ~17s long with transcript.

Generated output on H100:
sentence_000.wav

Note: When running with x-vector-only mode this doesn't seem to be an issue

Sy0307 · 2026-03-08T18:02:04Z

I tried testing this PR using ICL mode + streaming but noticing a lot of noise at the beginning of the generated output. Is ICL supported here?

The reference audio is clean and ~17s long with transcript.

Generated output on H100: sentence_000.wav

Note: When running with x-vector-only mode this doesn't seem to be an issue

Plz see this PR #1731 and check whether it fix your issue. Sorry I missed ICL needs and some extra ref infos needed in refactor for qwen3 tts async_chunk.

lishunyang12

left a comment inline

lishunyang12 · 2026-03-09T01:51:47Z

RFC_1479_streaming_ws_tts_plan.md should probably live in a GitHub issue or discussion rather than the repo root

Sy0307 · 2026-03-09T12:49:24Z

RFC_1479_streaming_ws_tts_plan.md should probably live in a GitHub issue or discussion rather than the repo root

Fixed.

Signed-off-by: Sy03 <1370724210@qq.com>

Sy0307 · 2026-03-11T17:50:13Z

Have fixed all the issues you mentioned. PTAK again @linyueqian. Thanks.

Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>

Signed-off-by: linyueqian <linyueqian@outlook.com>

…project#1719) Signed-off-by: Sy03 <1370724210@qq.com> Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com> Signed-off-by: linyueqian <linyueqian@outlook.com> Co-authored-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com> Co-authored-by: linyueqian <linyueqian@outlook.com> Signed-off-by: Megha Agarwal <agarwalmegha1308@gmail.com>

…project#1719) Signed-off-by: Sy03 <1370724210@qq.com> Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com> Signed-off-by: linyueqian <linyueqian@outlook.com> Co-authored-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com> Co-authored-by: linyueqian <linyueqian@outlook.com>

zhaotyer · 2026-03-13T06:36:36Z

@linyueqian @Sy0307 This PR causes some endpoints in api_server.py to be reverted.

…ect#1719 Signed-off-by: linyueqian <linyueqian@outlook.com>

…1719 (#1879) Signed-off-by: linyueqian <linyueqian@outlook.com>

…llm-project#1719 (vllm-project#1879) Signed-off-by: linyueqian <linyueqian@outlook.com>

…project#1719) Signed-off-by: Sy03 <1370724210@qq.com> Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com> Signed-off-by: linyueqian <linyueqian@outlook.com> Co-authored-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com> Co-authored-by: linyueqian <linyueqian@outlook.com> Signed-off-by: yiliu30 <yi4.liu@intel.com>

…llm-project#1719 (vllm-project#1879) Signed-off-by: linyueqian <linyueqian@outlook.com> Signed-off-by: yiliu30 <yi4.liu@intel.com>

…project#1719) Signed-off-by: Sy03 <1370724210@qq.com> Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com> Signed-off-by: linyueqian <linyueqian@outlook.com> Co-authored-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com> Co-authored-by: linyueqian <linyueqian@outlook.com>

…llm-project#1719 (vllm-project#1879) Signed-off-by: linyueqian <linyueqian@outlook.com>

Sy0307 requested a review from hsliuustc0106 as a code owner March 6, 2026 21:40

chatgpt-codex-connector Bot reviewed Mar 6, 2026

View reviewed changes

linyueqian reviewed Mar 6, 2026

View reviewed changes

Sy0307 mentioned this pull request Mar 8, 2026

[Fix][Qwen3-TTS] Preserve ref_code decoder context for Base ICL #1731

Merged

Sy0307 force-pushed the feat/qwen3-tts-stream-audio-websocket-1479 branch 2 times, most recently from bfd0cc7 to d4bb651 Compare March 8, 2026 18:58

lishunyang12 reviewed Mar 9, 2026

View reviewed changes

Sy0307 force-pushed the feat/qwen3-tts-stream-audio-websocket-1479 branch 2 times, most recently from 0485e3a to 96419ed Compare March 9, 2026 12:47

linyueqian mentioned this pull request Mar 10, 2026

[RFC]: TTS Development Roadmap - March 2026 #1795

Open

[Feat][Qwen3-TTS] Stream audio output over WebSocket (vllm-project#1479)

ffc6476

Signed-off-by: Sy03 <1370724210@qq.com>

Sy0307 force-pushed the feat/qwen3-tts-stream-audio-websocket-1479 branch from 96419ed to ffc6476 Compare March 11, 2026 17:18

fix: remove duplicate get_supported_tasks in async_omni

deb92c0

Signed-off-by: Sy03 <1370724210@qq.com>

linyueqian added the ready label to trigger buildkite CI label Mar 11, 2026

linyueqian and others added 2 commits March 11, 2026 19:45

Merge branch 'main' into feat/qwen3-tts-stream-audio-websocket-1479

2c55e3c

Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>

Fix stray line from merge conflict in test_serving_speech.py

3b438d0

Signed-off-by: linyueqian <linyueqian@outlook.com>

linyueqian enabled auto-merge (squash) March 12, 2026 03:48

linyueqian approved these changes Mar 12, 2026

View reviewed changes

linyueqian merged commit 5d28b6a into vllm-project:main Mar 12, 2026
6 of 7 checks passed

linyueqian mentioned this pull request Mar 12, 2026

[Feat][Qwen3-TTS] Better Qwen3-TTS online serving demo #1857

Merged

5 tasks

JuanPZuluaga mentioned this pull request Mar 13, 2026

feat(tts): add voice upload API for Qwen3-TTS #1201

Merged

linyueqian added a commit to linyueqian/vllm-omni that referenced this pull request Mar 13, 2026

Restore voice upload API and profiler endpoints reverted by vllm-proj…

e78e139

…ect#1719 Signed-off-by: linyueqian <linyueqian@outlook.com>

linyueqian mentioned this pull request Mar 13, 2026

[Bugfix] Restore voice upload API and profiler endpoints reverted by #1719 #1879

Merged

3 tasks

hsliuustc0106 pushed a commit that referenced this pull request Mar 13, 2026

[Bugfix] Restore voice upload API and profiler endpoints reverted by #…

cebf78b

…1719 (#1879) Signed-off-by: linyueqian <linyueqian@outlook.com>

dorhuri123 pushed a commit to dorhuri123/vllm-omni that referenced this pull request Mar 13, 2026

[Bugfix] Restore voice upload API and profiler endpoints reverted by v…

406498c

…llm-project#1719 (vllm-project#1879) Signed-off-by: linyueqian <linyueqian@outlook.com>

wtomin pushed a commit to wtomin/vllm-omni that referenced this pull request Mar 16, 2026

[Bugfix] Restore voice upload API and profiler endpoints reverted by v…

99c448e

…llm-project#1719 (vllm-project#1879) Signed-off-by: linyueqian <linyueqian@outlook.com>

lishunyang12 mentioned this pull request Mar 16, 2026

[RFC] Streaming Audio Output for WebSocket TTS #1479

Closed

Copilot AI mentioned this pull request Mar 23, 2026

[WIP] Compare changes between PR #1719 and PR #1883 LJH-LBJ/vllm-omni#2

Closed

14 tasks

gcanlin mentioned this pull request Mar 30, 2026

[RFC]: Improving Qwen3-TTS Performance on NPU #2328

Open

9 tasks

clodaghwalsh17 pushed a commit to clodaghwalsh17/nm-vllm-omni-ent that referenced this pull request May 12, 2026

[Bugfix] Restore voice upload API and profiler endpoints reverted by v…

1a58aec

…llm-project#1719 (vllm-project#1879) Signed-off-by: linyueqian <linyueqian@outlook.com>

Conversation

Sy0307 commented Mar 6, 2026

Summary

Relationship to #1230

What changed

WebSocket protocol

Shared speech generation path

WebSocket streaming audio output

Client and docs

Protocol example

Backward compatibility

Tests

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

linyueqian commented Mar 6, 2026

Uh oh!

linyueqian left a comment

Choose a reason for hiding this comment

Uh oh!

linyueqian Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

linyueqian Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

iancarrasco-b10 commented Mar 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Sy0307 commented Mar 8, 2026

Uh oh!

lishunyang12 left a comment

Choose a reason for hiding this comment

Uh oh!

lishunyang12 commented Mar 9, 2026

Uh oh!

Sy0307 commented Mar 9, 2026

Uh oh!

Sy0307 commented Mar 11, 2026

Uh oh!

Uh oh!

zhaotyer commented Mar 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

iancarrasco-b10 commented Mar 7, 2026 •

edited

Loading