Skip to content

[Feat][Qwen3-TTS] Support streaming audio output for websocket #1719

Merged
linyueqian merged 4 commits into
vllm-project:mainfrom
Sy0307:feat/qwen3-tts-stream-audio-websocket-1479
Mar 12, 2026
Merged

[Feat][Qwen3-TTS] Support streaming audio output for websocket #1719
linyueqian merged 4 commits into
vllm-project:mainfrom
Sy0307:feat/qwen3-tts-stream-audio-websocket-1479

Conversation

@Sy0307
Copy link
Copy Markdown
Contributor

@Sy0307 Sy0307 commented Mar 6, 2026

Summary

This PR implements streaming audio output for the Qwen3-TTS WebSocket endpoint and addresses #1479.

Today, /v1/audio/speech/stream supports incremental text input over WebSocket and sentence-level audio generation, but each sentence is returned as a single binary frame only after full synthesis completes.

This PR adds sentence-level streaming audio output:

  • audio.start
  • one or more binary PCM chunks
  • audio.done

This allows clients to start playback before full sentence synthesis finishes.

Relationship to #1230

This PR is written as a standalone diff against main, but conceptually it builds on the WebSocket text-input foundation introduced in #1230.

In other words:

  • #1230 provides the WebSocket session / incremental text input / sentence splitting baseline
  • this PR adds chunked audio output for each sentence

Because of that, part of the diff includes the WebSocket text-input scaffolding as the base needed for #1479.

What changed

WebSocket protocol

  • Added stream_audio: bool = false to StreamingSpeechSessionConfig
  • Enforced stream_audio=true => response_format="pcm"
  • Enforced stream_audio=true => speed=1.0
  • Added initial_codec_chunk_frames to the WebSocket session config to match the REST API

Shared speech generation path

  • Refactored OmniOpenAIServingSpeech so REST and WebSocket reuse the same generation preparation path
  • Added a shared PCM chunk iterator for streaming output
  • Kept the non-streaming byte aggregation path for backward compatibility

WebSocket streaming audio output

  • Replaced the per-sentence blocking audio generation path in the WebSocket handler
  • When stream_audio=true, multiple PCM binary frames are sent per sentence as chunks arrive
  • Preserved audio.start / audio.done framing
  • Added total_bytes to audio.done
  • Added sample_rate=24000 to audio.start for PCM streaming
  • Added error to audio.done so clients can distinguish failed sentence completion

Client and docs

  • Updated the example WebSocket client to append multiple chunks for the same sentence instead of overwriting on every binary frame
  • Updated docs to describe the PCM chunked WebSocket protocol

Protocol example

Client:

{"type":"session.config","voice":"Vivian","response_format":"pcm","stream_audio":true}
{"type":"input.text","text":"Hello world. "}
{"type":"input.done"}

Server:

{"type":"audio.start","sentence_index":0,"sentence_text":"Hello world.","format":"pcm","sample_rate":24000}
<binary PCM chunk 1>
<binary PCM chunk 2>
{"type":"audio.done","sentence_index":0,"total_bytes":96000,"error":false}
{"type":"session.done","total_sentences":1}

Backward compatibility

  • Default behavior is unchanged: stream_audio=false
  • In non-streaming mode, each sentence still returns a single binary frame
  • Streaming over WebSocket is limited to PCM, matching the existing REST streaming constraints

Tests

Passed locally:

  • python -m pytest tests/entrypoints/openai_api/test_serving_speech_stream.py tests/entrypoints/openai_api/test_text_splitter.py -q

Also added:

  • WebSocket online e2e coverage for Qwen3-TTS
  • coverage for empty input, multi-sentence indexing, unknown message handling, config timeout, generation failure, and splitter carry behavior

@Sy0307 Sy0307 requested a review from hsliuustc0106 as a code owner March 6, 2026 21:40
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 14e90c3259

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

status_code=result.error.code if result.error else 400,
)
return result
return await handler.create_speech(request, raw_request)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Restore non-200 status mapping for speech errors

This handler now returns handler.create_speech(...) directly, but create_speech still returns ErrorResponse objects for validation/generation failures; without the removed isinstance(result, ErrorResponse) conversion, FastAPI will serialize those as HTTP 200 responses. That makes invalid requests (e.g., bad TTS params) look successful to clients and breaks error handling/retry logic that depends on 4xx/5xx status codes.

Useful? React with 👍 / 👎.

Comment on lines 229 to 231
await omni_init_app_state(engine_client, app.state, args)

# Conditionally register profiler endpoints based on config or env var
if _should_enable_profiler_endpoints(args):
logger.warning("Profiler endpoints are enabled. This should ONLY be used for local development!")
app.include_router(profiler_router)

vllm_config = await engine_client.get_vllm_config()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Re-enable profiler route registration

The server bootstrap no longer conditionally mounts the profiler router, so /start_profile and /stop_profile are never exposed even when profiling is configured. This is a functional regression for profiling workflows: internal benchmarking code still calls these endpoints (vllm_omni/benchmarks/patch/patch.py uses .../start_profile and .../stop_profile), so profile runs will now fail with missing-route errors.

Useful? React with 👍 / 👎.

Comment on lines 188 to 190
# Validate ref_audio format
if not (request.ref_audio.startswith(("http://", "https://")) or request.ref_audio.startswith("data:")):
return "ref_audio must be a URL (http/https) or base64 data URL (data:...)"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Enforce ref_text for Base cloning requests

Base-task validation now only checks that ref_audio exists and is URL/data-URI, but it no longer rejects missing/empty ref_text when x_vector_only_mode is not enabled. That lets invalid voice-cloning requests pass validation and fail later in model execution instead of returning a clean client error, which is a regression in API robustness for Base mode.

Useful? React with 👍 / 👎.

@linyueqian
Copy link
Copy Markdown
Collaborator

Thanks for this PR! I ran some benchmarks comparing connection setup overhead across transports (HTTP SSE, WebSocket, WebRTC).

Connection setup latency (localhost, 10 runs):

Transport Mean P50
WebSocket (TCP + upgrade) 2.9ms 1.7ms
HTTP SSE (TCP + POST) 47ms 43ms
WebRTC (offer + ICE + DTLS) 55ms 54ms

Projected connection overhead by network RTT:

RTT HTTP (1 RTT) WebSocket (2 RTTs) WebRTC (~6 RTTs)
0.2ms (localhost) 0.2ms 0.4ms 1.2ms
50ms (cross-region) 50ms 100ms 300ms
100ms (cross-continent) 100ms 200ms 600ms

WebSocket is the clear winner for TTS streaming. Fast connection setup, no STUN/TURN dependency, and the incremental text input support is a nice bonus. WebRTC's UDP advantage doesn't really help for one directional audio push.

Looking forward to a follow up PR to update the Gradio demo to use the WebSocket endpoint.

Copy link
Copy Markdown
Collaborator

@linyueqian linyueqian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a look at the code, left some inline comments on a few things I noticed.

await websocket.send_json(start_payload)

total_bytes = 0
generation_failed = False
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If websocket.send_bytes(chunk) throws WebSocketDisconnect, this except Exception block catches it and tries to send an error back to the already disconnected client. Then the finally block also tries to send_json the audio.done frame on the closed socket, which throws again.

Two things worth fixing:

  1. Catch WebSocketDisconnect separately and re-raise it so the outer handler in handle_session deals with it cleanly
  2. Wrap the finally block send in try/except so it doesnt mask the original exception
except WebSocketDisconnect:
    raise
except Exception as e:
    generation_failed = True
    ...
finally:
    try:
        await websocket.send_json({...})
    except Exception:
        pass

"sentence_index": sentence_index,
"sentence_text": sentence_text,
"format": response_format,
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the client disconnects mid-generation, the async generator from _iter_pcm_audio_bytes is not explicitly closed. Python should call aclose() during GC but thats non-deterministic, so the underlying engine generator may keep running and burning GPU until it finishes naturally.

Worth wrapping this in contextlib.aclosing and calling abort(request_id) on disconnect:

from contextlib import aclosing

async with aclosing(self._speech_service._iter_pcm_audio_bytes(request)) as stream:
    async for chunk in stream:
        ...

Comment thread vllm_omni/entrypoints/openai/api_server.py
Comment thread vllm_omni/entrypoints/openai/serving_speech_stream.py
Comment thread vllm_omni/entrypoints/openai/serving_speech_stream.py
@iancarrasco-b10
Copy link
Copy Markdown
Contributor

iancarrasco-b10 commented Mar 7, 2026

I tried testing this PR using ICL mode + streaming but noticing a lot of noise at the beginning of the generated output. Is ICL supported here?

The reference audio is clean and ~17s long with transcript.

Generated output on H100:
sentence_000.wav

Note: When running with x-vector-only mode this doesn't seem to be an issue

@Sy0307
Copy link
Copy Markdown
Contributor Author

Sy0307 commented Mar 8, 2026

I tried testing this PR using ICL mode + streaming but noticing a lot of noise at the beginning of the generated output. Is ICL supported here?

The reference audio is clean and ~17s long with transcript.

Generated output on H100: sentence_000.wav

Note: When running with x-vector-only mode this doesn't seem to be an issue

Plz see this PR #1731 and check whether it fix your issue. Sorry I missed ICL needs and some extra ref infos needed in refactor for qwen3 tts async_chunk.

@Sy0307 Sy0307 force-pushed the feat/qwen3-tts-stream-audio-websocket-1479 branch 2 times, most recently from bfd0cc7 to d4bb651 Compare March 8, 2026 18:58
Copy link
Copy Markdown
Collaborator

@lishunyang12 lishunyang12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left a comment inline

@lishunyang12
Copy link
Copy Markdown
Collaborator

RFC_1479_streaming_ws_tts_plan.md should probably live in a GitHub issue or discussion rather than the repo root

@Sy0307 Sy0307 force-pushed the feat/qwen3-tts-stream-audio-websocket-1479 branch 2 times, most recently from 0485e3a to 96419ed Compare March 9, 2026 12:47
@Sy0307
Copy link
Copy Markdown
Contributor Author

Sy0307 commented Mar 9, 2026

RFC_1479_streaming_ws_tts_plan.md should probably live in a GitHub issue or discussion rather than the repo root

Fixed.

@Sy0307 Sy0307 force-pushed the feat/qwen3-tts-stream-audio-websocket-1479 branch from 96419ed to ffc6476 Compare March 11, 2026 17:18
Signed-off-by: Sy03 <1370724210@qq.com>
@Sy0307
Copy link
Copy Markdown
Contributor Author

Sy0307 commented Mar 11, 2026

Have fixed all the issues you mentioned. PTAK again @linyueqian. Thanks.

@linyueqian linyueqian added the ready label to trigger buildkite CI label Mar 11, 2026
linyueqian and others added 2 commits March 11, 2026 19:45
Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>
Signed-off-by: linyueqian <linyueqian@outlook.com>
@linyueqian linyueqian enabled auto-merge (squash) March 12, 2026 03:48
@linyueqian linyueqian merged commit 5d28b6a into vllm-project:main Mar 12, 2026
6 of 7 checks passed
meghaagr13 pushed a commit to meghaagr13/vllm-omni that referenced this pull request Mar 12, 2026
…project#1719)

Signed-off-by: Sy03 <1370724210@qq.com>
Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>
Signed-off-by: linyueqian <linyueqian@outlook.com>
Co-authored-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>
Co-authored-by: linyueqian <linyueqian@outlook.com>
Signed-off-by: Megha Agarwal <agarwalmegha1308@gmail.com>
meghaagr13 pushed a commit to meghaagr13/vllm-omni that referenced this pull request Mar 12, 2026
…project#1719)

Signed-off-by: Sy03 <1370724210@qq.com>
Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>
Signed-off-by: linyueqian <linyueqian@outlook.com>
Co-authored-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>
Co-authored-by: linyueqian <linyueqian@outlook.com>
@zhaotyer
Copy link
Copy Markdown
Contributor

@linyueqian @Sy0307 This PR causes some endpoints in api_server.py to be reverted.

linyueqian added a commit to linyueqian/vllm-omni that referenced this pull request Mar 13, 2026
hsliuustc0106 pushed a commit that referenced this pull request Mar 13, 2026
…1719 (#1879)

Signed-off-by: linyueqian <linyueqian@outlook.com>
dorhuri123 pushed a commit to dorhuri123/vllm-omni that referenced this pull request Mar 13, 2026
wtomin pushed a commit to wtomin/vllm-omni that referenced this pull request Mar 16, 2026
yiliu30 pushed a commit to yiliu30/vllm-omni-fork that referenced this pull request Mar 20, 2026
…project#1719)

Signed-off-by: Sy03 <1370724210@qq.com>
Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>
Signed-off-by: linyueqian <linyueqian@outlook.com>
Co-authored-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>
Co-authored-by: linyueqian <linyueqian@outlook.com>

Signed-off-by: yiliu30 <yi4.liu@intel.com>
yiliu30 pushed a commit to yiliu30/vllm-omni-fork that referenced this pull request Mar 20, 2026
…llm-project#1719 (vllm-project#1879)

Signed-off-by: linyueqian <linyueqian@outlook.com>

Signed-off-by: yiliu30 <yi4.liu@intel.com>
clodaghwalsh17 pushed a commit to clodaghwalsh17/nm-vllm-omni-ent that referenced this pull request May 12, 2026
…project#1719)

Signed-off-by: Sy03 <1370724210@qq.com>
Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>
Signed-off-by: linyueqian <linyueqian@outlook.com>
Co-authored-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>
Co-authored-by: linyueqian <linyueqian@outlook.com>
clodaghwalsh17 pushed a commit to clodaghwalsh17/nm-vllm-omni-ent that referenced this pull request May 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants