[Feat][Qwen3-TTS] Support streaming audio output for websocket #1719
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 14e90c3259
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| status_code=result.error.code if result.error else 400, | ||
| ) | ||
| return result | ||
| return await handler.create_speech(request, raw_request) |
There was a problem hiding this comment.
Restore non-200 status mapping for speech errors
This handler now returns handler.create_speech(...) directly, but create_speech still returns ErrorResponse objects for validation/generation failures; without the removed isinstance(result, ErrorResponse) conversion, FastAPI will serialize those as HTTP 200 responses. That makes invalid requests (e.g., bad TTS params) look successful to clients and breaks error handling/retry logic that depends on 4xx/5xx status codes.
Useful? React with 👍 / 👎.
| await omni_init_app_state(engine_client, app.state, args) | ||
|
|
||
| # Conditionally register profiler endpoints based on config or env var | ||
| if _should_enable_profiler_endpoints(args): | ||
| logger.warning("Profiler endpoints are enabled. This should ONLY be used for local development!") | ||
| app.include_router(profiler_router) | ||
|
|
||
| vllm_config = await engine_client.get_vllm_config() |
There was a problem hiding this comment.
Re-enable profiler route registration
The server bootstrap no longer conditionally mounts the profiler router, so /start_profile and /stop_profile are never exposed even when profiling is configured. This is a functional regression for profiling workflows: internal benchmarking code still calls these endpoints (vllm_omni/benchmarks/patch/patch.py uses .../start_profile and .../stop_profile), so profile runs will now fail with missing-route errors.
Useful? React with 👍 / 👎.
| # Validate ref_audio format | ||
| if not (request.ref_audio.startswith(("http://", "https://")) or request.ref_audio.startswith("data:")): | ||
| return "ref_audio must be a URL (http/https) or base64 data URL (data:...)" |
There was a problem hiding this comment.
Enforce ref_text for Base cloning requests
Base-task validation now only checks that ref_audio exists and is URL/data-URI, but it no longer rejects missing/empty ref_text when x_vector_only_mode is not enabled. That lets invalid voice-cloning requests pass validation and fail later in model execution instead of returning a clean client error, which is a regression in API robustness for Base mode.
Useful? React with 👍 / 👎.
|
Thanks for this PR! I ran some benchmarks comparing connection setup overhead across transports (HTTP SSE, WebSocket, WebRTC). Connection setup latency (localhost, 10 runs):
Projected connection overhead by network RTT:
WebSocket is the clear winner for TTS streaming. Fast connection setup, no STUN/TURN dependency, and the incremental text input support is a nice bonus. WebRTC's UDP advantage doesn't really help for one directional audio push. Looking forward to a follow up PR to update the Gradio demo to use the WebSocket endpoint. |
linyueqian
left a comment
There was a problem hiding this comment.
Took a look at the code, left some inline comments on a few things I noticed.
| await websocket.send_json(start_payload) | ||
|
|
||
| total_bytes = 0 | ||
| generation_failed = False |
There was a problem hiding this comment.
If websocket.send_bytes(chunk) throws WebSocketDisconnect, this except Exception block catches it and tries to send an error back to the already disconnected client. Then the finally block also tries to send_json the audio.done frame on the closed socket, which throws again.
Two things worth fixing:
- Catch
WebSocketDisconnectseparately and re-raise it so the outer handler inhandle_sessiondeals with it cleanly - Wrap the
finallyblock send in try/except so it doesnt mask the original exception
except WebSocketDisconnect:
raise
except Exception as e:
generation_failed = True
...
finally:
try:
await websocket.send_json({...})
except Exception:
pass| "sentence_index": sentence_index, | ||
| "sentence_text": sentence_text, | ||
| "format": response_format, | ||
| } |
There was a problem hiding this comment.
When the client disconnects mid-generation, the async generator from _iter_pcm_audio_bytes is not explicitly closed. Python should call aclose() during GC but thats non-deterministic, so the underlying engine generator may keep running and burning GPU until it finishes naturally.
Worth wrapping this in contextlib.aclosing and calling abort(request_id) on disconnect:
from contextlib import aclosing
async with aclosing(self._speech_service._iter_pcm_audio_bytes(request)) as stream:
async for chunk in stream:
...|
I tried testing this PR using ICL mode + streaming but noticing a lot of noise at the beginning of the generated output. Is ICL supported here? The reference audio is clean and ~17s long with transcript. Generated output on H100: Note: When running with x-vector-only mode this doesn't seem to be an issue |
Plz see this PR #1731 and check whether it fix your issue. Sorry I missed ICL needs and some extra ref infos needed in refactor for qwen3 tts async_chunk. |
bfd0cc7 to
d4bb651
Compare
lishunyang12
left a comment
There was a problem hiding this comment.
left a comment inline
|
|
0485e3a to
96419ed
Compare
Fixed. |
Signed-off-by: Sy03 <1370724210@qq.com>
96419ed to
ffc6476
Compare
Signed-off-by: Sy03 <1370724210@qq.com>
|
Have fixed all the issues you mentioned. PTAK again @linyueqian. Thanks. |
Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>
Signed-off-by: linyueqian <linyueqian@outlook.com>
…project#1719) Signed-off-by: Sy03 <1370724210@qq.com> Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com> Signed-off-by: linyueqian <linyueqian@outlook.com> Co-authored-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com> Co-authored-by: linyueqian <linyueqian@outlook.com> Signed-off-by: Megha Agarwal <agarwalmegha1308@gmail.com>
…project#1719) Signed-off-by: Sy03 <1370724210@qq.com> Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com> Signed-off-by: linyueqian <linyueqian@outlook.com> Co-authored-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com> Co-authored-by: linyueqian <linyueqian@outlook.com>
|
@linyueqian @Sy0307 This PR causes some endpoints in api_server.py to be reverted. |
…ect#1719 Signed-off-by: linyueqian <linyueqian@outlook.com>
…llm-project#1719 (vllm-project#1879) Signed-off-by: linyueqian <linyueqian@outlook.com>
…llm-project#1719 (vllm-project#1879) Signed-off-by: linyueqian <linyueqian@outlook.com>
…project#1719) Signed-off-by: Sy03 <1370724210@qq.com> Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com> Signed-off-by: linyueqian <linyueqian@outlook.com> Co-authored-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com> Co-authored-by: linyueqian <linyueqian@outlook.com> Signed-off-by: yiliu30 <yi4.liu@intel.com>
…llm-project#1719 (vllm-project#1879) Signed-off-by: linyueqian <linyueqian@outlook.com> Signed-off-by: yiliu30 <yi4.liu@intel.com>
…project#1719) Signed-off-by: Sy03 <1370724210@qq.com> Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com> Signed-off-by: linyueqian <linyueqian@outlook.com> Co-authored-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com> Co-authored-by: linyueqian <linyueqian@outlook.com>
…llm-project#1719 (vllm-project#1879) Signed-off-by: linyueqian <linyueqian@outlook.com>
Summary
This PR implements streaming audio output for the Qwen3-TTS WebSocket endpoint and addresses #1479.
Today,
/v1/audio/speech/streamsupports incremental text input over WebSocket and sentence-level audio generation, but each sentence is returned as a single binary frame only after full synthesis completes.This PR adds sentence-level streaming audio output:
audio.startaudio.doneThis allows clients to start playback before full sentence synthesis finishes.
Relationship to #1230
This PR is written as a standalone diff against
main, but conceptually it builds on the WebSocket text-input foundation introduced in #1230.In other words:
#1230provides the WebSocket session / incremental text input / sentence splitting baselineBecause of that, part of the diff includes the WebSocket text-input scaffolding as the base needed for
#1479.What changed
WebSocket protocol
stream_audio: bool = falsetoStreamingSpeechSessionConfigstream_audio=true => response_format="pcm"stream_audio=true => speed=1.0initial_codec_chunk_framesto the WebSocket session config to match the REST APIShared speech generation path
OmniOpenAIServingSpeechso REST and WebSocket reuse the same generation preparation pathWebSocket streaming audio output
stream_audio=true, multiple PCM binary frames are sent per sentence as chunks arriveaudio.start/audio.doneframingtotal_bytestoaudio.donesample_rate=24000toaudio.startfor PCM streamingerrortoaudio.doneso clients can distinguish failed sentence completionClient and docs
Protocol example
Client:
{"type":"session.config","voice":"Vivian","response_format":"pcm","stream_audio":true} {"type":"input.text","text":"Hello world. "} {"type":"input.done"}Server:
{"type":"audio.start","sentence_index":0,"sentence_text":"Hello world.","format":"pcm","sample_rate":24000}{"type":"audio.done","sentence_index":0,"total_bytes":96000,"error":false} {"type":"session.done","total_sentences":1}Backward compatibility
stream_audio=falseTests
Passed locally:
python -m pytest tests/entrypoints/openai_api/test_serving_speech_stream.py tests/entrypoints/openai_api/test_text_splitter.py -qAlso added: