[Feature] Streaming text input for Qwen3-TTS#1883
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 25fc012eb2
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| except WebSocketDisconnect: | ||
| logger.info(f"[{_req_id}] Client disconnected") |
There was a problem hiding this comment.
Abort backend TTS request on WebSocket disconnect
When the WebSocket client disconnects, this branch only logs and exits, but it never aborts the in-flight request created by create_speech_streaming. In that scenario (e.g., mobile/network drop mid-audio), the engine can keep decoding audio with no consumer until it naturally stops, wasting GPU time and request capacity; add explicit request cancellation/abort in the disconnect path.
Useful? React with 👍 / 👎.
|
Hi @thrashingstate. There is already a PR making the same effort ##1230. Unfortunately, I closed it because I didn't have too much bandwidth testing this while cannot find too much benefit out of this for the current TTS model. It will be particularly useful for model that can accept speech input but tts model doesn't. If you think you can find a very through logic to prove its usefulness. Then, the community will continue this effort. |
| @@ -0,0 +1,286 @@ | |||
| # SPDX-License-Identifier: Apache-2.0 | |||
There was a problem hiding this comment.
This PR unifies the test case style for qwen-tts. Is it sufficient to cover the corresponding streaming scenarios?
#1911
testcase style and test level can refer to: https://github.com/vllm-project/vllm-omni/blob/main/docs/contributing/ci/CI_5levels.md
| def _maybe_resume_request(self, req_id: str) -> None: | ||
| """Resume a paused request if it was waiting for an update.""" | ||
| req = self.requests.get(req_id) | ||
| if req is not None and req.status == RequestStatus.WAITING_FOR_CHUNK: |
There was a problem hiding this comment.
Can we use the RequestStatus.WAITING_FOR_STREAMING_REQ state in vLLM?
|
resolve. conflicts please |
Gaohan123
left a comment
There was a problem hiding this comment.
Could you please provide an e2e use case and update the docs?
|
I've been busy with other tasks. I'll try to find time this week to address the feedback. |
|
@thrashingstate is there any updates? Thanks! |
|
@gcanlin PTAL |
|
I will take over this task for streaming text input for Qwen3 TTS. cc @linyueqian @amy-why-3459 |
Purpose
Add true streaming text input support for Qwen3-TTS via an
UPDATE_REQUESTmechanism. Resolves #1766.Text token IDs arrive incrementally (e.g. from an LLM), get embedded on GPU, and are injected into a running TTS generation with zero voice discontinuity. The model pauses when the text queue runs low and resumes when more tokens arrive.
Key changes:
omni_ar_scheduler.py): Queue/flush/drain logic for pendingadditional_informationupdates on running requests, with early-arrival buffering for requests not yet registered and pause/resume on text starvationgpu_model_runner.py,gpu_ar_model_runner.py): Append-mode merge forstreaming_text_token_idsasync_omni.py,omni_stage.py,api_server.py,serving_speech.py): Route update requests through the async engine to the scheduler; new/v1/audio/speech/streamWebSocket endpointqwen3_tts_talker.py): Consume streaming text embeddings during generationpatch.py):EngineCoreRequestType.UPDATEenum additionTest Plan
tests/entrypoints/test_streaming_tts.pycovering scheduler update routing, early buffer, flush, drain, pause/resume, external ID resolution, model runner merge semantics, output types, task types, async routing, and patch enumtests/e2e/online_serving/test_qwen3_tts_streaming.pycovering: all-text-in-initial, chunked streaming, slow delivery with pause/resume, sequential requests, audio-not-error regression, and non-streaming fallbackTest Result
Unit tests and e2e tests passed locally.
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model. Please runmkdocs serveto sync the documentation editions to./docs.BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)