Skip to content

[Feature] Streaming text input for Qwen3-TTS#1883

Open
thrashingstate wants to merge 2 commits into
vllm-project:mainfrom
thrashingstate:feature/streaming-tts-input-v016
Open

[Feature] Streaming text input for Qwen3-TTS#1883
thrashingstate wants to merge 2 commits into
vllm-project:mainfrom
thrashingstate:feature/streaming-tts-input-v016

Conversation

@thrashingstate
Copy link
Copy Markdown

Note: Much of this code was generated using Claude Code — a thorough review would be much appreciated.

Purpose

Add true streaming text input support for Qwen3-TTS via an UPDATE_REQUEST mechanism. Resolves #1766.

Text token IDs arrive incrementally (e.g. from an LLM), get embedded on GPU, and are injected into a running TTS generation with zero voice discontinuity. The model pauses when the text queue runs low and resumes when more tokens arrive.

Key changes:

  • Scheduler (omni_ar_scheduler.py): Queue/flush/drain logic for pending additional_information updates on running requests, with early-arrival buffering for requests not yet registered and pause/resume on text starvation
  • Model runner (gpu_model_runner.py, gpu_ar_model_runner.py): Append-mode merge for streaming_text_token_ids
  • Entrypoints (async_omni.py, omni_stage.py, api_server.py, serving_speech.py): Route update requests through the async engine to the scheduler; new /v1/audio/speech/stream WebSocket endpoint
  • Model (qwen3_tts_talker.py): Consume streaming text embeddings during generation
  • Patch (patch.py): EngineCoreRequestType.UPDATE enum addition

Test Plan

  • 16 unit tests in tests/entrypoints/test_streaming_tts.py covering scheduler update routing, early buffer, flush, drain, pause/resume, external ID resolution, model runner merge semantics, output types, task types, async routing, and patch enum
  • 6 e2e tests in tests/e2e/online_serving/test_qwen3_tts_streaming.py covering: all-text-in-initial, chunked streaming, slow delivery with pause/resume, sequential requests, audio-not-error regression, and non-streaming fallback
# Unit tests
pytest tests/entrypoints/test_streaming_tts.py -v

# E2E tests (requires L4 GPU)
pytest tests/e2e/online_serving/test_qwen3_tts_streaming.py -v

Test Result

Unit tests and e2e tests passed locally.


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
  • The test results. Please paste the results comparison before and after, or the e2e results.
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
  • (Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 25fc012eb2

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +340 to +341
except WebSocketDisconnect:
logger.info(f"[{_req_id}] Client disconnected")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Abort backend TTS request on WebSocket disconnect

When the WebSocket client disconnects, this branch only logs and exits, but it never aborts the in-flight request created by create_speech_streaming. In that scenario (e.g., mobile/network drop mid-audio), the engine can keep decoding audio with no consumer until it naturally stops, wasting GPU time and request capacity; add explicit request cancellation/abort in the disconnect path.

Useful? React with 👍 / 👎.

@lishunyang12
Copy link
Copy Markdown
Collaborator

lishunyang12 commented Mar 13, 2026

Hi @thrashingstate. There is already a PR making the same effort ##1230. Unfortunately, I closed it because I didn't have too much bandwidth testing this while cannot find too much benefit out of this for the current TTS model. It will be particularly useful for model that can accept speech input but tts model doesn't. If you think you can find a very through logic to prove its usefulness. Then, the community will continue this effort.

@@ -0,0 +1,286 @@
# SPDX-License-Identifier: Apache-2.0
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR unifies the test case style for qwen-tts. Is it sufficient to cover the corresponding streaming scenarios?
#1911
testcase style and test level can refer to: https://github.com/vllm-project/vllm-omni/blob/main/docs/contributing/ci/CI_5levels.md

@linyueqian linyueqian self-requested a review March 16, 2026 19:36
def _maybe_resume_request(self, req_id: str) -> None:
"""Resume a paused request if it was waiting for an update."""
req = self.requests.get(req_id)
if req is not None and req.status == RequestStatus.WAITING_FOR_CHUNK:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use the RequestStatus.WAITING_FOR_STREAMING_REQ state in vLLM?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed

@hsliuustc0106
Copy link
Copy Markdown
Collaborator

resolve. conflicts please

Copy link
Copy Markdown
Collaborator

@Gaohan123 Gaohan123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please provide an e2e use case and update the docs?

@thrashingstate
Copy link
Copy Markdown
Author

I've been busy with other tasks. I'll try to find time this week to address the feedback.

@linyueqian
Copy link
Copy Markdown
Collaborator

@thrashingstate is there any updates? Thanks!

@amy-why-3459
Copy link
Copy Markdown
Contributor

@gcanlin PTAL

@Sy0307
Copy link
Copy Markdown
Contributor

Sy0307 commented Apr 21, 2026

I will take over this task for streaming text input for Qwen3 TTS. cc @linyueqian @amy-why-3459

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: Streaming Input for Qwen3-TTS

9 participants