[Feature][TTS] Streaming Text Input for Qwen3-TTS via WebSocket by lishunyang12 · Pull Request #1230 · vllm-project/vllm-omni

lishunyang12 · 2026-02-05T19:31:57Z

Summary

Add a WebSocket endpoint /v1/audio/speech/stream that accepts text input incrementally (e.g., from a real-time STT pipeline or LLM token stream), buffers and splits at sentence boundaries, and generates audio per sentence using the existing TTS pipeline.

This enables real-time text-to-speech workflows where text is produced progressively (speech-to-text, LLM token streaming, live captions) and audio needs to be generated as soon as complete sentences are available, rather than waiting for the entire input.

Scope: Streaming text input only. Each sentence produces a complete audio response. Streaming audio output (chunked PCM) is tracked separately in PR #1189.

Motivation

The current /v1/audio/speech REST endpoint requires the full text upfront. In real-time pipelines (e.g., STT → LLM → TTS), text arrives incrementally. Without streaming input support, clients must either:

Wait for the entire text before calling TTS (high latency), or
Manually implement sentence buffering and make multiple REST calls (complex, no session state).

This PR solves both issues with a single WebSocket session that handles buffering, sentence detection, and per-sentence generation automatically.

WebSocket Protocol

Transport: WebSocket (industry standard — used by OpenAI Realtime API, ElevenLabs, Azure TTS)

Client → Server

// 1. Session config (sent once, first message)
{"type": "session.config", "voice": "Vivian", "task_type": "CustomVoice", "split_granularity": "sentence"}

// 2. Text chunks (sent incrementally, any number of times)
{"type": "input.text", "text": "Hello, how are you? "}

// 3. End of input (flushes remaining buffer)
{"type": "input.done"}

Server → Client

// Per-sentence: metadata → binary audio → completion
{"type": "audio.start", "sentence_index": 0, "sentence_text": "Hello, how are you?", "format": "wav"}
<binary WebSocket frame: audio bytes>
{"type": "audio.done", "sentence_index": 0}

// Session complete
{"type": "session.done", "total_sentences": 3}

// Non-fatal error (session continues)
{"type": "error", "message": "..."}

Changes

New Files

File	Description
`vllm_omni/entrypoints/openai/text_splitter.py`	`SentenceSplitter` — incremental sentence boundary detector. Regex-based splitting at English `.!?` + whitespace and CJK fullwidth `。！？`. Configurable `boundary_re` and `min_sentence_length`. Exports `SPLIT_SENTENCE` and `SPLIT_CLAUSE` patterns.
`vllm_omni/entrypoints/openai/serving_speech_stream.py`	`OmniStreamingSpeechHandler` — WebSocket session handler. Manages config validation, idle/config timeouts (30s/10s), per-sentence audio generation, configurable split granularity, and error resilience (one sentence failure doesn't kill the session).
`examples/online_serving/qwen3_tts/streaming_speech_client.py`	Python WebSocket client example. Supports `--simulate-stt` mode (word-by-word with configurable delay), all 3 task types (CustomVoice, VoiceDesign, Base), saves per-sentence audio files.
`tests/entrypoints/openai_api/test_text_splitter.py`	Unit tests for `SentenceSplitter`: English/Chinese/mixed splitting, incremental accumulation, flush behavior, edge cases. (29 tests)
`tests/entrypoints/openai_api/test_serving_speech_stream.py`	WebSocket integration tests: session lifecycle, multi-sentence, incremental text, flush-on-done, empty input, invalid config, invalid JSON, unknown message types, generation failure recovery. (14 tests)

Modified Files

File	Description
`vllm_omni/entrypoints/openai/serving_speech.py`	Refactor: Extracted `_generate_audio_bytes(request) → (bytes, media_type)` from `create_speech()`. The REST endpoint delegates to it; the WebSocket handler reuses it per sentence. No behavior change for existing callers.
`vllm_omni/entrypoints/openai/protocol/audio.py`	Added `StreamingSpeechSessionConfig` Pydantic model for WebSocket session configuration (mirrors `OpenAICreateSpeechRequest` fields minus `input`). Includes `split_granularity` field.
`vllm_omni/entrypoints/openai/api_server.py`	Added `@router.websocket("/v1/audio/speech/stream")` route and `OmniStreamingSpeechHandler` initialization in `omni_init_app_state()`.
`examples/online_serving/qwen3_tts/README.md`	Added streaming text input documentation section with protocol spec, parameters, and usage examples.

Design Decisions

Decision	Rationale
WebSocket (not SSE/HTTP chunked)	Bidirectional: client sends text incrementally, server sends audio back. Industry standard for real-time TTS (OpenAI, ElevenLabs, Azure).
Configurable split granularity	`"sentence"` (default): splits on `.!?。！？` — natural boundaries, better prosody. `"clause"`: also splits on CJK `，；` — lower latency, more frequent chunks. Configurable via `session.config`.
`min_sentence_length=2`	Prevents splitting on lone punctuation (`.`) while supporting short CJK sentences like `你好！` (3 chars).
`_generate_audio_bytes()` extraction	Clean separation of concerns. REST endpoint wraps in `Response`; WebSocket sends raw bytes. No code duplication.
Per-sentence error resilience	If generation fails for one sentence, an error is sent but the session continues for remaining sentences.
Idle + config timeouts	Prevents resource leaks from abandoned connections (30s idle, 10s for initial config).

Test Results

Tested against a live server running Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice:

Unit tests: 29/29 text_splitter + 14/14 WebSocket integration tests pass
E2E: Sentence boundary detection, incremental buffering, flush-on-done, and audio generation all work correctly. First sentence produces valid 96KB WAV audio.
Known issue: Stage-1 engine crashes after first request due to a Prometheus metrics bug ([Bug] Stage-1 engine crashes after first request: Prometheus counter receives negative token count #1477, fix in [Bugfix]: initialize num_cached_tokens in generation scheduler to prevent metrics crash #1478) — not caused by this PR, affects REST endpoint equally.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3a53c3b43c

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-02-05T19:34:56Z

+            audio_bytes, _ = await self._speech_service._generate_audio_bytes(
+                request
+            )


Validate requested model before websocket generation

The websocket path never runs _check_model, so session.config.model is effectively ignored and audio is generated with whatever model is currently served. In contrast, the REST /v1/audio/speech path validates model selection first, so clients relying on model-specific behavior can get silent misrouting instead of an explicit error when they pass an unknown or mismatched model. Please add the same model check in the streaming flow before calling generation.

Useful? React with 👍 / 👎.

Fixed in 170e2d3 — added _check_model validation right after receiving session config. If the model does not match, an error is sent and the session is closed.

chatgpt-codex-connector · 2026-02-05T19:34:57Z

+_SENTENCE_BOUNDARY_RE = re.compile(
+    r"(?<=[.!?])\s+"  # English punctuation followed by whitespace


Split on sentence-ending punctuation at chunk end

The English boundary regex requires trailing whitespace ((?<=[.!?])\s+), so a chunk like "Hello world." is not emitted until another message adds whitespace or input.done is sent. That adds avoidable latency in real-time streaming and contradicts the stated behavior of splitting at punctuation when a sentence is complete. Consider treating end-of-buffer punctuation as a boundary as well.

Useful? React with 👍 / 👎.

Fixed in 170e2d3 — updated regex to (?<=[.\!?])(?:\s+|$) so it matches sentence-ending punctuation at end of string too, not just followed by whitespace.

hsliuustc0106 · 2026-02-05T23:21:33Z

@linyueqian would this help ?

linyueqian · 2026-02-05T23:24:17Z

@linyueqian would this help ?

this is in the roadmap and should be nice to have especially for real world real-time scenario. i will take a look at it later today.

Copilot

Pull request overview

This PR adds streaming text input support for Qwen3-TTS via a WebSocket endpoint /v1/audio/speech/stream. The implementation enables real-time text-to-speech workflows where text arrives incrementally (e.g., from STT, LLM streaming) and audio is generated on a per-sentence basis rather than waiting for complete input. This is distinct from PR #1189 which adds streaming audio output - this PR focuses solely on streaming text input.

Changes:

Implements WebSocket-based incremental text input with automatic sentence boundary detection for English and CJK languages
Refactors existing REST TTS endpoint to extract core generation logic into a reusable _generate_audio_bytes method
Adds comprehensive test coverage for sentence splitting logic and WebSocket session lifecycle

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
`vllm_omni/entrypoints/openai/text_splitter.py`	New incremental sentence boundary detector supporting English (.!? + whitespace) and CJK fullwidth punctuation (。！？，；) with configurable min_sentence_length
`vllm_omni/entrypoints/openai/serving_speech_stream.py`	New WebSocket handler managing session lifecycle with config/idle timeouts (10s/30s), per-sentence audio generation, and error resilience
`vllm_omni/entrypoints/openai/serving_speech.py`	Refactored to extract `_generate_audio_bytes(request) → (bytes, media_type)` from `create_speech()` for reuse by WebSocket handler
`vllm_omni/entrypoints/openai/protocol/audio.py`	Added `StreamingSpeechSessionConfig` Pydantic model mirroring `OpenAICreateSpeechRequest` fields (minus input text)
`vllm_omni/entrypoints/openai/api_server.py`	Registered WebSocket route `/v1/audio/speech/stream` and initialized `OmniStreamingSpeechHandler` in app state
`examples/online_serving/qwen3_tts/streaming_speech_client.py`	Example Python client with STT simulation mode, per-sentence audio file saving, and support for all task types
`examples/online_serving/qwen3_tts/README.md`	Comprehensive documentation of WebSocket protocol, session parameters, and sentence detection rules
`tests/entrypoints/openai_api/test_text_splitter.py`	20+ unit tests covering English/CJK/mixed splitting, incremental accumulation, flush behavior, and edge cases
`tests/entrypoints/openai_api/test_serving_speech_stream.py`	Integration tests for session lifecycle, multi-sentence handling, config validation, error scenarios, and generation failure recovery

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-05T23:29:43Z

+    max_new_tokens: int | None = None
+    ref_audio: str | None = None
+    ref_text: str | None = None
+    x_vector_only_mode: bool | None = None


StreamingSpeechSessionConfig lacks field validators for instructions length and max_new_tokens range that are present in the validation logic (_validate_tts_request). Consider adding @field_validator decorators to validate these constraints upfront when the session config is received, similar to how speed is validated with Field constraints. This would provide earlier feedback to clients instead of failing on the first sentence generation. For example: max length validator for instructions (500 chars), and range validator for max_new_tokens (1-4096).

Suggested change

x_vector_only_mode: bool | None = None

x_vector_only_mode: bool | None = None

@field_validator("instructions")

@classmethod

def validate_instructions(

cls,

v: str | None,

) -> str | None:

if v is not None and len(v) > 500:

raise ValueError("instructions must be at most 500 characters long")

return v

@field_validator("max_new_tokens")

@classmethod

def validate_max_new_tokens(

cls,

v: int | None,

) -> int | None:

if v is None:

return v

if not 1 <= v <= 4096:

raise ValueError("max_new_tokens must be between 1 and 4096")

return v

Added ge=1 validator on max_new_tokens. Kept the config model lean — pydantic field constraints handle the validation cleanly without needing separate @field_validator decorators.

Copilot · 2026-02-05T23:29:44Z

+        self._buffer += text
+        return self._extract_sentences()


The buffer can grow unbounded if text is sent continuously without sentence boundaries. Consider adding a maximum buffer size limit to prevent potential denial-of-service attacks or memory exhaustion. For example, if a client sends megabytes of text without punctuation, the buffer will continue to grow until the system runs out of memory. A reasonable limit (e.g., 10KB-100KB) would prevent this while still allowing for long sentences.

Fixed in 170e2d3 — added _MAX_BUFFER_SIZE = 100_000 (100K chars). add_text() now raises ValueError if the buffer exceeds this limit, preventing unbounded memory growth from clients sending text without sentence boundaries.

Copilot · 2026-02-05T23:29:44Z

+                if msg_type == "input.text":
+                    text = msg.get("text", "")
+                    sentences = splitter.add_text(text)


There is no validation on the length of individual text chunks or total accumulated text per session. A malicious client could send extremely large text chunks or accumulate unbounded text across multiple messages, potentially causing memory exhaustion or performance degradation. Consider adding limits such as: max text chunk size per message (e.g., 10KB), max total text per session (e.g., 100KB), or max number of sentences per session.

Addressed — the buffer size limit in the SentenceSplitter (100K chars) also caps total accumulated text per session. If a client sends too much text without sentence boundaries, ValueError is raised and the handler sends an error to the client.

Copilot · 2026-02-05T23:29:44Z

+                if isinstance(message, bytes):
+                    # Binary frame: audio data
+                    filename = os.path.join(
+                        output_dir,
+                        f"sentence_{sentence_count:03d}.{response_format}",
+                    )
+                    with open(filename, "wb") as f:
+                        f.write(message)
+                    print(f"  Saved audio: {filename} ({len(message)} bytes)")
+                    sentence_count += 1


The sentence counter is incremented on receiving binary audio data, but the actual sentence index comes from the server in the audio.start message. This creates a potential mismatch if audio.start and binary frames arrive in different orders, or if generation fails for a sentence (where audio.done is still sent but no binary frame). Consider using msg['sentence_index'] from the audio.start message to name the file instead of a local counter.

Fixed in 170e2d3 — the client now tracks current_sentence_index from audio.start messages and uses it for file naming, instead of incrementing a counter on binary frames.

Copilot · 2026-02-05T23:29:44Z

+# - CJK fullwidth: 。！？，；
+_SENTENCE_BOUNDARY_RE = re.compile(
+    r"(?<=[.!?])\s+"  # English punctuation followed by whitespace
+    r"|(?<=[。！？，；])"  # CJK fullwidth punctuation


Splitting on CJK commas (，) and semicolons (；) may result in very short fragments that don't constitute complete sentences. In Chinese, commas are often used within a single sentence to separate clauses, similar to English. Consider whether these should be treated as sentence boundaries or only split on the stronger punctuation marks (。！？). This could lead to choppy audio output with many small fragments. The tests show this behavior (e.g., "你好，" as a sentence), but it may not align with typical TTS use cases where complete sentences are preferred for natural prosody.

Suggested change

# - CJK fullwidth: 。！？，；

_SENTENCE_BOUNDARY_RE = re.compile(

r"(?<=[.!?])\s+" # English punctuation followed by whitespace

r"|(?<=[。！？，；])" # CJK fullwidth punctuation

# - CJK fullwidth: 。！？

_SENTENCE_BOUNDARY_RE = re.compile(

r"(?<=[.!?])\s+" # English punctuation followed by whitespace

r"|(?<=[。！？])" # CJK fullwidth sentence-final punctuation

Good point — removed CJK commas (，) and semicolons (；) from the default sentence boundary regex. They are clause-level, not sentence-level. Added a separate SPLIT_CLAUSE pattern and a split_granularity session config option so users can opt into finer-grained splitting if they prefer lower latency.

Copilot · 2026-02-05T23:29:45Z

+            try:
+                await self._send_error(websocket, f"Internal error: {e}")
+            except Exception:
+                pass


'except' clause does nothing but pass and there is no explanatory comment.

Suggested change

pass

logger.debug(

"Failed to send error message to streaming speech client",

exc_info=True,

)

Fixed in 170e2d3 — added logger.debug("Failed to send error message to streaming speech client", exc_info=True) to the bare except clause.

Copilot · 2026-02-05T23:29:45Z

+            sender_task.cancel()
+            try:
+                await sender_task
+            except asyncio.CancelledError:


'except' clause does nothing but pass and there is no explanatory comment.

Suggested change

except asyncio.CancelledError:

except asyncio.CancelledError:

# Task cancellation is expected during shutdown; safe to ignore.

Fixed in 170e2d3 — added explanatory comment: # Task cancellation is expected during shutdown.

linyueqian · 2026-02-05T23:31:22Z

The streaming input approach makes sense for LLM→TTS pipelines. One suggestion: the buffering/split granularity should be configurable via session.config (e.g. split on sentence vs clause vs word count), not hardcoded to sentence-level regex. Different latency/quality tradeoffs depending on the use case.

Also, how does this relate to PR #986 which adds engine-level StreamingInput support? Could the two eventually be combined so that buffered chunks are fed incrementally into the model rather than as separate full generation calls?

congw729 · 2026-02-06T03:07:30Z

Hi, if you add a markdown document under ./examples/*, please also run mkdocs serve to sync those editions to ./docs/ before merging this PR.

And also please add markers to your test functions, I think there are two markers needed for your file

import pytest
pytestmark = [pytest.mark.core_model, pytest.mark.cpu]

You can also check the markers usage under this #577.
For more markers please check test markers

Gaohan123

Streaming input is already supported in vLLM main repo of PR vllm-project/vllm#28973 . Do you refer it or develop a new one?

Adds /v1/audio/speech/stream WebSocket endpoint for incremental text input TTS. Text is buffered and split at sentence boundaries, with audio generated and returned per sentence. New files: - text_splitter.py: SentenceSplitter with regex boundary detection - serving_speech_stream.py: OmniStreamingSpeechHandler session handler - protocol/audio.py: StreamingSpeechSessionConfig Pydantic model - streaming_speech_client.py: example Python WebSocket client - test_text_splitter.py + test_serving_speech_stream.py: tests Modified: - api_server.py: register /v1/audio/speech/stream route + handler init - serving_speech.py: extract _generate_audio_bytes() for reuse Cherry-picked from vllm-project/vllm-omni PR vllm-project#1230 (lishunyang12). Co-Authored-By: Claude <noreply@anthropic.com>

1. SentenceSplitter: add max_buffer_size (100K default) to prevent OOM 2. Regex fix: English boundary now matches .!? at end of string, not just when followed by whitespace (reduces latency) 3. Feature flag: /v1/audio/speech/stream gated by QWEN3_TTS_STREAMING_ENABLED env var (defaults "1") 4. Model validation: _receive_config now validates config.model against served models instead of silently ignoring 5. Field validators: instructions max_length=500, max_new_tokens 1-4096 Co-Authored-By: Claude <noreply@anthropic.com>

hsliuustc0106 · 2026-02-24T08:02:44Z

@vllm-omni-reviewer

lishunyang12 · 2026-02-24T15:27:27Z

@linyueqian: the buffering/split granularity should be configurable via session.config

Addressed in 170e2d3 — added split_granularity field to StreamingSpeechSessionConfig ("sentence" default, "clause" for finer CJK splitting). The SentenceSplitter also accepts a custom boundary_re for full control.

lishunyang12 · 2026-02-24T15:27:36Z

@congw729: please also run mkdocs serve to sync those editions to ./docs/ ... also please add markers to your test functions

Added pytest markers (openai, speech) to both test files in 170e2d3. Will sync docs via mkdocs before merge.

Signed-off-by: lishunyang <lishunyang12@163.com>

- Add model validation via _check_model before WebSocket generation - Fix English sentence boundary regex to match .!? at end of string - Remove CJK commas/semicolons from sentence boundary (clause-level) - Add configurable split_granularity (sentence vs clause) in session config - Add buffer size limit (100K chars) to prevent unbounded memory growth - Add max_new_tokens field validator (ge=1) - Fix sentence counter mismatch in client (use audio.start index) - Add logging to silent except clauses - Add pytest markers (openai, speech) to test files - Add buffer overflow test Signed-off-by: lishunyang <lishunyang12@163.com>

- Add missing request_id declaration in create_speech() - Fix import order in test_serving_speech_stream.py (isort) - Remove unused SPLIT_CLAUSE/SPLIT_SENTENCE imports - Use `from re import Pattern` instead of deprecated `typing.Pattern` Signed-off-by: lishunyang <lishunyang12@163.com>

Signed-off-by: lishunyang <lishunyang12@163.com>

lishunyang12 · 2026-02-25T11:16:46Z

@Gaohan123: Streaming input is already supported in vLLM main repo of PR vllm-project/vllm#28973. Do you refer it or develop a new one?

Different layers — vllm#28973 is engine-level session-based streaming (cumulative prompt concatenation across chunks, scheduler-integrated). This PR is an application-level WebSocket endpoint specifically for TTS sentence buffering — it splits incoming text at sentence boundaries and calls the existing TTS generation per sentence. No engine-level streaming input is involved here, each sentence is a standalone generation call.

That said, once vllm-omni picks up the engine-level StreamingInput from main (PR #986 was an earlier attempt, now closed), there could be a path to feed buffered chunks incrementally into the model instead of separate generation calls. But that's a separate optimization — the WebSocket protocol and sentence splitting logic here would still be needed as the transport/buffering layer on top.

lishunyang12 · 2026-02-25T11:17:00Z

@linyueqian: how does this relate to PR #986 which adds engine-level StreamingInput support? Could the two eventually be combined?

Yes, they could be layered. #986 (now closed) was porting vllm#28973's engine-level StreamingInput to vllm-omni. If that gets picked up again, the architecture would be:

This PR — WebSocket transport + sentence buffering (application layer)
Engine-level StreamingInput — feed buffered chunks incrementally into the model (engine layer)

Right now each sentence triggers a separate _generate_audio_bytes call. With engine-level streaming, the buffered chunks could be fed into a single session where the model builds on prior context incrementally. The WebSocket protocol and text_splitter from this PR would still serve as the ingestion layer — they'd just push chunks into a StreamingInput generator instead of making independent generation calls.

That's a natural follow-up once engine-level streaming lands in vllm-omni. Keeping them decoupled for now avoids blocking on that dependency.

lishunyang12 · 2026-02-25T12:14:39Z

E2E Test Results

Tested the streaming WebSocket endpoint (/v1/audio/speech/stream) against a live server running Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice.

What works:

WebSocket connection + session lifecycle (config → text → done → session.done)
Sentence boundary detection — incremental word-by-word input correctly detects . ! ? boundaries
Flush on input.done — unbounded text without punctuation correctly flushed as final sentence
Audio generation — first sentence produces valid WAV audio (96KB)
Error recovery — per-sentence errors are reported gracefully, session continues
Bulk text splitting — multiple sentences in a single message correctly split and processed

Known issue (not caused by this PR):
Stage-1 (code2wav) engine crashes after the first request due to a Prometheus metrics bug:

ValueError: Counters can only be incremented by non-negative amounts.

This kills the engine for all subsequent requests (EngineDeadError). Filed as a separate issue. This affects the REST endpoint equally — not specific to streaming.

Unit tests: All 29 text_splitter tests + 14 WebSocket integration tests pass after fixes pushed to this PR.

lishunyang12 · 2026-02-25T12:23:30Z

Full test logs (click to expand)

Server: Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice with --omni --enforce-eager

Unit tests — text_splitter (29/29 pass):

$ python -m pytest tests/entrypoints/openai_api/test_text_splitter.py -v
============================= 29 passed in 0.17s ==============================

Unit tests — WebSocket integration (14/14 pass):

$ python -m pytest tests/entrypoints/openai_api/test_serving_speech_stream.py -v
============================= 14 passed in 1.55s ==============================

E2E streaming test (word-by-word input):

============================================================
TEST 1: Streaming input (word-by-word, 0.1s per token)
============================================================
[  0.06s] Sent: 'The '
[  0.16s] Sent: 'quick '
[  0.26s] Sent: 'brown '
[  0.36s] Sent: 'fox '
[  0.46s] Sent: 'jumps. '
[  0.56s] Sent: 'Over '
[  0.66s] Sent: 'the '
[  0.76s] Sent: 'lazy '
[  0.86s] Sent: 'dog! '
[  0.96s] Sent: 'What '
[  1.06s] Sent: 'happens '
[  1.16s] Sent: 'next'
[  1.26s] Sent: input.done
[  1.74s]   audio.start  #0  text='The quick brown fox jumps.'
[  7.37s]   Audio: 96044 bytes (sentence 0)
[  7.37s]   audio.done   #0
[  7.37s]   audio.start  #1  text='Over the lazy dog!'
[  7.42s]   ERROR: Generation failed for sentence 1: EngineDeadError (see #1477)
[  7.42s]   audio.done   #1
[  7.42s]   audio.start  #2  text='What happens next'
[  9.88s]   ERROR: Generation failed for sentence 2: EngineDeadError (see #1477)
[  9.88s]   audio.done   #2
[  9.88s]   session.done (3 sentences)

  Time to first audio: 7.37s
  Total time:          9.88s
  Sentences received:  1
  -> Saved streaming_sentence_0.wav (96044 bytes)

E2E bulk text test:

============================================================
TEST 2: Streaming input (bulk text, single message)
============================================================
[  0.00s] Sent all text (130 chars)
[  0.00s]   audio.start  #0  text='Welcome to the streaming TTS demo.'
[  1.61s]   audio.done   #0
[  1.61s]   audio.start  #1  text='This text is sent all at once.'
[  5.25s]   audio.done   #1
[  5.25s]   audio.start  #2  text='But the server splits it into sentences and streams audio back.'
[  8.55s]   audio.done   #2
[  8.55s]   session.done (3 sentences)

  Total time: 8.55s  |  Sentences: 3

E2E flush test (no punctuation):

============================================================
TEST 3: Streaming input (no punctuation, flush on done)
============================================================
[  0.00s] Sent text without punctuation
[  0.00s] Sent input.done — should trigger flush
[  0.00s]   audio.start  text='This text has no punctuation so it stays buffered'
[  6.07s]   audio.done
[  6.07s]   session.done (1 sentences)

  Total time: 6.07s

Note: Sentences 1+ in Test 1 fail due to a Stage-1 Prometheus metrics crash (#1477, fix in #1478) — not related to this PR. First sentence generates valid 96KB WAV audio successfully.

Signed-off-by: lishunyang <lishunyang12@163.com>

lishunyang12 · 2026-02-25T12:33:55Z

@linyueqian: the buffering/split granularity should be configurable via session.config

Fully wired up now. split_granularity in session.config controls the splitting behavior:

"sentence" (default) — splits on .!?。！？ + whitespace. Better prosody, higher latency to first audio.
"clause" — also splits on CJK ，；. Lower latency, more frequent but shorter chunks.

{"type": "session.config", "voice": "Vivian", "split_granularity": "clause"}

The SentenceSplitter accepts a custom boundary_re, so adding more granularity options (e.g. word count) in the future is straightforward.

Signed-off-by: lishunyang <lishunyang12@163.com>

lishunyang12 · 2026-03-11T02:27:55Z

Closed due to limited bandwidth and low priority

lishunyang12 requested a review from hsliuustc0106 as a code owner February 5, 2026 19:31

chatgpt-codex-connector Bot reviewed Feb 5, 2026

View reviewed changes

hsliuustc0106 requested a review from Copilot February 5, 2026 23:21

Copilot started reviewing on behalf of hsliuustc0106 February 5, 2026 23:21 View session

Copilot AI reviewed Feb 5, 2026

View reviewed changes

Gaohan123 reviewed Feb 10, 2026

View reviewed changes

lishunyang12 mentioned this pull request Feb 12, 2026

[RFC]: Qwen3-TTS Production Ready - February Milestone #938

Open

lishunyang12 added 3 commits February 24, 2026 23:47

add streaming text input support for qwen3 tts

8478fdb

Signed-off-by: lishunyang <lishunyang12@163.com>

fix precommit

93b5ae9

Signed-off-by: lishunyang <lishunyang12@163.com>

lishunyang12 force-pushed the tts branch from 170e2d3 to 8e96e62 Compare February 24, 2026 15:47

lishunyang12 added 2 commits February 24, 2026 23:53

fix: move pytestmark after imports to fix E402 lint error

ec8c2cc

Signed-off-by: lishunyang <lishunyang12@163.com>

lishunyang12 added 2 commits February 25, 2026 20:25

fix: don't treat end-of-string as sentence boundary

d7c2333

Signed-off-by: lishunyang <lishunyang12@163.com>

fix: add WebSocket type annotation to test endpoint

46ca73d

Signed-off-by: lishunyang <lishunyang12@163.com>

lishunyang12 force-pushed the tts branch from b705638 to 46ca73d Compare February 25, 2026 12:25

Merge branch 'main' into tts

9f0c43c

fix: wire split_granularity config to SentenceSplitter

9179d63

Signed-off-by: lishunyang <lishunyang12@163.com>

lishunyang12 added 2 commits February 25, 2026 20:35

docs: add split_granularity to streaming TTS documentation

a5fc1e0

Signed-off-by: lishunyang <lishunyang12@163.com>

docs: add streaming WebSocket section to speech API docs

b734ac3

Signed-off-by: lishunyang <lishunyang12@163.com>

This was referenced Feb 25, 2026

[RFC] Streaming Audio Output for WebSocket TTS #1479

Closed

[RFC] Multi-Stage Abort / Barge-in for Omni Models #1480

Closed

[RFC] Full-Duplex Voice Pipeline (STT → LLM → Streaming TTS) #1481

Closed

Sy0307 mentioned this pull request Mar 6, 2026

[Feat][Qwen3-TTS] Support streaming audio output for websocket #1719

Merged

linyueqian mentioned this pull request Mar 10, 2026

[RFC]: TTS Development Roadmap - March 2026 #1795

Open

lishunyang12 closed this Mar 11, 2026

lishunyang12 mentioned this pull request Mar 13, 2026

[Feature] Streaming text input for Qwen3-TTS #1883

Open

5 tasks

lishunyang12 reopened this Mar 20, 2026

lishunyang12 closed this Mar 20, 2026

		_SENTENCE_BOUNDARY_RE = re.compile(
		r"(?<=[.!?])\s+" # English punctuation followed by whitespace

-    x_vector_only_mode: bool | None = None
+    x_vector_only_mode: bool | None = None
+    @field_validator("instructions")
+    @classmethod
+    def validate_instructions(
+        cls,
+        v: str | None,
+    ) -> str | None:
+        if v is not None and len(v) > 500:
+            raise ValueError("instructions must be at most 500 characters long")
+        return v
+    @field_validator("max_new_tokens")
+    @classmethod
+    def validate_max_new_tokens(
+        cls,
+        v: int | None,
+    ) -> int | None:
+        if v is None:
+            return v
+        if not 1 <= v <= 4096:
+            raise ValueError("max_new_tokens must be between 1 and 4096")
+        return v

-                pass
+                logger.debug(
+                    "Failed to send error message to streaming speech client",
+                    exc_info=True,
+                )

	except asyncio.CancelledError:
	except asyncio.CancelledError:
	# Task cancellation is expected during shutdown; safe to ignore.

Conversation

lishunyang12 commented Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

WebSocket Protocol

Client → Server

Server → Client

Changes

New Files

Modified Files

Design Decisions

Test Results

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hsliuustc0106 commented Feb 5, 2026

Uh oh!

linyueqian commented Feb 5, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

linyueqian commented Feb 5, 2026

Uh oh!

congw729 commented Feb 6, 2026

Uh oh!

Gaohan123 left a comment

Choose a reason for hiding this comment

Uh oh!

hsliuustc0106 commented Feb 24, 2026

lishunyang12 commented Feb 5, 2026 •

edited

Loading