support qwen3 tts streaming output by gerayking · Pull Request #1189 · vllm-project/vllm-omni

gerayking · 2026-02-04T04:16:58Z

Purpose

Add streaming output support for Qwen3-TTS model to enable real-time audio generation with lower latency. #938

Key Changes

Streaming Generation Interface (modeling_qwen3_tts.py):
- Added StreamingChunkOutput dataclass for streaming chunk output
- Added AsyncDecodingPipeline class for asynchronous audio decoding in background thread
- Implemented generate_streaming_iter() in Qwen3TTSTalkerForConditionalGeneration for token-level streaming
High-Level Streaming API (qwen3_tts.py):
- Added forward_streaming() method in Qwen3TTSModelForGeneration
- Added streaming variants: generate_custom_voice_streaming(), generate_voice_design_streaming(), generate_voice_clone_streaming()
- Implemented streaming state management for request handling
Examples and Tests:
- Added streaming test cases in examples/offline_inference/qwen3_tts/end2end.py
- Added test_streaming_vs_nonstreaming.py to verify streaming and non-streaming outputs produce identical tokens

Test Plan

# Run streaming vs non-streaming consistency test
python tests/model_executor/models/qwen3_tts/test_streaming_vs_nonstreaming.py

# Run end2end streaming example
python examples/offline_inference/qwen3_tts/end2end.py --streaming

Test Result

============================================================
COMPARISON RESULTS (TOKEN_ID LEVEL):
============================================================
Non-streaming tokens (raw): 88
Streaming tokens (raw):     89
EOS token ID: 4198
Non-streaming tokens (no EOS): 88
Streaming tokens (no EOS):     88
✓ Token counts match (excluding EOS)!
✓ All 88 tokens are IDENTICAL across all 16 codebooks!

============================================================
TEST PASSED!
============================================================

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 22d078aa39

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

hsliuustc0106

could this work with online serving?

hsliuustc0106 · 2026-02-04T04:27:26Z

        # Convert result to OmniOutput format
        return self.make_omni_output(result, **kwargs)

+    def forward_streaming(


does every tts model need this function?

Externally we only expose forward. If a model needs to support streaming output, it should additionally implement forward_streaming; otherwise forward alone is sufficient.

hsliuustc0106 · 2026-02-04T04:27:42Z

+    # ==================== Streaming Generation Methods ====================
+
+    @torch.inference_mode()
+    def generate_custom_voice_streaming(


same question

hsliuustc0106 · 2026-02-04T13:36:13Z

I think we need to discuss a general abstraction of streaming output such that every tts model can inherit from it. @Gaohan123 @linyueqian @amy-why-3459

Copilot

Pull request overview

Adds streaming audio output support for Qwen3-TTS to enable lower-latency, chunked waveform generation.

Changes:

Introduces token-level streaming generation in the talker and an async background decoding pipeline to turn codec chunks into audio.
Adds high-level streaming APIs and per-request streaming state handling in the Qwen3-TTS vLLM model wrapper.
Extends the offline example and adds a streaming-vs-nonstreaming consistency script/test.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 14 comments.

Show a summary per file

File	Description
`vllm_omni/model_executor/models/qwen3_tts/tokenizer_12hz/modeling_qwen3_tts_tokenizer_v2.py`	Disables cache during decoder transformer forward; adds invalid-token truncation for chunked decoding.
`vllm_omni/model_executor/models/qwen3_tts/qwen3_tts.py`	Adds streaming request path (`forward_streaming`) and public streaming generation helpers.
`vllm_omni/model_executor/models/qwen3_tts/modeling_qwen3_tts.py`	Implements streaming token iterator, async decoding thread pipeline, and streaming audio generator.
`tests/model_executor/models/qwen3_tts/test_streaming_vs_nonstreaming.py`	Adds a codec-token consistency check script (currently placed as a pytest-discoverable test).
`tests/model_executor/models/qwen3_tts/__init__.py`	Initializes the new Qwen3-TTS test package.
`examples/offline_inference/qwen3_tts/end2end.py`	Adds a streaming test query and a streaming test runner that writes chunk WAVs + timing JSON.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-04T13:45:10Z

+            audio_tensor = output.multimodal_output["audio"]
+            audio_samplerate = output.multimodal_output["sr"].item()
+


In streaming mode, multimodal_output["audio"] (and potentially sr) may become a list due to multimodal accumulation across iterations. This loop assumes a tensor and will break after the first chunk (audio_tensor.float(), sr.item()). Update the example to handle list values (e.g., take the last element) so it works with accumulated streaming outputs.

Suggested change

audio_tensor = output.multimodal_output["audio"]

audio_samplerate = output.multimodal_output["sr"].item()

# In streaming mode, multimodal_output["audio"] / ["sr"] may be lists

audio_value = output.multimodal_output["audio"]

if isinstance(audio_value, list) and len(audio_value) > 0:

audio_tensor = audio_value[-1]

else:

audio_tensor = audio_value

sr_value = output.multimodal_output.get("sr", audio_samplerate)

if isinstance(sr_value, list) and len(sr_value) > 0:

sr_item = sr_value[-1]

else:

sr_item = sr_value

if hasattr(sr_item, "item"):

audio_samplerate = int(sr_item.item())

else:

audio_samplerate = int(sr_item)

Copilot · 2026-02-04T13:45:10Z

+    model = Qwen3TTSModel.from_pretrained(
+        MODEL_PATH,
+        torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32,
+        device_map=device,


device_map is forwarded to transformers.AutoModel.from_pretrained(...), where it expects a string like "auto"/"cuda:0" or a device mapping dict. Passing a torch.device object here is likely to error. Use device_map="auto" (or a proper dict), or move the model to the desired device via .to(device) after loading.

Suggested change

device_map=device,

device_map="auto",

isn't device map Union[torch.device, dict, str]? Think this is wrong

Copilot · 2026-02-04T13:45:11Z

+        request_id = kwargs.get("request_id", "default")
+


Using request_id = kwargs.get("request_id", "default") risks mixing state across concurrent requests whenever request_id is missing/empty. Please require request_id (raise) or generate a unique id (e.g., uuid4) instead of defaulting to a constant.

Copilot · 2026-02-04T13:45:13Z

+            )
+
+        self._validate_languages(languages)
+


generate_voice_design_streaming accepts text: str | list[str], but the underlying streaming implementation only supports batch size 1 (assert in _prepare_talker_inputs). Please validate len(texts)==1 and raise a clear error, or add batching support.

Suggested change

if len(texts) != 1:

raise ValueError(

"generate_voice_design_streaming currently supports only batch size 1; "

f"got batch size {len(texts)}. For batched synthesis, use the non-streaming "

"generate_voice_design API instead."

)

marksverdhai

Nice work on the streaming primitives, @gerayking. The model-level token streaming iterator and async decode pipeline are well structured. We've been running a streaming TTS setup in production on our fork, so I wanted to share some context on how our approaches compare and flag a few things.

How our fork approaches streaming vs this PR

We took a slightly different architecture. Rather than a threaded AsyncDecodingPipeline, we do synchronous inline decoding — the codec-to-audio decode happens in the same thread as the generator, without queues. The tradeoff is less potential parallelism between token generation and audio decoding, but simpler concurrency and no queue-related edge cases (like the bounded stop() put).

At the token generation level, we have a single generate_streaming() generator on the ForConditionalGeneration class rather than a separate talker-level iterator. We also factored sampling and repetition penalty into standalone helper functions (_sample_token(), _apply_repetition_penalty() using vectorized torch.where) instead of inlining them in the loop.

For context stripping when decoding chunks, we use a proportional ratio (context_frames / total_decoded_frames * total_decoded_samples) rather than a fixed upsample_rate. The transformer-based decoder doesn't always produce exactly frames * upsample_rate samples, so the proportional approach adapts to the actual decoder output and avoids subtle audio glitches at chunk boundaries.

The bigger architectural difference is in the integration layers — which brings me to the main gaps I see in this PR:

Scheduler integration

forward_streaming() returns OmniOutput with "finished": torch.tensor(is_finished), but the generation scheduler (omni_generation_scheduler.py) doesn't know to look for this field. Without a corresponding scheduler change, the request will be marked FINISHED_STOPPED after the first forward() call, so subsequent chunks will never be generated in the online serving path.

In our fork we added an is_final flag to OmniOutput.multimodal_outputs and patched the scheduler to keep requests in RUNNING status until is_final=True. The scheduler reads is_final from the pooler output and only transitions to FINISHED_STOPPED when it's true. This required handling deserialization edge cases too — is_final can arrive as a bool, list, tensor, or scalar after ZMQ transport.

Is this PR intentionally scoped to offline inference only, or is the scheduler/serving integration planned as a follow-up?

Serving layer

Related — there are no changes to serving_speech.py or any entrypoint. For online serving you'd need a StreamingResponse path that yields chunked audio. In our fork we added _stream_progressive_audio() as a FastAPI StreamingResponse that yields PCM bytes or WAV (with a placeholder max-size WAV header). One non-obvious thing: the output processor accumulates all tensors across iterations (not just new ones), so you need a cursor to track which chunks have already been yielded.

Stride-0 tensor serialization

One production bug we hit: when chunks cross certain boundaries, the audio tensor from the decoder can have stride-0 dimensions (from expand() or numpy views). Calling .contiguous() alone doesn't fix stride-0 — you need an explicit .clone(). This crashes during ZMQ serialization in the connector layer. Something to watch for when wiring this into online serving.

Streaming state cleanup

In forward_streaming(), the streaming state is only cleaned up on a subsequent call after is_finished=True. If the caller stops iterating once it sees finished=True (as the docstring suggests), the generator and accumulated audio chunks will leak in self._streaming_state. Cleaning up immediately when the generator signals completion would prevent this.

Minor

print(f"chunk_size : {chunk_size}, {runtime_additional_information}") on line 142 of qwen3_tts.py looks like a debug print that should be removed or converted to logger.debug().

What this PR does well

The decoder-side fixes in this PR are solid and independently valuable:

codec_valid_max truncation in chunked_decode — prevents out-of-bounds embedding lookups
use_cache=False in the tokenizer 12Hz decoder — prevents KV cache leaks
suppress_tokens on the code predictor — prevents generating tokens >= 2048 that would crash the decoder

These are fixes we don't have in our fork yet and plan to adopt.

gerayking · 2026-02-08T13:45:29Z

Nice work on the streaming primitives, @gerayking. The model-level token streaming iterator and async decode pipeline are well structured. We've been running a streaming TTS setup in production on our fork, so I wanted to share some context on how our approaches compare and flag a few things.

How our fork approaches streaming vs this PR

We took a slightly different architecture. Rather than a threaded AsyncDecodingPipeline, we do synchronous inline decoding — the codec-to-audio decode happens in the same thread as the generator, without queues. The tradeoff is less potential parallelism between token generation and audio decoding, but simpler concurrency and no queue-related edge cases (like the bounded stop() put).

At the token generation level, we have a single generate_streaming() generator on the ForConditionalGeneration class rather than a separate talker-level iterator. We also factored sampling and repetition penalty into standalone helper functions (_sample_token(), _apply_repetition_penalty() using vectorized torch.where) instead of inlining them in the loop.

For context stripping when decoding chunks, we use a proportional ratio (context_frames / total_decoded_frames * total_decoded_samples) rather than a fixed upsample_rate. The transformer-based decoder doesn't always produce exactly frames * upsample_rate samples, so the proportional approach adapts to the actual decoder output and avoids subtle audio glitches at chunk boundaries.

The bigger architectural difference is in the integration layers — which brings me to the main gaps I see in this PR:

Scheduler integration

forward_streaming() returns OmniOutput with "finished": torch.tensor(is_finished), but the generation scheduler (omni_generation_scheduler.py) doesn't know to look for this field. Without a corresponding scheduler change, the request will be marked FINISHED_STOPPED after the first forward() call, so subsequent chunks will never be generated in the online serving path.

In our fork we added an is_final flag to OmniOutput.multimodal_outputs and patched the scheduler to keep requests in RUNNING status until is_final=True. The scheduler reads is_final from the pooler output and only transitions to FINISHED_STOPPED when it's true. This required handling deserialization edge cases too — is_final can arrive as a bool, list, tensor, or scalar after ZMQ transport.

Is this PR intentionally scoped to offline inference only, or is the scheduler/serving integration planned as a follow-up?

Serving layer

Related — there are no changes to serving_speech.py or any entrypoint. For online serving you'd need a StreamingResponse path that yields chunked audio. In our fork we added _stream_progressive_audio() as a FastAPI StreamingResponse that yields PCM bytes or WAV (with a placeholder max-size WAV header). One non-obvious thing: the output processor accumulates all tensors across iterations (not just new ones), so you need a cursor to track which chunks have already been yielded.

Stride-0 tensor serialization

One production bug we hit: when chunks cross certain boundaries, the audio tensor from the decoder can have stride-0 dimensions (from expand() or numpy views). Calling .contiguous() alone doesn't fix stride-0 — you need an explicit .clone(). This crashes during ZMQ serialization in the connector layer. Something to watch for when wiring this into online serving.

Streaming state cleanup

In forward_streaming(), the streaming state is only cleaned up on a subsequent call after is_finished=True. If the caller stops iterating once it sees finished=True (as the docstring suggests), the generator and accumulated audio chunks will leak in self._streaming_state. Cleaning up immediately when the generator signals completion would prevent this.

Minor

print(f"chunk_size : {chunk_size}, {runtime_additional_information}") on line 142 of qwen3_tts.py looks like a debug print that should be removed or converted to logger.debug().

What this PR does well

The decoder-side fixes in this PR are solid and independently valuable:

codec_valid_max truncation in chunked_decode — prevents out-of-bounds embedding lookups

use_cache=False in the tokenizer 12Hz decoder — prevents KV cache leaks

suppress_tokens on the code predictor — prevents generating tokens >= 2048 that would crash the decoder

These are fixes we don't have in our fork yet and plan to adopt.

Thanks for the detailed review and sharing your production experience!

Scope clarification: This PR is intentionally scoped to offline inference only. Scheduler integration and online serving changes are planned as follow-up work.

Will fix in this PR:

Remove debug print() on line 142
Add .clone().contiguous() for stride-0 tensor safety
Clean up streaming state immediately on completion

X-Skoprio · 2026-02-08T22:34:02Z

Hi, is this going to be online anytime soon ? Thank you !

Signed-off-by: gerayking <399geray@gmail.com>

gerayking · 2026-02-10T08:06:14Z

Hi, is this going to be online anytime soon ? Thank you !

Qwen3-TTS online serving streaming output will be supported in another PR. I’ll push it soon.

…ress_tokens guard Three production-ready fixes for the 12Hz tokenizer decoder: 1. codec_valid_max truncation: prevents out-of-bounds embedding lookups when tokens >= codebook_size (including EOS tokens) reach the decoder 2. use_cache=False: prevents KV cache accumulation in the ConvTranspose1d decoder's pre_transformer, which caused memory leaks during chunked decode 3. suppress_tokens: added via the streaming pipeline code to block tokens >= 2048 that crash the codec decoder Cherry-picked from vllm-project/vllm-omni PR vllm-project#1189 (gerayking). Co-Authored-By: Claude <noreply@anthropic.com>

Adds StreamingChunkOutput, AsyncDecodingPipeline (background thread + bounded queue), generate_streaming_iter(), and forward_streaming() to the Qwen3-TTS model layer. Includes streaming variants for all three voice paths: custom voice, voice design, and voice clone. Key components: - StreamingChunkOutput: dataclass for incremental codec chunks - AsyncDecodingPipeline: background thread decodes codec→audio with context overlap, communicates via bounded queues - generate_streaming_iter(): yields StreamingChunkOutput per chunk - forward_streaming(): top-level entry that returns OmniOutput with intermediate chunks and a "finished" flag - generate_custom_voice_streaming/voice_design_streaming/voice_clone_streaming Includes test (test_streaming_vs_nonstreaming.py) for token consistency (88 tokens across 16 codebooks) and updated end2end example. Cherry-picked from vllm-project/vllm-omni PR vllm-project#1189 (gerayking). Co-Authored-By: Claude <noreply@anthropic.com>

Extends OmniRequestOutput with finished flag and streaming_audio_chunk for progressive audio delivery. Updates the scheduler to check the "finished" key in multimodal_outputs — keeping requests in RUNNING state when finished=False instead of prematurely transitioning to FINISHED_STOPPED. Modified files: - omni_generation_scheduler.py: streaming-aware update_from_output() - output_processor.py: handle intermediate streaming outputs - outputs.py: OmniRequestOutput.finished + streaming_audio_chunk fields - async_omni.py: yield intermediate OmniRequestOutput in generate() Cherry-picked from vllm-project/vllm-omni PR vllm-project#1189 (gerayking). Co-Authored-By: Claude <noreply@anthropic.com>

1. Replace debug print() with logger.warning() in mel_spectrogram() 2. Fix request_id collision: use uuid4 instead of "default" fallback 3. Remove dead deferred-cleanup branch in forward_streaming() — the immediate cleanup at is_finished already handles state deletion 4. Add batch-size-1 guards to all three streaming generation methods (generate_custom_voice_streaming, generate_voice_design_streaming, generate_voice_clone_streaming) — the underlying model asserts batch=1 but the public API accepted lists silently 5. Remove no-op assignment in scheduler update_from_output() Co-Authored-By: Claude <noreply@anthropic.com>

hsliuustc0106 · 2026-02-23T01:00:28Z

I think there are some bugs in this PR, we should not have changed so many files :)

hsliuustc0106 · 2026-02-23T14:19:29Z

🤖 Code Review: PR #1189 — Streaming Audio Output 🔴

Verdict: Request Changes

Summary: Adds streaming audio output for TTS. However, the PR is ~132KB and bundles unrelated changes: CI infrastructure (pytest markers, nightly tests), ComfyUI integration app, GitHub workflows, and CFG parallel mixin docs. The PR must be split before meaningful review of the streaming core is possible. The CI changes and ComfyUI integration are entirely separate concerns that increase merge conflict risk and make bisecting regressions impossible.

Key Concerns:

⚠️ PR must be split into at least 3 PRs: (1) Streaming TTS core, (2) CI infrastructure (pytest markers, nightly config, buildkite changes), (3) ComfyUI integration (new apps/ComfyUI-vLLM-Omni/ directory). Bundling makes review difficult and bisecting regressions impossible.
⚠️ Cannot fully assess the streaming implementation due to diff size. Key unanswered questions: How are audio chunks delivered to the client? Does it integrate with existing chunked_decode() on the 12Hz tokenizer? What is the chunk boundary strategy?
⚠️ Interaction with PR feat(qwen3-tts): Add CUDA Graph support for speech tokenizer decoder #1205 (CUDA Graph) needs clarification -- both modify the decode path and could conflict. Streaming chunk size must align with CUDA graph capture sizes.
⚠️ First chunk latency measurement against the 200ms target cannot be verified without isolating the streaming core.

Detailed Feedback

📄 `.buildkite/pipeline.yml`

🟡 MEDIUM (architecture)

CI pipeline changes (switching from explicit test paths to pytest marker-based selection with '-m core_model and cpu') are a good pattern but are completely unrelated to streaming audio. Should be in a separate PR.

📄 `.buildkite/test-nightly.yaml`

🟡 MEDIUM (architecture)

New nightly test config for H100 is reasonable but unrelated to streaming audio. Should be in the CI infrastructure PR.

📄 `.github/workflows/build_wheel.yml`

🟡 MEDIUM (architecture)

Workflow modifications are unrelated to streaming audio output. Should be in the CI infrastructure PR.

📄 `apps/ComfyUI-vLLM-Omni/`

🟠 HIGH (architecture)

Entirely new ComfyUI integration application. This is a separate feature that should not be bundled in a streaming audio PR. Creates unnecessary merge conflicts and makes the PR unreviewable at 132KB.

📄 `vllm_omni/entrypoints/openai/serving_speech.py`

🟠 HIGH (architecture)

Cannot fully assess streaming integration with create_speech() due to the PR's size. Key questions: (1) How are audio chunks delivered -- chunked HTTP transfer, SSE, or WebSocket? (2) Does it integrate with the existing chunked_decode() on the 12Hz tokenizer? (3) What is the chunk boundary strategy and does it match CUDA graph PR #1205 assumptions?

Suggestions

Split the PR immediately into 3 separate PRs: streaming core, CI infrastructure, ComfyUI integration.
Ensure the streaming chunk size aligns with CUDA graph capture sizes from PR feat(qwen3-tts): Add CUDA Graph support for speech tokenizer decoder #1205.
Add a benchmark measuring first-chunk latency specifically (PR Add benchmark for v1/audio/speech non-streaming #1408 notes this metric requires streaming).
Clarify the audio chunk delivery mechanism (chunked HTTP, SSE, WebSocket).
After splitting, rebase the streaming core on top of PR feat(qwen3-tts): Add CUDA Graph support for speech tokenizer decoder #1205 to avoid decode path conflicts.

Part of Issue #938 — Qwen3-TTS Production Ready

Review generated by vLLM-Omni PR Review Agent

hsliuustc0106 · 2026-02-24T08:03:02Z

@vllm-omni-reviewer

lishunyang12 · 2026-02-25T09:12:17Z

The size of PR make it difficult to review and test, can we downsize the scope?

Edit: most of changes are not from this PR.

lishunyang12 · 2026-02-25T10:24:31Z

Looks like the branch was forked from an older main and carries ~28 already-merged commits (CI, ComfyUI, docs, etc.) — that's where the 149 files come from. The actual streaming work is only ~6 files.

git rebase origin/main should clean this up. If that gets messy, cherry-picking your streaming commits onto a fresh branch from main works too.

gerayking · 2026-02-25T11:36:18Z

Looks like the branch was forked from an older main and carries ~28 already-merged commits (CI, ComfyUI, docs, etc.) — that's where the 149 files come from. The actual streaming work is only ~6 files.

git rebase origin/main should clean this up. If that gets messy, cherry-picking your streaming commits onto a fresh branch from main works too.

sry, I will fix it tonight.

hsliuustc0106 · 2026-02-25T16:05:09Z

@vllm-omni-reviewer

hsliuustc0106 · 2026-02-26T12:45:51Z

check #1438 for details

gerayking requested a review from hsliuustc0106 as a code owner February 4, 2026 04:16

chatgpt-codex-connector Bot reviewed Feb 4, 2026

View reviewed changes

Comment thread vllm_omni/model_executor/models/qwen3_tts/modeling_qwen3_tts.py Outdated

gerayking force-pushed the feature/support_qwen3_tts_streaming_output branch 7 times, most recently from 0edb11a to b8aa200 Compare February 4, 2026 12:32

hsliuustc0106 reviewed Feb 4, 2026

View reviewed changes

hsliuustc0106 requested a review from Copilot February 4, 2026 13:35

Copilot started reviewing on behalf of hsliuustc0106 February 4, 2026 13:35 View session

Copilot AI reviewed Feb 4, 2026

View reviewed changes

Gaohan123 mentioned this pull request Feb 5, 2026

[RFC]: Qwen3-TTS Production Ready - February Milestone #938

Open

amy-why-3459 mentioned this pull request Feb 5, 2026

[RFC]: Omni Model 2026 Q1 Roadmap #1191

Open

1 task

linyueqian mentioned this pull request Feb 5, 2026

[RFC]: General TTS Model Implementation #1225

Open

1 task

This was referenced Feb 5, 2026

[Feature][TTS] Streaming Text Input for Qwen3-TTS via WebSocket #1230

Closed

[Feat][Qwen3-tts]: Add Gradio demo for online serving #1231

Merged

hsliuustc0106 mentioned this pull request Feb 6, 2026

[RFC]: vLLM-Omni 2026 Q1 Roadmap #677

Open

38 tasks

marksverdhai reviewed Feb 6, 2026

View reviewed changes

xuanweifu and others added 3 commits February 9, 2026 19:34

support qwen3 tts streaming output

fd59210

Signed-off-by: gerayking <399geray@gmail.com>

fixed max_new_tokens

505b90d

Signed-off-by: gerayking <399geray@gmail.com>

fix(streaming): prevent stride-0 tensor crash and memory leak

b021f2e

gerayking force-pushed the feature/support_qwen3_tts_streaming_output branch from 0894b02 to 9719de5 Compare February 12, 2026 06:01

lishunyang12 mentioned this pull request Feb 25, 2026

[RFC] Streaming Audio Output for WebSocket TTS #1479

Closed

gerayking force-pushed the feature/support_qwen3_tts_streaming_output branch from 9719de5 to b021f2e Compare February 25, 2026 15:15

fix stats

ef289c8

hsliuustc0106 closed this Feb 26, 2026

		audio_tensor = output.multimodal_output["audio"]
		audio_samplerate = output.multimodal_output["sr"].item()

-            audio_tensor = output.multimodal_output["audio"]
-            audio_samplerate = output.multimodal_output["sr"].item()
+            # In streaming mode, multimodal_output["audio"] / ["sr"] may be lists
+            audio_value = output.multimodal_output["audio"]
+            if isinstance(audio_value, list) and len(audio_value) > 0:
+                audio_tensor = audio_value[-1]
+            else:
+                audio_tensor = audio_value
+            sr_value = output.multimodal_output.get("sr", audio_samplerate)
+            if isinstance(sr_value, list) and len(sr_value) > 0:
+                sr_item = sr_value[-1]
+            else:
+                sr_item = sr_value
+            if hasattr(sr_item, "item"):
+                audio_samplerate = int(sr_item.item())
+            else:
+                audio_samplerate = int(sr_item)

+        if len(texts) != 1:
+            raise ValueError(
+                "generate_voice_design_streaming currently supports only batch size 1; "
+                f"got batch size {len(texts)}. For batched synthesis, use the non-streaming "
+                "generate_voice_design API instead."
+            )

Conversation

gerayking commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Key Changes

Test Plan

Test Result

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

hsliuustc0106 left a comment

Choose a reason for hiding this comment

Uh oh!

hsliuustc0106 Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

gerayking Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

hsliuustc0106 Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

hsliuustc0106 commented Feb 4, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

marksverdhei Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

marksverdhai left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

How our fork approaches streaming vs this PR

Scheduler integration

Serving layer

Stride-0 tensor serialization

Streaming state cleanup

Minor

What this PR does well

Uh oh!

gerayking commented Feb 8, 2026

How our fork approaches streaming vs this PR

Scheduler integration

Serving layer

Stride-0 tensor serialization

Streaming state cleanup

Minor

What this PR does well

Uh oh!

X-Skoprio commented Feb 8, 2026

Uh oh!

gerayking commented Feb 10, 2026

Uh oh!

hsliuustc0106 commented Feb 23, 2026

Uh oh!

gerayking commented Feb 4, 2026 •

edited

Loading

marksverdhai left a comment •

edited

Loading

📄 `.buildkite/pipeline.yml`

📄 `.buildkite/test-nightly.yaml`

📄 `.github/workflows/build_wheel.yml`

📄 `apps/ComfyUI-vLLM-Omni/`

📄 `vllm_omni/entrypoints/openai/serving_speech.py`

lishunyang12 commented Feb 25, 2026 •

edited

Loading