feat: reconcile HT TTS features onto upstream main by marksverdhei · Pull Request #17 · heiervang-technologies/ht-vllm-omni

marksverdhei · 2026-02-23T07:07:03Z

Purpose

Reconcile all HT-specific TTS features from the old ht branch onto the current upstream main architecture. Upstream PR vllm-project#1161 replaced the monolithic modeling_qwen3_tts.py with a disaggregated two-stage pipeline (qwen3_tts_talker.py + qwen3_tts_code2wav.py). This PR ports all viable HT features to the new architecture, dropping those superseded by upstream.

Features ported

Feature	Status
HTTP streaming for `/v1/audio/speech`	Ported — bridges upstream's progressive pipeline to `StreamingResponse`
`speaker_embedding` API parameter	Ported — exposes upstream's internal `ref_spk_embedding` at API level
CUDA graph decoder wrapper	Ported — self-contained module, integrated into both Code2Wav and Talker lazy-load paths
`tts-stream` bash tool	Copied — client-side, architecture-independent
`stream_tts_play.py` Python client	Copied — client-side, architecture-independent
`tts-test.sh` test script	Copied — client-side, architecture-independent
HT fork branding	Updated — README, logo, PR template

Features dropped (superseded by upstream)

Feature	Reason
SDPA attention fallback	New talker uses vLLM's Qwen3Model which handles attention backends automatically
Manual KV-cached `generate_codes()`	Upstream's `_LocalPredictorKVCache` is the vLLM-native equivalent
Regional `torch.compile` on code predictor	Upstream explicitly sets `CUDAGraphMode.NONE` for the code predictor
Model-level `generate_streaming()`	Upstream's two-stage async_chunk pipeline already produces progressive audio chunks

Test Plan

Build check: Verify no import errors after all changes
Non-streaming TTS: scripts/tts-test.sh "Hello world" — verify audio output works unchanged
Streaming TTS: scripts/tts-stream "Hello streaming world" — verify progressive audio playback
Speaker embedding: Use examples/online_serving/qwen3_tts/speaker_embedding_interpolation.py to extract embedding, then POST with speaker_embedding param
CUDA graph decoder: Run TTS with CUDA, check logs for graph capture, verify audio quality matches non-graph path
Regression: Ensure non-streaming path still works identically to upstream

Test Result

Pending server-side validation. All Python files pass ast.parse() syntax validation. Branch is cleanly rebased on upstream main.

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan. Please providing the test scripts & test commands.
The test results.
(Optional) The necessary documentation update.
(HT fork) If this PR adds or changes HT-specific functionality, update the HT Fork Changes section in README.md.

🤖 Generated with Claude Code

marksverdhei · 2026-02-23T07:13:15Z

Self-Review Findings

Fixed (commit `8afc413`)

Critical: speaker_embedding data flow was completely broken
The serving layer stored the embedding as params["speaker_embedding"] but the talker model reads it from voice_clone_prompt["ref_spk_embedding"] inside _build_prompt_embeds(). Every request with speaker_embedding would have raised ValueError("Base requires ref_audio."). Fixed by building a proper voice_clone_prompt dict with the tensor.

Streaming format validation missing
stream=True with response_format="mp3" (or flac/opus/aac) would silently emit raw PCM with wrong content-type. Now rejects unsupported formats early.

Speed + streaming
speed != 1.0 with stream=True silently ignored the speed parameter. Now returns a validation error.

Minor fixes: misleading ref_audio field description (said "file path" but file paths are rejected), unused Tuple import.

Remaining Notes (non-blocking)

Issue	Severity	Notes
WAV streaming uses `0x7FFFFFFF` placeholder header	Low	Standard pattern for streaming WAV (used by icecast). Client scripts use `pcm` format, avoiding this.
No `GeneratorExit` handling in streaming generator	Low	FastAPI/Starlette handles cancellation at the ASGI level; resources are cleaned up when the response generator is garbage collected.
CUDA graph warmup runs decoder 3x per size (extra eager pass)	Low	~seconds of extra GPU time at startup, acceptable.
`total_upsample` is numpy int64	Low	Works correctly in all arithmetic/slicing contexts.

🤖 Generated with Claude Code

marksverdhei · 2026-02-23T07:38:13Z

Code Review: PR #17 — Reconcile HT TTS Features

Overall, the work is solid and well-structured. The streaming architecture, speaker embedding flow, and CUDA graph wrapper are all thoughtfully designed. Findings organized by severity below.

CRITICAL

1. Broken import in `speaker_embedding_interpolation.py` — references deleted module

File: examples/online_serving/qwen3_tts/speaker_embedding_interpolation.py:63-68

from vllm_omni.model_executor.models.qwen3_tts.modeling_qwen3_tts import (
    Qwen3TTSSpeakerEncoder,
)

modeling_qwen3_tts.py no longer exists — it was removed during the upstream disaggregation (PR vllm-project#1161). Qwen3TTSSpeakerEncoder now lives in qwen3_tts_talker.py. This import will always fail with ModuleNotFoundError. The fallback branch (line 82) also references modeling_qwen3_tts.py for remote download, which may or may not exist in the remote checkpoint.

Fix:

from vllm_omni.model_executor.models.qwen3_tts.qwen3_tts_talker import (
    Qwen3TTSSpeakerEncoder,
)

2. CUDA graph wrapper: double-warmup overhead

File: vllm_omni/model_executor/models/qwen3_tts/cuda_graph_decoder_wrapper.py:95-127

warmup() does a full forward pass for every capture size (lines 105-112), then _capture_graph_for_size() does another warmup run for the same size before capturing (lines 150-152). Every size gets three forward passes total (two warmups + one capture) when two would suffice. Not a correctness bug but doubles warmup time and GPU memory churn. The outer warmup loop should be removed, or _capture_graph_for_size should skip its inner warmup.

IMPORTANT

3. `_extract_audio_output` mutates dict in-place via `.pop()`

File: vllm_omni/entrypoints/openai/serving_speech.py:567

audio_output["audio"] = audio_output.pop("model_outputs")

Mutates the multimodal_output dict on the OmniRequestOutput object. For non-streaming this is fine, but for streaming, the same output object may be examined multiple times as the generator yields. .pop() happens to be idempotent here (subsequent calls find "audio" already set), but mutation of shared state during iteration is risky. Consider a shallow copy or assign without popping.

4. Streaming cursor logic: single-tensor first-chunk case may miscount

File: vllm_omni/entrypoints/openai/serving_speech.py:504-507

else:
    # Single tensor or first chunk
    new_tensors = [audio_data]
    chunks_yielded = 1

If the upstream output processor continues returning single tensors (not accumulated into a list) on subsequent iterations, chunks_yielded resets to 1 each time, re-yielding the same audio. The comment says "first chunk" but there's no guard preventing this from repeating. Warrants verification that the upstream output processor always accumulates into a list after the first output.

5. `tts-stream` saves raw PCM even when `--save output.wav` is specified

File: scripts/tts-stream:110-114

The script always requests "response_format": "pcm" (line 82) and pipes via tee "$SAVE". If the user specifies --save output.wav, they get a raw PCM file with a .wav extension — no WAV header. By contrast, stream_tts_play.py correctly wraps PCM data in a WAV header when saving. The bash script should either request WAV format or add a header when saving.

6. Speaker embedding dimension validation is too permissive

File: vllm_omni/entrypoints/openai/serving_speech.py:158-159

if len(request.speaker_embedding) < 64 or len(request.speaker_embedding) > 8192:

The ECAPA-TDNN speaker encoder produces 1024-dim embeddings (confirmed in the example at line 182). The [64, 8192] range is extremely permissive — a 200-dim or 5000-dim vector would silently proceed and produce garbage. Consider tightening or at least logging a warning when dimension != 1024.

7. CUDA graph only enabled for 12Hz tokenizer (undocumented)

Both qwen3_tts_code2wav.py:110 and qwen3_tts_talker.py:1106-1112 check hasattr(decoder, "enable_cudagraph") which only exists on Qwen3TTSTokenizerV2Decoder (12Hz). Gracefully skips for 25Hz, but should be documented.

MINOR / NITS

8. Redundant `import torch` at line 416 in `serving_speech.py`

torch is already imported at the top (line 11). The inline import is a leftover from before the fix commit.

9. `Dict`, `List`, `Optional` from `typing` — deprecated in favor of builtins

File: cuda_graph_decoder_wrapper.py:12 — rest of codebase uses dict, list, int | None.

10. Copyright header attributes HT code to "Alibaba Qwen team"

11. `stream_tts_play.py` hardcodes model to Base variant

Defaults to Qwen/Qwen3-TTS-12Hz-0.6B-Base while tts-stream defaults to 1.7B-CustomVoice. The inconsistency could confuse users.

12. Import ordering in `serving_speech.py:11-13`

import torch before import numpy as np — should be alphabetical per PEP 8/isort.

DESIGN OBSERVATIONS (positive notes)

Speaker embedding data flow: The fix in 8afc413 correctly transforms the flat speaker_embedding into the nested voice_clone_prompt.ref_spk_embedding structure. Clean fix.
Streaming architecture: Cursor-based approach with 0x7FFFFFFF WAV header placeholder is standard practice. Good design.
CUDA graph integration: Wrapping at the decoder level with transparent fallback for unsupported sizes/batch-sizes is clean. Capture sizes [25-300] cover typical chunks well.
SSRF protection: _check_ssrf properly blocks private/internal IPs for ref_audio URLs. Good security.

Summary

Severity	Count	Key Items
Critical	2	Broken import in example script; double-warmup overhead
Important	5	Dict mutation during streaming; PCM-as-WAV save bug; permissive embedding validation
Minor/Nit	5	Redundant import; deprecated typing; copyright; style

Most pressing: #1 (broken import — will crash at runtime) and #5 (tts-stream saves headerless WAV). The CUDA graph double-warmup (#2) is wasteful but not a correctness issue. Streaming cursor logic (#4) warrants verification against the upstream output processor behavior.

marksverdhei changed the base branch from main to ht February 23, 2026 07:21

marksverdhei force-pushed the ht branch 2 times, most recently from 7a1c549 to e9b2c56 Compare February 27, 2026 13:20

marksverdhei merged commit 47ba1ad into ht Feb 27, 2026

marksverdhei force-pushed the ht branch from e9b2c56 to 39db6bf Compare February 27, 2026 14:44

marksverdhei deleted the ht-rebase-reconcile branch March 6, 2026 10:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: reconcile HT TTS features onto upstream main#17

feat: reconcile HT TTS features onto upstream main#17
marksverdhei merged 0 commit into
htfrom
ht-rebase-reconcile

marksverdhei commented Feb 23, 2026

Uh oh!

marksverdhei commented Feb 23, 2026

Uh oh!

marksverdhei commented Feb 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

marksverdhei commented Feb 23, 2026

Purpose

Features ported

Features dropped (superseded by upstream)

Test Plan

Test Result

Uh oh!

marksverdhei commented Feb 23, 2026

Self-Review Findings

Fixed (commit 8afc413)

Remaining Notes (non-blocking)

Uh oh!

marksverdhei commented Feb 23, 2026

Code Review: PR #17 — Reconcile HT TTS Features

CRITICAL

1. Broken import in speaker_embedding_interpolation.py — references deleted module

2. CUDA graph wrapper: double-warmup overhead

IMPORTANT

3. _extract_audio_output mutates dict in-place via .pop()

4. Streaming cursor logic: single-tensor first-chunk case may miscount

5. tts-stream saves raw PCM even when --save output.wav is specified

6. Speaker embedding dimension validation is too permissive

7. CUDA graph only enabled for 12Hz tokenizer (undocumented)

MINOR / NITS

8. Redundant import torch at line 416 in serving_speech.py

9. Dict, List, Optional from typing — deprecated in favor of builtins

10. Copyright header attributes HT code to "Alibaba Qwen team"

11. stream_tts_play.py hardcodes model to Base variant

12. Import ordering in serving_speech.py:11-13

DESIGN OBSERVATIONS (positive notes)

Summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Fixed (commit `8afc413`)

1. Broken import in `speaker_embedding_interpolation.py` — references deleted module

3. `_extract_audio_output` mutates dict in-place via `.pop()`

5. `tts-stream` saves raw PCM even when `--save output.wav` is specified

8. Redundant `import torch` at line 416 in `serving_speech.py`

9. `Dict`, `List`, `Optional` from `typing` — deprecated in favor of builtins

11. `stream_tts_play.py` hardcodes model to Base variant

12. Import ordering in `serving_speech.py:11-13`