Skip to content

feat: reconcile HT TTS features onto upstream main#17

Merged
marksverdhei merged 0 commit into
htfrom
ht-rebase-reconcile
Feb 27, 2026
Merged

feat: reconcile HT TTS features onto upstream main#17
marksverdhei merged 0 commit into
htfrom
ht-rebase-reconcile

Conversation

@marksverdhei

Copy link
Copy Markdown

Purpose

Reconcile all HT-specific TTS features from the old ht branch onto the current upstream main architecture. Upstream PR vllm-project#1161 replaced the monolithic modeling_qwen3_tts.py with a disaggregated two-stage pipeline (qwen3_tts_talker.py + qwen3_tts_code2wav.py). This PR ports all viable HT features to the new architecture, dropping those superseded by upstream.

Features ported

Feature Status
HTTP streaming for /v1/audio/speech Ported — bridges upstream's progressive pipeline to StreamingResponse
speaker_embedding API parameter Ported — exposes upstream's internal ref_spk_embedding at API level
CUDA graph decoder wrapper Ported — self-contained module, integrated into both Code2Wav and Talker lazy-load paths
tts-stream bash tool Copied — client-side, architecture-independent
stream_tts_play.py Python client Copied — client-side, architecture-independent
tts-test.sh test script Copied — client-side, architecture-independent
HT fork branding Updated — README, logo, PR template

Features dropped (superseded by upstream)

Feature Reason
SDPA attention fallback New talker uses vLLM's Qwen3Model which handles attention backends automatically
Manual KV-cached generate_codes() Upstream's _LocalPredictorKVCache is the vLLM-native equivalent
Regional torch.compile on code predictor Upstream explicitly sets CUDAGraphMode.NONE for the code predictor
Model-level generate_streaming() Upstream's two-stage async_chunk pipeline already produces progressive audio chunks

Test Plan

  • Build check: Verify no import errors after all changes
  • Non-streaming TTS: scripts/tts-test.sh "Hello world" — verify audio output works unchanged
  • Streaming TTS: scripts/tts-stream "Hello streaming world" — verify progressive audio playback
  • Speaker embedding: Use examples/online_serving/qwen3_tts/speaker_embedding_interpolation.py to extract embedding, then POST with speaker_embedding param
  • CUDA graph decoder: Run TTS with CUDA, check logs for graph capture, verify audio quality matches non-graph path
  • Regression: Ensure non-streaming path still works identically to upstream

Test Result

Pending server-side validation. All Python files pass ast.parse() syntax validation. Branch is cleanly rebased on upstream main.


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan. Please providing the test scripts & test commands.
  • The test results.
  • (Optional) The necessary documentation update.
  • (HT fork) If this PR adds or changes HT-specific functionality, update the HT Fork Changes section in README.md.

🤖 Generated with Claude Code

@marksverdhei

Copy link
Copy Markdown
Author

Self-Review Findings

Fixed (commit 8afc413)

Critical: speaker_embedding data flow was completely broken
The serving layer stored the embedding as params["speaker_embedding"] but the talker model reads it from voice_clone_prompt["ref_spk_embedding"] inside _build_prompt_embeds(). Every request with speaker_embedding would have raised ValueError("Base requires ref_audio."). Fixed by building a proper voice_clone_prompt dict with the tensor.

Streaming format validation missing
stream=True with response_format="mp3" (or flac/opus/aac) would silently emit raw PCM with wrong content-type. Now rejects unsupported formats early.

Speed + streaming
speed != 1.0 with stream=True silently ignored the speed parameter. Now returns a validation error.

Minor fixes: misleading ref_audio field description (said "file path" but file paths are rejected), unused Tuple import.

Remaining Notes (non-blocking)

Issue Severity Notes
WAV streaming uses 0x7FFFFFFF placeholder header Low Standard pattern for streaming WAV (used by icecast). Client scripts use pcm format, avoiding this.
No GeneratorExit handling in streaming generator Low FastAPI/Starlette handles cancellation at the ASGI level; resources are cleaned up when the response generator is garbage collected.
CUDA graph warmup runs decoder 3x per size (extra eager pass) Low ~seconds of extra GPU time at startup, acceptable.
total_upsample is numpy int64 Low Works correctly in all arithmetic/slicing contexts.

🤖 Generated with Claude Code

@marksverdhei marksverdhei changed the base branch from main to ht February 23, 2026 07:21
@marksverdhei

Copy link
Copy Markdown
Author

Code Review: PR #17 — Reconcile HT TTS Features

Overall, the work is solid and well-structured. The streaming architecture, speaker embedding flow, and CUDA graph wrapper are all thoughtfully designed. Findings organized by severity below.


CRITICAL

1. Broken import in speaker_embedding_interpolation.py — references deleted module

File: examples/online_serving/qwen3_tts/speaker_embedding_interpolation.py:63-68

from vllm_omni.model_executor.models.qwen3_tts.modeling_qwen3_tts import (
    Qwen3TTSSpeakerEncoder,
)

modeling_qwen3_tts.py no longer exists — it was removed during the upstream disaggregation (PR vllm-project#1161). Qwen3TTSSpeakerEncoder now lives in qwen3_tts_talker.py. This import will always fail with ModuleNotFoundError. The fallback branch (line 82) also references modeling_qwen3_tts.py for remote download, which may or may not exist in the remote checkpoint.

Fix:

from vllm_omni.model_executor.models.qwen3_tts.qwen3_tts_talker import (
    Qwen3TTSSpeakerEncoder,
)

2. CUDA graph wrapper: double-warmup overhead

File: vllm_omni/model_executor/models/qwen3_tts/cuda_graph_decoder_wrapper.py:95-127

warmup() does a full forward pass for every capture size (lines 105-112), then _capture_graph_for_size() does another warmup run for the same size before capturing (lines 150-152). Every size gets three forward passes total (two warmups + one capture) when two would suffice. Not a correctness bug but doubles warmup time and GPU memory churn. The outer warmup loop should be removed, or _capture_graph_for_size should skip its inner warmup.


IMPORTANT

3. _extract_audio_output mutates dict in-place via .pop()

File: vllm_omni/entrypoints/openai/serving_speech.py:567

audio_output["audio"] = audio_output.pop("model_outputs")

Mutates the multimodal_output dict on the OmniRequestOutput object. For non-streaming this is fine, but for streaming, the same output object may be examined multiple times as the generator yields. .pop() happens to be idempotent here (subsequent calls find "audio" already set), but mutation of shared state during iteration is risky. Consider a shallow copy or assign without popping.

4. Streaming cursor logic: single-tensor first-chunk case may miscount

File: vllm_omni/entrypoints/openai/serving_speech.py:504-507

else:
    # Single tensor or first chunk
    new_tensors = [audio_data]
    chunks_yielded = 1

If the upstream output processor continues returning single tensors (not accumulated into a list) on subsequent iterations, chunks_yielded resets to 1 each time, re-yielding the same audio. The comment says "first chunk" but there's no guard preventing this from repeating. Warrants verification that the upstream output processor always accumulates into a list after the first output.

5. tts-stream saves raw PCM even when --save output.wav is specified

File: scripts/tts-stream:110-114

The script always requests "response_format": "pcm" (line 82) and pipes via tee "$SAVE". If the user specifies --save output.wav, they get a raw PCM file with a .wav extension — no WAV header. By contrast, stream_tts_play.py correctly wraps PCM data in a WAV header when saving. The bash script should either request WAV format or add a header when saving.

6. Speaker embedding dimension validation is too permissive

File: vllm_omni/entrypoints/openai/serving_speech.py:158-159

if len(request.speaker_embedding) < 64 or len(request.speaker_embedding) > 8192:

The ECAPA-TDNN speaker encoder produces 1024-dim embeddings (confirmed in the example at line 182). The [64, 8192] range is extremely permissive — a 200-dim or 5000-dim vector would silently proceed and produce garbage. Consider tightening or at least logging a warning when dimension != 1024.

7. CUDA graph only enabled for 12Hz tokenizer (undocumented)

Both qwen3_tts_code2wav.py:110 and qwen3_tts_talker.py:1106-1112 check hasattr(decoder, "enable_cudagraph") which only exists on Qwen3TTSTokenizerV2Decoder (12Hz). Gracefully skips for 25Hz, but should be documented.


MINOR / NITS

8. Redundant import torch at line 416 in serving_speech.py

torch is already imported at the top (line 11). The inline import is a leftover from before the fix commit.

9. Dict, List, Optional from typing — deprecated in favor of builtins

File: cuda_graph_decoder_wrapper.py:12 — rest of codebase uses dict, list, int | None.

10. Copyright header attributes HT code to "Alibaba Qwen team"

File: cuda_graph_decoder_wrapper.py:1# Copyright 2026 The Alibaba Qwen team. — this is HT-authored code, copyright attribution is incorrect.

11. stream_tts_play.py hardcodes model to Base variant

Defaults to Qwen/Qwen3-TTS-12Hz-0.6B-Base while tts-stream defaults to 1.7B-CustomVoice. The inconsistency could confuse users.

12. Import ordering in serving_speech.py:11-13

import torch before import numpy as np — should be alphabetical per PEP 8/isort.


DESIGN OBSERVATIONS (positive notes)

  • Speaker embedding data flow: The fix in 8afc413 correctly transforms the flat speaker_embedding into the nested voice_clone_prompt.ref_spk_embedding structure. Clean fix.
  • Streaming architecture: Cursor-based approach with 0x7FFFFFFF WAV header placeholder is standard practice. Good design.
  • CUDA graph integration: Wrapping at the decoder level with transparent fallback for unsupported sizes/batch-sizes is clean. Capture sizes [25-300] cover typical chunks well.
  • SSRF protection: _check_ssrf properly blocks private/internal IPs for ref_audio URLs. Good security.

Summary

Severity Count Key Items
Critical 2 Broken import in example script; double-warmup overhead
Important 5 Dict mutation during streaming; PCM-as-WAV save bug; permissive embedding validation
Minor/Nit 5 Redundant import; deprecated typing; copyright; style

Most pressing: #1 (broken import — will crash at runtime) and #5 (tts-stream saves headerless WAV). The CUDA graph double-warmup (#2) is wasteful but not a correctness issue. Streaming cursor logic (#4) warrants verification against the upstream output processor behavior.

@marksverdhei marksverdhei force-pushed the ht branch 2 times, most recently from 7a1c549 to e9b2c56 Compare February 27, 2026 13:20
@marksverdhei marksverdhei merged commit 47ba1ad into ht Feb 27, 2026
@marksverdhei marksverdhei deleted the ht-rebase-reconcile branch March 6, 2026 10:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant