feat: reconcile HT TTS features onto upstream main#17
Conversation
Self-Review FindingsFixed (commit 8afc413)Critical: Streaming format validation missing Speed + streaming Minor fixes: misleading Remaining Notes (non-blocking)
🤖 Generated with Claude Code |
Code Review: PR #17 — Reconcile HT TTS FeaturesOverall, the work is solid and well-structured. The streaming architecture, speaker embedding flow, and CUDA graph wrapper are all thoughtfully designed. Findings organized by severity below. CRITICAL1. Broken import in
|
| Severity | Count | Key Items |
|---|---|---|
| Critical | 2 | Broken import in example script; double-warmup overhead |
| Important | 5 | Dict mutation during streaming; PCM-as-WAV save bug; permissive embedding validation |
| Minor/Nit | 5 | Redundant import; deprecated typing; copyright; style |
Most pressing: #1 (broken import — will crash at runtime) and #5 (tts-stream saves headerless WAV). The CUDA graph double-warmup (#2) is wasteful but not a correctness issue. Streaming cursor logic (#4) warrants verification against the upstream output processor behavior.
7a1c549 to
e9b2c56
Compare
Purpose
Reconcile all HT-specific TTS features from the old
htbranch onto the current upstreammainarchitecture. Upstream PR vllm-project#1161 replaced the monolithicmodeling_qwen3_tts.pywith a disaggregated two-stage pipeline (qwen3_tts_talker.py+qwen3_tts_code2wav.py). This PR ports all viable HT features to the new architecture, dropping those superseded by upstream.Features ported
/v1/audio/speechStreamingResponsespeaker_embeddingAPI parameterref_spk_embeddingat API leveltts-streambash toolstream_tts_play.pyPython clienttts-test.shtest scriptFeatures dropped (superseded by upstream)
generate_codes()_LocalPredictorKVCacheis the vLLM-native equivalenttorch.compileon code predictorCUDAGraphMode.NONEfor the code predictorgenerate_streaming()Test Plan
scripts/tts-test.sh "Hello world"— verify audio output works unchangedscripts/tts-stream "Hello streaming world"— verify progressive audio playbackexamples/online_serving/qwen3_tts/speaker_embedding_interpolation.pyto extract embedding, then POST withspeaker_embeddingparamTest Result
Pending server-side validation. All Python files pass
ast.parse()syntax validation. Branch is cleanly rebased on upstreammain.Essential Elements of an Effective PR Description Checklist
README.md.🤖 Generated with Claude Code