[Feat][Qwen3-TTS] Better Qwen3-TTS online serving demo#1857
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 29be07e26b
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| usable = len(message) - (len(message) % 2) | ||
| if usable > 0: | ||
| yield np.frombuffer(message[:usable], dtype=np.int16).copy() |
There was a problem hiding this comment.
Preserve PCM byte alignment across WebSocket frames
This drops trailing odd bytes on each WebSocket frame instead of carrying them into the next frame, so PCM samples are lost whenever the server splits audio on odd byte boundaries. The streaming endpoint explicitly allows chunk boundaries that are not sample-aligned (the WebSocket tests include 3-byte and 1-byte frames), so this path can deterministically corrupt/shorten audio in WebSocket transport; the function should buffer leftover bytes between frames like stream_pcm_chunks() does for HTTP chunks.
Useful? React with 👍 / 👎.
Signed-off-by: linyueqian <linyueqian@outlook.com>
Signed-off-by: linyueqian <linyueqian@outlook.com>
…TS playback Gradio's built-in gr.Audio(streaming=True) plays each yielded chunk as a separate audio blob, causing audible gaps between chunks. This replaces it with a custom Web Audio API AudioWorklet player (inspired by KoljaB/RealtimeVoiceChat) that maintains a FIFO buffer queue and plays samples at the audio clock rate — eliminating inter-chunk gaps entirely. Key changes: - AudioWorklet-based player with FIFO queue for gap-free streaming - Same-origin FastAPI proxy endpoint (/proxy/v1/audio/speech) to avoid CORS - Browser-side fetch() with ReadableStream feeds PCM directly to worklet - Live streaming stats dashboard: TTFP, RTF, speed, audio duration - RTF bar with color-coded speed indicator (green/amber/red) - RTF frozen at stream-end time (fixes 1.0x bug from playback wait) - vLLM blue theme (primary #4A90D9) across all Gradio components - Non-streaming mode falls back to standard gr.Audio component - HTTP streaming default (full text synthesis, no sentence splitting gaps) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: linyueqian <linyueqian@outlook.com>
- Delete gradio_fastrtc_demo.py (superseded by AudioWorklet player in gradio_demo.py which provides gapless streaming without WebRTC overhead) - Remove build_ws_config() and stream_pcm_chunks_ws() from tts_common.py - Drop --enforce-eager from run_server.sh and run_gradio_demo.sh so the stage config controls CUDA graph usage (Stage 0 gets graphs, Stage 1 eager) - Fix --ip → --host in run_gradio_demo.sh Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: linyueqian <linyueqian@outlook.com>
ab988ce to
214239d
Compare
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: linyueqian <linyueqian@outlook.com>
- Add per-task-type examples (CustomVoice/VoiceDesign/Base) that toggle with task selection and pre-fill relevant fields including ref audio URL - Add Reset button that stops playback and clears inputs - Remove player Ready state empty margin - Use Gradio's native theme for consistent styling across all components - Header with vLLM-Omni logo and "Served by" branding - Lowercase "speaker" label Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: linyueqian <linyueqian@outlook.com>
|
Looks good! A few suggestions: 1.The "--share" parameter is no longer used in uvicorn, so we can consider removing it. |
- Pre-download ref_audio URL in proxy before forwarding to vLLM, so TTFP only measures synthesis time, not audio download latency - Store large payloads (uploaded ref audio) server-side with request ID to avoid Gradio textbox truncating 1MB+ base64 strings - Base examples use ref_audio_url (not file upload) for simplicity - Abort previous fetch and double-clear worklet buffer on new generation - Fix .then() JS event chain (removes Blocks.svelte map error) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: linyueqian <linyueqian@outlook.com>
…1857) Signed-off-by: linyueqian <linyueqian@outlook.com>
…1857) Signed-off-by: linyueqian <linyueqian@outlook.com>
…1857) Signed-off-by: linyueqian <linyueqian@outlook.com>
Summary
/v1/audio/speech/streamendpoint from [Feat][Qwen3-TTS] Support streaming audio output for websocket #1719build_ws_config()andstream_pcm_chunks_ws()helpers totts_common.pyfor reuse across demosfetch_voices()returning empty list when server returns{"voices": []}Test plan
python gradio_fastrtc_demo.py --api-base http://localhost:8000