[Perf][Qwen3-TTS] Free unused decoder in Talker SpeechTokenizer to VRAM#2429
Conversation
b84468b to
b0b04a1
Compare
There was a problem hiding this comment.
Pull request overview
Optimizes the Qwen3-TTS Talker stage’s SpeechTokenizer GPU memory usage by moving only the encoder to CUDA (since Talker only encodes reference audio) and dropping the unused decoder to reduce VRAM footprint during voice-clone requests.
Changes:
- Move
tok.model.encoderto GPU instead of the full tokenizer model. - Explicitly release the tokenizer decoder module in Talker (
tok.model.decoder = None). - Refine the error message to reflect encoder-only device movement.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| ) | ||
| # Prefer GPU for encoder if available; otherwise keep CPU. | ||
| # Only move encoder to GPU; decoder is unused by Talker (it only | ||
| # calls tok.encode()) and would waste ~218 MiB bf16 VRAM. |
There was a problem hiding this comment.
The comment hard-codes a VRAM saving estimate (~218 MiB), but the PR description and measurements cite ~331 MiB saved. To avoid future confusion (and because this can vary by checkpoint/dtype), consider either updating the number to match the verified measurement or removing the specific MiB figure from the comment.
| # calls tok.encode()) and would waste ~218 MiB bf16 VRAM. | |
| # calls tok.encode()) and would otherwise consume additional bf16 VRAM. |
i test it in 0.6B model w/o this PR, it looks help save some memory during request. cc @linyueqian |
lishunyang12
left a comment
There was a problem hiding this comment.
looks good, small and safe optimization. +1 to linyueqian's point about documenting the encode-only constraint on the cached instance.
…ve ~331 MiB VRAM The Talker stage lazily loads the full SpeechTokenizer (encoder + decoder) onto GPU for voice-clone requests, but only calls tok.encode() which uses the encoder exclusively. The decoder sits idle on GPU. Free the decoder before moving the encoder to device so it is never allocated on GPU. Verified bit-identical encode output (0/400 codes differ) and successful e2e voice clone (HTTP 200) on H20 with Qwen3-TTS-12Hz-1.7B-Base. Signed-off-by: Sy03 <1370724210@qq.com>
b0b04a1 to
3ec2dc3
Compare
…AM (vllm-project#2429) Signed-off-by: Sy03 <1370724210@qq.com> Co-authored-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>
…AM (vllm-project#2429) Signed-off-by: Sy03 <1370724210@qq.com> Co-authored-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>
…AM (vllm-project#2429) Signed-off-by: Sy03 <1370724210@qq.com> Co-authored-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>
…AM (vllm-project#2429) Signed-off-by: Sy03 <1370724210@qq.com> Co-authored-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>

Summary
The Talker stage lazily loads the full
Qwen3TTSTokenizerV2Model(encoder + decoder) onto GPU when processing voice-clone requests (_ensure_speech_tokenizer_loaded). However, Talker only callstok.encode()->model.encoder.encode()-- the decoder is never used. This wastes ~331 MiB of GPU memory (bf16) for the duration of the session.This PR moves only the encoder sub-module to GPU and immediately frees the decoder weights, saving ~331 MiB VRAM with zero functional impact.
Motivation
Qwen3-TTS uses a 2-stage pipeline where the SpeechTokenizer serves different roles in each stage:
_ensure_speech_tokenizer_loaded()tok.encode()for ref audio)_ensure_speech_tokenizer_loaded()tok.model.decoder)The existing code does
tok.model.to(dev), which moves both encoder and decoder to GPU. Since Talker never touches the decoder, those weights sit idle on the GPU.Changes
vllm_omni/model_executor/models/qwen3_tts/qwen3_tts_talker.py-- 3-line change in_ensure_speech_tokenizer_loaded():tok.model.to(dev)->tok.model.encoder.to(dev)(only move encoder to GPU)tok.model.decoder = None(free decoder weights from CPU memory as well)Why this is safe
tok.encode()only callsself.encoder.encode()-- verified inQwen3TTSTokenizerV2Model.encode(). The decoder is not referenced anywhere in the encode path.tok.model.dtypestill works -- after settingdecoder = None, the model still has encoder parameters, sonn.Module.dtyperesolves correctly.Code2Wav is unaffected -- it runs in a separate EngineCore process with its own
_ensure_speech_tokenizer_loaded().Encode output is bit-identical -- verified on H20 GPU with
Qwen3-TTS-12Hz-1.7B-Base, 0/400 codes differ between baseline and optimized.Test Plan
Verified on H20 (98 GB) with
Qwen3-TTS-12Hz-1.7B-Base:VRAM measurement
E2E tests: