Skip to content

[Perf][Qwen3-TTS] Free unused decoder in Talker SpeechTokenizer to VRAM#2429

Merged
linyueqian merged 4 commits into
vllm-project:mainfrom
Sy0307:perf/qwen3-tts-talker-free-decoder
Apr 5, 2026
Merged

[Perf][Qwen3-TTS] Free unused decoder in Talker SpeechTokenizer to VRAM#2429
linyueqian merged 4 commits into
vllm-project:mainfrom
Sy0307:perf/qwen3-tts-talker-free-decoder

Conversation

@Sy0307
Copy link
Copy Markdown
Contributor

@Sy0307 Sy0307 commented Apr 1, 2026

Summary

The Talker stage lazily loads the full Qwen3TTSTokenizerV2Model (encoder + decoder) onto GPU when processing voice-clone requests (_ensure_speech_tokenizer_loaded). However, Talker only calls tok.encode() -> model.encoder.encode() -- the decoder is never used. This wastes ~331 MiB of GPU memory (bf16) for the duration of the session.

This PR moves only the encoder sub-module to GPU and immediately frees the decoder weights, saving ~331 MiB VRAM with zero functional impact.

Motivation

Qwen3-TTS uses a 2-stage pipeline where the SpeechTokenizer serves different roles in each stage:

Stage Component Uses encoder Uses decoder
Stage 0 (Talker) _ensure_speech_tokenizer_loaded() Yes (tok.encode() for ref audio) No
Stage 1 (Code2Wav) _ensure_speech_tokenizer_loaded() No Yes (tok.model.decoder)

The existing code does tok.model.to(dev), which moves both encoder and decoder to GPU. Since Talker never touches the decoder, those weights sit idle on the GPU.

Changes

vllm_omni/model_executor/models/qwen3_tts/qwen3_tts_talker.py -- 3-line change in _ensure_speech_tokenizer_loaded():

  • tok.model.to(dev) -> tok.model.encoder.to(dev) (only move encoder to GPU)
  • Add tok.model.decoder = None (free decoder weights from CPU memory as well)
  • Update error message to reflect the narrower scope

Why this is safe

  1. tok.encode() only calls self.encoder.encode() -- verified in Qwen3TTSTokenizerV2Model.encode(). The decoder is not referenced anywhere in the encode path.

  2. tok.model.dtype still works -- after setting decoder = None, the model still has encoder parameters, so nn.Module.dtype resolves correctly.

  3. Code2Wav is unaffected -- it runs in a separate EngineCore process with its own _ensure_speech_tokenizer_loaded().

  4. Encode output is bit-identical -- verified on H20 GPU with Qwen3-TTS-12Hz-1.7B-Base, 0/400 codes differ between baseline and optimized.

Test Plan

Verified on H20 (98 GB) with Qwen3-TTS-12Hz-1.7B-Base:

# Bit-exact comparison: baseline (full model on GPU) vs optimized (encoder only)
tok1 = Qwen3TTSTokenizer.from_pretrained(dir, torch_dtype=torch.bfloat16)
tok1.model.to(dev)  # baseline
codes1 = tok1.encode(wav, sr=sr).audio_codes[0]

tok2 = Qwen3TTSTokenizer.from_pretrained(dir, torch_dtype=torch.bfloat16)
tok2.model.encoder.to(dev); tok2.model.decoder = None  # optimized
codes2 = tok2.encode(wav, sr=sr).audio_codes[0]

torch.equal(codes1, codes2)  # True -- bit-identical

VRAM measurement

Config VRAM Saved
Full model on GPU (baseline) 504.1 MiB --
Encoder only on GPU (this PR) 173.3 MiB 330.8 MiB

E2E tests:

python -m pytest tests/e2e/online_serving/test_qwen3_tts.py -v --timeout=300
python -m pytest tests/e2e/online_serving/test_qwen3_tts_base.py -v --timeout=300

@Sy0307 Sy0307 requested a review from hsliuustc0106 as a code owner April 1, 2026 19:31
@Sy0307 Sy0307 changed the title [Perf][Qwen3-TTS] Free unused decoder in Talker SpeechTokenizer to save ~331 MiB VRAM [Perf][Qwen3-TTS] Free unused decoder in Talker SpeechTokenizer to VRAM Apr 1, 2026
@Sy0307 Sy0307 force-pushed the perf/qwen3-tts-talker-free-decoder branch from b84468b to b0b04a1 Compare April 1, 2026 19:36
@hsliuustc0106 hsliuustc0106 requested a review from Copilot April 1, 2026 22:13
@hsliuustc0106
Copy link
Copy Markdown
Collaborator

@linyueqian

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Optimizes the Qwen3-TTS Talker stage’s SpeechTokenizer GPU memory usage by moving only the encoder to CUDA (since Talker only encodes reference audio) and dropping the unused decoder to reduce VRAM footprint during voice-clone requests.

Changes:

  • Move tok.model.encoder to GPU instead of the full tokenizer model.
  • Explicitly release the tokenizer decoder module in Talker (tok.model.decoder = None).
  • Refine the error message to reflect encoder-only device movement.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

)
# Prefer GPU for encoder if available; otherwise keep CPU.
# Only move encoder to GPU; decoder is unused by Talker (it only
# calls tok.encode()) and would waste ~218 MiB bf16 VRAM.
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment hard-codes a VRAM saving estimate (~218 MiB), but the PR description and measurements cite ~331 MiB saved. To avoid future confusion (and because this can vary by checkpoint/dtype), consider either updating the number to match the verified measurement or removing the specific MiB figure from the comment.

Suggested change
# calls tok.encode()) and would waste ~218 MiB bf16 VRAM.
# calls tok.encode()) and would otherwise consume additional bf16 VRAM.

Copilot uses AI. Check for mistakes.
Comment thread vllm_omni/model_executor/models/qwen3_tts/qwen3_tts_talker.py Outdated
@hsliuustc0106
Copy link
Copy Markdown
Collaborator

image

i test it in 0.6B model w/o this PR, it looks help save some memory during request. cc @linyueqian

Copy link
Copy Markdown
Collaborator

@lishunyang12 lishunyang12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, small and safe optimization. +1 to linyueqian's point about documenting the encode-only constraint on the cached instance.

Comment thread vllm_omni/model_executor/models/qwen3_tts/qwen3_tts_talker.py Outdated
…ve ~331 MiB VRAM

The Talker stage lazily loads the full SpeechTokenizer (encoder + decoder)
onto GPU for voice-clone requests, but only calls tok.encode() which uses
the encoder exclusively. The decoder sits idle on GPU.

Free the decoder before moving the encoder to device so it is never
allocated on GPU. Verified bit-identical encode output (0/400 codes differ)
and successful e2e voice clone (HTTP 200) on H20 with Qwen3-TTS-12Hz-1.7B-Base.

Signed-off-by: Sy03 <1370724210@qq.com>
@Sy0307 Sy0307 force-pushed the perf/qwen3-tts-talker-free-decoder branch from b0b04a1 to 3ec2dc3 Compare April 4, 2026 06:10
@linyueqian linyueqian added the ready label to trigger buildkite CI label Apr 5, 2026
Copy link
Copy Markdown
Collaborator

@linyueqian linyueqian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@linyueqian linyueqian enabled auto-merge (squash) April 5, 2026 18:44
@linyueqian linyueqian merged commit f6cfacd into vllm-project:main Apr 5, 2026
6 of 8 checks passed
skf-1999 pushed a commit to Semmer2/vllm-omni that referenced this pull request Apr 7, 2026
…AM (vllm-project#2429)

Signed-off-by: Sy03 <1370724210@qq.com>
Co-authored-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>
vraiti pushed a commit to vraiti/vllm-omni that referenced this pull request Apr 9, 2026
…AM (vllm-project#2429)

Signed-off-by: Sy03 <1370724210@qq.com>
Co-authored-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>
lengrongfu pushed a commit to lengrongfu/vllm-omni that referenced this pull request May 1, 2026
…AM (vllm-project#2429)

Signed-off-by: Sy03 <1370724210@qq.com>
Co-authored-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>
clodaghwalsh17 pushed a commit to clodaghwalsh17/nm-vllm-omni-ent that referenced this pull request May 12, 2026
…AM (vllm-project#2429)

Signed-off-by: Sy03 <1370724210@qq.com>
Co-authored-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants