[Perf][Qwen3-TTS] Free unused decoder in Talker SpeechTokenizer to VRAM by Sy0307 · Pull Request #2429 · vllm-project/vllm-omni

Sy0307 · 2026-04-01T19:31:55Z

Summary

The Talker stage lazily loads the full Qwen3TTSTokenizerV2Model (encoder + decoder) onto GPU when processing voice-clone requests (_ensure_speech_tokenizer_loaded). However, Talker only calls tok.encode() -> model.encoder.encode() -- the decoder is never used. This wastes ~331 MiB of GPU memory (bf16) for the duration of the session.

This PR moves only the encoder sub-module to GPU and immediately frees the decoder weights, saving ~331 MiB VRAM with zero functional impact.

Motivation

Qwen3-TTS uses a 2-stage pipeline where the SpeechTokenizer serves different roles in each stage:

Stage	Component	Uses encoder	Uses decoder
Stage 0 (Talker)	`_ensure_speech_tokenizer_loaded()`	Yes (`tok.encode()` for ref audio)	No
Stage 1 (Code2Wav)	`_ensure_speech_tokenizer_loaded()`	No	Yes (`tok.model.decoder`)

The existing code does tok.model.to(dev), which moves both encoder and decoder to GPU. Since Talker never touches the decoder, those weights sit idle on the GPU.

Changes

vllm_omni/model_executor/models/qwen3_tts/qwen3_tts_talker.py -- 3-line change in _ensure_speech_tokenizer_loaded():

tok.model.to(dev) -> tok.model.encoder.to(dev) (only move encoder to GPU)
Add tok.model.decoder = None (free decoder weights from CPU memory as well)
Update error message to reflect the narrower scope

Why this is safe

tok.encode() only calls self.encoder.encode() -- verified in Qwen3TTSTokenizerV2Model.encode(). The decoder is not referenced anywhere in the encode path.
tok.model.dtype still works -- after setting decoder = None, the model still has encoder parameters, so nn.Module.dtype resolves correctly.
Code2Wav is unaffected -- it runs in a separate EngineCore process with its own _ensure_speech_tokenizer_loaded().
Encode output is bit-identical -- verified on H20 GPU with Qwen3-TTS-12Hz-1.7B-Base, 0/400 codes differ between baseline and optimized.

Test Plan

Verified on H20 (98 GB) with Qwen3-TTS-12Hz-1.7B-Base:

# Bit-exact comparison: baseline (full model on GPU) vs optimized (encoder only)
tok1 = Qwen3TTSTokenizer.from_pretrained(dir, torch_dtype=torch.bfloat16)
tok1.model.to(dev)  # baseline
codes1 = tok1.encode(wav, sr=sr).audio_codes[0]

tok2 = Qwen3TTSTokenizer.from_pretrained(dir, torch_dtype=torch.bfloat16)
tok2.model.encoder.to(dev); tok2.model.decoder = None  # optimized
codes2 = tok2.encode(wav, sr=sr).audio_codes[0]

torch.equal(codes1, codes2)  # True -- bit-identical

VRAM measurement

Config	VRAM	Saved
Full model on GPU (baseline)	504.1 MiB	--
Encoder only on GPU (this PR)	173.3 MiB	330.8 MiB

E2E tests:

python -m pytest tests/e2e/online_serving/test_qwen3_tts.py -v --timeout=300
python -m pytest tests/e2e/online_serving/test_qwen3_tts_base.py -v --timeout=300

hsliuustc0106 · 2026-04-01T22:13:26Z

@linyueqian

Copilot

Pull request overview

Optimizes the Qwen3-TTS Talker stage’s SpeechTokenizer GPU memory usage by moving only the encoder to CUDA (since Talker only encodes reference audio) and dropping the unused decoder to reduce VRAM footprint during voice-clone requests.

Changes:

Move tok.model.encoder to GPU instead of the full tokenizer model.
Explicitly release the tokenizer decoder module in Talker (tok.model.decoder = None).
Refine the error message to reflect encoder-only device movement.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-01T22:15:56Z

        )
-        # Prefer GPU for encoder if available; otherwise keep CPU.
+        # Only move encoder to GPU; decoder is unused by Talker (it only
+        # calls tok.encode()) and would waste ~218 MiB bf16 VRAM.


The comment hard-codes a VRAM saving estimate (~218 MiB), but the PR description and measurements cite ~331 MiB saved. To avoid future confusion (and because this can vary by checkpoint/dtype), consider either updating the number to match the verified measurement or removing the specific MiB figure from the comment.

Suggested change

# calls tok.encode()) and would waste ~218 MiB bf16 VRAM.

# calls tok.encode()) and would otherwise consume additional bf16 VRAM.

hsliuustc0106 · 2026-04-02T03:13:30Z

i test it in 0.6B model w/o this PR, it looks help save some memory during request. cc @linyueqian

lishunyang12

looks good, small and safe optimization. +1 to linyueqian's point about documenting the encode-only constraint on the cached instance.

…ve ~331 MiB VRAM The Talker stage lazily loads the full SpeechTokenizer (encoder + decoder) onto GPU for voice-clone requests, but only calls tok.encode() which uses the encoder exclusively. The decoder sits idle on GPU. Free the decoder before moving the encoder to device so it is never allocated on GPU. Verified bit-identical encode output (0/400 codes differ) and successful e2e voice clone (HTTP 200) on H20 with Qwen3-TTS-12Hz-1.7B-Base. Signed-off-by: Sy03 <1370724210@qq.com>

linyueqian

LGTM

…AM (vllm-project#2429) Signed-off-by: Sy03 <1370724210@qq.com> Co-authored-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>

Sy0307 requested a review from hsliuustc0106 as a code owner April 1, 2026 19:31

Sy0307 changed the title ~~[Perf][Qwen3-TTS] Free unused decoder in Talker SpeechTokenizer to save ~331 MiB VRAM~~ [Perf][Qwen3-TTS] Free unused decoder in Talker SpeechTokenizer to VRAM Apr 1, 2026

Sy0307 force-pushed the perf/qwen3-tts-talker-free-decoder branch from b84468b to b0b04a1 Compare April 1, 2026 19:36

hsliuustc0106 requested a review from Copilot April 1, 2026 22:13

Copilot started reviewing on behalf of hsliuustc0106 April 1, 2026 22:13 View session

Copilot AI reviewed Apr 1, 2026

View reviewed changes

linyueqian reviewed Apr 2, 2026

View reviewed changes

Comment thread vllm_omni/model_executor/models/qwen3_tts/qwen3_tts_talker.py Outdated

lishunyang12 reviewed Apr 2, 2026

View reviewed changes

Comment thread vllm_omni/model_executor/models/qwen3_tts/qwen3_tts_talker.py Outdated

Sy0307 force-pushed the perf/qwen3-tts-talker-free-decoder branch from b0b04a1 to 3ec2dc3 Compare April 4, 2026 06:10

Sy0307 added 2 commits April 4, 2026 14:16

Merge branch 'main' into perf/qwen3-tts-talker-free-decoder

9f0fab0

Merge branch 'main' into perf/qwen3-tts-talker-free-decoder

068a6fa

linyueqian added the ready label to trigger buildkite CI label Apr 5, 2026

linyueqian approved these changes Apr 5, 2026

View reviewed changes

Merge branch 'main' into perf/qwen3-tts-talker-free-decoder

44b0d88

linyueqian enabled auto-merge (squash) April 5, 2026 18:44

linyueqian merged commit f6cfacd into vllm-project:main Apr 5, 2026
6 of 8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Perf][Qwen3-TTS] Free unused decoder in Talker SpeechTokenizer to VRAM#2429

[Perf][Qwen3-TTS] Free unused decoder in Talker SpeechTokenizer to VRAM#2429
linyueqian merged 4 commits into
vllm-project:mainfrom
Sy0307:perf/qwen3-tts-talker-free-decoder

Sy0307 commented Apr 1, 2026 •

edited

Loading

Uh oh!

hsliuustc0106 commented Apr 1, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 1, 2026

Uh oh!

Uh oh!

hsliuustc0106 commented Apr 2, 2026

Uh oh!

lishunyang12 left a comment

Uh oh!

Uh oh!

linyueqian left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

	# calls tok.encode()) and would waste ~218 MiB bf16 VRAM.
	# calls tok.encode()) and would otherwise consume additional bf16 VRAM.

Conversation

Sy0307 commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Changes

Why this is safe

Test Plan

VRAM measurement

Uh oh!

hsliuustc0106 commented Apr 1, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

hsliuustc0106 commented Apr 2, 2026

Uh oh!

lishunyang12 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

linyueqian left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Sy0307 commented Apr 1, 2026 •

edited

Loading