[Perf][Fish Speech] Free unused DAC codec components to save VRAM by Sy0307 · Pull Request #2430 · vllm-project/vllm-omni

Sy0307 · 2026-04-01T19:33:04Z

Summary

Fish Speech S2 Pro loads the full DAC codec (~1,789 MiB fp32) into GPU in both the encoder stage (voice cloning) and decoder stage (waveform synthesis), but each stage only uses a subset of the codec's components. This PR frees the unused components after loading, saving ~1,275 MiB total across both stages.

Motivation

The DAC codec has three main components with asymmetric sizes:

Component	Params	Size (fp32)	Used by Encoder	Used by Decoder
`encoder`	76.9M	293.2 MiB	Yes	No
`decoder`	54.1M	206.4 MiB	No	Yes
`quantizer.pre_module` (8-layer transformer)	109.1M	416.1 MiB	Yes (`forward()`)	No
`quantizer.post_module` (8-layer transformer)	109.1M	416.1 MiB	Yes (`forward()`)	Yes (`decode()`)
`quantizer.downsample`	21.0M	80.1 MiB	Yes (`forward()`)	No
`quantizer.upsample`	21.0M	80.1 MiB	Yes (`forward()`)	Yes (`decode()`)
`quantizer.codebooks`	0.3M	1.1 MiB	Yes	Yes

The encode path (codec.encode()) calls encoder -> quantizer.forward(), which uses all quantizer sub-components.

The decode path (quantizer.decode() -> decoder()) only uses semantic_quantizer.from_codes, quantizer.from_codes, post_module, upsample, and decoder. It never touches encoder, quantizer.pre_module, or quantizer.downsample.

Changes

dac_encoder.py -- 1 line after codec.eval():

codec.decoder = None  # saves ~208 MiB

fish_speech_dac_decoder.py -- 3 lines after codec.eval():

codec.encoder = None                  # saves ~293 MiB
codec.quantizer.pre_module = None     # saves ~416 MiB
codec.quantizer.downsample = None     # saves ~80 MiB

Why this is safe

Both encode and decode paths were verified to produce bit-identical output after freeing unused components (same model instance, same weights):

ENC bit-identical: True    (same codes after freeing decoder)
DEC bit-identical: True    (max_diff = 0.0 after freeing encoder+pre+down)

The freed components are never referenced in the active code path -- PyTorch's forward pass only traverses explicitly called sub-modules.

The encoder and decoder stages run in separate EngineCore processes, so the global _codec_cache in dac_encoder.py is not shared between them.

VRAM Measurements (H20, fp32)

Stage	Before	After	Saved
Encoder stage (`dac_encoder.py`)	1,789.5 MiB	1,581.8 MiB	207.6 MiB (12%)
Decoder stage (`fish_speech_dac_decoder.py`)	1,789.5 MiB	722.0 MiB	1,067.4 MiB (60%)
Combined (both processes)	3,578.9 MiB	2,303.8 MiB	1,275.1 MiB (36%)

Test Plan

Bit-exact verification (no model weights needed -- uses random init):

codec = build_dac_codec(); codec.to(dev, dtype=torch.float32).eval()
codes_base, _ = codec.encode(wav, fl)           # baseline encode
codec.decoder = None                             # free decoder
codes_opt, _ = codec.encode(wav, fl)             # optimized encode
assert torch.equal(codes_base, codes_opt)        # True

codec2 = build_dac_codec(); codec2.to(dev, dtype=torch.float32).eval()
z_base = codec2.quantizer.decode(codes_base)
audio_base = codec2.decoder(z_base)              # baseline decode
codec2.encoder = None; codec2.quantizer.pre_module = None
codec2.quantizer.downsample = None               # free encoder components
z_opt = codec2.quantizer.decode(codes_base)
audio_opt = codec2.decoder(z_opt)                # optimized decode
assert torch.equal(audio_base, audio_opt)        # True (max_diff = 0.0)

E2E (requires Fish Speech S2 Pro model):

python -m pytest tests/e2e/online_serving/test_fish_speech.py -v --timeout=300

hsliuustc0106 · 2026-04-01T22:14:24Z

@claude

Copilot

Pull request overview

Optimizes Fish Speech S2 Pro’s DAC codec memory footprint by pruning stage-specific unused codec components after loading, reducing GPU VRAM usage for both the encoder (voice cloning) and decoder (waveform synthesis) stages.

Changes:

In the DAC encoder loader, drops the codec decoder module since the encode path never calls it.
In the DAC decoder loader, drops the codec encoder plus quantizer encode-only submodules (pre_module, downsample) since the decode path never calls them.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File	Description
`vllm_omni/model_executor/models/fish_speech/fish_speech_dac_decoder.py`	Removes encode-only DAC components in the decode stage to cut GPU memory usage.
`vllm_omni/model_executor/models/fish_speech/dac_encoder.py`	Removes decode-only DAC components in the encode stage to cut GPU memory usage.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-01T22:18:07Z

        device = self.vllm_config.device_config.device
        codec = codec.to(device=device, dtype=torch.float32)
        codec.eval()
+        # Decode path only uses quantizer.decode() + decoder; the encoder
+        # and quantizer's encode-only components (pre_module, downsample)
+        # are never called and would waste ~1,067 MiB GPU memory.
+        codec.encoder = None
+        codec.quantizer.pre_module = None
+        codec.quantizer.downsample = None


These components are freed only after codec.to(device=...) has already moved the full codec (including encoder/pre_module/downsample) onto the GPU. That can still cause a peak-VRAM OOM during loading and adds unnecessary host→device transfer time. Consider removing the unused submodules before calling .to(...) so they are never transferred/allocated on the target device.

Suggested change

device = self.vllm_config.device_config.device

codec = codec.to(device=device, dtype=torch.float32)

codec.eval()

# Decode path only uses quantizer.decode() + decoder; the encoder

# and quantizer's encode-only components (pre_module, downsample)

# are never called and would waste ~1,067 MiB GPU memory.

codec.encoder = None

codec.quantizer.pre_module = None

codec.quantizer.downsample = None

# Decode path only uses quantizer.decode() + decoder; the encoder

# and quantizer's encode-only components (pre_module, downsample)

# are never called and would waste ~1,067 MiB GPU memory.

if hasattr(codec, "encoder"):

codec.encoder = None

quantizer = getattr(codec, "quantizer", None)

if quantizer is not None:

if hasattr(quantizer, "pre_module"):

quantizer.pre_module = None

if hasattr(quantizer, "downsample"):

quantizer.downsample = None

device = self.vllm_config.device_config.device

codec = codec.to(device=device, dtype=torch.float32)

codec.eval()

Copilot · 2026-04-01T22:18:08Z

    codec = codec.to(device=device, dtype=dtype)
    codec.eval()
+    # Encoder path only uses encoder + quantizer.forward(); the decoder
+    # is never called and would waste ~208 MiB GPU memory.
+    codec.decoder = None


codec.decoder is removed only after the full codec has been moved to the target device via .to(...). If device is CUDA, this still allocates/transfers the decoder weights and can OOM on smaller GPUs despite being freed immediately after. Consider setting codec.decoder = None (or otherwise pruning) before .to(...) so the decoder is never moved/allocated on the target device.

Suggested change

codec = codec.to(device=device, dtype=dtype)

codec.eval()

# Encoder path only uses encoder + quantizer.forward(); the decoder

# is never called and would waste ~208 MiB GPU memory.

codec.decoder = None

# Encoder path only uses encoder + quantizer.forward(); the decoder

# is never called and would waste ~208 MiB GPU memory. Prune it before

# moving the model to the target device to avoid unnecessary allocation.

codec.decoder = None

codec = codec.to(device=device, dtype=dtype)

codec.eval()

Copilot · 2026-04-01T22:18:08Z

+    # Encoder path only uses encoder + quantizer.forward(); the decoder
+    # is never called and would waste ~208 MiB GPU memory.
+    codec.decoder = None


The PR description’s E2E test command references tests/e2e/online_serving/test_fish_speech.py, but that file doesn’t exist in this repo (Fish Speech tests appear under tests/entrypoints/openai_api/test_serving_speech.py). Please update the Test Plan in the PR description to point at the correct test(s).

lishunyang12

left a couple comments -- nice savings for a small change.

… VRAM Fish Speech S2 Pro loads the full DAC codec (encoder + quantizer + decoder) into GPU in both stages, but each stage only uses a subset: - Encoder stage (dac_encoder.py): only uses encoder + quantizer.forward() -> decoder is unused, wasting ~208 MiB - Decoder stage (fish_speech_dac_decoder.py): only uses quantizer.decode() + decoder -> encoder, quantizer.pre_module, and quantizer.downsample are unused, wasting ~1,067 MiB Free the unused components before moving to device so they are never allocated on GPU. Verified bit-identical output and successful e2e encode/decode with real codec.pth weights on H20. Signed-off-by: Sy03 <1370724210@qq.com>

linyueqian

LGTM. Clean, minimal change with solid VRAM savings (~1.2 GiB combined across both processes).

The pruning is correctly placed before .to(device) in both files, so unused components are never transferred to GPU. The component names match build_dac_codec() in dac_utils.py, and since encoder/decoder stages run in separate EngineCore processes, the _codec_cache is not shared.

Note to other reviewers: the existing Copilot and inline comments about "pruning after .to()" misread the diff -- the pruning already happens before .to(). The gc.collect() / torch.cuda.empty_cache() suggestion is also unnecessary since these are CPU-only tensors being dereferenced before any device transfer.

Minor: PR description references tests/e2e/online_serving/test_fish_speech.py which doesn't exist -- actual test is tests/model_executor/models/test_fish_speech_regressions.py.

…lm-project#2430) Signed-off-by: Sy03 <1370724210@qq.com>

Sy0307 requested a review from hsliuustc0106 as a code owner April 1, 2026 19:33

Sy0307 changed the title ~~[Perf][Fish Speech] Free unused DAC codec components to save ~1.2 GiB VRAM~~ [Perf][Fish Speech] Free unused DAC codec components to save VRAM Apr 1, 2026

Sy0307 force-pushed the perf/fish-speech-dac-free-unused-components branch from a9d3c68 to f0c1dd1 Compare April 1, 2026 19:37

hsliuustc0106 requested review from Copilot and linyueqian April 1, 2026 22:14

Copilot started reviewing on behalf of hsliuustc0106 April 1, 2026 22:15 View session

Copilot AI reviewed Apr 1, 2026

View reviewed changes

lishunyang12 reviewed Apr 2, 2026

View reviewed changes

Comment thread vllm_omni/model_executor/models/fish_speech/dac_encoder.py Outdated

Comment thread vllm_omni/model_executor/models/fish_speech/fish_speech_dac_decoder.py Outdated

Sy0307 force-pushed the perf/fish-speech-dac-free-unused-components branch from f0c1dd1 to 7bd0dd7 Compare April 4, 2026 06:12

Merge branch 'main' into perf/fish-speech-dac-free-unused-components

e39bcd7

linyueqian approved these changes Apr 5, 2026

View reviewed changes

linyueqian added the ready label to trigger buildkite CI label Apr 5, 2026

linyueqian enabled auto-merge (squash) April 5, 2026 20:20

linyueqian merged commit 8b57c62 into vllm-project:main Apr 5, 2026
8 checks passed

skf-1999 pushed a commit to Semmer2/vllm-omni that referenced this pull request Apr 7, 2026

[Perf][Fish Speech] Free unused DAC codec components to save VRAM (vl…

d554674

…lm-project#2430) Signed-off-by: Sy03 <1370724210@qq.com>

vraiti pushed a commit to vraiti/vllm-omni that referenced this pull request Apr 9, 2026

[Perf][Fish Speech] Free unused DAC codec components to save VRAM (vl…

69989da

…lm-project#2430) Signed-off-by: Sy03 <1370724210@qq.com>

lengrongfu pushed a commit to lengrongfu/vllm-omni that referenced this pull request May 1, 2026

[Perf][Fish Speech] Free unused DAC codec components to save VRAM (vl…

dfaa09b

…lm-project#2430) Signed-off-by: Sy03 <1370724210@qq.com>

clodaghwalsh17 pushed a commit to clodaghwalsh17/nm-vllm-omni-ent that referenced this pull request May 12, 2026

[Perf][Fish Speech] Free unused DAC codec components to save VRAM (vl…

8e064c4

…lm-project#2430) Signed-off-by: Sy03 <1370724210@qq.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Perf][Fish Speech] Free unused DAC codec components to save VRAM#2430

[Perf][Fish Speech] Free unused DAC codec components to save VRAM#2430
linyueqian merged 2 commits into
vllm-project:mainfrom
Sy0307:perf/fish-speech-dac-free-unused-components

Sy0307 commented Apr 1, 2026 •

edited

Loading

Uh oh!

hsliuustc0106 commented Apr 1, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 1, 2026

Uh oh!

Copilot AI Apr 1, 2026

Uh oh!

Copilot AI Apr 1, 2026

Uh oh!

lishunyang12 left a comment

Uh oh!

Uh oh!

Uh oh!

linyueqian left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

Sy0307 commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Changes

Why this is safe

VRAM Measurements (H20, fp32)

Test Plan

Uh oh!

hsliuustc0106 commented Apr 1, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

lishunyang12 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

linyueqian left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Sy0307 commented Apr 1, 2026 •

edited

Loading