Skip to content

[Perf][Fish Speech] Free unused DAC codec components to save VRAM#2430

Merged
linyueqian merged 2 commits into
vllm-project:mainfrom
Sy0307:perf/fish-speech-dac-free-unused-components
Apr 5, 2026
Merged

[Perf][Fish Speech] Free unused DAC codec components to save VRAM#2430
linyueqian merged 2 commits into
vllm-project:mainfrom
Sy0307:perf/fish-speech-dac-free-unused-components

Conversation

@Sy0307
Copy link
Copy Markdown
Contributor

@Sy0307 Sy0307 commented Apr 1, 2026

Summary

Fish Speech S2 Pro loads the full DAC codec (~1,789 MiB fp32) into GPU in both the encoder stage (voice cloning) and decoder stage (waveform synthesis), but each stage only uses a subset of the codec's components. This PR frees the unused components after loading, saving ~1,275 MiB total across both stages.

Motivation

The DAC codec has three main components with asymmetric sizes:

Component Params Size (fp32) Used by Encoder Used by Decoder
encoder 76.9M 293.2 MiB Yes No
decoder 54.1M 206.4 MiB No Yes
quantizer.pre_module (8-layer transformer) 109.1M 416.1 MiB Yes (forward()) No
quantizer.post_module (8-layer transformer) 109.1M 416.1 MiB Yes (forward()) Yes (decode())
quantizer.downsample 21.0M 80.1 MiB Yes (forward()) No
quantizer.upsample 21.0M 80.1 MiB Yes (forward()) Yes (decode())
quantizer.codebooks 0.3M 1.1 MiB Yes Yes

The encode path (codec.encode()) calls encoder -> quantizer.forward(), which uses all quantizer sub-components.

The decode path (quantizer.decode() -> decoder()) only uses semantic_quantizer.from_codes, quantizer.from_codes, post_module, upsample, and decoder. It never touches encoder, quantizer.pre_module, or quantizer.downsample.

Changes

dac_encoder.py -- 1 line after codec.eval():

codec.decoder = None  # saves ~208 MiB

fish_speech_dac_decoder.py -- 3 lines after codec.eval():

codec.encoder = None                  # saves ~293 MiB
codec.quantizer.pre_module = None     # saves ~416 MiB
codec.quantizer.downsample = None     # saves ~80 MiB

Why this is safe

Both encode and decode paths were verified to produce bit-identical output after freeing unused components (same model instance, same weights):

ENC bit-identical: True    (same codes after freeing decoder)
DEC bit-identical: True    (max_diff = 0.0 after freeing encoder+pre+down)

The freed components are never referenced in the active code path -- PyTorch's forward pass only traverses explicitly called sub-modules.

The encoder and decoder stages run in separate EngineCore processes, so the global _codec_cache in dac_encoder.py is not shared between them.

VRAM Measurements (H20, fp32)

Stage Before After Saved
Encoder stage (dac_encoder.py) 1,789.5 MiB 1,581.8 MiB 207.6 MiB (12%)
Decoder stage (fish_speech_dac_decoder.py) 1,789.5 MiB 722.0 MiB 1,067.4 MiB (60%)
Combined (both processes) 3,578.9 MiB 2,303.8 MiB 1,275.1 MiB (36%)

Test Plan

Bit-exact verification (no model weights needed -- uses random init):

codec = build_dac_codec(); codec.to(dev, dtype=torch.float32).eval()
codes_base, _ = codec.encode(wav, fl)           # baseline encode
codec.decoder = None                             # free decoder
codes_opt, _ = codec.encode(wav, fl)             # optimized encode
assert torch.equal(codes_base, codes_opt)        # True

codec2 = build_dac_codec(); codec2.to(dev, dtype=torch.float32).eval()
z_base = codec2.quantizer.decode(codes_base)
audio_base = codec2.decoder(z_base)              # baseline decode
codec2.encoder = None; codec2.quantizer.pre_module = None
codec2.quantizer.downsample = None               # free encoder components
z_opt = codec2.quantizer.decode(codes_base)
audio_opt = codec2.decoder(z_opt)                # optimized decode
assert torch.equal(audio_base, audio_opt)        # True (max_diff = 0.0)

E2E (requires Fish Speech S2 Pro model):

python -m pytest tests/e2e/online_serving/test_fish_speech.py -v --timeout=300

@Sy0307 Sy0307 requested a review from hsliuustc0106 as a code owner April 1, 2026 19:33
@Sy0307 Sy0307 changed the title [Perf][Fish Speech] Free unused DAC codec components to save ~1.2 GiB VRAM [Perf][Fish Speech] Free unused DAC codec components to save VRAM Apr 1, 2026
@Sy0307 Sy0307 force-pushed the perf/fish-speech-dac-free-unused-components branch from a9d3c68 to f0c1dd1 Compare April 1, 2026 19:37
@hsliuustc0106
Copy link
Copy Markdown
Collaborator

@claude

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Optimizes Fish Speech S2 Pro’s DAC codec memory footprint by pruning stage-specific unused codec components after loading, reducing GPU VRAM usage for both the encoder (voice cloning) and decoder (waveform synthesis) stages.

Changes:

  • In the DAC encoder loader, drops the codec decoder module since the encode path never calls it.
  • In the DAC decoder loader, drops the codec encoder plus quantizer encode-only submodules (pre_module, downsample) since the decode path never calls them.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
vllm_omni/model_executor/models/fish_speech/fish_speech_dac_decoder.py Removes encode-only DAC components in the decode stage to cut GPU memory usage.
vllm_omni/model_executor/models/fish_speech/dac_encoder.py Removes decode-only DAC components in the encode stage to cut GPU memory usage.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +144 to +152
device = self.vllm_config.device_config.device
codec = codec.to(device=device, dtype=torch.float32)
codec.eval()
# Decode path only uses quantizer.decode() + decoder; the encoder
# and quantizer's encode-only components (pre_module, downsample)
# are never called and would waste ~1,067 MiB GPU memory.
codec.encoder = None
codec.quantizer.pre_module = None
codec.quantizer.downsample = None
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These components are freed only after codec.to(device=...) has already moved the full codec (including encoder/pre_module/downsample) onto the GPU. That can still cause a peak-VRAM OOM during loading and adds unnecessary host→device transfer time. Consider removing the unused submodules before calling .to(...) so they are never transferred/allocated on the target device.

Suggested change
device = self.vllm_config.device_config.device
codec = codec.to(device=device, dtype=torch.float32)
codec.eval()
# Decode path only uses quantizer.decode() + decoder; the encoder
# and quantizer's encode-only components (pre_module, downsample)
# are never called and would waste ~1,067 MiB GPU memory.
codec.encoder = None
codec.quantizer.pre_module = None
codec.quantizer.downsample = None
# Decode path only uses quantizer.decode() + decoder; the encoder
# and quantizer's encode-only components (pre_module, downsample)
# are never called and would waste ~1,067 MiB GPU memory.
if hasattr(codec, "encoder"):
codec.encoder = None
quantizer = getattr(codec, "quantizer", None)
if quantizer is not None:
if hasattr(quantizer, "pre_module"):
quantizer.pre_module = None
if hasattr(quantizer, "downsample"):
quantizer.downsample = None
device = self.vllm_config.device_config.device
codec = codec.to(device=device, dtype=torch.float32)
codec.eval()

Copilot uses AI. Check for mistakes.
Comment on lines +57 to +61
codec = codec.to(device=device, dtype=dtype)
codec.eval()
# Encoder path only uses encoder + quantizer.forward(); the decoder
# is never called and would waste ~208 MiB GPU memory.
codec.decoder = None
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

codec.decoder is removed only after the full codec has been moved to the target device via .to(...). If device is CUDA, this still allocates/transfers the decoder weights and can OOM on smaller GPUs despite being freed immediately after. Consider setting codec.decoder = None (or otherwise pruning) before .to(...) so the decoder is never moved/allocated on the target device.

Suggested change
codec = codec.to(device=device, dtype=dtype)
codec.eval()
# Encoder path only uses encoder + quantizer.forward(); the decoder
# is never called and would waste ~208 MiB GPU memory.
codec.decoder = None
# Encoder path only uses encoder + quantizer.forward(); the decoder
# is never called and would waste ~208 MiB GPU memory. Prune it before
# moving the model to the target device to avoid unnecessary allocation.
codec.decoder = None
codec = codec.to(device=device, dtype=dtype)
codec.eval()

Copilot uses AI. Check for mistakes.
Comment on lines +59 to +61
# Encoder path only uses encoder + quantizer.forward(); the decoder
# is never called and would waste ~208 MiB GPU memory.
codec.decoder = None
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description’s E2E test command references tests/e2e/online_serving/test_fish_speech.py, but that file doesn’t exist in this repo (Fish Speech tests appear under tests/entrypoints/openai_api/test_serving_speech.py). Please update the Test Plan in the PR description to point at the correct test(s).

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator

@lishunyang12 lishunyang12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left a couple comments -- nice savings for a small change.

Comment thread vllm_omni/model_executor/models/fish_speech/dac_encoder.py Outdated
Comment thread vllm_omni/model_executor/models/fish_speech/fish_speech_dac_decoder.py Outdated
… VRAM

Fish Speech S2 Pro loads the full DAC codec (encoder + quantizer + decoder)
into GPU in both stages, but each stage only uses a subset:

- Encoder stage (dac_encoder.py): only uses encoder + quantizer.forward()
  -> decoder is unused, wasting ~208 MiB
- Decoder stage (fish_speech_dac_decoder.py): only uses quantizer.decode()
  + decoder -> encoder, quantizer.pre_module, and quantizer.downsample are
  unused, wasting ~1,067 MiB

Free the unused components before moving to device so they are never
allocated on GPU. Verified bit-identical output and successful e2e
encode/decode with real codec.pth weights on H20.

Signed-off-by: Sy03 <1370724210@qq.com>
@Sy0307 Sy0307 force-pushed the perf/fish-speech-dac-free-unused-components branch from f0c1dd1 to 7bd0dd7 Compare April 4, 2026 06:12
Copy link
Copy Markdown
Collaborator

@linyueqian linyueqian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Clean, minimal change with solid VRAM savings (~1.2 GiB combined across both processes).

The pruning is correctly placed before .to(device) in both files, so unused components are never transferred to GPU. The component names match build_dac_codec() in dac_utils.py, and since encoder/decoder stages run in separate EngineCore processes, the _codec_cache is not shared.

Note to other reviewers: the existing Copilot and inline comments about "pruning after .to()" misread the diff -- the pruning already happens before .to(). The gc.collect() / torch.cuda.empty_cache() suggestion is also unnecessary since these are CPU-only tensors being dereferenced before any device transfer.

Minor: PR description references tests/e2e/online_serving/test_fish_speech.py which doesn't exist -- actual test is tests/model_executor/models/test_fish_speech_regressions.py.

@linyueqian linyueqian added the ready label to trigger buildkite CI label Apr 5, 2026
@linyueqian linyueqian enabled auto-merge (squash) April 5, 2026 20:20
@linyueqian linyueqian merged commit 8b57c62 into vllm-project:main Apr 5, 2026
8 checks passed
skf-1999 pushed a commit to Semmer2/vllm-omni that referenced this pull request Apr 7, 2026
vraiti pushed a commit to vraiti/vllm-omni that referenced this pull request Apr 9, 2026
lengrongfu pushed a commit to lengrongfu/vllm-omni that referenced this pull request May 1, 2026
clodaghwalsh17 pushed a commit to clodaghwalsh17/nm-vllm-omni-ent that referenced this pull request May 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants