[Perf][Fish Speech] Free unused DAC codec components to save VRAM#2430
Conversation
a9d3c68 to
f0c1dd1
Compare
There was a problem hiding this comment.
Pull request overview
Optimizes Fish Speech S2 Pro’s DAC codec memory footprint by pruning stage-specific unused codec components after loading, reducing GPU VRAM usage for both the encoder (voice cloning) and decoder (waveform synthesis) stages.
Changes:
- In the DAC encoder loader, drops the codec
decodermodule since the encode path never calls it. - In the DAC decoder loader, drops the codec
encoderplus quantizer encode-only submodules (pre_module,downsample) since the decode path never calls them.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
vllm_omni/model_executor/models/fish_speech/fish_speech_dac_decoder.py |
Removes encode-only DAC components in the decode stage to cut GPU memory usage. |
vllm_omni/model_executor/models/fish_speech/dac_encoder.py |
Removes decode-only DAC components in the encode stage to cut GPU memory usage. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| device = self.vllm_config.device_config.device | ||
| codec = codec.to(device=device, dtype=torch.float32) | ||
| codec.eval() | ||
| # Decode path only uses quantizer.decode() + decoder; the encoder | ||
| # and quantizer's encode-only components (pre_module, downsample) | ||
| # are never called and would waste ~1,067 MiB GPU memory. | ||
| codec.encoder = None | ||
| codec.quantizer.pre_module = None | ||
| codec.quantizer.downsample = None |
There was a problem hiding this comment.
These components are freed only after codec.to(device=...) has already moved the full codec (including encoder/pre_module/downsample) onto the GPU. That can still cause a peak-VRAM OOM during loading and adds unnecessary host→device transfer time. Consider removing the unused submodules before calling .to(...) so they are never transferred/allocated on the target device.
| device = self.vllm_config.device_config.device | |
| codec = codec.to(device=device, dtype=torch.float32) | |
| codec.eval() | |
| # Decode path only uses quantizer.decode() + decoder; the encoder | |
| # and quantizer's encode-only components (pre_module, downsample) | |
| # are never called and would waste ~1,067 MiB GPU memory. | |
| codec.encoder = None | |
| codec.quantizer.pre_module = None | |
| codec.quantizer.downsample = None | |
| # Decode path only uses quantizer.decode() + decoder; the encoder | |
| # and quantizer's encode-only components (pre_module, downsample) | |
| # are never called and would waste ~1,067 MiB GPU memory. | |
| if hasattr(codec, "encoder"): | |
| codec.encoder = None | |
| quantizer = getattr(codec, "quantizer", None) | |
| if quantizer is not None: | |
| if hasattr(quantizer, "pre_module"): | |
| quantizer.pre_module = None | |
| if hasattr(quantizer, "downsample"): | |
| quantizer.downsample = None | |
| device = self.vllm_config.device_config.device | |
| codec = codec.to(device=device, dtype=torch.float32) | |
| codec.eval() |
| codec = codec.to(device=device, dtype=dtype) | ||
| codec.eval() | ||
| # Encoder path only uses encoder + quantizer.forward(); the decoder | ||
| # is never called and would waste ~208 MiB GPU memory. | ||
| codec.decoder = None |
There was a problem hiding this comment.
codec.decoder is removed only after the full codec has been moved to the target device via .to(...). If device is CUDA, this still allocates/transfers the decoder weights and can OOM on smaller GPUs despite being freed immediately after. Consider setting codec.decoder = None (or otherwise pruning) before .to(...) so the decoder is never moved/allocated on the target device.
| codec = codec.to(device=device, dtype=dtype) | |
| codec.eval() | |
| # Encoder path only uses encoder + quantizer.forward(); the decoder | |
| # is never called and would waste ~208 MiB GPU memory. | |
| codec.decoder = None | |
| # Encoder path only uses encoder + quantizer.forward(); the decoder | |
| # is never called and would waste ~208 MiB GPU memory. Prune it before | |
| # moving the model to the target device to avoid unnecessary allocation. | |
| codec.decoder = None | |
| codec = codec.to(device=device, dtype=dtype) | |
| codec.eval() |
| # Encoder path only uses encoder + quantizer.forward(); the decoder | ||
| # is never called and would waste ~208 MiB GPU memory. | ||
| codec.decoder = None |
There was a problem hiding this comment.
The PR description’s E2E test command references tests/e2e/online_serving/test_fish_speech.py, but that file doesn’t exist in this repo (Fish Speech tests appear under tests/entrypoints/openai_api/test_serving_speech.py). Please update the Test Plan in the PR description to point at the correct test(s).
lishunyang12
left a comment
There was a problem hiding this comment.
left a couple comments -- nice savings for a small change.
… VRAM Fish Speech S2 Pro loads the full DAC codec (encoder + quantizer + decoder) into GPU in both stages, but each stage only uses a subset: - Encoder stage (dac_encoder.py): only uses encoder + quantizer.forward() -> decoder is unused, wasting ~208 MiB - Decoder stage (fish_speech_dac_decoder.py): only uses quantizer.decode() + decoder -> encoder, quantizer.pre_module, and quantizer.downsample are unused, wasting ~1,067 MiB Free the unused components before moving to device so they are never allocated on GPU. Verified bit-identical output and successful e2e encode/decode with real codec.pth weights on H20. Signed-off-by: Sy03 <1370724210@qq.com>
f0c1dd1 to
7bd0dd7
Compare
linyueqian
left a comment
There was a problem hiding this comment.
LGTM. Clean, minimal change with solid VRAM savings (~1.2 GiB combined across both processes).
The pruning is correctly placed before .to(device) in both files, so unused components are never transferred to GPU. The component names match build_dac_codec() in dac_utils.py, and since encoder/decoder stages run in separate EngineCore processes, the _codec_cache is not shared.
Note to other reviewers: the existing Copilot and inline comments about "pruning after .to()" misread the diff -- the pruning already happens before .to(). The gc.collect() / torch.cuda.empty_cache() suggestion is also unnecessary since these are CPU-only tensors being dereferenced before any device transfer.
Minor: PR description references tests/e2e/online_serving/test_fish_speech.py which doesn't exist -- actual test is tests/model_executor/models/test_fish_speech_regressions.py.
…lm-project#2430) Signed-off-by: Sy03 <1370724210@qq.com>
…lm-project#2430) Signed-off-by: Sy03 <1370724210@qq.com>
…lm-project#2430) Signed-off-by: Sy03 <1370724210@qq.com>
…lm-project#2430) Signed-off-by: Sy03 <1370724210@qq.com>
Summary
Fish Speech S2 Pro loads the full DAC codec (~1,789 MiB fp32) into GPU in both the encoder stage (voice cloning) and decoder stage (waveform synthesis), but each stage only uses a subset of the codec's components. This PR frees the unused components after loading, saving ~1,275 MiB total across both stages.
Motivation
The DAC codec has three main components with asymmetric sizes:
encoderdecoderquantizer.pre_module(8-layer transformer)forward())quantizer.post_module(8-layer transformer)forward())decode())quantizer.downsampleforward())quantizer.upsampleforward())decode())quantizer.codebooksThe encode path (
codec.encode()) callsencoder -> quantizer.forward(), which uses all quantizer sub-components.The decode path (
quantizer.decode() -> decoder()) only usessemantic_quantizer.from_codes,quantizer.from_codes,post_module,upsample, anddecoder. It never touchesencoder,quantizer.pre_module, orquantizer.downsample.Changes
dac_encoder.py-- 1 line aftercodec.eval():fish_speech_dac_decoder.py-- 3 lines aftercodec.eval():Why this is safe
Both encode and decode paths were verified to produce bit-identical output after freeing unused components (same model instance, same weights):
The freed components are never referenced in the active code path -- PyTorch's forward pass only traverses explicitly called sub-modules.
The encoder and decoder stages run in separate EngineCore processes, so the global
_codec_cacheindac_encoder.pyis not shared between them.VRAM Measurements (H20, fp32)
dac_encoder.py)fish_speech_dac_decoder.py)Test Plan
Bit-exact verification (no model weights needed -- uses random init):
E2E (requires Fish Speech S2 Pro model):