[Model]: Add HyperCLOVAX Audio Decoder (BigVGAN) support to vllm-omni#2512
[Model]: Add HyperCLOVAX Audio Decoder (BigVGAN) support to vllm-omni#2512KilJaeeun wants to merge 1 commit intovllm-project:mainfrom
Conversation
… vllm-omni - Add HyperCLOVAXAudioPipeline: CosyVoice2 FSQ discrete unit → BigVGAN vocoder - Add supporting layers: EcapaTDNN speaker encoder, activations, constants - Register HyperCLOVAXAudioPipeline in diffusion model registry - Add pydub>=0.25.1 dependency for audio I/O Co-Authored-By: Hyunjoon Cho <with1015@github.com> Signed-off-by: jaeeun.kil <jaeeun.kil@navercorp.com>
61deed5 to
a8f5551
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 61deed5946
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| "HyperCLOVAXVisionPipeline": ( | ||
| "hyperclovax_vision", | ||
| "pipeline_hyperclovax_vision", | ||
| "HyperCLOVAXVisionPipeline", | ||
| ), |
There was a problem hiding this comment.
Restore removed pipeline registrations in model registry
This registry rewrite drops multiple existing model-class mappings (e.g., WanVACE, LTX2 variants, FluxKontext, Helios, Flux2, HunyuanVideo15, MagiHuman, OmniVoice, DreamIDOmni) while adding HyperCLOVAX entries; those pipeline classes still exist under vllm_omni/diffusion/models/*, so configs that previously worked will now fail in initialize_model with Model class ... not found and lose their pre/post-process hooks. Unless this is an explicit deprecation pass, this is a backward-incompatible regression introduced by this commit.
Useful? React with 👍 / 👎.
| if len(units.size()) == 2 and units.size(0) == 1: | ||
| return DiffusionOutput(output=None, error="the underlying decoder does not support batch inference yet") |
There was a problem hiding this comment.
Reject all batched token tensors before unsqueeze
The batch guard is inverted: it returns an error only when units is 2-D with size(0) == 1, but true batched inputs ([B, T] with B > 1) pass through and are then unsqueezed to 3-D, which later embedding/conv code does not support. This turns an intended user-facing validation error into a downstream shape/runtime failure for pre-batched token inputs.
Useful? React with 👍 / 👎.
| for l_i in layer: | ||
| remove_weight_norm(l_i) |
There was a problem hiding this comment.
Remove weight norm from inner convs, not wrapper modules
weight_norm is applied to inner conv/deconv layers, but this code calls remove_weight_norm on wrapper modules (l_i) instead. That raises ValueError, gets swallowed by remove_weight_norm()'s broad handler, and leaves parametrizations intact; the fallback path in from_pretrained for checkpoints without weight norm can then still fail to load.
Useful? React with 👍 / 👎.
Summary
Add HyperCLOVAX Audio Decoder (unit-BigVGAN vocoder) diffusion pipeline to vllm-omni.
This is a clean rebase of #869 onto current upstream/main.
#613 (vision decoder + full pipeline) stacks on top of this PR.
Changes
New Files
vllm_omni/diffusion/models/hyperclovax_audio/pipeline_hyperclovax_audio.py— CosyVoice2 FSQ discrete unit tokens → BigVGAN waveformhyperclovax_audio_decoder.py— BigVGAN decoder + EcapaTDNN speaker conditioningecapa_tdnn.py— ECAPA-TDNN speaker encoderactivations.py— Snake activationconstants.py— Mel filterbank constantsModified Files
vllm_omni/diffusion/registry.py— Register HyperCLOVAXAudioPipelinepyproject.toml— Add pydub>=0.25.1PR dependency
Co-Authored-By: Hyunjoon Cho with1015@github.com
Test Plan