Skip to content

[Model]: Add HyperCLOVAX Audio Decoder (BigVGAN) support to vllm-omni#2512

Closed
KilJaeeun wants to merge 1 commit intovllm-project:mainfrom
KilJaeeun:feat/hyperclovax-audio
Closed

[Model]: Add HyperCLOVAX Audio Decoder (BigVGAN) support to vllm-omni#2512
KilJaeeun wants to merge 1 commit intovllm-project:mainfrom
KilJaeeun:feat/hyperclovax-audio

Conversation

@KilJaeeun
Copy link
Copy Markdown

@KilJaeeun KilJaeeun commented Apr 6, 2026

Summary

Add HyperCLOVAX Audio Decoder (unit-BigVGAN vocoder) diffusion pipeline to vllm-omni.

This is a clean rebase of #869 onto current upstream/main.
#613 (vision decoder + full pipeline) stacks on top of this PR.

Changes

New Files

vllm_omni/diffusion/models/hyperclovax_audio/

  • pipeline_hyperclovax_audio.py — CosyVoice2 FSQ discrete unit tokens → BigVGAN waveform
  • hyperclovax_audio_decoder.py — BigVGAN decoder + EcapaTDNN speaker conditioning
  • ecapa_tdnn.py — ECAPA-TDNN speaker encoder
  • activations.py — Snake activation
  • constants.py — Mel filterbank constants

Modified Files

  • vllm_omni/diffusion/registry.py — Register HyperCLOVAXAudioPipeline
  • pyproject.toml — Add pydub>=0.25.1

PR dependency

Co-Authored-By: Hyunjoon Cho with1015@github.com

Test Plan

  • HyperCLOVAXAudioPipeline loads from HF hub
  • S2S pipeline generates audio output (see tests/e2e/online_serving/test_hcx_omni.py)

… vllm-omni

- Add HyperCLOVAXAudioPipeline: CosyVoice2 FSQ discrete unit → BigVGAN vocoder
- Add supporting layers: EcapaTDNN speaker encoder, activations, constants
- Register HyperCLOVAXAudioPipeline in diffusion model registry
- Add pydub>=0.25.1 dependency for audio I/O

Co-Authored-By: Hyunjoon Cho <with1015@github.com>

Signed-off-by: jaeeun.kil <jaeeun.kil@navercorp.com>
@KilJaeeun KilJaeeun force-pushed the feat/hyperclovax-audio branch from 61deed5 to a8f5551 Compare April 6, 2026 04:18
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 61deed5946

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +113 to 117
"HyperCLOVAXVisionPipeline": (
"hyperclovax_vision",
"pipeline_hyperclovax_vision",
"HyperCLOVAXVisionPipeline",
),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Restore removed pipeline registrations in model registry

This registry rewrite drops multiple existing model-class mappings (e.g., WanVACE, LTX2 variants, FluxKontext, Helios, Flux2, HunyuanVideo15, MagiHuman, OmniVoice, DreamIDOmni) while adding HyperCLOVAX entries; those pipeline classes still exist under vllm_omni/diffusion/models/*, so configs that previously worked will now fail in initialize_model with Model class ... not found and lose their pre/post-process hooks. Unless this is an explicit deprecation pass, this is a backward-incompatible regression introduced by this commit.

Useful? React with 👍 / 👎.

Comment on lines +229 to +230
if len(units.size()) == 2 and units.size(0) == 1:
return DiffusionOutput(output=None, error="the underlying decoder does not support batch inference yet")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Reject all batched token tensors before unsqueeze

The batch guard is inverted: it returns an error only when units is 2-D with size(0) == 1, but true batched inputs ([B, T] with B > 1) pass through and are then unsqueezed to 3-D, which later embedding/conv code does not support. This turns an intended user-facing validation error into a downstream shape/runtime failure for pre-batched token inputs.

Useful? React with 👍 / 👎.

Comment on lines +470 to +471
for l_i in layer:
remove_weight_norm(l_i)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Remove weight norm from inner convs, not wrapper modules

weight_norm is applied to inner conv/deconv layers, but this code calls remove_weight_norm on wrapper modules (l_i) instead. That raises ValueError, gets swallowed by remove_weight_norm()'s broad handler, and leaves parametrizations intact; the fallback path in from_pretrained for checkpoints without weight norm can then still fail to load.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant