Add Vocos model #39403

Manalelaidouni · 2025-07-14T18:25:37Z

What does this PR do?

This PR aims at integrating Vocos model to transformers.

Vocos is a neural vocoder designed for high quality audio synthesis in TTS pipelines and related tasks, outpeforms HifiGan and it is significantly faster. It has 2 main variants :

VocosModel can be used as a standalone vocoder in audio generation pipeline, the goal is to use it as a drop in vocoder in YuE model. It can also be used together with VocosFeatureExtractor to synthesis audio from mel-spectrogram features.
VocosWithEncodecModel : integrates the EnCodec neural audio codec model into Vocos for end-to-end audio compression and reconstruction.

This is a continuation of integrating model components for the new YuE model (mention in #36784).

Who can review?

Anyone in the community is free to review the PR once the tests have passed.
@ArthurZucker @eustlb @ylacombe

ArthurZucker

Nice! My main comment is to remove the hidden states post processing!

src/transformers/models/vocos/modeling_vocos.py

Manalelaidouni · 2025-07-22T15:36:39Z

Thanks for reviewing! the failing tests seem unrelated to my changes, but I realized the latest datasets 4.0.0 loads different audio samples than earlier versions which was causing integration tests to fail in CI.

ArthurZucker

Sorry for my late review!

src/transformers/models/vocos/modeling_vocos.py

src/transformers/models/vocos/feature_extraction_vocos.py

ArthurZucker · 2025-08-11T15:22:11Z

If you can merge main adress the small comment and we can merge!

…ransformers into add-vocos-model

ebezzam · 2025-10-10T14:33:57Z

run-slow: vocos

github-actions · 2025-10-10T14:35:27Z

This comment contains run-slow, running the specified jobs:

models: ['models/vocos']
quantizations: [] ...

Manalelaidouni · 2025-10-10T14:41:09Z

@Manalelaidouni Thanks! Tests are again passing on my machine and will try GitHub actions soon.

Also I flipped back again to the functional ISTFT as you used to have 🙈 because @eustlb and I had the same idea of putting it in audio utils.

Nice, great idea actually, now it looks like the model is good to go right?

eustlb

Here we see that different codepaths in the processor/ model are used in different situations, and cannot be used in a crossed manner.

if we use the mel spectrogram inputs, then we should not have an adaptative layernorm
if we use the non-mel spectrogram inputs, then we should have an adaptative layernorm.

We should therefore have two models Vocos, and VocosEncodec.
VocosEncodec should be defined using modular, simply replacing the norm by a VocosAdaptativeLayerNorm.

Moreover it should have an embedding layer that does the input_features preparation that is currently in the processing for audio codes → inputs_embeds, with a torch.no_grad(). I do get that we duplicate the weights necessary for codebook embedding in processor and the model, yet it's only 4MB and I'd rather fix the inputs of the model in token ids rather, so that it can be used without the processor given you have encodec codebook token ids. Morever, by passing codes directly, we do not need to pass the bandwidth id to VocosEncodec, which makes no sense to pass in the forward since it can be directly inferred from the input_ids shape.

VocosEncodec would simply have in it's modular something like:

class VocosEncodecModel(VocosModel)
	def __init__(self, config):
		super.__init__(config)
		self.embed_tokens = nn.embedding()
		self.offsets = register_buffer(torch.arange(config.num_codebooks) * codebook_size, persistent=False)
		self.norms = nn.ModuleList(...)
        del self.norm

	def forward(
		input_ids, # shape (batch_size, seq_len, num_codebooks)
		inputs_embeds=None,
	)
		if (input_ids is None) ^ (inputs_embeds is not None):
            raise ValueError("You must specify exactly one of input_ids or inputs_embeds")
            
        num_codebooks = input_ids.shape[2]
        if num_codebooks not in self.config.supported_num_codebooks:
	        raise ValueError(f"{num_codebooks}...{self.config.supported_num_codebooks}")
	    else:
		    norm_idx = self.config.supported_num_codebooks.index(num_codebooks)
        self.norm = self.norms[norm_idx]
			
		if inputs_embeds is None:
			inputs_embeds = self.embed_tokens(audio_input_ids + self.offsets[:num_codebooks])
			
		super().forward(input_features=inputs_embeds)

eustlb · 2025-10-16T09:30:55Z

docs/source/en/model_doc/vocos.md

+
+The original code can be found [here](https://github.com/gemelo-ai/vocos) and original checkpoints [here](https://huggingface.co/charactr/vocos-mel-24khz) and [here](https://huggingface.co/charactr/vocos-encodec-24khz).
+


We'd want (best before merging if authors are responsive) to merge converted checkpoint directly to their repo, way better than having them under hf-audio

docs/source/en/model_doc/vocos.md

eustlb · 2025-10-16T09:54:30Z

docs/source/en/model_doc/vocos.md

+# load the Bark model and processor
+bark_id = "suno/bark-small"
+bark_processor = BarkProcessor.from_pretrained(bark_id)
+bark = BarkModel.from_pretrained(bark_id, device_map="auto")


device_map auto fails here

Hmm works for me. What error do you get?

eustlb · 2025-10-16T14:08:42Z

src/transformers/audio_utils.py

+def istft(input, n_fft: int, padding=None, **kwargs) -> "torch.Tensor":
+    """
+    Performs the Inverse Short Time Fourier Transform (ISTFT) on STFT coefficients to reconstruct audio in the time domain.
+
+    Adds support for `same` padding as in Vocos:
+    https://github.com/gemelo-ai/vocos/blob/c859e3b7b534f3776a357983029d34170ddd6fc3/vocos/spectral_ops.py#L7
+
+    Otherwise falls back to PyTorch's built-in ISTFT implementation `torch.istft`.
+
+    Args:
+        input (`torch.Tensor`): Complex-valued STFT coefficients of shape (batch_size, freq_bins, time_frames).
+        n_fft (`int`): Size of the FFT.
+        padding (`str`, *optional*): Padding mode. Either "center" or "same".
+        **kwargs: Additional arguments passed to torch.istft or used for "same" padding:
+            - win_length (`int`, *optional*): Window length. Defaults to n_fft.
+            - hop_length (`int`, *optional*): Hop length. Defaults to n_fft // 4.
+            - window (`torch.Tensor`, *optional*): Window function. Defaults to Hann window.
+            - center (`bool`, *optional*): Used only for "center" padding mode.
+
+    Returns:
+        `torch.Tensor`: Reconstructed audio waveform.
+
+    It computes ISTFT differently depending on padding:
+        if `center` : uses PyTorch's built-in ISTFT implementation since it uses `center=True` by default.
+        if `same` : uses custom implementation of ISTFT with the overlap-add method, since the Pytorch version fails the
+        Nonzero Overlap Add (NOLA) condition when center is False. See issue: https://github.com/pytorch/pytorch/issues/62323
+    """
+    requires_backends(istft, ["torch"])
+
+    if padding == "center" or padding is None:
+        # user may provide center=False in kwargs
+        center = kwargs.get("center", True)
+        audio = torch.istft(
+            input,
+            n_fft=n_fft,
+            center=center,
+            **kwargs,
+        )
+
+    elif padding == "same":
+        win_length = kwargs.get("win_length", n_fft)
+        hop_length = kwargs.get("hop_length", n_fft // 4)
+        window = kwargs.get("window", torch.hann_window(win_length))
+
+        _, _, num_time_frames = input.shape
+        pad = (win_length - hop_length) // 2
+        # the inverse FFT of each frame
+        inverse_fft = torch.fft.irfft(input, n=n_fft, dim=1, norm="backward")
+        inverse_fft = inverse_fft * window[None, :, None]
+
+        # combine the overlapping frame with windowing and normalizing by the sum of squared window values across overlapping frames
+        # to make sure the reconstruction of the audio is accurate
+        output_length = (num_time_frames - 1) * hop_length + win_length
+        audio = F.fold(
+            inverse_fft,
+            output_size=(1, output_length),
+            kernel_size=(1, win_length),
+            stride=(1, hop_length),
+        )[:, 0, 0, pad:-pad]
+        window_sqrt = window.square().expand(1, num_time_frames, -1).transpose(1, 2)
+        norm = F.fold(
+            window_sqrt,
+            output_size=(1, output_length),
+            kernel_size=(1, win_length),
+            stride=(1, hop_length),
+        ).squeeze()[pad:-pad]
+
+        if torch.any(norm <= 1e-11):
+            raise ValueError(
+                "Normalization tensor `norm` contains values ≤ 1e-11, it would cause division by zero. check the n_fft, hop_length and padding parameters."
+            )
+        audio = audio / norm
+
+    else:
+        raise ValueError(f"Unsupported padding mode: {padding}. Supported modes are 'center' and 'same'.")
+
+    return audio


nice initiative but let's revert and simply add a TODO in the original code. Such an important function would require proper testing! work that has to be done when refacto audio_utils

I've put in the modeling file!

FYI while Mel variant uses "center" Encodec variant and Xcodec2 uses "same"

eustlb · 2025-10-16T14:38:33Z

src/transformers/models/vocos/modeling_vocos.py

+        audio_spectrogram: Optional[torch.FloatTensor] = None,
+        input_features: Optional[torch.FloatTensor] = None,


and what if the user provides both? we silently use input_features. This happens because there is no reason to provide both, so such a use case has no reason to be integrated in the model forward signature. input_feature and audio_spectrogram are both input_features (without considering the renaming, let's keep it to input_features for now).

eustlb · 2025-10-16T14:55:30Z

src/transformers/models/vocos/configuration_vocos.py

+        use_adaptive_norm (`bool`, *optional*, defaults to `False`):
+            Whether to use adaptive layer normalization.


so we have bandwidths that can be setted and adaptative norm not used? that seem overcomplicated, especially when looking at config.json, user will see bandwiths that are not used. Let's target expliciteness: either bandwidths are specified and in this case we use them, either it is None and therefore we do not use them.

Agreed, removing this.

src/transformers/models/vocos/modeling_vocos.py

eustlb · 2025-10-20T22:53:25Z

src/transformers/models/vocos/modeling_vocos.py

+        audio_spectrogram: Optional[torch.FloatTensor] = None,
+        input_features: Optional[torch.FloatTensor] = None,
+        bandwidth: Optional[float] = None,
+        **kwargs: Unpack[TransformersKwargs],


why do we handle kwargs here? and why should it be TransformersKwargs?

because of padding_mask that could be inputted if we do something like:

inputs = processor(audio=audios, return_tensors="pt") output = model(**inputs) # `inputs.padding_mask` would be inputted

Is it ok to have an unused padding_mask input to forward for the convenience of model(**inputs)?

The padding_mask is useful for trimming individual audios (at the output) in the case of batch processing.

yep of course I know what a padding_mask is :) and in this case we use a padding_mask kwarg directly if necessary. As you can see, padding_mask is not a TransformersKwargs. To the best of my knowledge there is nothing forcing us from returning a padding_mask in the processor

If padding_mask is not supported, then the processor should not return padding_mask

Woops sorry I know you do 😄 I meant more to explain why it's here but not used.

Do you think it could be useful for below batch usage outside of modeling code? (to remove known padding)

inputs = processor(audio=audio, bandwidth=bandwidth, sampling_rate=sampling_rate) outputs = model(**inputs) audio_vocos = outputs.audio # use padding mask to extract audio with same length as original `audio` for i in range(audio_vocos.shape[0]): # remove padding padding_mask = inputs.padding_mask[i].bool() valid_audio = audio_vocos[i][padding_mask]

Because same padding (used by Encodec approach) actually pads before the audio (original which is also accounted for in the processor for both audio and padding mask), so just truncating according to original length is not enough.

IMO it should be taken as an input, and returned in the output and documented, ie using directly outputs.padding_mask, something similar to that

Co-authored-by: eustlb <[email protected]>

github-actions · 2025-10-21T07:51:33Z