-
Notifications
You must be signed in to change notification settings - Fork 31.9k
Add Vocos model #39403
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
Manalelaidouni
wants to merge
127
commits into
huggingface:main
Choose a base branch
from
Manalelaidouni:add-vocos-model
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Add Vocos model #39403
Changes from all commits
Commits
Show all changes
127 commits
Select commit
Hold shift + click to select a range
7b502bb
add working vocos
Manalelaidouni 30a17e7
update vocos
Manalelaidouni 33a715e
refactor vocos head
Manalelaidouni f6026d9
fix docstring
Manalelaidouni 7fa04e0
nit
Manalelaidouni 910c500
Merge branch 'huggingface:main' into add-vocos-model
Manalelaidouni 09cf23f
fix output mismatch
Manalelaidouni d909e64
update checkpoint conversions
Manalelaidouni 6709a97
add working vocos
Manalelaidouni 01913f5
update vocos
Manalelaidouni c42d8b9
refactor vocos head
Manalelaidouni b987b13
fix docstring
Manalelaidouni ea73b18
nit
Manalelaidouni 1fab4f5
fix output mismatch
Manalelaidouni e81574a
update checkpoint conversions
Manalelaidouni 324b7c7
Merge branch 'main' into add-vocos-model
Manalelaidouni 3f469f7
Merge branch 'add-vocos-model' of https://github.com/Manalelaidouni/t…
Manalelaidouni 35d7545
fix adaptive layer norm
Manalelaidouni 5559fd1
fix conflict
Manalelaidouni 229464a
fix auto
Manalelaidouni 3f123cb
nit
Manalelaidouni b0fc62d
Merge branch 'huggingface:main' into add-vocos-model
Manalelaidouni c15786d
add VocosProcessor and refactor
Manalelaidouni 8373fd7
nit
Manalelaidouni a869548
clean up
Manalelaidouni e21abe1
Update docs.
ebezzam 643d41d
add bark example to docs
Manalelaidouni 30f2a66
make fixture file shorter
Manalelaidouni 208cc5c
Update docs.
ebezzam 191f2a7
Fix tensor device for tests.
ebezzam 688e915
Nits
ebezzam 948a4c0
recreate fixtures
Manalelaidouni 2f7db93
add torch backend + batching support
Manalelaidouni 93c3fde
add tests for both numpy and torch backends
Manalelaidouni e93807a
add batch integration test for mel and encodec
Manalelaidouni e6e0820
make tests pass
Manalelaidouni 33bddda
update fixtures and tests
Manalelaidouni d9b2017
Merge branch 'main' into add-vocos-model
Manalelaidouni 5ad61b9
add torchaudio backend + edit batching tests
Manalelaidouni a8bbd0a
nits
Manalelaidouni e3c5ae1
update test and fixtures
Manalelaidouni ad3fdc1
Merge remote-tracking branch 'upstream/main' into add-vocos-model
Manalelaidouni 0b2305a
cleanup
Manalelaidouni ced9a22
update feature extractor
Manalelaidouni f5e6463
Merge remote-tracking branch 'upstream/main' into add-vocos-model
Manalelaidouni cf4c993
Fix small typo.
ebezzam 7e784ee
Small fixes for passing integration tests.
ebezzam 22e7488
More Transformers compatible version, with naming and more amenable t…
ebezzam c79686b
refactor processor
Manalelaidouni 6746a14
update feature extractor torchaudio + spectogram_batch
Manalelaidouni 388235a
nits
Manalelaidouni f7e1ce1
edit skipped tests reason
Manalelaidouni cd001a1
Merge branch 'main' into add-vocos-model
Manalelaidouni a576daf
Nits.
ebezzam 5d72339
Updated expected outputs.
ebezzam 98c10d2
Add slow decorator.
ebezzam 86d5ee3
Fix import.
ebezzam 8fca95f
Format.
ebezzam 704828d
Move slow decorators to methods.
ebezzam b77ac67
Simplify feature extraction and more intuitive names.
ebezzam 2dde484
Merge branch 'main' into add-vocos-model
Manalelaidouni 07b1b34
make original vs hf feature extractor match on gpu
Manalelaidouni 62f1cc0
Merge branch 'add-vocos-model' of github.com:Manalelaidouni/transform…
ebezzam 65cb11f
Simplify to just torch support.
ebezzam e1a1537
Standardize model inputs.
ebezzam 88a6d89
Update docs
ebezzam 6d2a5c1
Merge branch 'main' into add-vocos-model
ebezzam 6c61575
Merge branch 'main' into add-vocos-model
Manalelaidouni 51117cb
clean up
Manalelaidouni e6e486f
Merge branch 'main' into add-vocos-model
Manalelaidouni 6c1cac7
Add gpu decorator.
ebezzam 57502fc
undo warning
Manalelaidouni e3db2fb
nits
Manalelaidouni 70e8f7e
Add pad_to_multiple_of and use corresponding hop_length.
ebezzam d0c7306
Merge branch 'main' into add-vocos-model
Manalelaidouni 68ef6e1
pad only batch
Manalelaidouni 0a03739
reproduce fixtures
Manalelaidouni 8007e8a
minor correction
Manalelaidouni e8917f9
Address comments.
ebezzam 4afdef3
Reintroduce slow/gpu decorators.
ebezzam 532b33d
change to old fixtures
Manalelaidouni 7c94471
New istft utils and nits.
ebezzam 84bbfc2
Update docs/source/en/model_doc/vocos.md
ebezzam 037a67e
Update src/transformers/models/vocos/configuration_vocos.py
ebezzam 6ae6d20
Flatten backbone and nits.
ebezzam b5e9fa5
Update mel conversion for flattening.
ebezzam 34e5848
Cleaner mel vocos model.
ebezzam ec04346
Base simpler Vocos with encodec.
ebezzam e9c4635
Update convert for encodec variant and integration tests.
ebezzam 90aace1
Nits
ebezzam d2a51c8
Nits before modular.
ebezzam db96400
From modular.
ebezzam 5f13298
Revert to input_features.
ebezzam 4a5c2f1
Processor only for encodec variant.
ebezzam 82d1d46
Make style
ebezzam b342d4f
Add codebook weights to Encodec model.
ebezzam 5490a72
Update modular and nits.
ebezzam f1a6459
Make style
ebezzam 5300652
Merge upstream/main into add-vocos-model
Manalelaidouni 6dec301
Merge branch 'add-vocos-model' of https://github.com/Manalelaidouni/t…
Manalelaidouni 20fe380
update feature extractor
Manalelaidouni becbec8
update vocos modeling
Manalelaidouni 5ecd7ec
update vocos encodec modeling
Manalelaidouni 8766aec
correct config
Manalelaidouni 72b7f7a
update fixtures
Manalelaidouni 0b6cf12
update and add pad test to feature extractor
Manalelaidouni e30a043
update model tests
Manalelaidouni 17f7a61
add processor tests
Manalelaidouni decf1d3
update processor
Manalelaidouni 4dbd064
nits
Manalelaidouni d764c62
update docs
Manalelaidouni 4174445
allow EncodecModel as audio_tokenizer
Manalelaidouni c5012cb
Merge remote-tracking branch 'upstream/main' into add-vocos-model
Manalelaidouni ea16ce0
correct weight initialization
Manalelaidouni 56c4a0c
fix modeling + feature extractor tests
Manalelaidouni 4c410ba
update modular
Manalelaidouni 5a86436
ruff styling
Manalelaidouni bd47278
update modular 2
Manalelaidouni b7bac40
nits
Manalelaidouni d55b0f1
update auto mapping + tests
Manalelaidouni 0150a11
add codebook_weights buffer to initialization
Manalelaidouni c504d3b
allow unused config attribute
Manalelaidouni 4cc6166
Merge branch 'main' into add-vocos-model
Manalelaidouni 4b0aec4
skip training tests
Manalelaidouni 39af5a1
update docs
Manalelaidouni 3e8df70
add test decorators
Manalelaidouni 373a5d2
Merge branch 'main' into add-vocos-model
Manalelaidouni File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Some comments aren't visible on the classic Files Changed page.
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,113 @@ | ||
| <!--Copyright 2026 The HuggingFace Team. All rights reserved. | ||
|
|
||
| Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with | ||
| the License. You may obtain a copy of the License at | ||
|
|
||
| http://www.apache.org/licenses/LICENSE-2.0 | ||
|
|
||
| Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on | ||
| an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the | ||
| specific language governing permissions and limitations under the License. | ||
|
|
||
| ⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be | ||
| rendered properly in your Markdown viewer. | ||
|
|
||
| --> | ||
| *This model was released on 2023-06-01 and added to Hugging Face Transformers on 2026-01-23.* | ||
|
|
||
| # Vocos | ||
|
|
||
| <div class="flex flex-wrap space-x-1"> | ||
| <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white"> | ||
| </div> | ||
|
|
||
| ## Overview | ||
|
|
||
| The Vocos model was proposed in [**Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis**](https://huggingface.co/papers/2306.00814) by Hubert Siuzdak. | ||
|
|
||
| Vocos is a GAN-based neural vocoder designed for high quality audio synthesis in text to speech (TTS) pipelines and related tasks. Traditional time-domain vocoders rely on transposed convolutions for upsampling, which degrades temporal resolution across all layers and introduces aliasing artifacts into synthesized speech. | ||
| Instead, Vocos represents audio signals in the time-frequency domain, it's trained to predict the complex Short Time Fourier Transform (STFT) coefficients, magnitude and phase, and uses the computationally inverse STFT (ISTFT) for upsampling, which maintains the same temporal resolution throughout the network and converts directly to speech waveforms. | ||
|
|
||
| Vocos delivers the same high audio quality while achieving 30× faster inference speed on CPU and outperforming HiFi-GAN in both VISQOL and PESQ scores. | ||
|
|
||
|
|
||
|
|
||
| The abstract of the paper states the following: | ||
|
|
||
| *Recent advancements in neural vocoding are predominantly driven by Generative Adversarial Networks (GANs) operating in the time-domain. While effective, this approach neglects the inductive bias offered by time-frequency representations, resulting in reduntant and computionally-intensive upsampling operations. Fourier-based time-frequency representation is an appealing alternative, aligning more accurately with human auditory perception, and benefitting from well-established fast algorithms for its computation. Nevertheless, direct reconstruction of complex-valued spectrograms has been historically problematic, primarily due to phase recovery issues. This study seeks to close this gap by presenting Vocos, a new model that directly generates Fourier spectral coefficients. Vocos not only matches the state-of-the-art in audio quality, as demonstrated in our evaluations, but it also substantially improves computational efficiency, achieving an order of magnitude increase in speed compared to prevailing time-domain neural vocoding approaches.* | ||
|
|
||
|
|
||
| Vocos is available in two variants: | ||
|
|
||
| - `VocosModel` : Mel-spectrogram based vocoder documented in this card. | ||
|
|
||
| - `VocosEncodecModel` : EnCodec based vocoder can be found [here](https://huggingface.co/docs/transformers/model_doc/vocos_encodec). | ||
|
|
||
|
|
||
|
|
||
| You can find demos in this [post](https://gemelo-ai.github.io/vocos/). The original implementation can be found [here](https://github.com/gemelo-ai/vocos) and the original checkpoint is available [here](https://huggingface.co/charactr/vocos-mel-24khz). | ||
|
|
||
| This model was contributed by [Manal El Aidouni](https://huggingface.co/Manel) and [Eric Bezzam](https://huggingface.co/bezzam). | ||
|
|
||
| ## Usage | ||
|
|
||
|
|
||
| You can extract mel-spectrogram features from an audio using `VocosFeatureExtractor` and feed them into `VocosModel` to generate high quality audio. You can also plug `VocosModel` in as a standalone vocoder component within a larger audio generation pipeline (for example the [YuE](https://github.com/multimodal-art-projection/YuE) model). | ||
|
|
||
|
|
||
| ```python | ||
| from datasets import load_dataset, Audio | ||
| from transformers import VocosFeatureExtractor, VocosModel | ||
| from scipy.io.wavfile import write as write_wav | ||
|
|
||
| # load model and processor | ||
| model_id = "hf-audio/vocos-mel-24khz" | ||
| feature_extractor = VocosFeatureExtractor.from_pretrained(model_id) | ||
| model = VocosModel.from_pretrained(model_id, device_map="auto") | ||
| sampling_rate = feature_extractor.sampling_rate | ||
|
|
||
| # load audio sample | ||
| ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation") | ||
| ds = ds.cast_column("audio", Audio(sampling_rate=sampling_rate)) | ||
| audio = ds[0]["audio"]["array"] | ||
|
|
||
| inputs = feature_extractor(audio=audio, sampling_rate=sampling_rate).to(model.device) | ||
|
|
||
| print(inputs.input_features.shape) # (batch_size, num_mel_bins, frame) [1, 100, 550] | ||
|
|
||
| outputs = model(**inputs) | ||
|
|
||
| audio = outputs.audio | ||
|
|
||
| print(audio.shape) # (batch_size, time) [1, 140544] | ||
|
|
||
| # save audio to file | ||
| write_wav("vocos.wav", sampling_rate, audio[0].detach().cpu().numpy()) | ||
| ``` | ||
|
|
||
| In case of processing multiple audio files in batch, you can remove padding from reconstructed audios using `attention_mask` returned in output as such: | ||
|
|
||
|
|
||
| ```python | ||
| inputs = feature_extractor(audio=[audio1, audio2], return_tensors="pt") | ||
|
|
||
| outputs = model(**inputs) | ||
|
|
||
| reconstructed_audio, attention_mask = outputs.audio, outputs.attention_mask | ||
|
|
||
| unpadded_audios = [reconstructed_audio[i][attention_mask[i].bool()].detach().cpu().numpy() for i in range(reconstructed_audio.shape[0])] | ||
|
|
||
| ``` | ||
|
|
||
| ## VocosConfig | ||
|
|
||
| [[autodoc]] VocosConfig | ||
|
|
||
| ## VocosFeatureExtractor | ||
|
|
||
| [[autodoc]] VocosFeatureExtractor | ||
|
|
||
| ## VocosModel | ||
|
|
||
| [[autodoc]] VocosModel | ||
| - forward | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,182 @@ | ||
| <!--Copyright 2026 The HuggingFace Team. All rights reserved. | ||
|
|
||
| Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with | ||
| the License. You may obtain a copy of the License at | ||
|
|
||
| http://www.apache.org/licenses/LICENSE-2.0 | ||
|
|
||
| Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on | ||
| an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the | ||
| specific language governing permissions and limitations under the License. | ||
|
|
||
| ⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be | ||
| rendered properly in your Markdown viewer. | ||
|
|
||
| --> | ||
| *This model was released on 2023-06-01 and added to Hugging Face Transformers on 2026-01-23.* | ||
|
|
||
| # VocosEncodec | ||
|
|
||
| <div class="flex flex-wrap space-x-1"> | ||
| <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white"> | ||
| </div> | ||
|
|
||
| ## Overview | ||
|
|
||
| The VocosEncodec model is the EnCodec variant of the Vocos model that was proposed in [**Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis**](https://huggingface.co/papers/2306.00814) by Hubert Siuzdak. | ||
|
|
||
| Vocos is a GAN-based neural vocoder designed for high quality audio synthesis in text to speech (TTS) pipelines and related tasks. Traditional time-domain vocoders rely on transposed convolutions for upsampling, which degrades temporal resolution across all layers and introduces aliasing artifacts into synthesized speech. | ||
| Instead, Vocos represents audio signals in the time-frequency domain, it's trained to predict the complex Short Time Fourier Transform (STFT) coefficients, magnitude and phase, and uses the computationally inverse STFT (ISTFT) for upsampling, which maintains the same temporal resolution throughout the network and converts directly to speech waveforms. | ||
|
|
||
| Vocos delivers the same high audio quality while achieving 30× faster inference speed on CPU and outperforming HiFi-GAN in both VISQOL and PESQ scores. | ||
|
|
||
|
|
||
|
|
||
| The abstract of the paper states the following: | ||
|
|
||
| *Recent advancements in neural vocoding are predominantly driven by Generative Adversarial Networks (GANs) operating in the time-domain. While effective, this approach neglects the inductive bias offered by time-frequency representations, resulting in reduntant and computionally-intensive upsampling operations. Fourier-based time-frequency representation is an appealing alternative, aligning more accurately with human auditory perception, and benefitting from well-established fast algorithms for its computation. Nevertheless, direct reconstruction of complex-valued spectrograms has been historically problematic, primarily due to phase recovery issues. This study seeks to close this gap by presenting Vocos, a new model that directly generates Fourier spectral coefficients. Vocos not only matches the state-of-the-art in audio quality, as demonstrated in our evaluations, but it also substantially improves computational efficiency, achieving an order of magnitude increase in speed compared to prevailing time-domain neural vocoding approaches.* | ||
|
|
||
|
|
||
| Vocos is available in two variants: | ||
|
|
||
| - `VocosModel` : Mel-spectrogram based vocoder can be found [here](https://huggingface.co/docs/transformers/model_doc/vocos). | ||
|
|
||
| - `VocosEncodecModel` : EnCodec based vocoder documented in this card. | ||
|
|
||
|
|
||
| You can find demos in this [post](https://gemelo-ai.github.io/vocos/). The original code can be found [here](https://github.com/gemelo-ai/vocos) and the original checkpoint is available [here](https://huggingface.co/charactr/vocos-encodec-24khz). | ||
|
|
||
| This model was contributed by [Manal El Aidouni](https://huggingface.co/Manel) and [Eric Bezzam](https://huggingface.co/bezzam). | ||
|
|
||
|
|
||
|
|
||
|
|
||
| ## Usage | ||
|
|
||
| Recent work has increasingly adopted learned neural audio codec features, Vocos supports [EnCodec](https://huggingface.co/docs/transformers/main/en/model_doc/encodec) based reconstruction for high-quality audio generation through `VocosEncodecProcessor`, where the EnCodec neural audio codec model encodes the input audio into discrete tokens using Residual Vector Quantization (RVQ). These codes are then converted into embedding that serve as input to `VocosEncodecModel`. | ||
|
|
||
| A desired target `bandwidth` value is required for `VocosEncodecProcessor`. The supported bandwidths are [1.5, 3, 6, 12] kbps. The selected bandwidth determines the number of quantizers/codebooks used by the RVQ of EnCodec, namely [2, 4, 6, 8] quantizers respectively. | ||
|
|
||
| ```python | ||
| from datasets import load_dataset, Audio | ||
| from transformers import VocosEncodecModel, VocosEncodecProcessor | ||
| from scipy.io.wavfile import write as write_wav | ||
|
|
||
| bandwidth = 6.0 | ||
|
|
||
| # load model and processor | ||
| model_id = "hf-audio/vocos-encodec-24khz" | ||
| processor = VocosEncodecProcessor.from_pretrained(model_id) | ||
| model = VocosEncodecModel.from_pretrained(model_id, device_map="auto") | ||
| sampling_rate = processor.feature_extractor.sampling_rate | ||
|
|
||
| # load audio sample | ||
| ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation") | ||
| ds = ds.cast_column("audio", Audio(sampling_rate=sampling_rate)) | ||
| audio = ds[0]["audio"]["array"] | ||
|
|
||
| inputs = processor(audio=audio, bandwidth=bandwidth, sampling_rate=sampling_rate).to(model.device) | ||
|
|
||
| print(inputs.input_features.shape) # (batch_size, codebook_dim, num_frames) [1, 128, 440] | ||
|
|
||
| outputs = model(**inputs) | ||
|
|
||
| audio = outputs.audio | ||
|
|
||
| print(audio.shape) # (batch_size, time) [1, 140800] | ||
|
|
||
| # save audio to file | ||
| write_wav("vocos_encodec.wav", sampling_rate, audio[0].detach().cpu().numpy()) | ||
| ``` | ||
|
|
||
| ### Reconstructing audio from quantized RVQ codes | ||
|
|
||
| The EnCodec variant can also process precomputed RVQ codes directly. You can provide quantized audio codes as input to the `VocosEncodecProcessor` processor, which converts them into embeddings for the `VocosEncodecModel` model. | ||
|
|
||
| ```python | ||
| from transformers import VocosEncodecModel, VocosEncodecProcessor | ||
|
|
||
| model = VocosEncodecModel.from_pretrained("hf-audio/vocos-encodec-24khz") | ||
| processor = VocosEncodecProcessor.from_pretrained("hf-audio/vocos-encodec-24khz") | ||
| # 8 codeboooks, 200 frames | ||
| audio_codes = torch.randint(low=0, high=1024, size=(8, 200)) | ||
| inputs = processor(codes=audio_codes, bandwidth_id=bandwidth_id) | ||
| audio = model(**inputs).audio | ||
|
|
||
| ``` | ||
|
|
||
| ### Reconstructing audio from Bark tokens | ||
|
|
||
| Bark is a text-to-speech model that encodes input text into discrete EnCodec RVQ codes, then uses EnCodec to convert those codes into an audio waveform. The Vocos vocoder is often integrated with Bark instead of relying only on the EnCodec's decoder for better audio quality. | ||
|
|
||
| Below is an example using the Transformers implementation of [Bark](./bark) to generate quantized codes from text, then decoding them with ``VocosEncodecProcessor`` and `VocosEncodecModel`: | ||
|
|
||
| ```python | ||
| from transformers import VocosEncodecModel, VocosEncodecProcessor, BarkProcessor, BarkModel | ||
| from transformers.models.bark.generation_configuration_bark import BarkSemanticGenerationConfig, BarkCoarseGenerationConfig, BarkFineGenerationConfig | ||
| from scipy.io.wavfile import write as write_wav | ||
|
|
||
| # load the Bark model and processor | ||
| bark_id = "suno/bark-small" | ||
| bark_processor = BarkProcessor.from_pretrained(bark_id) | ||
| bark = BarkModel.from_pretrained(bark_id, device_map="auto") | ||
|
|
||
| text_prompt = "We've been messing around with this new model called Vocos." | ||
| bark_inputs = bark_processor(text_prompt, return_tensors="pt").to(bark.device) | ||
|
|
||
| # building generation configs for each stage | ||
| semantic_generation_config = BarkSemanticGenerationConfig(**bark.generation_config.semantic_config) | ||
| coarse_generation_config = BarkCoarseGenerationConfig(**bark.generation_config.coarse_acoustics_config) | ||
| fine_generation_config = BarkFineGenerationConfig(**bark.generation_config.fine_acoustics_config) | ||
|
|
||
| # generating the RVQ codes | ||
| semantic_tokens = bark.semantic.generate( | ||
| **bark_inputs, | ||
| semantic_generation_config=semantic_generation_config, | ||
| ) | ||
|
|
||
| coarse_tokens = bark.coarse_acoustics.generate( | ||
| semantic_tokens, | ||
| semantic_generation_config=semantic_generation_config, | ||
| coarse_generation_config=coarse_generation_config, | ||
| codebook_size=bark.generation_config.codebook_size, | ||
| ) | ||
|
|
||
| fine_tokens = bark.fine_acoustics.generate( | ||
| coarse_tokens, | ||
| semantic_generation_config=semantic_generation_config, | ||
| coarse_generation_config=coarse_generation_config, | ||
| fine_generation_config=fine_generation_config, | ||
| codebook_size=bark.generation_config.codebook_size, | ||
| ) | ||
|
|
||
| codes = fine_tokens.squeeze(0) # codes (8 codebooks, * frames) | ||
|
|
||
| # Reconstruct audio with Vocos from codes | ||
| vocos_id = "hf-audio/vocos-encodec-24khz" | ||
| processor = VocosEncodecProcessor.from_pretrained(vocos_id) | ||
| model = VocosEncodecModel.from_pretrained(vocos_id, device_map="auto") | ||
| sampling_rate = processor.feature_extractor.sampling_rate | ||
|
|
||
| # generate audio | ||
| inputs = processor(codes=codes.to("cpu"), bandwidth=6.0).to(model.device) | ||
| audio = model(**inputs).audio | ||
|
|
||
| # save audio to file | ||
| write_wav("vocos_bark.wav", sampling_rate, audio[0].detach().cpu().numpy()) | ||
| ``` | ||
|
|
||
| ## VocosEncodecConfig | ||
|
|
||
| [[autodoc]] VocosEncodecConfig | ||
|
|
||
|
|
||
| ## VocosEncodecProcessor | ||
|
|
||
| [[autodoc]] VocosEncodecProcessor | ||
| - __call__ | ||
|
|
||
| ## VocosEncodecModel | ||
|
|
||
| [[autodoc]] VocosEncodecModel | ||
| - forward |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.