VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech

This is the exerpt from original paper by the following authors.

Jaehyeon Kim, Jungil Kong, and Juhee Son

In our recent paper, we propose VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech.

Several recent end-to-end text-to-speech (TTS) models enabling single-stage training and parallel sampling have been proposed, but their sample quality does not match that of two-stage TTS systems. In this work, we present a parallel end-to-end TTS method that generates more natural sounding audio than current two-stage models. Our method adopts variational inference augmented with normalizing flows and an adversarial training process, which improves the expressive power of generative modeling. We also propose a stochastic duration predictor to synthesize speech with diverse rhythms from input text. With the uncertainty modeling over latent variables and the stochastic duration predictor, our method expresses the natural one-to-many relationship in which a text input can be spoken in multiple ways with different pitches and rhythms. A subjective human evaluation (mean opinion score, or MOS) on the LJ Speech, a single speaker dataset, shows that our method outperforms the best publicly available TTS systems and achieves a MOS comparable to ground truth.

VITS at training	VITS at inference

Installation:

Clone the repo

git clone [email protected]:SameerSri72/Sanskrit-text-to-speech.git

Setting up the conda env

This is assuming you have navigated to the root folder after cloning it.

NOTE: This is tested under python3.11 with conda env. For other python versions, you might encounter version conflicts. NOTE: This is tested under python3.11 with conda env. For other python versions, you might encounter version conflicts.

PyTorch 2.0 Please refer requirements.txt Please refer requirements.txt

# install required packages (for pytorch 2.0)
conda create -n vits python=3.11
conda activate vits
pip install -r requirements.txt

Download datasets

There are three options you can choose from: LJ Speech, VCTK, and custom dataset.

LJ Speech: LJ Speech dataset. Used for single speaker TTS.
VCTK: VCTK dataset. Used for multi-speaker TTS.
Custom dataset: You can use your own dataset. (I have used Saskrit text and audio files here, data is locally available with me can not upload it publicly due to copyright)

Custom dataset

create a folder with wav files
create configuration file in configs. Change the following fields in custom_base.json:
create configuration file in configs. Change the following fields in custom_base.json:

{
  "data": {
    "training_files": "filelists/custom_audio_text_train_filelist.txt.cleaned", // path to training cleaned filelist
    "validation_files": "filelists/custom_audio_text_val_filelist.txt.cleaned", // path to validation cleaned filelist
    "text_cleaners": ["sanskrit_cleaners"], // text cleaner
    "bits_per_sample": 16, // bit depth of wav files
    "training_files": "filelists/custom_audio_text_train_filelist.txt.cleaned", // path to training cleaned filelist
    "validation_files": "filelists/custom_audio_text_val_filelist.txt.cleaned", // path to validation cleaned filelist
    "text_cleaners": ["english_cleaners2"], // text cleaner
    "bits_per_sample": 16, // bit depth of wav files
    "sampling_rate": 22050, // sampling rate if you resampled your wav files
    ...
    "n_speakers": 0, // number of speakers in your dataset if you use multi-speaker setting
    "cleaned_text": true // if you already cleaned your text (See text_phonemizer.ipynb), set this to true
  },
  ...
    "cleaned_text": true // if you already cleaned your text (See text_phonemizer.ipynb), set this to true
  },
  ...
}

install espeak-ng (optional)

NOTE: This is required for the preprocess.py and inference.ipynb notebook to work. If you don't need it, you can skip this step. Please refer espeak-ng

preprocess text

You can do this step by step way: (No phonemization needed for Sanskrit)

create a dataset of text files. See text_dataset.ipynb
phonemize or just clean up the text. Please refer text_phonemizer.ipynb
create filelists and cleaned version with train test split. See text_split.ipynb
rename or create a link to the dataset folder. Please refer text_split.ipynb

ln -s /path/to/custom_dataset DUMMY3

Training Examples

# LJ Speech
python train.py -c configs/ljs_base.json -m ljs_base

# VCTK
python train_ms.py -c configs/vctk_base.json -m vctk_base

#Sanskrit
python train.py -c configs/sans_base.json -m sans_base

# Custom dataset (multi-speaker)
python train_ms.py -c configs/custom_base.json -m custom_base

Inference Example

After training refer to infer.ipynb

Generated Sample Audio from Sanskrit Text:

Sanskrit Text: हे भगवन् त्वम् आत्मानम् लोकानुग्रहार्थम् ब्रह्मरूपेण उत्पिपादयिषितम् स्वस्वरूपम् आत्मनैव वेत्सि जानासि ।

Audio File: Click here to listen to the audio

Sanskrit Text: निरन्तरासु नीरन्ध्रासु अन्तरे मध्ये वातो यासाम् तादृश्यः या वृष्टयः तासु अन्तरवातवृष्टिषु

Audio File: Click here to listen to the audio

Acknowledgements

This repo is based on VITS
Also thanks to Daniil Robnikov for his work link

References

VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
configs		configs
filelists		filelists
pics		pics
preprocess		preprocess
text		text
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
attentions.py		attentions.py
audio_1.wav		audio_1.wav
audio_2.wav		audio_2.wav
batch_inference.ipynb		batch_inference.ipynb
commons.py		commons.py
data_utils.py		data_utils.py
generated_audio5.wav		generated_audio5.wav
generated_audio6.wav		generated_audio6.wav
generated_audio7.wav		generated_audio7.wav
infer.ipynb		infer.ipynb
inference.ipynb		inference.ipynb
inference_batch.ipynb		inference_batch.ipynb
inference_batch.py		inference_batch.py
losses.py		losses.py
mel_processing.py		mel_processing.py
models.py		models.py
modules.py		modules.py
monotonic_align.py		monotonic_align.py
preprocess.py		preprocess.py
requirements.txt		requirements.txt
train.py		train.py
train_ms.py		train_ms.py
transforms.py		transforms.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech

This is the exerpt from original paper by the following authors.

Jaehyeon Kim, Jungil Kong, and Juhee Son

Installation:

Setting up the conda env

Download datasets

Custom dataset

Training Examples

Inference Example

Generated Sample Audio from Sanskrit Text:

Audio File: Click here to listen to the audio

Sanskrit Text: निरन्तरासु नीरन्ध्रासु अन्तरे मध्ये वातो यासाम् तादृश्यः या वृष्टयः तासु अन्तरवातवृष्टिषु

Audio File: Click here to listen to the audio

Acknowledgements

References

About

Releases

Packages

Languages

License

SameerSri72/Sanskrit-text-to-speech

Folders and files

Latest commit

History

Repository files navigation

VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech

This is the exerpt from original paper by the following authors.

Jaehyeon Kim, Jungil Kong, and Juhee Son

Installation:

Setting up the conda env

Download datasets

Custom dataset

Training Examples

Inference Example

Generated Sample Audio from Sanskrit Text:

Audio File: Click here to listen to the audio

Sanskrit Text: निरन्तरासु नीरन्ध्रासु अन्तरे मध्ये वातो यासाम् तादृश्यः या वृष्टयः तासु अन्तरवातवृष्टिषु

Audio File: Click here to listen to the audio

Acknowledgements

References

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages