Skip to content

End to end text to speech transformation for Sanskrit Using VITS model.

License

Notifications You must be signed in to change notification settings

SameerSri72/Sanskrit-text-to-speech

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech

This is the exerpt from original paper by the following authors.

Jaehyeon Kim, Jungil Kong, and Juhee Son

In our recent paper, we propose VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech.

Several recent end-to-end text-to-speech (TTS) models enabling single-stage training and parallel sampling have been proposed, but their sample quality does not match that of two-stage TTS systems. In this work, we present a parallel end-to-end TTS method that generates more natural sounding audio than current two-stage models. Our method adopts variational inference augmented with normalizing flows and an adversarial training process, which improves the expressive power of generative modeling. We also propose a stochastic duration predictor to synthesize speech with diverse rhythms from input text. With the uncertainty modeling over latent variables and the stochastic duration predictor, our method expresses the natural one-to-many relationship in which a text input can be spoken in multiple ways with different pitches and rhythms. A subjective human evaluation (mean opinion score, or MOS) on the LJ Speech, a single speaker dataset, shows that our method outperforms the best publicly available TTS systems and achieves a MOS comparable to ground truth.

VITS at training VITS at inference
VITS at training VITS at inference

Installation:

Clone the repo

git clone [email protected]:SameerSri72/Sanskrit-text-to-speech.git

Setting up the conda env

This is assuming you have navigated to the root folder after cloning it.

NOTE: This is tested under python3.11 with conda env. For other python versions, you might encounter version conflicts. NOTE: This is tested under python3.11 with conda env. For other python versions, you might encounter version conflicts.

PyTorch 2.0 Please refer requirements.txt Please refer requirements.txt

# install required packages (for pytorch 2.0)
conda create -n vits python=3.11
conda activate vits
pip install -r requirements.txt

Download datasets

There are three options you can choose from: LJ Speech, VCTK, and custom dataset.

  1. LJ Speech: LJ Speech dataset. Used for single speaker TTS.
  2. VCTK: VCTK dataset. Used for multi-speaker TTS.
  3. Custom dataset: You can use your own dataset. (I have used Saskrit text and audio files here, data is locally available with me can not upload it publicly due to copyright)

Custom dataset

  1. create a folder with wav files
  2. create configuration file in configs. Change the following fields in custom_base.json:
  3. create configuration file in configs. Change the following fields in custom_base.json:
{
  "data": {
    "training_files": "filelists/custom_audio_text_train_filelist.txt.cleaned", // path to training cleaned filelist
    "validation_files": "filelists/custom_audio_text_val_filelist.txt.cleaned", // path to validation cleaned filelist
    "text_cleaners": ["sanskrit_cleaners"], // text cleaner
    "bits_per_sample": 16, // bit depth of wav files
    "training_files": "filelists/custom_audio_text_train_filelist.txt.cleaned", // path to training cleaned filelist
    "validation_files": "filelists/custom_audio_text_val_filelist.txt.cleaned", // path to validation cleaned filelist
    "text_cleaners": ["english_cleaners2"], // text cleaner
    "bits_per_sample": 16, // bit depth of wav files
    "sampling_rate": 22050, // sampling rate if you resampled your wav files
    ...
    "n_speakers": 0, // number of speakers in your dataset if you use multi-speaker setting
    "cleaned_text": true // if you already cleaned your text (See text_phonemizer.ipynb), set this to true
  },
  ...
    "cleaned_text": true // if you already cleaned your text (See text_phonemizer.ipynb), set this to true
  },
  ...
}
  1. install espeak-ng (optional)

NOTE: This is required for the preprocess.py and inference.ipynb notebook to work. If you don't need it, you can skip this step. Please refer espeak-ng

  1. preprocess text

You can do this step by step way: (No phonemization needed for Sanskrit)

ln -s /path/to/custom_dataset DUMMY3

Training Examples

# LJ Speech
python train.py -c configs/ljs_base.json -m ljs_base

# VCTK
python train_ms.py -c configs/vctk_base.json -m vctk_base

#Sanskrit
python train.py -c configs/sans_base.json -m sans_base

# Custom dataset (multi-speaker)
python train_ms.py -c configs/custom_base.json -m custom_base

Inference Example

After training refer to infer.ipynb

Generated Sample Audio from Sanskrit Text:

Sanskrit Text: हे भगवन् त्वम् आत्मानम् लोकानुग्रहार्थम् ब्रह्मरूपेण उत्पिपादयिषितम् स्वस्वरूपम् आत्मनैव वेत्सि जानासि ।

Sanskrit Text: निरन्तरासु नीरन्ध्रासु अन्तरे मध्ये वातो यासाम् तादृश्यः या वृष्टयः तासु अन्तरवातवृष्टिषु

Acknowledgements

  • This repo is based on VITS
  • Also thanks to Daniil Robnikov for his work link

References

About

End to end text to speech transformation for Sanskrit Using VITS model.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published