VITS Fast Fine-tuning

This repo is a pipeline of VITS finetuning for fast speaker adaptation TTS, and any-to-any voice conversion. Forked from here to be adapted to Spanish. Ideally all changes should be done adding spanish as an option and not as a replacement, so that it then could be PRd to the main repo.

TODO list:

on checkpoint ./pretrained_models/G_trilingual.pth (actually our spanish model replacing G_trilingual) model has no iteration so setting to default 0
on checkpoint ./pretrained_models/G_trilingual.pth (actually our spanish model replacing G_trilingual) model has learning_rate so setting to default 0.001
on checkpoint ./pretrained_models/G_trilingual.pth (actually our spanish model replacing G_trilingual) model has optimizer but it is not a string

on checkpoint ./pretrained_models/D_trilingual.pth model has iteration but it is not a string
on checkpoint ./pretrained_models/D_trilingual.pth model has learning_rate but it is not a string
on checkpoint ./pretrained_models/D_trilingual.pth model has no optimizer  so setting to default AdamW

From above: Appareantly the model from coqui is not directly compatible with this, but using it runs training as if it was a new model, so, two options:
- Find a way to adapt the coqui model to this
- First do the CSS10 Dataset item above, once that is done use it to retrain a compatible model from scratch, using it as the base user voice. Attempting here. Instead of that will try to train with original repo for VITS model or its adaptation used to train the Chinese-Japanese models on repo from fork.
Ask if data_utils.py line 337 range second argument was correctly changed from 0 to -1, before it was missing checking the lowest bucket that on our data was empty (or maybe that bucket CAN NOT be empty). It seems to be okay to change that

中文文档请点击这里

VITS Fast Fine-tuning

This repo will guide you to add your own character voices, or even your own voice, into an existing VITS TTS model to make it able to do the following tasks in less than 1 hour:

Any-to-any voice conversion between you & any characters you added & preset characters
English, Japanese & Chinese Text-to-Speech synthesis with the characters you added & preset characters

Welcome to play around with the base model, a Trilingual Anime VITS!

Currently Supported Tasks:

Convert user's voice to characters listed here
Chinese, English, Japanese TTS with user's voice
Chinese, English, Japanese TTS with custom characters!

Currently Supported Characters for TTS & VC:

Umamusume Pretty Derby (Used as base model pretraining)
Sanoba Witch (Used as base model pretraining)
Genshin Impact (Used as base model pretraining)
Any character you wish as long as you have their voices!

Fine-tuning

It's recommended to perform fine-tuning on Google Colab because the original VITS has some dependencies that are difficult to configure.

How long does it take?

Install dependencies (2 min)
Record at least 20 your own voice, the content to read will be presented in UI, less than 20 words per sentence. (5~10 min)
Upload your character voices, which should be a .zip file, it's file structure should be like:

Your-zip-file.zip
├───Character_name_1
├   ├───xxx.wav
├   ├───...
├   ├───yyy.mp3
├   └───zzz.wav
├───Character_name_2
├   ├───xxx.wav
├   ├───...
├   ├───yyy.mp3
├   └───zzz.wav
├───...
├
└───Character_name_n
    ├───xxx.wav
    ├───...
    ├───yyy.mp3
    └───zzz.wav

Note that the format & name of the audio files does not matter as long as they are audio files.
Audio quality requirements: >=2s, <=10s per audio, background noise should be as less as possible. Audio quantity requirements: at least 10 per character, better if 20+ per character.
You can either choose to perform step 2, 3, or both, depending on your needs.

Fine-tune (30 min)
After everything is done, download the fine-tuned model & model config

Inference or Usage (Currently support Windows only)

Remember to download your fine-tuned model!
Download the latest release
Put your model & config file into the folder inference, make sure to rename the model to G_latest.pth and config file to finetune_speaker.json
The file structure should be as follows:

inference
├───inference.exe
├───...
├───finetune_speaker.json
└───G_latest.pth

run inference.exe, the browser should pop up automatically.

Name		Name	Last commit message	Last commit date
Latest commit History 132 Commits
.idea		.idea
configs		configs
monotonic_align		monotonic_align
ntbk		ntbk
text		text
user_voice		user_voice
LICENSE		LICENSE
README.md		README.md
README_EN.md		README_EN.md
README_ZH.md		README_ZH.md
VC_inference.py		VC_inference.py
attentions.py		attentions.py
commons.py		commons.py
data_utils.py		data_utils.py
demucs_denoise.py		demucs_denoise.py
download_model.py		download_model.py
finetune_speaker.py		finetune_speaker.py
losses.py		losses.py
mel_processing.py		mel_processing.py
models.py		models.py
models_infer.py		models_infer.py
modules.py		modules.py
preprocess.py		preprocess.py
requirements.txt		requirements.txt
requirements_infer.txt		requirements_infer.txt
transforms.py		transforms.py
user_voice_collect.py		user_voice_collect.py
utils.py		utils.py
video_transcribe.py		video_transcribe.py
voice_upload.py		voice_upload.py
whisper_transcribe.py		whisper_transcribe.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VITS Fast Fine-tuning

Currently Supported Tasks:

Currently Supported Characters for TTS & VC:

Fine-tuning

How long does it take?

Inference or Usage (Currently support Windows only)

About

Releases

Packages

Languages

License

lopezjuanma96/VITS-fast-fine-tuning

Folders and files

Latest commit

History

Repository files navigation

VITS Fast Fine-tuning

Currently Supported Tasks:

Currently Supported Characters for TTS & VC:

Fine-tuning

How long does it take?

Inference or Usage (Currently support Windows only)

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages