git clone https://github.com/litagin02/Style-Bert-VITS2.git
cd Style-Bert-VITS2
python -m venv venv
venv\Scripts\activate
pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install -r requirements.txt
Then download the necessary models and the default TTS model, and set the global paths.
python initialize.py [--skip_default_models] [--dataset_root <path>] [--assets_root <path>]
Optional:
--skip_default_models
: Skip downloading the default voice models (use this if you only have to train your own models).--dataset_root
: Default:Data
. Root directory of the training dataset. The training dataset of{model_name}
should be placed in{dataset_root}/{model_name}
.--assets_root
: Default:model_assets
. Root directory of the model assets (for inference). In training, the model assets will be saved to{assets_root}/{model_name}
, and in inference, we load all the models from{assets_root}
.
The following audio formats are supported: ".wav", ".flac", ".mp3", ".ogg", ".opus", ".m4a".
python slice.py --model_name <model_name> [-i <input_dir>] [-m <min_sec>] [-M <max_sec>] [--time_suffix]
Required:
model_name
: Name of the speaker (to be used as the name of the trained model).
Optional:
input_dir
: Path to the directory containing the audio files to slice (default:inputs
)min_sec
: Minimum duration of the sliced audio files in seconds (default: 2).max_sec
: Maximum duration of the sliced audio files in seconds (default: 12).--time_suffix
: Make the filename end with -start_ms-end_ms when saving wav.
python transcribe.py --model_name <model_name>
Required:
model_name
: Name of the speaker (to be used as the name of the trained model).
Optional
--initial_prompt
: Initial prompt to use for the transcription (default value is specific to Japanese).--device
:cuda
orcpu
(default:cuda
).--language
:jp
,en
, oren
(default:jp
).--model
: Whisper model, default:large-v3
--compute_type
: default:bfloat16
. Only used if not--use_hf_whisper
.--use_hf_whisper
: Use Hugging Face's whisper model instead of default faster-whisper (HF whisper is faster but requires more VRAM).--batch_size
: Batch size (default: 16). Only used if--use_hf_whisper
.--num_beams
: Beam size (default: 1).--no_repeat_ngram_size
: N-gram size for no repeat (default: 10).
python preprocess_all.py -m <model_name> [--use_jp_extra] [-b <batch_size>] [-e <epochs>] [-s <save_every_steps>] [--num_processes <num_processes>] [--normalize] [--trim] [--val_per_lang <val_per_lang>] [--log_interval <log_interval>] [--freeze_EN_bert] [--freeze_JP_bert] [--freeze_ZH_bert] [--freeze_style] [--freeze_decoder] [--yomi_error <yomi_error>]
Required:
model_name
: Name of the speaker (to be used as the name of the trained model).
Optional:
--batch_size
,-b
: Batch size (default: 2).--epochs
,-e
: Number of epochs (default: 100).--save_every_steps
,-s
: Save every steps (default: 1000).--num_processes
: Number of processes (default: half of the number of CPU cores).--normalize
: Loudness normalize audio.--trim
: Trim silence.--freeze_EN_bert
: Freeze English BERT.--freeze_JP_bert
: Freeze Japanese BERT.--freeze_ZH_bert
: Freeze Chinese BERT.--freeze_style
: Freeze style vector.--freeze_decoder
: Freeze decoder.--use_jp_extra
: Use JP-Extra model.--val_per_lang
: Validation data per language (default: 0).--log_interval
: Log interval (default: 200).--yomi_error
: How to handle yomi errors (default:raise
: raise an error after preprocessing all texts,skip
: skip the texts with errors,use
: use the texts with errors by ignoring unknown characters).
Training settings are automatically loaded from the above process.
If NOT using JP-Extra model:
python train_ms.py [--repo_id <username>/<repo_name>]
If using JP-Extra model:
python train_ms_jp_extra.py [--repo_id <username>/<repo_name>] [--skip_default_style]
Optional:
--repo_id
: Hugging Face repository ID to upload the trained model to. You should have logged in usinghuggingface-cli login
before running this command.--skip_default_style
: Skip making the default style vector. Use this if you want to resume training (since the default style vector has been already made).