Skip to content

Latest commit

 

History

History
148 lines (115 loc) · 6.78 KB

README.md

File metadata and controls

148 lines (115 loc) · 6.78 KB

NaturalSpeech 2

  • This code is a unofficial implementation of NaturalSpeech 2.
  • The algorithm is based on the following paper:
Shen, K., Ju, Z., Tan, X., Liu, Y., Leng, Y., He, L., ... & Bian, J. (2023). NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers. arXiv preprint arXiv:2304.09116.

Modifications from Paper

  • The structure is derived from NaturalSpeech 2, but I made several modifications.
  • The audio codec has been changed to HifiCodec from AcademiCodec.
    • This is done to reduce the time spent training a separate audio codec.
    • The model uses 22.05Khz audio, but no audio resampling is applied.
    • To maintain similarity with the paper, it may be better to apply Google's SoundStream instead of HifiCodec, but I couldn't apply SoundStream to this repository because official pyTorch source code or pretrained model was not provided.
    • Meta's Encodec 24K version was also tested, but it could not be trained.
  • About CE-RVQ
    • I have observed that the quality of the model with CE-RVQ applied is decreased, so it has been removed from the current implementation.
    • However, this does not necessarily mean that CE-RVQ is ineffective.
    • It could be due to the change of codec in this repository or the presence of bugs in the implemented module.
    • This aspect requires further validation and investigation.
  • Information on the segment length σ of the speech prompt during training was not found in the paper and was arbitrarily set.
    • The σ = 3, 5, and 10 seconds used in the evaluation of paper are too long to apply to both the variance predictor and diffusion during training.
    • To ensure stability in pattern usage, half the length of the shortest pattern used in each training is set as σ for each training.
  • The target duration is obtained through Alignment learning framework (ALF), rather than being brought in externally.
    • Using external modules such as Montreal Force Alignment (MFA) may have benefits in terms of training speed or stability, but I prioritized simplifying the training process.
  • Padding is applied between tokens like 'A <P> B <P> C ....'
    • I could not verify whether there was a difference in performance depending on its usage.

Supported dataset

Hyper parameters

Before proceeding, please set the pattern, inference, and checkpoint paths in Hyper_Parameters.yaml according to your environment.

  • Sound

    • Setting basic sound parameters.
  • Tokens

    • The number of token.
    • After pattern generating, you can see which tokens are included in the dataset at Token_Path.
  • Audio_Codec

    • Setting the audio codec.
    • This repository is using HifiCodec, so only the size of the latents output from HifiCodec's encoder is set for reference in other modules.
  • Train

    • Setting the parameters of training.
  • Inference_Batch_Size

    • Setting the batch size when inference
  • Inference_Path

    • Setting the inference path
  • Checkpoint_Path

    • Setting the checkpoint path
  • Log_Path

    • Setting the tensorboard log path
  • Use_Mixed_Precision

    • Setting using mixed precision
  • Use_Multi_GPU

    • Setting using multi gpu
    • By the nvcc problem, Only linux supports this option.
    • If this is True, device parameter is also multiple like 0,1,2,3.
    • And you have to change the training command also: please check multi_gpu.sh.
  • Device

    • Setting which GPU devices are used in multi-GPU enviornment.
    • Or, if using only CPU, please set '-1'. (But, I don't recommend while training.)

Generate pattern

python Pattern_Generate.py [parameters]

Parameters

  • -lj
    • The path of LJSpeech dataset
  • -vctk
    • The path of VCTK dataset
  • -libri
    • The path of LbiriTTS dataset
  • -hp
    • The path of hyperparameter.

About phonemizer

  • To phoneme string generate, this repository uses phonimizer library.
  • Please refer here to install phonemizer and backend
  • In Windows, you need more setting to use phonemizer.
    • Please refer here
    • In conda enviornment, the following commands are useful.
      conda env config vars set PHONEMIZER_ESPEAK_PATH='C:\Program Files\eSpeak NG'
      conda env config vars set PHONEMIZER_ESPEAK_LIBRARY='C:\Program Files\eSpeak NG\libespeak-ng.dll'

Run

Command

Single GPU

python Train.py -hp <path> -s <int>
  • -hp <path>

    • The hyper paramter file path
    • This is required.
  • -s <int>

    • The resume step parameter.
    • Default is 0.
    • If value is 0, model try to search the latest checkpoint.

Multi GPU

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 OMP_NUM_THREADS=32 python -m torch.distributed.launch --nproc_per_node=8 Train.py --hyper_parameters Hyper_Parameters.yaml --port 54322

Checkpoint

Dataset SR Link
VCTK 22050 Google drive
  • This checkpoint was trained in a single GPU environment (RTX4090 x 1) with the VCTK dataset.
  • It has limited quality compared to the official demo, and there are issues with the generation for unseen reference.
  • While checking the loss flow, I observed the possibility of loss decreasing as training progresses, but it doesn't guarantee an improvement in quality.
  • Unfortunately, I don't have personal resources for testing beyond the current state, so I'm releasing the checkpoint after discontinuation.