Group Latent Embedding for Vector Quantized Variational Autoencoder in Non-Parallel Voice Conversion

Code for this paper Group Latent Embedding for Vector Quantized Variational Autoencoder in Non-Parallel Voice Conversion

Shaojin Ding, Ricardo Gutierrez-Osuna

In INTERSPEECH 2019

This is a Pytorch implementation. This implementation is based on the VQ-VAE-WaveRNN implementation at https://github.com/mkotha/WaveRNN.

Dataset:

VCTK
- Audio samples.
- Trained model.

Preparation

The preparation is similar to that at https://github.com/mkotha/WaveRNN. We repeat it here for convenience.

Requirements

Python 3.6 or newer
PyTorch with CUDA enabled
librosa
apex if you want to use FP16 (it probably doesn't work that well).

Create config.py

cp config.py.example config.py

Preparing VCTK

You can skip this section if you don't need a multi-speaker dataset.

Download and uncompress the VCTK dataset.
python preprocess_multispeaker.py /path/to/dataset/VCTK-Corpus/wav48 /path/to/output/directory
In config.py, set multi_speaker_data_path to point to the output directory.

Usage

To run Group Latent Embedding:

$ python wavernn.py -m vqvae_group --num-group 41 --num-sample 10

The -m option can be used to tell the the script what model to train. By default, it trains a vanilla VQ-VAE model.

Trained models are saved under the model_checkpoints directory.

By default, the script will take the latest snapshot and continues training from there. To train a new model freshly, use the --scratch option.

Every 50k steps, the model is run to generate test audio outputs. The output goes under the model_outputs directory.

When the -g option is given, the script produces the output using the saved model, rather than training it.

--num-group specifies the number of groups. --num-sample specifies the number of atoms in each group. Note that num-group times num-sample should be equal to the total number of atoms in the embedding dictionary (n_classes in class VectorQuantGroup in vector_quant.py)

Acknowledgement

The code is based on mkotha/WaveRNN.

Cite the work

@inproceedings{Ding2019,
  author={Shaojin Ding and Ricardo Gutierrez-Osuna},
  title={{Group Latent Embedding for Vector Quantized Variational Autoencoder in Non-Parallel Voice Conversion}},
  year=2019,
  booktitle={Proc. Interspeech 2019},
  pages={724--728},
  doi={10.21437/Interspeech.2019-1198},
  url={http://dx.doi.org/10.21437/Interspeech.2019-1198}
}

Name		Name	Last commit message	Last commit date
Latest commit History 156 Commits
layers		layers
models		models
utils		utils
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
config.py.example		config.py.example
preload.sh		preload.sh
preprocess16.py		preprocess16.py
preprocess_multispeaker.py		preprocess_multispeaker.py
wavernn.py		wavernn.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Group Latent Embedding for Vector Quantized Variational Autoencoder in Non-Parallel Voice Conversion

Dataset:

Preparation

Requirements

Create config.py

Preparing VCTK

Usage

Acknowledgement

Cite the work

About

Releases

Packages

Languages

License

shaojinding/GroupLatentEmbedding

Folders and files

Latest commit

History

Repository files navigation

Group Latent Embedding for Vector Quantized Variational Autoencoder in Non-Parallel Voice Conversion

Dataset:

Preparation

Requirements

Create config.py

Preparing VCTK

Usage

Acknowledgement

Cite the work

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages