Skip to content

Commit b5aad6a

Browse files
committed
Better model building instructions
1 parent 55513e8 commit b5aad6a

File tree

4 files changed

+30
-20
lines changed

4 files changed

+30
-20
lines changed

dnn/datasets.txt

Lines changed: 3 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,6 @@
1-
The following datasets can be used to train a language-independent LPCNet model.
2-
A good choice is to include all the data from these datasets, except for
3-
hi_fi_tts for which only a small subset is recommended (since it's very large
4-
but has few speakers). Note that this data typically needs to be resampled
5-
before it can be used.
1+
The following datasets can be used to train a language-independent FARGAN model
2+
and a Deep REDundancy (DRED) model. Note that this data typically needs to be
3+
resampled before it can be used.
64

75
https://www.openslr.org/resources/30/si_lk.tar.gz
86
https://www.openslr.org/resources/32/af_za.tar.gz
@@ -61,7 +59,6 @@ https://www.openslr.org/resources/83/welsh_english_female.zip
6159
https://www.openslr.org/resources/83/welsh_english_male.zip
6260
https://www.openslr.org/resources/86/yo_ng_female.zip
6361
https://www.openslr.org/resources/86/yo_ng_male.zip
64-
https://www.openslr.org/resources/109/hi_fi_tts_v0.tar.gz
6562

6663
The corresponding citations for all these datasets are:
6764

@@ -164,10 +161,3 @@ The corresponding citations for all these datasets are:
164161
doi = {10.21437/Interspeech.2020-1096},
165162
url = {http://dx.doi.org/10.21437/Interspeech.2020-1096},
166163
}
167-
168-
@article{bakhturina2021hi,
169-
title={{Hi-Fi Multi-Speaker English TTS Dataset}},
170-
author={Bakhturina, Evelina and Lavrukhin, Vitaly and Ginsburg, Boris and Zhang, Yang},
171-
journal={arXiv preprint arXiv:2104.01497},
172-
year={2021}
173-
}

dnn/torch/rdovae/README.md

Lines changed: 14 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -8,24 +8,31 @@ skip straight to the Inference section.
88

99
## Data preparation
1010

11+
First, fetch all the data from the datasets.txt file using:
12+
```
13+
./download_datasets.sh
14+
```
15+
16+
Then concatenate and resample the data into a single 16-kHz file:
17+
```
18+
./process_speech.sh
19+
```
20+
The script will produce an all_speech.pcm speech file in raw 16-bit PCM format.
21+
22+
1123
For data preparation you need to build Opus as detailed in the top-level README.
1224
You will need to use the --enable-dred configure option.
1325
The build will produce an executable named "dump_data".
1426
To prepare the training data, run:
1527
```
16-
./dump_data -train in_speech.pcm out_features.f32 out_speech.pcm
28+
./dump_data -train all_speech.pcm all_features.f32 /dev/null
1729
```
18-
Where the in_speech.pcm speech file is a raw 16-bit PCM file sampled at 16 kHz.
19-
The speech data used for training the model can be found at:
20-
https://media.xiph.org/lpcnet/speech/tts_speech_negative_16k.sw
21-
The out_speech.pcm file isn't needed for DRED, but it is needed to train
22-
the FARGAN vocoder (see dnn/torch/fargan/ for details).
2330

2431
## Training
2532

2633
To perform training, run the following command:
2734
```
28-
python ./train_rdovae.py --cuda-visible-devices 0 --sequence-length 400 --split-mode random_split --state-dim 80 --batch-size 512 --epochs 400 --lambda-max 0.04 --lr 0.003 --lr-decay-factor 0.0001 out_features.f32 output_dir
35+
python ./train_rdovae.py --sequence-length 400 --split-mode random_split --state-dim 80 --batch-size 512 --epochs 400 --lambda-max 0.04 --lr 0.003 --lr-decay-factor 0.0001 all_features.f32 output_dir
2936
```
3037
The final model will be in output_dir/checkpoints/chechpoint_400.pth.
3138

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
mkdir datasets
2+
cd datasets
3+
for i in `grep https ../../../datasets.txt`
4+
do
5+
wget $i
6+
done

dnn/torch/rdovae/process_speech.sh

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
#!/bin/sh
2+
3+
cd datasets
4+
5+
#parallel -j +2 'unzip -n {}' ::: *.zip
6+
7+
find . -name "*.wav" | parallel -k -j 20 'sox --no-dither {} -t sw -r 16000 -c 1 -' > ../all_speech.sw

0 commit comments

Comments
 (0)