Better model building instructions

jmvalin · jmvalin · commit b5aad6a28299 · 2025-03-18T14:01:52.000-04:00
diff --git a/dnn/datasets.txt b/dnn/datasets.txt
@@ -1,8 +1,6 @@
-The following datasets can be used to train a language-independent LPCNet model.
-A good choice is to include all the data from these datasets, except for
-hi_fi_tts for which only a small subset is recommended (since it's very large
-but has few speakers). Note that this data typically needs to be resampled
-before it can be used.
+The following datasets can be used to train a language-independent FARGAN model
+and a Deep REDundancy (DRED) model. Note that this data typically needs to be
+resampled before it can be used.
 
 https://www.openslr.org/resources/30/si_lk.tar.gz
 https://www.openslr.org/resources/32/af_za.tar.gz
@@ -61,7 +59,6 @@ https://www.openslr.org/resources/83/welsh_english_female.zip
 https://www.openslr.org/resources/83/welsh_english_male.zip
 https://www.openslr.org/resources/86/yo_ng_female.zip
 https://www.openslr.org/resources/86/yo_ng_male.zip
-https://www.openslr.org/resources/109/hi_fi_tts_v0.tar.gz
 
 The corresponding citations for all these datasets are:
 
@@ -164,10 +161,3 @@ The corresponding citations for all these datasets are:
     doi = {10.21437/Interspeech.2020-1096},
     url = {http://dx.doi.org/10.21437/Interspeech.2020-1096},
   }
-
-@article{bakhturina2021hi,
-  title={{Hi-Fi Multi-Speaker English TTS Dataset}},
-  author={Bakhturina, Evelina and Lavrukhin, Vitaly and Ginsburg, Boris and Zhang, Yang},
-  journal={arXiv preprint arXiv:2104.01497},
-  year={2021}
-}
diff --git a/dnn/torch/rdovae/README.md b/dnn/torch/rdovae/README.md
@@ -8,24 +8,31 @@ skip straight to the Inference section.
 
 ## Data preparation
 
+First, fetch all the data from the datasets.txt file using:
+```
+./download_datasets.sh
+```
+
+Then concatenate and resample the data into a single 16-kHz file:
+```
+./process_speech.sh
+```
+The script will produce an all_speech.pcm speech file in raw 16-bit PCM format.
+
+
 For data preparation you need to build Opus as detailed in the top-level README.
 You will need to use the --enable-dred configure option.
 The build will produce an executable named "dump_data".
 To prepare the training data, run:
 ```
-./dump_data -train in_speech.pcm out_features.f32 out_speech.pcm
+./dump_data -train all_speech.pcm all_features.f32 /dev/null
 ```
-Where the in_speech.pcm speech file is a raw 16-bit PCM file sampled at 16 kHz.
-The speech data used for training the model can be found at:
-https://media.xiph.org/lpcnet/speech/tts_speech_negative_16k.sw
-The out_speech.pcm file isn't needed for DRED, but it is needed to train
-the FARGAN vocoder (see dnn/torch/fargan/ for details).
 
 ## Training
 
 To perform training, run the following command:
 ```
-python ./train_rdovae.py --cuda-visible-devices 0 --sequence-length 400 --split-mode random_split --state-dim 80 --batch-size 512 --epochs 400 --lambda-max 0.04 --lr 0.003 --lr-decay-factor 0.0001 out_features.f32 output_dir
+python ./train_rdovae.py --sequence-length 400 --split-mode random_split --state-dim 80 --batch-size 512 --epochs 400 --lambda-max 0.04 --lr 0.003 --lr-decay-factor 0.0001 all_features.f32 output_dir
 ```
 The final model will be in output_dir/checkpoints/chechpoint_400.pth.
 
diff --git a/dnn/torch/rdovae/download_datasets.sh b/dnn/torch/rdovae/download_datasets.sh
@@ -0,0 +1,6 @@
+mkdir datasets
+cd datasets
+for i in `grep https ../../../datasets.txt`
+do
+	wget $i
+done
diff --git a/dnn/torch/rdovae/process_speech.sh b/dnn/torch/rdovae/process_speech.sh
@@ -0,0 +1,7 @@
+#!/bin/sh
+
+cd datasets
+
+#parallel -j +2 'unzip -n {}' ::: *.zip
+
+find . -name "*.wav" | parallel -k -j 20 'sox --no-dither {} -t sw -r 16000 -c 1 -' > ../all_speech.sw