[TTS][refactor] Part 1 - nemo.collections.tts.data (NVIDIA#6099)

* [TTS] refactor nemo.collections.tts.data * update tutorials * update line number * update the year of the copyright header Signed-off-by: Xuesong Yang <[email protected]> --------- Signed-off-by: Xuesong Yang <[email protected]> Signed-off-by: hsiehjackson <[email protected]>
hsiehjackson · Jun 2, 2023 · 2d48084 · 2d48084
1 parent 364cd89
commit 2d48084
Show file tree

Hide file tree

Showing 60 changed files with 607 additions and 656 deletions.
diff --git a/docs/source/tts/api.rst b/docs/source/tts/api.rst
@@ -86,14 +86,14 @@ To read more about them, see the `Base Classes <./intro.html#Base Classes>`__ se
 
 Dataset Processing Classes
 --------------------------
-.. autoclass:: nemo.collections.tts.torch.data.MixerTTSXDataset
+.. autoclass:: nemo.collections.tts.data.tts_dataset.MixerTTSXDataset
     :show-inheritance:
     :members:
 
-.. autoclass:: nemo.collections.tts.torch.data.TTSDataset
+.. autoclass:: nemo.collections.tts.data.tts_dataset.TTSDataset
     :show-inheritance:
     :members:
 
-.. autoclass:: nemo.collections.tts.torch.data.VocoderDataset
+.. autoclass:: nemo.collections.tts.data.tts_dataset.VocoderDataset
     :show-inheritance:
     :members:
diff --git a/docs/source/tts/configs.rst b/docs/source/tts/configs.rst
@@ -16,14 +16,16 @@ Dataset Configuration
 
 Training, validation, and test parameters are specified using the ``model.train_ds``, ``model.validation_ds``, and ``model.test_ds`` sections in the configuration file, respectively. Depending on the task, there may be arguments specifying the sample rate of the audio files, supplementary data such as speech/text alignment priors and speaker IDs, etc., the threshold to trim leading and trailing silence from an audio signal, pitch normalization parameters, and so on. You may also decide to leave fields such as the ``manifest_filepath`` blank, to be specified via the command-line at runtime.
 
-Any initialization parameter that is accepted for the class `nemo.collections.tts.torch.data.TTSDataset <https://github.com/NVIDIA/NeMo/tree/stable/nemo/collections/tts/torch/data.py#L78>`_  can be set in the config file. Refer to the `Dataset Processing Classes <./api.html#Datasets>`__ section of the API for a list of datasets classes and their respective parameters. An example TTS train and validation configuration should look similar to the following:
+Any initialization parameter that is accepted for the class `nemo.collections.tts.data.tts_dataset.TTSDataset
+<https://github.com/NVIDIA/NeMo/tree/stable/nemo/collections/tts/data/tts_dataset.py#L80>`_  can be set in the config
+file. Refer to the `Dataset Processing Classes <./api.html#Datasets>`__ section of the API for a list of datasets classes and their respective parameters. An example TTS train and validation configuration should look similar to the following:
 
 .. code-block:: yaml
 
   model:
     train_ds:
       dataset:
-        _target_: nemo.collections.tts.torch.data.TTSDataset
+        _target_: nemo.collections.tts.data.tts_dataset.TTSDataset
         manifest_filepath: ???
         sample_rate: 44100
         sup_data_path: ???

diff --git a/docs/source/tts/datasets.rst b/docs/source/tts/datasets.rst
@@ -1,7 +1,7 @@
 Data Preprocessing
 ==================
 
-NeMo TTS recipes support most of public TTS datasets that consist of multiple languages, multiple emotions, and multiple speakers. Current recipes covered English (en-US), German (de-DE), Spanish (es-ES), and Mandarin Chinese (zh-CN), while the support for many other languages is under planning. NeMo provides corpus-specific data preprocessing scripts, as shown in the directory of `scripts/data_processing/tts/ <https://github.com/NVIDIA/NeMo/tree/stable/scripts/dataset_processing/tts/>`_, to convert common public TTS datasets into the format expected by the dataloaders as defined in `nemo/collections/tts/torch/data.py <https://github.com/NVIDIA/NeMo/tree/stable/nemo/collections/tts/torch/data.py>`_. The ``nemo_tts`` collection expects each dataset to consist of a set of utterances in individual audio files plus a ``JSON`` manifest that describes the dataset, with information about one utterance per line. The audio files can be of any format supported by `Pydub <https://github.com/jiaaro/pydub>`_, though we recommend ``WAV`` files as they are the default and have been most thoroughly tested. NeMo supports any original sampling rates of audios, although our scripts of extracting supplementary data and model training all specify the common target sampling rates as either 44100 Hz or 22050 Hz. If the original sampling rate mismatches the target sampling rate, the `feature preprocess <https://github.com/NVIDIA/NeMo/blob/stable/nemo/collections/asr/parts/preprocessing/features.py#L124>`_ can automatically resample the original sampling rate into the target one.
+NeMo TTS recipes support most of public TTS datasets that consist of multiple languages, multiple emotions, and multiple speakers. Current recipes covered English (en-US), German (de-DE), Spanish (es-ES), and Mandarin Chinese (zh-CN), while the support for many other languages is under planning. NeMo provides corpus-specific data preprocessing scripts, as shown in the directory of `scripts/data_processing/tts/ <https://github.com/NVIDIA/NeMo/tree/stable/scripts/dataset_processing/tts/>`_, to convert common public TTS datasets into the format expected by the dataloaders as defined in `nemo/collections/tts/data/tts_dataset.py <https://github.com/NVIDIA/NeMo/tree/stable/nemo/collections/tts/data/tts_dataset.py>`_. The ``nemo_tts`` collection expects each dataset to consist of a set of utterances in individual audio files plus a ``JSON`` manifest that describes the dataset, with information about one utterance per line. The audio files can be of any format supported by `Pydub <https://github.com/jiaaro/pydub>`_, though we recommend ``WAV`` files as they are the default and have been most thoroughly tested. NeMo supports any original sampling rates of audios, although our scripts of extracting supplementary data and model training all specify the common target sampling rates as either 44100 Hz or 22050 Hz. If the original sampling rate mismatches the target sampling rate, the `feature preprocess <https://github.com/NVIDIA/NeMo/blob/stable/nemo/collections/asr/parts/preprocessing/features.py#L124>`_ can automatically resample the original sampling rate into the target one.
 
 There should be one ``JSON`` manifest file per dataset that will be passed in, therefore, if the user wants separate training and validation datasets, they should also have separate manifests. Otherwise, they will be loading validation data with their training data and vice versa. Each line of the manifest should be in the following format:
 

diff --git a/examples/tts/conf/aligner.yaml b/examples/tts/conf/aligner.yaml
@@ -60,7 +60,7 @@ model:
 
   train_ds:
     dataset:
-      _target_: nemo.collections.tts.torch.data.TTSDataset
+      _target_: nemo.collections.tts.data.tts_dataset.TTSDataset
       manifest_filepath: ${train_dataset}
       sample_rate: ${model.sample_rate}
       sup_data_path: ${sup_data_path}
@@ -86,7 +86,7 @@ model:
 
   validation_ds:
     dataset:
-      _target_: nemo.collections.tts.torch.data.TTSDataset
+      _target_: nemo.collections.tts.data.tts_dataset.TTSDataset
       manifest_filepath: ${validation_datasets}
       sample_rate: ${model.sample_rate}
       sup_data_path: ${sup_data_path}

diff --git a/examples/tts/conf/de/fastpitch_align_22050_grapheme.yaml b/examples/tts/conf/de/fastpitch_align_22050_grapheme.yaml
@@ -70,7 +70,7 @@ model:
 
   train_ds:
     dataset:
-      _target_: nemo.collections.tts.torch.data.TTSDataset
+      _target_: nemo.collections.tts.data.tts_dataset.TTSDataset
       manifest_filepath: ${train_dataset}
       sample_rate: ${model.sample_rate}
       sup_data_path: ${sup_data_path}
@@ -104,7 +104,7 @@ model:
 
   validation_ds:
     dataset:
-      _target_: nemo.collections.tts.torch.data.TTSDataset
+      _target_: nemo.collections.tts.data.tts_dataset.TTSDataset
       manifest_filepath: ${validation_datasets}
       sample_rate: ${model.sample_rate}
       sup_data_path: ${sup_data_path}

diff --git a/examples/tts/conf/de/fastpitch_align_22050_mix.yaml b/examples/tts/conf/de/fastpitch_align_22050_mix.yaml
@@ -85,7 +85,7 @@ model:
 
   train_ds:
     dataset:
-      _target_: nemo.collections.tts.torch.data.TTSDataset
+      _target_: nemo.collections.tts.data.tts_dataset.TTSDataset
       manifest_filepath: ${train_dataset}
       sample_rate: ${model.sample_rate}
       sup_data_path: ${sup_data_path}
@@ -119,7 +119,7 @@ model:
 
   validation_ds:
     dataset:
-      _target_: nemo.collections.tts.torch.data.TTSDataset
+      _target_: nemo.collections.tts.data.tts_dataset.TTSDataset
       manifest_filepath: ${validation_datasets}
       sample_rate: ${model.sample_rate}
       sup_data_path: ${sup_data_path}

diff --git a/examples/tts/conf/de/fastpitch_align_44100_grapheme.yaml b/examples/tts/conf/de/fastpitch_align_44100_grapheme.yaml
@@ -70,7 +70,7 @@ model:
 
   train_ds:
     dataset:
-      _target_: nemo.collections.tts.torch.data.TTSDataset
+      _target_: nemo.collections.tts.data.tts_dataset.TTSDataset
       manifest_filepath: ${train_dataset}
       sample_rate: ${model.sample_rate}
       sup_data_path: ${sup_data_path}
@@ -104,7 +104,7 @@ model:
 
   validation_ds:
     dataset:
-      _target_: nemo.collections.tts.torch.data.TTSDataset
+      _target_: nemo.collections.tts.data.tts_dataset.TTSDataset
       manifest_filepath: ${validation_datasets}
       sample_rate: ${model.sample_rate}
       sup_data_path: ${sup_data_path}

diff --git a/examples/tts/conf/de/fastpitch_align_44100_phoneme.yaml b/examples/tts/conf/de/fastpitch_align_44100_phoneme.yaml
@@ -70,7 +70,7 @@ model:
 
   train_ds:
     dataset:
-      _target_: nemo.collections.tts.torch.data.TTSDataset
+      _target_: nemo.collections.tts.data.tts_dataset.TTSDataset
       manifest_filepath: ${train_dataset}
       sample_rate: ${model.sample_rate}
       sup_data_path: ${sup_data_path}
@@ -101,7 +101,7 @@ model:
 
   validation_ds:
     dataset:
-      _target_: nemo.collections.tts.torch.data.TTSDataset
+      _target_: nemo.collections.tts.data.tts_dataset.TTSDataset
       manifest_filepath: ${validation_datasets}
       sample_rate: ${model.sample_rate}
       sup_data_path: ${sup_data_path}

diff --git a/examples/tts/conf/es/fastpitch_align_44100.yaml b/examples/tts/conf/es/fastpitch_align_44100.yaml
@@ -57,7 +57,7 @@ model:
 
   train_ds:
     dataset:
-      _target_: nemo.collections.tts.torch.data.TTSDataset
+      _target_: nemo.collections.tts.data.tts_dataset.TTSDataset
       manifest_filepath: ${train_dataset}
       sample_rate: ${model.sample_rate}
       sup_data_path: ${sup_data_path}
@@ -88,7 +88,7 @@ model:
 
   validation_ds:
     dataset:
-      _target_: nemo.collections.tts.torch.data.TTSDataset
+      _target_: nemo.collections.tts.data.tts_dataset.TTSDataset
       manifest_filepath: ${validation_datasets}
       sample_rate: ${model.sample_rate}
       sup_data_path: ${sup_data_path}

diff --git a/examples/tts/conf/es/fastpitch_align_44100_ipa.yaml b/examples/tts/conf/es/fastpitch_align_44100_ipa.yaml
@@ -70,7 +70,7 @@ model:
 
   train_ds:
     dataset:
-      _target_: nemo.collections.tts.torch.data.TTSDataset
+      _target_: nemo.collections.tts.data.tts_dataset.TTSDataset
       manifest_filepath: ${train_dataset}
       sample_rate: ${model.sample_rate}
       sup_data_path: ${sup_data_path}
@@ -101,7 +101,7 @@ model:
 
   validation_ds:
     dataset:
-      _target_: nemo.collections.tts.torch.data.TTSDataset
+      _target_: nemo.collections.tts.data.tts_dataset.TTSDataset
       manifest_filepath: ${validation_datasets}
       sample_rate: ${model.sample_rate}
       sup_data_path: ${sup_data_path}

diff --git a/examples/tts/conf/es/fastpitch_align_44100_ipa_multi.yaml b/examples/tts/conf/es/fastpitch_align_44100_ipa_multi.yaml
@@ -67,7 +67,7 @@ model:
 
   train_ds:
     dataset:
-      _target_: nemo.collections.tts.torch.data.TTSDataset
+      _target_: nemo.collections.tts.data.tts_dataset.TTSDataset
       manifest_filepath: ${train_dataset}
       sample_rate: ${model.sample_rate}
       sup_data_path: ${sup_data_path}
@@ -97,7 +97,7 @@ model:
 
   validation_ds:
     dataset:
-      _target_: nemo.collections.tts.torch.data.TTSDataset
+      _target_: nemo.collections.tts.data.tts_dataset.TTSDataset
       manifest_filepath: ${validation_datasets}
       sample_rate: ${model.sample_rate}
       sup_data_path: ${sup_data_path}

diff --git a/examples/tts/conf/fastpitch_align_44100.yaml b/examples/tts/conf/fastpitch_align_44100.yaml
@@ -79,7 +79,7 @@ model:
 
   train_ds:
     dataset:
-      _target_: nemo.collections.tts.torch.data.TTSDataset
+      _target_: nemo.collections.tts.data.tts_dataset.TTSDataset
       manifest_filepath: ${train_dataset}
       sample_rate: ${model.sample_rate}
       sup_data_path: ${sup_data_path}
@@ -111,7 +111,7 @@ model:
 
   validation_ds:
     dataset:
-      _target_: nemo.collections.tts.torch.data.TTSDataset
+      _target_: nemo.collections.tts.data.tts_dataset.TTSDataset
       manifest_filepath: ${validation_datasets}
       sample_rate: ${model.sample_rate}
       sup_data_path: ${sup_data_path}

diff --git a/examples/tts/conf/fastpitch_align_ipa.yaml b/examples/tts/conf/fastpitch_align_ipa.yaml
@@ -81,7 +81,7 @@ model:
       use_stresses: true
   train_ds:
     dataset:
-      _target_: nemo.collections.tts.torch.data.TTSDataset
+      _target_: nemo.collections.tts.data.tts_dataset.TTSDataset
       manifest_filepath: ${train_dataset}
       sample_rate: ${model.sample_rate}
       sup_data_path: ${sup_data_path}
@@ -112,7 +112,7 @@ model:
 
   validation_ds:
     dataset:
-      _target_: nemo.collections.tts.torch.data.TTSDataset
+      _target_: nemo.collections.tts.data.tts_dataset.TTSDataset
       manifest_filepath: ${validation_datasets}
       sample_rate: ${model.sample_rate}
       sup_data_path: ${sup_data_path}

diff --git a/examples/tts/conf/fastpitch_align_v1.05.yaml b/examples/tts/conf/fastpitch_align_v1.05.yaml
@@ -80,7 +80,7 @@ model:
 
   train_ds:
     dataset:
-      _target_: nemo.collections.tts.torch.data.TTSDataset
+      _target_: nemo.collections.tts.data.tts_dataset.TTSDataset
       manifest_filepath: ${train_dataset}
       sample_rate: ${model.sample_rate}
       sup_data_path: ${sup_data_path}
@@ -112,7 +112,7 @@ model:
 
   validation_ds:
     dataset:
-      _target_: nemo.collections.tts.torch.data.TTSDataset
+      _target_: nemo.collections.tts.data.tts_dataset.TTSDataset
       manifest_filepath: ${validation_datasets}
       sample_rate: ${model.sample_rate}
       sup_data_path: ${sup_data_path}

diff --git a/examples/tts/conf/fastpitch_ssl.yaml b/examples/tts/conf/fastpitch_ssl.yaml
@@ -66,7 +66,7 @@ model:
 
   train_ds:
     dataset:
-      _target_: nemo.collections.tts.torch.data.FastPitchSSLDataset
+      _target_: nemo.collections.tts.data.tts_dataset.FastPitchSSLDataset
       manifest_filepath: ${train_dataset}
       sample_rate: ${model.sample_rate}
       ssl_content_emb_type: ${ssl_content_emb_type}
@@ -90,7 +90,7 @@ model:
 
   validation_ds:
     dataset:
-      _target_: nemo.collections.tts.torch.data.FastPitchSSLDataset
+      _target_: nemo.collections.tts.data.tts_dataset.FastPitchSSLDataset
       manifest_filepath: ${validation_datasets}
       sample_rate: ${model.sample_rate}
       ssl_content_emb_type: ${ssl_content_emb_type}

diff --git a/examples/tts/conf/hifigan/model/train_ds/train_ds.yaml b/examples/tts/conf/hifigan/model/train_ds/train_ds.yaml
@@ -1,5 +1,5 @@
 dataset:
-  _target_: "nemo.collections.tts.torch.data.VocoderDataset"
+  _target_: "nemo.collections.tts.data.tts_dataset.VocoderDataset"
   manifest_filepath: ${train_dataset}
   sample_rate: ${sample_rate}
   n_segments: ${train_n_segments}

diff --git a/examples/tts/conf/hifigan/model/train_ds/train_ds_finetune.yaml b/examples/tts/conf/hifigan/model/train_ds/train_ds_finetune.yaml
@@ -1,5 +1,5 @@
 dataset:
-  _target_: "nemo.collections.tts.torch.data.VocoderDataset"
+  _target_: "nemo.collections.tts.data.tts_dataset.VocoderDataset"
   manifest_filepath: ${train_dataset}
   sample_rate: ${sample_rate}
   n_segments: ${train_n_segments}

diff --git a/examples/tts/conf/hifigan/model/validation_ds/val_ds.yaml b/examples/tts/conf/hifigan/model/validation_ds/val_ds.yaml
@@ -1,5 +1,5 @@
 dataset:
-  _target_: "nemo.collections.tts.torch.data.VocoderDataset"
+  _target_: "nemo.collections.tts.data.tts_dataset.VocoderDataset"
   manifest_filepath: ${validation_datasets}
   sample_rate: ${sample_rate}
   n_segments: ${val_n_segments}

diff --git a/examples/tts/conf/hifigan/model/validation_ds/val_ds_finetune.yaml b/examples/tts/conf/hifigan/model/validation_ds/val_ds_finetune.yaml
@@ -1,5 +1,5 @@
 dataset:
-  _target_: "nemo.collections.tts.torch.data.VocoderDataset"
+  _target_: "nemo.collections.tts.data.tts_dataset.VocoderDataset"
   manifest_filepath: ${validation_datasets}
   sample_rate: ${sample_rate}
   n_segments: ${val_n_segments}

diff --git a/examples/tts/conf/mixer-tts-x.yaml b/examples/tts/conf/mixer-tts-x.yaml
@@ -75,7 +75,7 @@ model:
 
   train_ds:
     dataset:
-      _target_: nemo.collections.tts.torch.data.MixerTTSXDataset
+      _target_: nemo.collections.tts.data.tts_dataset.MixerTTSXDataset
       manifest_filepath: ${train_dataset}
       sample_rate: ${model.sample_rate}
       sup_data_path: ${sup_data_path}
@@ -104,7 +104,7 @@ model:
 
   validation_ds:
     dataset:
-      _target_: nemo.collections.tts.torch.data.MixerTTSXDataset
+      _target_: nemo.collections.tts.data.tts_dataset.MixerTTSXDataset
       manifest_filepath: ${validation_datasets}
       sample_rate: ${model.sample_rate}
       sup_data_path: ${sup_data_path}

diff --git a/examples/tts/conf/mixer-tts.yaml b/examples/tts/conf/mixer-tts.yaml
@@ -80,7 +80,7 @@ model:
 
   train_ds:
     dataset:
-      _target_: nemo.collections.tts.torch.data.TTSDataset
+      _target_: nemo.collections.tts.data.tts_dataset.TTSDataset
       manifest_filepath: ${train_dataset}
       sample_rate: ${model.sample_rate}
       sup_data_path: ${sup_data_path}
@@ -108,7 +108,7 @@ model:
 
   validation_ds:
     dataset:
-      _target_: nemo.collections.tts.torch.data.TTSDataset
+      _target_: nemo.collections.tts.data.tts_dataset.TTSDataset
       manifest_filepath: ${validation_datasets}
       sample_rate: ${model.sample_rate}
       sup_data_path: ${sup_data_path}

diff --git a/examples/tts/conf/rad-tts_dec.yaml b/examples/tts/conf/rad-tts_dec.yaml
@@ -68,7 +68,7 @@ model:
 
   train_ds:
     dataset:
-      _target_: "nemo.collections.tts.torch.data.TTSDataset"
+      _target_: "nemo.collections.tts.data.tts_dataset.TTSDataset"
       manifest_filepath: ${train_dataset}
       sample_rate: ${sample_rate}
       sup_data_path: ${sup_data_path}
@@ -114,7 +114,7 @@ model:
 
   validation_ds:
     dataset:
-      _target_: "nemo.collections.tts.torch.data.TTSDataset"
+      _target_: "nemo.collections.tts.data.tts_dataset.TTSDataset"
       manifest_filepath: ${validation_datasets}
       sample_rate: ${sample_rate}
       sup_data_path: ${sup_data_path}

diff --git a/examples/tts/conf/rad-tts_dec_ipa.yaml b/examples/tts/conf/rad-tts_dec_ipa.yaml
@@ -71,7 +71,7 @@ model:
 
   train_ds:
     dataset:
-      _target_: "nemo.collections.tts.torch.data.TTSDataset"
+      _target_: "nemo.collections.tts.data.tts_dataset.TTSDataset"
       manifest_filepath: ${train_dataset}
       sample_rate: ${sample_rate}
       sup_data_path: ${sup_data_path}
@@ -117,7 +117,7 @@ model:
 
   validation_ds:
     dataset:
-      _target_: "nemo.collections.tts.torch.data.TTSDataset"
+      _target_: "nemo.collections.tts.data.tts_dataset.TTSDataset"
       manifest_filepath: ${validation_datasets}
       sample_rate: ${sample_rate}
       sup_data_path: ${sup_data_path}

diff --git a/examples/tts/conf/rad-tts_feature_pred.yaml b/examples/tts/conf/rad-tts_feature_pred.yaml
@@ -67,7 +67,7 @@ model:
 
   train_ds:
     dataset:
-      _target_: "nemo.collections.tts.torch.data.TTSDataset"
+      _target_: "nemo.collections.tts.data.tts_dataset.TTSDataset"
       manifest_filepath: ${train_dataset}
       sample_rate: ${sample_rate}
       sup_data_path: ${sup_data_path}
@@ -113,7 +113,7 @@ model:
 
   validation_ds:
     dataset:
-      _target_: "nemo.collections.tts.torch.data.TTSDataset"
+      _target_: "nemo.collections.tts.data.tts_dataset.TTSDataset"
       manifest_filepath: ${validation_datasets}
       sample_rate: ${sample_rate}
       sup_data_path: ${sup_data_path}

diff --git a/examples/tts/conf/rad-tts_feature_pred_ipa.yaml b/examples/tts/conf/rad-tts_feature_pred_ipa.yaml
@@ -69,7 +69,7 @@ model:
 
   train_ds:
     dataset:
-      _target_: "nemo.collections.tts.torch.data.TTSDataset"
+      _target_: "nemo.collections.tts.data.tts_dataset.TTSDataset"
       manifest_filepath: ${train_dataset}
       sample_rate: ${sample_rate}
       sup_data_path: ${sup_data_path}
@@ -115,7 +115,7 @@ model:
 
   validation_ds:
     dataset:
-      _target_: "nemo.collections.tts.torch.data.TTSDataset"
+      _target_: "nemo.collections.tts.data.tts_dataset.TTSDataset"
       manifest_filepath: ${validation_datasets}
       sample_rate: ${sample_rate}
       sup_data_path: ${sup_data_path}