Skip to content

Commit

Permalink
[TTS][refactor] Part 1 - nemo.collections.tts.data (NVIDIA#6099)
Browse files Browse the repository at this point in the history
* [TTS] refactor nemo.collections.tts.data
* update tutorials
* update line number
* update the year of the copyright header

Signed-off-by: Xuesong Yang <[email protected]>
---------
Signed-off-by: Xuesong Yang <[email protected]>
Signed-off-by: hsiehjackson <[email protected]>
  • Loading branch information
XuesongYang authored and hsiehjackson committed Jun 2, 2023
1 parent 364cd89 commit 2d48084
Show file tree
Hide file tree
Showing 60 changed files with 607 additions and 656 deletions.
6 changes: 3 additions & 3 deletions docs/source/tts/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -86,14 +86,14 @@ To read more about them, see the `Base Classes <./intro.html#Base Classes>`__ se

Dataset Processing Classes
--------------------------
.. autoclass:: nemo.collections.tts.torch.data.MixerTTSXDataset
.. autoclass:: nemo.collections.tts.data.tts_dataset.MixerTTSXDataset
:show-inheritance:
:members:

.. autoclass:: nemo.collections.tts.torch.data.TTSDataset
.. autoclass:: nemo.collections.tts.data.tts_dataset.TTSDataset
:show-inheritance:
:members:

.. autoclass:: nemo.collections.tts.torch.data.VocoderDataset
.. autoclass:: nemo.collections.tts.data.tts_dataset.VocoderDataset
:show-inheritance:
:members:
6 changes: 4 additions & 2 deletions docs/source/tts/configs.rst
Original file line number Diff line number Diff line change
Expand Up @@ -16,14 +16,16 @@ Dataset Configuration

Training, validation, and test parameters are specified using the ``model.train_ds``, ``model.validation_ds``, and ``model.test_ds`` sections in the configuration file, respectively. Depending on the task, there may be arguments specifying the sample rate of the audio files, supplementary data such as speech/text alignment priors and speaker IDs, etc., the threshold to trim leading and trailing silence from an audio signal, pitch normalization parameters, and so on. You may also decide to leave fields such as the ``manifest_filepath`` blank, to be specified via the command-line at runtime.

Any initialization parameter that is accepted for the class `nemo.collections.tts.torch.data.TTSDataset <https://github.com/NVIDIA/NeMo/tree/stable/nemo/collections/tts/torch/data.py#L78>`_ can be set in the config file. Refer to the `Dataset Processing Classes <./api.html#Datasets>`__ section of the API for a list of datasets classes and their respective parameters. An example TTS train and validation configuration should look similar to the following:
Any initialization parameter that is accepted for the class `nemo.collections.tts.data.tts_dataset.TTSDataset
<https://github.com/NVIDIA/NeMo/tree/stable/nemo/collections/tts/data/tts_dataset.py#L80>`_ can be set in the config
file. Refer to the `Dataset Processing Classes <./api.html#Datasets>`__ section of the API for a list of datasets classes and their respective parameters. An example TTS train and validation configuration should look similar to the following:

.. code-block:: yaml
model:
train_ds:
dataset:
_target_: nemo.collections.tts.torch.data.TTSDataset
_target_: nemo.collections.tts.data.tts_dataset.TTSDataset
manifest_filepath: ???
sample_rate: 44100
sup_data_path: ???
Expand Down
2 changes: 1 addition & 1 deletion docs/source/tts/datasets.rst
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
Data Preprocessing
==================

NeMo TTS recipes support most of public TTS datasets that consist of multiple languages, multiple emotions, and multiple speakers. Current recipes covered English (en-US), German (de-DE), Spanish (es-ES), and Mandarin Chinese (zh-CN), while the support for many other languages is under planning. NeMo provides corpus-specific data preprocessing scripts, as shown in the directory of `scripts/data_processing/tts/ <https://github.com/NVIDIA/NeMo/tree/stable/scripts/dataset_processing/tts/>`_, to convert common public TTS datasets into the format expected by the dataloaders as defined in `nemo/collections/tts/torch/data.py <https://github.com/NVIDIA/NeMo/tree/stable/nemo/collections/tts/torch/data.py>`_. The ``nemo_tts`` collection expects each dataset to consist of a set of utterances in individual audio files plus a ``JSON`` manifest that describes the dataset, with information about one utterance per line. The audio files can be of any format supported by `Pydub <https://github.com/jiaaro/pydub>`_, though we recommend ``WAV`` files as they are the default and have been most thoroughly tested. NeMo supports any original sampling rates of audios, although our scripts of extracting supplementary data and model training all specify the common target sampling rates as either 44100 Hz or 22050 Hz. If the original sampling rate mismatches the target sampling rate, the `feature preprocess <https://github.com/NVIDIA/NeMo/blob/stable/nemo/collections/asr/parts/preprocessing/features.py#L124>`_ can automatically resample the original sampling rate into the target one.
NeMo TTS recipes support most of public TTS datasets that consist of multiple languages, multiple emotions, and multiple speakers. Current recipes covered English (en-US), German (de-DE), Spanish (es-ES), and Mandarin Chinese (zh-CN), while the support for many other languages is under planning. NeMo provides corpus-specific data preprocessing scripts, as shown in the directory of `scripts/data_processing/tts/ <https://github.com/NVIDIA/NeMo/tree/stable/scripts/dataset_processing/tts/>`_, to convert common public TTS datasets into the format expected by the dataloaders as defined in `nemo/collections/tts/data/tts_dataset.py <https://github.com/NVIDIA/NeMo/tree/stable/nemo/collections/tts/data/tts_dataset.py>`_. The ``nemo_tts`` collection expects each dataset to consist of a set of utterances in individual audio files plus a ``JSON`` manifest that describes the dataset, with information about one utterance per line. The audio files can be of any format supported by `Pydub <https://github.com/jiaaro/pydub>`_, though we recommend ``WAV`` files as they are the default and have been most thoroughly tested. NeMo supports any original sampling rates of audios, although our scripts of extracting supplementary data and model training all specify the common target sampling rates as either 44100 Hz or 22050 Hz. If the original sampling rate mismatches the target sampling rate, the `feature preprocess <https://github.com/NVIDIA/NeMo/blob/stable/nemo/collections/asr/parts/preprocessing/features.py#L124>`_ can automatically resample the original sampling rate into the target one.

There should be one ``JSON`` manifest file per dataset that will be passed in, therefore, if the user wants separate training and validation datasets, they should also have separate manifests. Otherwise, they will be loading validation data with their training data and vice versa. Each line of the manifest should be in the following format:

Expand Down
4 changes: 2 additions & 2 deletions examples/tts/conf/aligner.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,7 @@ model:

train_ds:
dataset:
_target_: nemo.collections.tts.torch.data.TTSDataset
_target_: nemo.collections.tts.data.tts_dataset.TTSDataset
manifest_filepath: ${train_dataset}
sample_rate: ${model.sample_rate}
sup_data_path: ${sup_data_path}
Expand All @@ -86,7 +86,7 @@ model:

validation_ds:
dataset:
_target_: nemo.collections.tts.torch.data.TTSDataset
_target_: nemo.collections.tts.data.tts_dataset.TTSDataset
manifest_filepath: ${validation_datasets}
sample_rate: ${model.sample_rate}
sup_data_path: ${sup_data_path}
Expand Down
4 changes: 2 additions & 2 deletions examples/tts/conf/de/fastpitch_align_22050_grapheme.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -70,7 +70,7 @@ model:

train_ds:
dataset:
_target_: nemo.collections.tts.torch.data.TTSDataset
_target_: nemo.collections.tts.data.tts_dataset.TTSDataset
manifest_filepath: ${train_dataset}
sample_rate: ${model.sample_rate}
sup_data_path: ${sup_data_path}
Expand Down Expand Up @@ -104,7 +104,7 @@ model:

validation_ds:
dataset:
_target_: nemo.collections.tts.torch.data.TTSDataset
_target_: nemo.collections.tts.data.tts_dataset.TTSDataset
manifest_filepath: ${validation_datasets}
sample_rate: ${model.sample_rate}
sup_data_path: ${sup_data_path}
Expand Down
4 changes: 2 additions & 2 deletions examples/tts/conf/de/fastpitch_align_22050_mix.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -85,7 +85,7 @@ model:

train_ds:
dataset:
_target_: nemo.collections.tts.torch.data.TTSDataset
_target_: nemo.collections.tts.data.tts_dataset.TTSDataset
manifest_filepath: ${train_dataset}
sample_rate: ${model.sample_rate}
sup_data_path: ${sup_data_path}
Expand Down Expand Up @@ -119,7 +119,7 @@ model:

validation_ds:
dataset:
_target_: nemo.collections.tts.torch.data.TTSDataset
_target_: nemo.collections.tts.data.tts_dataset.TTSDataset
manifest_filepath: ${validation_datasets}
sample_rate: ${model.sample_rate}
sup_data_path: ${sup_data_path}
Expand Down
4 changes: 2 additions & 2 deletions examples/tts/conf/de/fastpitch_align_44100_grapheme.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -70,7 +70,7 @@ model:

train_ds:
dataset:
_target_: nemo.collections.tts.torch.data.TTSDataset
_target_: nemo.collections.tts.data.tts_dataset.TTSDataset
manifest_filepath: ${train_dataset}
sample_rate: ${model.sample_rate}
sup_data_path: ${sup_data_path}
Expand Down Expand Up @@ -104,7 +104,7 @@ model:

validation_ds:
dataset:
_target_: nemo.collections.tts.torch.data.TTSDataset
_target_: nemo.collections.tts.data.tts_dataset.TTSDataset
manifest_filepath: ${validation_datasets}
sample_rate: ${model.sample_rate}
sup_data_path: ${sup_data_path}
Expand Down
4 changes: 2 additions & 2 deletions examples/tts/conf/de/fastpitch_align_44100_phoneme.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -70,7 +70,7 @@ model:

train_ds:
dataset:
_target_: nemo.collections.tts.torch.data.TTSDataset
_target_: nemo.collections.tts.data.tts_dataset.TTSDataset
manifest_filepath: ${train_dataset}
sample_rate: ${model.sample_rate}
sup_data_path: ${sup_data_path}
Expand Down Expand Up @@ -101,7 +101,7 @@ model:

validation_ds:
dataset:
_target_: nemo.collections.tts.torch.data.TTSDataset
_target_: nemo.collections.tts.data.tts_dataset.TTSDataset
manifest_filepath: ${validation_datasets}
sample_rate: ${model.sample_rate}
sup_data_path: ${sup_data_path}
Expand Down
4 changes: 2 additions & 2 deletions examples/tts/conf/es/fastpitch_align_44100.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,7 @@ model:

train_ds:
dataset:
_target_: nemo.collections.tts.torch.data.TTSDataset
_target_: nemo.collections.tts.data.tts_dataset.TTSDataset
manifest_filepath: ${train_dataset}
sample_rate: ${model.sample_rate}
sup_data_path: ${sup_data_path}
Expand Down Expand Up @@ -88,7 +88,7 @@ model:

validation_ds:
dataset:
_target_: nemo.collections.tts.torch.data.TTSDataset
_target_: nemo.collections.tts.data.tts_dataset.TTSDataset
manifest_filepath: ${validation_datasets}
sample_rate: ${model.sample_rate}
sup_data_path: ${sup_data_path}
Expand Down
4 changes: 2 additions & 2 deletions examples/tts/conf/es/fastpitch_align_44100_ipa.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -70,7 +70,7 @@ model:

train_ds:
dataset:
_target_: nemo.collections.tts.torch.data.TTSDataset
_target_: nemo.collections.tts.data.tts_dataset.TTSDataset
manifest_filepath: ${train_dataset}
sample_rate: ${model.sample_rate}
sup_data_path: ${sup_data_path}
Expand Down Expand Up @@ -101,7 +101,7 @@ model:

validation_ds:
dataset:
_target_: nemo.collections.tts.torch.data.TTSDataset
_target_: nemo.collections.tts.data.tts_dataset.TTSDataset
manifest_filepath: ${validation_datasets}
sample_rate: ${model.sample_rate}
sup_data_path: ${sup_data_path}
Expand Down
4 changes: 2 additions & 2 deletions examples/tts/conf/es/fastpitch_align_44100_ipa_multi.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -67,7 +67,7 @@ model:

train_ds:
dataset:
_target_: nemo.collections.tts.torch.data.TTSDataset
_target_: nemo.collections.tts.data.tts_dataset.TTSDataset
manifest_filepath: ${train_dataset}
sample_rate: ${model.sample_rate}
sup_data_path: ${sup_data_path}
Expand Down Expand Up @@ -97,7 +97,7 @@ model:

validation_ds:
dataset:
_target_: nemo.collections.tts.torch.data.TTSDataset
_target_: nemo.collections.tts.data.tts_dataset.TTSDataset
manifest_filepath: ${validation_datasets}
sample_rate: ${model.sample_rate}
sup_data_path: ${sup_data_path}
Expand Down
4 changes: 2 additions & 2 deletions examples/tts/conf/fastpitch_align_44100.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -79,7 +79,7 @@ model:

train_ds:
dataset:
_target_: nemo.collections.tts.torch.data.TTSDataset
_target_: nemo.collections.tts.data.tts_dataset.TTSDataset
manifest_filepath: ${train_dataset}
sample_rate: ${model.sample_rate}
sup_data_path: ${sup_data_path}
Expand Down Expand Up @@ -111,7 +111,7 @@ model:

validation_ds:
dataset:
_target_: nemo.collections.tts.torch.data.TTSDataset
_target_: nemo.collections.tts.data.tts_dataset.TTSDataset
manifest_filepath: ${validation_datasets}
sample_rate: ${model.sample_rate}
sup_data_path: ${sup_data_path}
Expand Down
4 changes: 2 additions & 2 deletions examples/tts/conf/fastpitch_align_ipa.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -81,7 +81,7 @@ model:
use_stresses: true
train_ds:
dataset:
_target_: nemo.collections.tts.torch.data.TTSDataset
_target_: nemo.collections.tts.data.tts_dataset.TTSDataset
manifest_filepath: ${train_dataset}
sample_rate: ${model.sample_rate}
sup_data_path: ${sup_data_path}
Expand Down Expand Up @@ -112,7 +112,7 @@ model:

validation_ds:
dataset:
_target_: nemo.collections.tts.torch.data.TTSDataset
_target_: nemo.collections.tts.data.tts_dataset.TTSDataset
manifest_filepath: ${validation_datasets}
sample_rate: ${model.sample_rate}
sup_data_path: ${sup_data_path}
Expand Down
4 changes: 2 additions & 2 deletions examples/tts/conf/fastpitch_align_v1.05.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -80,7 +80,7 @@ model:

train_ds:
dataset:
_target_: nemo.collections.tts.torch.data.TTSDataset
_target_: nemo.collections.tts.data.tts_dataset.TTSDataset
manifest_filepath: ${train_dataset}
sample_rate: ${model.sample_rate}
sup_data_path: ${sup_data_path}
Expand Down Expand Up @@ -112,7 +112,7 @@ model:

validation_ds:
dataset:
_target_: nemo.collections.tts.torch.data.TTSDataset
_target_: nemo.collections.tts.data.tts_dataset.TTSDataset
manifest_filepath: ${validation_datasets}
sample_rate: ${model.sample_rate}
sup_data_path: ${sup_data_path}
Expand Down
4 changes: 2 additions & 2 deletions examples/tts/conf/fastpitch_ssl.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,7 @@ model:

train_ds:
dataset:
_target_: nemo.collections.tts.torch.data.FastPitchSSLDataset
_target_: nemo.collections.tts.data.tts_dataset.FastPitchSSLDataset
manifest_filepath: ${train_dataset}
sample_rate: ${model.sample_rate}
ssl_content_emb_type: ${ssl_content_emb_type}
Expand All @@ -90,7 +90,7 @@ model:

validation_ds:
dataset:
_target_: nemo.collections.tts.torch.data.FastPitchSSLDataset
_target_: nemo.collections.tts.data.tts_dataset.FastPitchSSLDataset
manifest_filepath: ${validation_datasets}
sample_rate: ${model.sample_rate}
ssl_content_emb_type: ${ssl_content_emb_type}
Expand Down
2 changes: 1 addition & 1 deletion examples/tts/conf/hifigan/model/train_ds/train_ds.yaml
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
dataset:
_target_: "nemo.collections.tts.torch.data.VocoderDataset"
_target_: "nemo.collections.tts.data.tts_dataset.VocoderDataset"
manifest_filepath: ${train_dataset}
sample_rate: ${sample_rate}
n_segments: ${train_n_segments}
Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
dataset:
_target_: "nemo.collections.tts.torch.data.VocoderDataset"
_target_: "nemo.collections.tts.data.tts_dataset.VocoderDataset"
manifest_filepath: ${train_dataset}
sample_rate: ${sample_rate}
n_segments: ${train_n_segments}
Expand Down
2 changes: 1 addition & 1 deletion examples/tts/conf/hifigan/model/validation_ds/val_ds.yaml
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
dataset:
_target_: "nemo.collections.tts.torch.data.VocoderDataset"
_target_: "nemo.collections.tts.data.tts_dataset.VocoderDataset"
manifest_filepath: ${validation_datasets}
sample_rate: ${sample_rate}
n_segments: ${val_n_segments}
Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
dataset:
_target_: "nemo.collections.tts.torch.data.VocoderDataset"
_target_: "nemo.collections.tts.data.tts_dataset.VocoderDataset"
manifest_filepath: ${validation_datasets}
sample_rate: ${sample_rate}
n_segments: ${val_n_segments}
Expand Down
4 changes: 2 additions & 2 deletions examples/tts/conf/mixer-tts-x.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -75,7 +75,7 @@ model:

train_ds:
dataset:
_target_: nemo.collections.tts.torch.data.MixerTTSXDataset
_target_: nemo.collections.tts.data.tts_dataset.MixerTTSXDataset
manifest_filepath: ${train_dataset}
sample_rate: ${model.sample_rate}
sup_data_path: ${sup_data_path}
Expand Down Expand Up @@ -104,7 +104,7 @@ model:

validation_ds:
dataset:
_target_: nemo.collections.tts.torch.data.MixerTTSXDataset
_target_: nemo.collections.tts.data.tts_dataset.MixerTTSXDataset
manifest_filepath: ${validation_datasets}
sample_rate: ${model.sample_rate}
sup_data_path: ${sup_data_path}
Expand Down
4 changes: 2 additions & 2 deletions examples/tts/conf/mixer-tts.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -80,7 +80,7 @@ model:

train_ds:
dataset:
_target_: nemo.collections.tts.torch.data.TTSDataset
_target_: nemo.collections.tts.data.tts_dataset.TTSDataset
manifest_filepath: ${train_dataset}
sample_rate: ${model.sample_rate}
sup_data_path: ${sup_data_path}
Expand Down Expand Up @@ -108,7 +108,7 @@ model:

validation_ds:
dataset:
_target_: nemo.collections.tts.torch.data.TTSDataset
_target_: nemo.collections.tts.data.tts_dataset.TTSDataset
manifest_filepath: ${validation_datasets}
sample_rate: ${model.sample_rate}
sup_data_path: ${sup_data_path}
Expand Down
4 changes: 2 additions & 2 deletions examples/tts/conf/rad-tts_dec.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -68,7 +68,7 @@ model:

train_ds:
dataset:
_target_: "nemo.collections.tts.torch.data.TTSDataset"
_target_: "nemo.collections.tts.data.tts_dataset.TTSDataset"
manifest_filepath: ${train_dataset}
sample_rate: ${sample_rate}
sup_data_path: ${sup_data_path}
Expand Down Expand Up @@ -114,7 +114,7 @@ model:

validation_ds:
dataset:
_target_: "nemo.collections.tts.torch.data.TTSDataset"
_target_: "nemo.collections.tts.data.tts_dataset.TTSDataset"
manifest_filepath: ${validation_datasets}
sample_rate: ${sample_rate}
sup_data_path: ${sup_data_path}
Expand Down
4 changes: 2 additions & 2 deletions examples/tts/conf/rad-tts_dec_ipa.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -71,7 +71,7 @@ model:

train_ds:
dataset:
_target_: "nemo.collections.tts.torch.data.TTSDataset"
_target_: "nemo.collections.tts.data.tts_dataset.TTSDataset"
manifest_filepath: ${train_dataset}
sample_rate: ${sample_rate}
sup_data_path: ${sup_data_path}
Expand Down Expand Up @@ -117,7 +117,7 @@ model:

validation_ds:
dataset:
_target_: "nemo.collections.tts.torch.data.TTSDataset"
_target_: "nemo.collections.tts.data.tts_dataset.TTSDataset"
manifest_filepath: ${validation_datasets}
sample_rate: ${sample_rate}
sup_data_path: ${sup_data_path}
Expand Down
4 changes: 2 additions & 2 deletions examples/tts/conf/rad-tts_feature_pred.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -67,7 +67,7 @@ model:

train_ds:
dataset:
_target_: "nemo.collections.tts.torch.data.TTSDataset"
_target_: "nemo.collections.tts.data.tts_dataset.TTSDataset"
manifest_filepath: ${train_dataset}
sample_rate: ${sample_rate}
sup_data_path: ${sup_data_path}
Expand Down Expand Up @@ -113,7 +113,7 @@ model:

validation_ds:
dataset:
_target_: "nemo.collections.tts.torch.data.TTSDataset"
_target_: "nemo.collections.tts.data.tts_dataset.TTSDataset"
manifest_filepath: ${validation_datasets}
sample_rate: ${sample_rate}
sup_data_path: ${sup_data_path}
Expand Down
4 changes: 2 additions & 2 deletions examples/tts/conf/rad-tts_feature_pred_ipa.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,7 @@ model:

train_ds:
dataset:
_target_: "nemo.collections.tts.torch.data.TTSDataset"
_target_: "nemo.collections.tts.data.tts_dataset.TTSDataset"
manifest_filepath: ${train_dataset}
sample_rate: ${sample_rate}
sup_data_path: ${sup_data_path}
Expand Down Expand Up @@ -115,7 +115,7 @@ model:

validation_ds:
dataset:
_target_: "nemo.collections.tts.torch.data.TTSDataset"
_target_: "nemo.collections.tts.data.tts_dataset.TTSDataset"
manifest_filepath: ${validation_datasets}
sample_rate: ${sample_rate}
sup_data_path: ${sup_data_path}
Expand Down
Loading

0 comments on commit 2d48084

Please sign in to comment.