Skip to content

Commit

Permalink
fix tarred dataset len when num shards is not divisible by workers (#…
Browse files Browse the repository at this point in the history
…4553)

* fix tarred dataset len when num shards is not divisible by workers

Signed-off-by: Iztok Lebar Bajec <[email protected]>

* update error reporting on invalid `shard_strategy`

* update NLP/PC tarred dataset docstring

* add `shard_strategy` to NLP/PC `@dataclass`

* update NLP/PC tarred dataset docstring

* add `shard_strategy` to NLP/PC docs

* revert test with Dataloader retruning the actual data length

* make dataloader return actual num of samples, set `limit_train_baches` on `setup_*`

* update `shard_strategy` docstrings

Signed-off-by: Iztok Lebar Bajec <[email protected]>

* update `tarred_dataset` documentation

Signed-off-by: Iztok Lebar Bajec <[email protected]>

* fix style

* update documentation

Signed-off-by: Iztok Lebar Bajec <[email protected]>

* updated docstrings

Signed-off-by: Iztok Lebar Bajec <[email protected]>

Co-authored-by: PeganovAnton <[email protected]>
  • Loading branch information
itzsimpl and PeganovAnton authored Jul 26, 2022
1 parent faf8ad8 commit 7890979
Show file tree
Hide file tree
Showing 14 changed files with 395 additions and 104 deletions.
66 changes: 36 additions & 30 deletions docs/source/asr/datasets.rst
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
Datasets
========

NeMo has scripts to convert several common ASR datasets into the format expected by the ``nemo_asr`` collection. You can get started
NeMo has scripts to convert several common ASR datasets into the format expected by the ``nemo_asr`` collection. You can get started
with those datasets by following the instructions to run those scripts in the section appropriate to each dataset below.

If the user has their own data and want to preprocess it to use with NeMo ASR models, refer to the `Preparing Custom ASR Data`_ section.
Expand All @@ -13,8 +13,8 @@ If the user already has a dataset that you want to convert to a tarred format, r
LibriSpeech
-----------

Run the following scripts to download the LibriSpeech data and convert it into the format expected by `nemo_asr`. At least 250GB free
space is required.
Run the following scripts to download the LibriSpeech data and convert it into the format expected by `nemo_asr`. At least 250GB free
space is required.

.. code-block:: bash
Expand All @@ -37,18 +37,18 @@ Fisher English Training Speech

Run these scripts to convert the Fisher English Training Speech data into a format expected by the ``nemo_asr`` collection.

In brief, the following scripts convert the ``.sph`` files to ``.wav``, slices those files into smaller audio samples, matches the
smaller slices with their corresponding transcripts, and splits the resulting audio segments into train, validation, and test sets
In brief, the following scripts convert the ``.sph`` files to ``.wav``, slices those files into smaller audio samples, matches the
smaller slices with their corresponding transcripts, and splits the resulting audio segments into train, validation, and test sets
(with one manifest each).

.. note::
- 106 GB of space is required to run the ``.wav`` conversion
- additional 105 GB is required for the slicing and matching
- ``sph2pipe`` is required in order to run the ``.wav`` conversion
- ``sph2pipe`` is required in order to run the ``.wav`` conversion

**Instructions**

The following scripts assume that you already have the Fisher dataset from the Linguistic Data Consortium, with a directory structure
The following scripts assume that you already have the Fisher dataset from the Linguistic Data Consortium, with a directory structure
that looks similar to the following:

.. code-block:: bash
Expand All @@ -67,7 +67,7 @@ that looks similar to the following:
├── fe_03_p2_sph3
└── ...
The transcripts that will be used are located in the ``fe_03_p<1,2>_transcripts/data/trans`` directory. The audio files (``.sph``)
The transcripts that will be used are located in the ``fe_03_p<1,2>_transcripts/data/trans`` directory. The audio files (``.sph``)
are located in the remaining directories in an ``audio`` subdirectory.

#. Convert the audio files from ``.sph`` to ``.wav`` by running:
Expand All @@ -78,7 +78,7 @@ are located in the remaining directories in an ``audio`` subdirectory.
python fisher_audio_to_wav.py \
--data_root=<fisher_root> --dest_root=<conversion_target_dir>
This will place the unsliced ``.wav`` files in ``<conversion_target_dir>/LDC200[4,5]S13-Part[1,2]/audio-wav/``. It will take several
This will place the unsliced ``.wav`` files in ``<conversion_target_dir>/LDC200[4,5]S13-Part[1,2]/audio-wav/``. It will take several
minutes to run.

#. Process the transcripts and slice the audio data.
Expand All @@ -90,7 +90,7 @@ are located in the remaining directories in an ``audio`` subdirectory.
--dest_root=<processing_target_dir> \
--remove_noises
This script splits the full dataset into train, validation, test sets, and places the audio slices in the corresponding folders
This script splits the full dataset into train, validation, test sets, and places the audio slices in the corresponding folders
in the destination directory. One manifest is written out per set, which includes each slice's transcript, duration, and path.

This will likely take around 20 minutes to run. Once finished, delete the 10 minute long ``.wav`` files.
Expand All @@ -100,8 +100,8 @@ are located in the remaining directories in an ``audio`` subdirectory.

Run the following script to convert the HUB5 data into a format expected by the ``nemo_asr`` collection.

Similarly, to the Fisher dataset processing scripts, this script converts the ``.sph`` files to ``.wav``, slices the audio files and
transcripts into utterances, and combines them into segments of some minimum length (default is 10 seconds). The resulting segments
Similarly, to the Fisher dataset processing scripts, this script converts the ``.sph`` files to ``.wav``, slices the audio files and
transcripts into utterances, and combines them into segments of some minimum length (default is 10 seconds). The resulting segments
are all written out to an audio directory and the corresponding transcripts are written to a manifest JSON file.

.. note::
Expand All @@ -123,7 +123,7 @@ You can optionally include ``--min_slice_duration=<num_seconds>`` if you would l
AN4 Dataset
-----------

This is a small dataset recorded and distributed by Carnegie Mellon University. It consists of recordings of people spelling out
This is a small dataset recorded and distributed by Carnegie Mellon University. It consists of recordings of people spelling out
addresses, names, etc. Information about this dataset can be found on the `official CMU site <http://www.speech.cs.cmu.edu/databases/an4/>`_.

#. `Download and extract the dataset <http://www.speech.cs.cmu.edu/databases/an4/an4_sphere.tar.gz>`_ (which is labeled "NIST's Sphere audio (.sph) format (64M)".
Expand Down Expand Up @@ -153,14 +153,14 @@ After the script finishes, the ``data`` folder should contain a ``data_aishell``
Aishell-2
---------

To process the AIShell-2 dataset, in the command below, set the data folder of AIShell-2 using ``--audio_folder`` and where to push
these files using ``--dest_folder``. In order to generate files in the supported format of ``nemo_asr``, run:
To process the AIShell-2 dataset, in the command below, set the data folder of AIShell-2 using ``--audio_folder`` and where to push
these files using ``--dest_folder``. In order to generate files in the supported format of ``nemo_asr``, run:

.. code-block:: bash
python process_aishell2_data.py --audio_folder=<data directory> --dest_folder=<destination directory>
After the script finishes, the ``train.json``, ``dev.json``, ``test.json``, and ``vocab.txt`` files can be found in the ``dest_folder`` directory.
After the script finishes, the ``train.json``, ``dev.json``, ``test.json``, and ``vocab.txt`` files can be found in the ``dest_folder`` directory.

Preparing Custom ASR Data
-------------------------
Expand All @@ -171,7 +171,7 @@ The audio files can be of any format supported by `Pydub <https://github.com/jia
WAV files as they are the default and have been most thoroughly tested.

There should be one manifest file per dataset that will be passed in, therefore, if the user wants separate training and validation
datasets, they should also have separate manifests. Otherwise, thay will be loading validation data with their training data and vice
datasets, they should also have separate manifests. Otherwise, thay will be loading validation data with their training data and vice
versa.

Each line of the manifest should be in the following format:
Expand Down Expand Up @@ -210,16 +210,22 @@ of filepaths, e.g. ``['/data/shard1.tar', '/data/shard2.tar']``, or in a single
``'/data/shard_{1..64}.tar'`` or ``'/data/shard__OP_1..64_CL_'`` (recommended, see note below).

.. note::
For brace expansion, there may be cases where ``{x..y}`` syntax cannot be used due to shell interference. This occurs most commonly
inside SLURM scripts. Therefore, we provide a few equivalent replacements. Supported opening braces (equivalent to ``{``) are ``(``,
``[``, ``<`` and the special tag ``_OP_``. Supported closing braces (equivalent to ``}``) are ``)``, ``]``, ``>`` and the special
For brace expansion, there may be cases where ``{x..y}`` syntax cannot be used due to shell interference. This occurs most commonly
inside SLURM scripts. Therefore, we provide a few equivalent replacements. Supported opening braces (equivalent to ``{``) are ``(``,
``[``, ``<`` and the special tag ``_OP_``. Supported closing braces (equivalent to ``}``) are ``)``, ``]``, ``>`` and the special
tag ``_CL_``. For SLURM based tasks, we suggest the use of the special tags for ease of use.

As with non-tarred datasets, the manifest file should be passed in ``manifest_filepath``. The dataloader assumes that the length
As with non-tarred datasets, the manifest file should be passed in ``manifest_filepath``. The dataloader assumes that the length
of the manifest after filtering is the correct size of the dataset for reporting training progress.

The ``tarred_shard_strategy`` field of the config file can be set if you have multiple shards and are running an experiment with
The ``tarred_shard_strategy`` field of the config file can be set if you have multiple shards and are running an experiment with
multiple workers. It defaults to ``scatter``, which preallocates a set of shards per worker which do not change during runtime.
Note that this strategy, on specific occasions (when the number of shards is not divisible with ``world_size``), will not sample
the entire dataset. As an alternative the ``replicate`` strategy, will preallocate the entire set of shards to every worker and not
change it during runtime. The benefit of this strategy is that it allows each worker to sample data points from the entire dataset
independently of others. Note, though, that more than one worker may sample the same shard, and even sample the same data points!
As such, there is no assured guarantee that all samples in the dataset will be sampled at least once during 1 epoch. Note that
for these reasons it is not advisable to use tarred datasets as validation and test datasets.

For more information about the individual tarred datasets and the parameters available, including shuffling options,
see the corresponding class APIs in the `Datasets <./api.html#Datasets>`__ section.
Expand All @@ -228,7 +234,7 @@ see the corresponding class APIs in the `Datasets <./api.html#Datasets>`__ secti
If using multiple workers, the number of shards should be divisible by the world size to ensure an even
split among workers. If it is not divisible, logging will give a warning but training will proceed, but likely hang at the last epoch.
In addition, if using distributed processing, each shard must have the same number of entries after filtering is
applied such that each worker ends up with the same number of files. We currently do not check for this in any dataloader, but the user's
applied such that each worker ends up with the same number of files. We currently do not check for this in any dataloader, but the user's
program may hang if the shards are uneven.

Conversion to Tarred Datasets
Expand Down Expand Up @@ -262,9 +268,9 @@ The files in the target directory should look similar to the following:
├── metadata.yaml
└── tarred_audio_manifest.json
Note that file structures are flattened such that all audio files are at the top level in each tarball. This ensures that
filenames are unique in the tarred dataset and the filepaths do not contain "-sub" and forward slashes in each ``audio_filepath`` are
simply converted to underscores. For example, a manifest entry for ``/data/directory1/file.wav`` would be ``_data_directory1_file.wav``
Note that file structures are flattened such that all audio files are at the top level in each tarball. This ensures that
filenames are unique in the tarred dataset and the filepaths do not contain "-sub" and forward slashes in each ``audio_filepath`` are
simply converted to underscores. For example, a manifest entry for ``/data/directory1/file.wav`` would be ``_data_directory1_file.wav``
in the tarred dataset manifest, and ``/data/directory2/file.wav`` would be converted to ``_data_directory2_file.wav``.

Bucketing Datasets
Expand Down Expand Up @@ -325,9 +331,9 @@ Currently bucketing feature is just supported for tarred datasets.
Upsampling Datasets
------------------

Buckets may also be 'weighted' to allow multiple runs through a target dataset during each training epoch. This can be beneficial in cases when a dataset is composed of several component sets of unequal sizes and one desires to mitigate bias towards the larger sets through oversampling.
Buckets may also be 'weighted' to allow multiple runs through a target dataset during each training epoch. This can be beneficial in cases when a dataset is composed of several component sets of unequal sizes and one desires to mitigate bias towards the larger sets through oversampling.

Weighting is managed with the `bucketing_weights` parameter. After passing your composite tarred datasets in the format described above for bucketing, pass a list of integers (one per bucket) to indicate how many times a manifest should be read during training.
Weighting is managed with the `bucketing_weights` parameter. After passing your composite tarred datasets in the format described above for bucketing, pass a list of integers (one per bucket) to indicate how many times a manifest should be read during training.

For example, by passing `[2,1,1,3]` to the code below:

Expand Down Expand Up @@ -363,7 +369,7 @@ If using adaptive bucketing, note that the same batch size will be assigned to e
model.train_ds.bucketing_weights=[2,1,1,3]
model.train_ds.bucketing_batch_size=[4,4,4,2]
All instances of data from `bucket4` will still be trained with a batch size of 2 while all others would have a batch size of 4. As with standard bucketing, this requires `batch_size`` to be set to 1.
All instances of data from `bucket4` will still be trained with a batch size of 2 while all others would have a batch size of 4. As with standard bucketing, this requires `batch_size`` to be set to 1.
If `bucketing_batch_size` is not specified, all datasets will be passed with the same fixed batch size as specified by the `batch_size` parameter.

It is recommended to set bucketing strategies to `fully_randomized` during multi-GPU training to prevent possible dataset bias during training.
It is recommended to set bucketing strategies to `fully_randomized` during multi-GPU training to prevent possible dataset bias during training.
28 changes: 22 additions & 6 deletions docs/source/nlp/punctuation_and_capitalization.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
Punctuation and Capitalization Model
====================================

Automatic Speech Recognition (ASR) systems typically generate text with no punctuation and capitalization of the words.
Automatic Speech Recognition (ASR) systems typically generate text with no punctuation and capitalization of the words.
There are two issues with non-punctuated ASR output:

- it could be difficult to read and understand
Expand Down Expand Up @@ -35,7 +35,7 @@ For each word in the input text, the Punctuation and Capitalization model:
- predicts a punctuation mark that should follow the word (if any). By default, the model supports commas, periods, and question marks.
- predicts if the word should be capitalized or not

In the Punctuation and Capitalization model, we are jointly training two token-level classifiers on top of a pre-trained
In the Punctuation and Capitalization model, we are jointly training two token-level classifiers on top of a pre-trained
language model, such as `BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding <https://arxiv.org/abs/1810.04805>`__ :cite:`nlp-punct-devlin2018bert`.

.. note::
Expand Down Expand Up @@ -85,7 +85,7 @@ NeMo Data Format

The Punctuation and Capitalization model expects the data in the following format:

The training and evaluation data is divided into 2 files:
The training and evaluation data is divided into 2 files:
- ``text.txt``
- ``labels.txt``

Expand All @@ -108,10 +108,10 @@ spaces. Each label in ``labels.txt`` file consists of 2 symbols:
- the second symbol determines if a word needs to be capitalized or not (where ``U`` indicates that the word should be
upper cased, and ``O`` - no capitalization needed)

By default, the following punctuation marks are considered: commas, periods, and question marks; the remaining punctuation marks were
By default, the following punctuation marks are considered: commas, periods, and question marks; the remaining punctuation marks were
removed from the data. This can be changed by introducing new labels in the ``labels.txt`` files.

Each line of the ``labels.txt`` should follow the format: ``[LABEL] [SPACE] [LABEL] [SPACE] [LABEL]`` (for ``labels.txt``). For example,
Each line of the ``labels.txt`` should follow the format: ``[LABEL] [SPACE] [LABEL] [SPACE] [LABEL]`` (for ``labels.txt``). For example,
labels for the above ``text.txt`` file should be:

::
Expand All @@ -120,7 +120,7 @@ labels for the above ``text.txt`` file should be:
OU OO OO OO ...
...

The complete list of all possible labels used in this tutorial are:
The complete list of all possible labels used in this tutorial are:

- ``OO``
- ``.O``
Expand Down Expand Up @@ -588,6 +588,22 @@ For convenience, items of data config are described in 4 tables:
- ``1``
- The size of shuffle buffer of `webdataset <https://github.com/webdataset/webdataset>`_. The number of batches
which are permuted.
* - **shard_strategy**
- string
- ``scatter``
- Tarred dataset shard distribution strategy chosen as a str value during ddp. Accepted values are ``scatter`` and ``replicate``.
``scatter``: Each node gets a unique set of shards, which are permanently pre-allocated and never changed at runtime, when the total
number of shards is not divisible with ``world_size``, some shards (at max ``world_size-1``) will not be used.
``replicate``: Each node gets the entire set of shards available in the tarred dataset, which are permanently pre-allocated and never
changed at runtime. The benefit of replication is that it allows each node to sample data points from the entire dataset independently
of other nodes, and reduces dependence on value of ``tar_shuffle_n``.

.. warning::
Replicated strategy allows every node to sample the entire set of available tarfiles, and therefore more than one node may sample
the same tarfile, and even sample the same data points! As such, there is no assured guarantee that all samples in the dataset will be
sampled at least once during 1 epoch. Scattered strategy, on the other hand, on specific occasions (when the number of shards is not
divisible with ``world_size``), will not sample the entire dataset. For these reasons it is not advisable to use tarred datasets as
validation or test datasets.

.. _pytorch-dataloader-parameters-label:

Expand Down
Loading

0 comments on commit 7890979

Please sign in to comment.