fix tarred dataset len when num shards is not divisible by workers (#…

…4553) * fix tarred dataset len when num shards is not divisible by workers Signed-off-by: Iztok Lebar Bajec <[email protected]> * update error reporting on invalid `shard_strategy` * update NLP/PC tarred dataset docstring * add `shard_strategy` to NLP/PC `@dataclass` * update NLP/PC tarred dataset docstring * add `shard_strategy` to NLP/PC docs * revert test with Dataloader retruning the actual data length * make dataloader return actual num of samples, set `limit_train_baches` on `setup_*` * update `shard_strategy` docstrings Signed-off-by: Iztok Lebar Bajec <[email protected]> * update `tarred_dataset` documentation Signed-off-by: Iztok Lebar Bajec <[email protected]> * fix style * update documentation Signed-off-by: Iztok Lebar Bajec <[email protected]> * updated docstrings Signed-off-by: Iztok Lebar Bajec <[email protected]> Co-authored-by: PeganovAnton <[email protected]>
NVIDIA · Jul 26, 2022 · 7890979 · 7890979
1 parent faf8ad8
commit 7890979
Show file tree

Hide file tree

Showing 14 changed files with 395 additions and 104 deletions.
diff --git a/docs/source/asr/datasets.rst b/docs/source/asr/datasets.rst
@@ -1,7 +1,7 @@
 Datasets
 ========
 
-NeMo has scripts to convert several common ASR datasets into the format expected by the ``nemo_asr`` collection. You can get started 
+NeMo has scripts to convert several common ASR datasets into the format expected by the ``nemo_asr`` collection. You can get started
 with those datasets by following the instructions to run those scripts in the section appropriate to each dataset below.
 
 If the user has their own data and want to preprocess it to use with NeMo ASR models, refer to the `Preparing Custom ASR Data`_ section.
@@ -13,8 +13,8 @@ If the user already has a dataset that you want to convert to a tarred format, r
 LibriSpeech
 -----------
 
-Run the following scripts to download the LibriSpeech data and convert it into the format expected by `nemo_asr`. At least 250GB free 
-space is required. 
+Run the following scripts to download the LibriSpeech data and convert it into the format expected by `nemo_asr`. At least 250GB free
+space is required.
 
 .. code-block:: bash
 
@@ -37,18 +37,18 @@ Fisher English Training Speech
 
 Run these scripts to convert the Fisher English Training Speech data into a format expected by the ``nemo_asr`` collection.
 
-In brief, the following scripts convert the ``.sph`` files to ``.wav``, slices those files into smaller audio samples, matches the 
-smaller slices with their corresponding transcripts, and splits the resulting audio segments into train, validation, and test sets 
+In brief, the following scripts convert the ``.sph`` files to ``.wav``, slices those files into smaller audio samples, matches the
+smaller slices with their corresponding transcripts, and splits the resulting audio segments into train, validation, and test sets
 (with one manifest each).
 
 .. note::
   - 106 GB of space is required to run the ``.wav`` conversion
   - additional 105 GB is required for the slicing and matching
-  - ``sph2pipe`` is required in order to run the ``.wav`` conversion 
+  - ``sph2pipe`` is required in order to run the ``.wav`` conversion
 
 **Instructions**
 
-The following scripts assume that you already have the Fisher dataset from the Linguistic Data Consortium, with a directory structure 
+The following scripts assume that you already have the Fisher dataset from the Linguistic Data Consortium, with a directory structure
 that looks similar to the following:
 
 .. code-block:: bash
@@ -67,7 +67,7 @@ that looks similar to the following:
       ├── fe_03_p2_sph3
       └── ...
 
-The transcripts that will be used are located in the ``fe_03_p<1,2>_transcripts/data/trans`` directory. The audio files (``.sph``) 
+The transcripts that will be used are located in the ``fe_03_p<1,2>_transcripts/data/trans`` directory. The audio files (``.sph``)
 are located in the remaining directories in an ``audio`` subdirectory.
 
 #. Convert the audio files from ``.sph`` to ``.wav`` by running:
@@ -78,7 +78,7 @@ are located in the remaining directories in an ``audio`` subdirectory.
      python fisher_audio_to_wav.py \
        --data_root=<fisher_root> --dest_root=<conversion_target_dir>
 
-   This will place the unsliced ``.wav`` files in ``<conversion_target_dir>/LDC200[4,5]S13-Part[1,2]/audio-wav/``. It will take several 
+   This will place the unsliced ``.wav`` files in ``<conversion_target_dir>/LDC200[4,5]S13-Part[1,2]/audio-wav/``. It will take several
    minutes to run.
 
 #. Process the transcripts and slice the audio data.
@@ -90,7 +90,7 @@ are located in the remaining directories in an ``audio`` subdirectory.
        --dest_root=<processing_target_dir> \
        --remove_noises
 
-   This script splits the full dataset into train, validation, test sets, and places the audio slices in the corresponding folders 
+   This script splits the full dataset into train, validation, test sets, and places the audio slices in the corresponding folders
    in the destination directory. One manifest is written out per set, which includes each slice's transcript, duration, and path.
 
    This will likely take around 20 minutes to run. Once finished, delete the 10 minute long ``.wav`` files.
@@ -100,8 +100,8 @@ are located in the remaining directories in an ``audio`` subdirectory.
 
 Run the following script to convert the HUB5 data into a format expected by the ``nemo_asr`` collection.
 
-Similarly, to the Fisher dataset processing scripts, this script converts the ``.sph`` files to ``.wav``, slices the audio files and 
-transcripts into utterances, and combines them into segments of some minimum length (default is 10 seconds). The resulting segments 
+Similarly, to the Fisher dataset processing scripts, this script converts the ``.sph`` files to ``.wav``, slices the audio files and
+transcripts into utterances, and combines them into segments of some minimum length (default is 10 seconds). The resulting segments
 are all written out to an audio directory and the corresponding transcripts are written to a manifest JSON file.
 
 .. note::
@@ -123,7 +123,7 @@ You can optionally include ``--min_slice_duration=<num_seconds>`` if you would l
 AN4 Dataset
 -----------
 
-This is a small dataset recorded and distributed by Carnegie Mellon University. It consists of recordings of people spelling out 
+This is a small dataset recorded and distributed by Carnegie Mellon University. It consists of recordings of people spelling out
 addresses, names, etc. Information about this dataset can be found on the `official CMU site <http://www.speech.cs.cmu.edu/databases/an4/>`_.
 
 #. `Download and extract the dataset <http://www.speech.cs.cmu.edu/databases/an4/an4_sphere.tar.gz>`_ (which is labeled "NIST's Sphere audio (.sph) format (64M)".
@@ -153,14 +153,14 @@ After the script finishes, the ``data`` folder should contain a ``data_aishell``
 Aishell-2
 ---------
 
-To process the AIShell-2 dataset, in the command below, set the data folder of AIShell-2 using ``--audio_folder`` and where to push 
-these files using ``--dest_folder``. In order to generate files in the supported format of ``nemo_asr``, run: 
+To process the AIShell-2 dataset, in the command below, set the data folder of AIShell-2 using ``--audio_folder`` and where to push
+these files using ``--dest_folder``. In order to generate files in the supported format of ``nemo_asr``, run:
 
 .. code-block:: bash
 
     python process_aishell2_data.py --audio_folder=<data directory> --dest_folder=<destination directory>
 
-After the script finishes, the ``train.json``, ``dev.json``, ``test.json``, and ``vocab.txt`` files can be found in the ``dest_folder`` directory. 
+After the script finishes, the ``train.json``, ``dev.json``, ``test.json``, and ``vocab.txt`` files can be found in the ``dest_folder`` directory.
 
 Preparing Custom ASR Data
 -------------------------
@@ -171,7 +171,7 @@ The audio files can be of any format supported by `Pydub <https://github.com/jia
 WAV files as they are the default and have been most thoroughly tested.
 
 There should be one manifest file per dataset that will be passed in, therefore, if the user wants separate training and validation
-datasets, they should also have separate manifests. Otherwise, thay will be loading validation data with their training data and vice 
+datasets, they should also have separate manifests. Otherwise, thay will be loading validation data with their training data and vice
 versa.
 
 Each line of the manifest should be in the following format:
@@ -210,16 +210,22 @@ of filepaths, e.g. ``['/data/shard1.tar', '/data/shard2.tar']``, or in a single
 ``'/data/shard_{1..64}.tar'`` or ``'/data/shard__OP_1..64_CL_'`` (recommended, see note below).
 
 .. note::
-  For brace expansion, there may be cases where ``{x..y}`` syntax cannot be used due to shell interference. This occurs most commonly 
-  inside SLURM scripts. Therefore, we provide a few equivalent replacements. Supported opening braces (equivalent to ``{``) are ``(``, 
-  ``[``, ``<`` and the special tag ``_OP_``. Supported closing braces (equivalent to ``}``) are ``)``, ``]``, ``>`` and the special 
+  For brace expansion, there may be cases where ``{x..y}`` syntax cannot be used due to shell interference. This occurs most commonly
+  inside SLURM scripts. Therefore, we provide a few equivalent replacements. Supported opening braces (equivalent to ``{``) are ``(``,
+  ``[``, ``<`` and the special tag ``_OP_``. Supported closing braces (equivalent to ``}``) are ``)``, ``]``, ``>`` and the special
   tag ``_CL_``. For SLURM based tasks, we suggest the use of the special tags for ease of use.
 
-As with non-tarred datasets, the manifest file should be passed in ``manifest_filepath``. The dataloader assumes that the length 
+As with non-tarred datasets, the manifest file should be passed in ``manifest_filepath``. The dataloader assumes that the length
 of the manifest after filtering is the correct size of the dataset for reporting training progress.
 
-The ``tarred_shard_strategy`` field of the config file can be set if you have multiple shards and are running an experiment with 
+The ``tarred_shard_strategy`` field of the config file can be set if you have multiple shards and are running an experiment with
 multiple workers. It defaults to ``scatter``, which preallocates a set of shards per worker which do not change during runtime.
+Note that this strategy, on specific occasions (when the number of shards is not divisible with ``world_size``), will not sample
+the entire dataset. As an alternative the ``replicate`` strategy, will preallocate the entire set of shards to every worker and not
+change it during runtime. The benefit of this strategy is that it allows each worker to sample data points from the entire dataset
+independently of others. Note, though, that more than one worker may sample the same shard, and even sample the same data points!
+As such, there is no assured guarantee that all samples in the dataset will be sampled at least once during 1 epoch. Note that
+for these reasons it is not advisable to use tarred datasets as validation and test datasets.
 
 For more information about the individual tarred datasets and the parameters available, including shuffling options,
 see the corresponding class APIs in the `Datasets <./api.html#Datasets>`__ section.
@@ -228,7 +234,7 @@ see the corresponding class APIs in the `Datasets <./api.html#Datasets>`__ secti
   If using multiple workers, the number of shards should be divisible by the world size to ensure an even
   split among workers. If it is not divisible, logging will give a warning but training will proceed, but likely hang at the last epoch.
   In addition, if using distributed processing, each shard must have the same number of entries after filtering is
-  applied such that each worker ends up with the same number of files. We currently do not check for this in any dataloader, but the user's 
+  applied such that each worker ends up with the same number of files. We currently do not check for this in any dataloader, but the user's
   program may hang if the shards are uneven.
 
 Conversion to Tarred Datasets
@@ -262,9 +268,9 @@ The files in the target directory should look similar to the following:
   ├── metadata.yaml
   └── tarred_audio_manifest.json
 
-Note that file structures are flattened such that all audio files are at the top level in each tarball. This ensures that 
-filenames are unique in the tarred dataset and the filepaths do not contain "-sub" and forward slashes in each ``audio_filepath`` are 
-simply converted to underscores. For example, a manifest entry for ``/data/directory1/file.wav`` would be ``_data_directory1_file.wav`` 
+Note that file structures are flattened such that all audio files are at the top level in each tarball. This ensures that
+filenames are unique in the tarred dataset and the filepaths do not contain "-sub" and forward slashes in each ``audio_filepath`` are
+simply converted to underscores. For example, a manifest entry for ``/data/directory1/file.wav`` would be ``_data_directory1_file.wav``
 in the tarred dataset manifest, and ``/data/directory2/file.wav`` would be converted to ``_data_directory2_file.wav``.
 
 Bucketing Datasets
@@ -325,9 +331,9 @@ Currently bucketing feature is just supported for tarred datasets.
 Upsampling Datasets
 ------------------
 
-Buckets may also be 'weighted' to allow multiple runs through a target dataset during each training epoch. This can be beneficial in cases when a dataset is composed of several component sets of unequal sizes and one desires to mitigate bias towards the larger sets through oversampling.   
+Buckets may also be 'weighted' to allow multiple runs through a target dataset during each training epoch. This can be beneficial in cases when a dataset is composed of several component sets of unequal sizes and one desires to mitigate bias towards the larger sets through oversampling.
 
-Weighting is managed with the `bucketing_weights` parameter. After passing your composite tarred datasets in the format described above for bucketing, pass a list of integers (one per bucket) to indicate how many times a manifest should be read during training. 
+Weighting is managed with the `bucketing_weights` parameter. After passing your composite tarred datasets in the format described above for bucketing, pass a list of integers (one per bucket) to indicate how many times a manifest should be read during training.
 
 For example, by passing `[2,1,1,3]` to the code below:
 
@@ -363,7 +369,7 @@ If using adaptive bucketing, note that the same batch size will be assigned to e
 	model.train_ds.bucketing_weights=[2,1,1,3]
 	model.train_ds.bucketing_batch_size=[4,4,4,2]
 
-All instances of data from `bucket4` will still be trained with a batch size of 2 while all others would have a batch size of 4. As with standard bucketing, this requires `batch_size`` to be set to 1. 
+All instances of data from `bucket4` will still be trained with a batch size of 2 while all others would have a batch size of 4. As with standard bucketing, this requires `batch_size`` to be set to 1.
 If `bucketing_batch_size` is not specified, all datasets will be passed with the same fixed batch size as specified by the `batch_size` parameter.
 
-It is recommended to set bucketing strategies to `fully_randomized` during multi-GPU training to prevent possible dataset bias during training. 
+It is recommended to set bucketing strategies to `fully_randomized` during multi-GPU training to prevent possible dataset bias during training.
diff --git a/docs/source/nlp/punctuation_and_capitalization.rst b/docs/source/nlp/punctuation_and_capitalization.rst
@@ -3,7 +3,7 @@
 Punctuation and Capitalization Model
 ====================================
 
-Automatic Speech Recognition (ASR) systems typically generate text with no punctuation and capitalization of the words. 
+Automatic Speech Recognition (ASR) systems typically generate text with no punctuation and capitalization of the words.
 There are two issues with non-punctuated ASR output:
 
 - it could be difficult to read and understand
@@ -35,7 +35,7 @@ For each word in the input text, the Punctuation and Capitalization model:
 - predicts a punctuation mark that should follow the word (if any). By default, the model supports commas, periods, and question marks.
 - predicts if the word should be capitalized or not
 
-In the Punctuation and Capitalization model, we are jointly training two token-level classifiers on top of a pre-trained 
+In the Punctuation and Capitalization model, we are jointly training two token-level classifiers on top of a pre-trained
 language model, such as `BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding <https://arxiv.org/abs/1810.04805>`__ :cite:`nlp-punct-devlin2018bert`.
 
 .. note::
@@ -85,7 +85,7 @@ NeMo Data Format
 
 The Punctuation and Capitalization model expects the data in the following format:
 
-The training and evaluation data is divided into 2 files: 
+The training and evaluation data is divided into 2 files:
 - ``text.txt``
 - ``labels.txt``
 
@@ -108,10 +108,10 @@ spaces. Each label in ``labels.txt`` file consists of 2 symbols:
 - the second symbol determines if a word needs to be capitalized or not (where ``U`` indicates that the word should be
   upper cased, and ``O`` - no capitalization needed)
 
-By default, the following punctuation marks are considered: commas, periods, and question marks; the remaining punctuation marks were 
+By default, the following punctuation marks are considered: commas, periods, and question marks; the remaining punctuation marks were
 removed from the data. This can be changed by introducing new labels in the ``labels.txt`` files.
 
-Each line of the ``labels.txt`` should follow the format: ``[LABEL] [SPACE] [LABEL] [SPACE] [LABEL]`` (for ``labels.txt``). For example, 
+Each line of the ``labels.txt`` should follow the format: ``[LABEL] [SPACE] [LABEL] [SPACE] [LABEL]`` (for ``labels.txt``). For example,
 labels for the above ``text.txt`` file should be:
 
     ::
@@ -120,7 +120,7 @@ labels for the above ``text.txt`` file should be:
         OU OO OO OO ...
         ...
 
-The complete list of all possible labels used in this tutorial are: 
+The complete list of all possible labels used in this tutorial are:
 
 - ``OO``
 - ``.O``
@@ -588,6 +588,22 @@ For convenience, items of data config are described in 4 tables:
      - ``1``
      - The size of shuffle buffer of `webdataset <https://github.com/webdataset/webdataset>`_. The number of batches
        which are permuted.
+   * - **shard_strategy**
+     - string
+     - ``scatter``
+     - Tarred dataset shard distribution strategy chosen as a str value during ddp. Accepted values are ``scatter`` and ``replicate``.
+       ``scatter``: Each node gets a unique set of shards, which are permanently pre-allocated and never changed at runtime, when the total
+       number of shards is not divisible with ``world_size``, some shards (at max ``world_size-1``) will not be used.
+       ``replicate``: Each node gets the entire set of shards available in the tarred dataset, which are permanently pre-allocated and never
+       changed at runtime. The benefit of replication is that it allows each node to sample data points from the entire dataset independently
+       of other nodes, and reduces dependence on value of ``tar_shuffle_n``.
+
+       .. warning::
+           Replicated strategy allows every node to sample the entire set of available tarfiles, and therefore more than one node may sample
+           the same tarfile, and even sample the same data points! As such, there is no assured guarantee that all samples in the dataset will be
+           sampled at least once during 1 epoch. Scattered strategy, on the other hand, on specific occasions (when the number of shards is not
+           divisible with ``world_size``), will not sample the entire dataset. For these reasons it is not advisable to use tarred datasets as
+           validation or test datasets.
 
 .. _pytorch-dataloader-parameters-label: