Skip to content

Commit

Permalink
Add --force_codec to tarred dataset creation examples (NVIDIA#8227)
Browse files Browse the repository at this point in the history
Signed-off-by: Piotr Żelasko <[email protected]>
  • Loading branch information
pzelasko committed Jan 23, 2024
1 parent e1d1f59 commit be75c7c
Show file tree
Hide file tree
Showing 2 changed files with 5 additions and 1 deletion.
3 changes: 3 additions & 0 deletions docs/source/asr/datasets.rst
Original file line number Diff line number Diff line change
Expand Up @@ -265,8 +265,11 @@ You can easily convert your existing NeMo-compatible ASR datasets using the
--num_shards=<number of tarfiles that will contain the audio>
--max_duration=<float representing maximum duration of audio samples> \
--min_duration=<float representing minimum duration of audio samples> \
--force_codec=flac \
--shuffle --shuffle_seed=0
.. note:: For extra reduction of storage space at the cost of lossy (but high-quality) compression, you may use ``--force_codec=opus`` instead.

This script shuffles the entries in the given manifest (if ``--shuffle`` is set, which we recommend), filter
audio files according to ``min_duration`` and ``max_duration``, and tar the remaining audio files to the directory
``--target_dir`` in ``n`` shards, along with separate manifest and metadata files.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,7 @@
--min_duration=<float representing minimum duration of audio samples> \
--shuffle --shuffle_seed=1 \
--sort_in_shards \
--force_codec=flac \
--workers=-1
Expand All @@ -56,7 +57,7 @@
--shuffle --shuffle_seed=1 \
--sort_in_shards \
--workers=-1 \
--concat_manifest_paths \
--concat_manifest_paths
<space separated paths to 1 or more manifest files to concatenate into the original tarred dataset>
3) Writing an empty metadata file
Expand Down

0 comments on commit be75c7c

Please sign in to comment.