Tar codec #7867

nithinraok · 2023-11-06T22:46:51Z

Add webdataset support for VocoderDataset

Collection: TTS

Changelog

Add TarredVocderDataset class for dataset fetching
sampler is not supported with IterableDataset
added a conditional code to calculate correct number of max steps when webdataset is used

Known issue:

Haven;t tested for passing multiple tarred datasets. Will be pushing a PR later to support it.

Usage

add is_sharded=True in cfg for train_ds to load webdataset

Before your PR is Ready for review

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

nemo/collections/asr/data/audio_to_label.py

rlangman · 2023-11-06T22:58:34Z

nemo/collections/tts/models/audio_codec.py

+    @staticmethod
+    def get_dataset(cfg):
+        if 'is_sharded' in cfg.dataset:
+            is_sharded = cfg.dataset.is_sharded
+            del cfg.dataset.is_sharded
+        else:
+            is_sharded = False
+
+        if is_sharded:
+            cfg.dataset._target_ = 'nemo.collections.tts.data.vocoder_dataset.TarredVocoderDataset'
+            dataset = instantiate(cfg.dataset)
+            sampler = None
+        else:
+            dataset = instantiate(cfg.dataset)
+            sampler = dataset.get_sampler(cfg.dataloader_params.batch_size)
+        return dataset, sampler


Is this the same outcome as defining this in TarredVocoderDataset?:

get_sampler(): return None

Good point i will create a simpler case here, but I created this function to set the target so as to use existing config files and provide toggle option for sharding.

Given that these are experimental recipes that are undergoing breaking changes and have no public models, I think it should be OK to require config updates. The tarred dataset also requires different inputs to instantiate it.

I am against creating one new config for each feature.

Do you mean you are against creating separate config files for new features in general (eg. sampling rate, new architectures)? Or specifically for needing separate config for tarred vs non-tarred datasets?

The former of creating lots of different overlapping configs is expected across NeMo. The latter can be addressed in several different ways, if needed.

We don;t want to create multiple configs with same model architecture but different setting (like tarred/sampling_rate etc).

For the same model architecture with different configs are main options I have heard before are:

Create different example configs.

Have one example config and provide web documentation and tutorials clearly indicating to users how to manually change fields for different valid configs (eg. sample rate, language).

From past discussions around this, generally (1) was preferable, especially for changes like sample rate which are relatively complicated and error prone if you attempt to update all needed parameters manually. A combination of (1) and (2) is needed if you have too many configs to support, such as when having to support multiple sample rates and languages at the same time.

In this particular case, there are 2 problems with the existing approach.

The model class takes the data loader class/config as input, however the constructor requires passing information global_rank and world_size which are known only to the model class and trainer.

Different data loaders require significantly different inputs to instantiate (and this will diverge more as we add additional sampling strategies).

Perhaps the cleanest way to address these would be to change it so the model is expected to receive a subset of dataset parameters, and then add a module which maps the parameters to the logic to create each dataset.

For example:

In .yaml

train_ds: dataset_type: "vocoder_tarred" dataset_args: batch_size: ${batch_size} ...

In vocoder_dataset.py

def create_dataset(dataset_type, dataset_args, global_rank, world_size, ...): if dataset_type == "vocoder": return VocoderDataset(*dataset_args) elif dataset_type == "vocoder_tarred": return TarrredVocoderDataset(*dataset_args, global_rank=global_rank, world_size=world_size) ...

Good catch, missed it. Updated to use those as well.

for logging, testing, validation we rely only on non tarred datasets.

nemo/collections/tts/data/vocoder_dataset.py

nemo/collections/tts/models/audio_codec.py

nemo/collections/tts/data/vocoder_dataset.py

nemo/collections/tts/models/audio_codec.py

nemo/collections/tts/data/vocoder_dataset.py

rlangman · 2023-11-08T20:42:38Z

nemo/collections/tts/models/audio_codec.py

+    @staticmethod
+    def get_dataset(cfg):
+        if 'is_sharded' in cfg.dataset:
+            is_sharded = cfg.dataset.is_sharded
+            del cfg.dataset.is_sharded
+        else:
+            is_sharded = False
+
+        if is_sharded:
+            cfg.dataset._target_ = 'nemo.collections.tts.data.vocoder_dataset.TarredVocoderDataset'
+            dataset = instantiate(cfg.dataset)
+            sampler = None
+        else:
+            dataset = instantiate(cfg.dataset)
+            sampler = dataset.get_sampler(cfg.dataloader_params.batch_size)
+        return dataset, sampler


Given that these are experimental recipes that are undergoing breaking changes and have no public models, I think it should be OK to require config updates. The tarred dataset also requires different inputs to instantiate it.

rlangman · 2023-11-08T20:47:11Z

nemo/collections/tts/models/audio_codec.py

+        self.world_size = 1
+        if trainer is not None:
+            self.world_size = trainer.num_nodes * trainer.num_devices


In general we need to figure out how to make the model class agnostic to the data loader. Especially given that the logic will need to be kept consistent across all vocoder recipes (audio codec, hifigan, and univnet at least).

If we think upgrading webdataset version is a blocker to streamline the code, then we can document that with some TODOs and leave these workarounds. Otherwise, we should redesign the contract between the dataset and the model classes so they work with both.

nemo/collections/tts/data/vocoder_dataset.py

rlangman · 2023-11-08T21:14:20Z

nemo/collections/tts/data/vocoder_dataset.py

+            self._dataset.rename(audio=VALID_FILE_FORMATS, key='__key__')
+            .to_tuple('audio', 'key')


Nitpick: it is not very important for audio only dataset, but code would be more readable and extensible using original dictionary format instead of converting to tuple.

nithinraok · 2023-11-09T00:02:09Z

jenkins

nithinraok · 2023-11-09T22:40:12Z

jenkins

nithinraok · 2023-11-10T01:00:25Z

jenkins

nemo/collections/tts/models/audio_codec.py

rlangman · 2023-11-14T00:32:25Z

nemo/collections/tts/models/audio_codec.py

+    @staticmethod
+    def get_dataset(cfg):
+        if 'is_sharded' in cfg.dataset:
+            is_sharded = cfg.dataset.is_sharded
+            del cfg.dataset.is_sharded
+        else:
+            is_sharded = False
+
+        if is_sharded:
+            cfg.dataset._target_ = 'nemo.collections.tts.data.vocoder_dataset.TarredVocoderDataset'
+            dataset = instantiate(cfg.dataset)
+            sampler = None
+        else:
+            dataset = instantiate(cfg.dataset)
+            sampler = dataset.get_sampler(cfg.dataloader_params.batch_size)
+        return dataset, sampler


For the same model architecture with different configs are main options I have heard before are:

Create different example configs.

Have one example config and provide web documentation and tutorials clearly indicating to users how to manually change fields for different valid configs (eg. sample rate, language).

From past discussions around this, generally (1) was preferable, especially for changes like sample rate which are relatively complicated and error prone if you attempt to update all needed parameters manually. A combination of (1) and (2) is needed if you have too many configs to support, such as when having to support multiple sample rates and languages at the same time.

rlangman · 2023-11-14T00:41:09Z

nemo/collections/tts/models/audio_codec.py

+    @staticmethod
+    def get_dataset(cfg):
+        if 'is_sharded' in cfg.dataset:
+            is_sharded = cfg.dataset.is_sharded
+            del cfg.dataset.is_sharded
+        else:
+            is_sharded = False
+
+        if is_sharded:
+            cfg.dataset._target_ = 'nemo.collections.tts.data.vocoder_dataset.TarredVocoderDataset'
+            dataset = instantiate(cfg.dataset)
+            sampler = None
+        else:
+            dataset = instantiate(cfg.dataset)
+            sampler = dataset.get_sampler(cfg.dataloader_params.batch_size)
+        return dataset, sampler


In this particular case, there are 2 problems with the existing approach.

The model class takes the data loader class/config as input, however the constructor requires passing information global_rank and world_size which are known only to the model class and trainer.

Different data loaders require significantly different inputs to instantiate (and this will diverge more as we add additional sampling strategies).

Perhaps the cleanest way to address these would be to change it so the model is expected to receive a subset of dataset parameters, and then add a module which maps the parameters to the logic to create each dataset.

For example:

In .yaml

train_ds: dataset_type: "vocoder_tarred" dataset_args: batch_size: ${batch_size} ...

In vocoder_dataset.py

def create_dataset(dataset_type, dataset_args, global_rank, world_size, ...): if dataset_type == "vocoder": return VocoderDataset(*dataset_args) elif dataset_type == "vocoder_tarred": return TarrredVocoderDataset(*dataset_args, global_rank=global_rank, world_size=world_size) ...

nithinraok · 2023-11-15T18:42:21Z

jenkins

nemo/collections/tts/data/vocoder_dataset.py

rlangman · 2023-11-16T22:12:00Z

nemo/collections/tts/data/vocoder_dataset.py

+                The benefit of replication is that it allows each node to sample data points from the entire
+                dataset independently of other nodes, and reduces dependence on value of `shuffle_n`.


If I am not mistaken, shuffle_n is the size of the shuffle buffer as the gpu reads in data from its shards. Allocating more shards to the gpu would not really reduce its dependence on shuffle_n. Intuitively, I would expect it to become more important (larger dataset requires more shuffling).

Somewhat related, wd. WebDataset() has a shardshuffle=True parameter that ensures that the shard list itself is fully shuffled.

From webdataset Docs:
https://webdataset.github.io/webdataset/gettingstarted/
shuffle(n): shuffle the dataset with a buffer of size n; also shuffles shards (see below)

Based on the thread linked below, it sounds like the comment is effectively saying "larger datasets are less sensitive to how they are shuffled".

updated to have both buffer and inital arguments to have more control over shuffling

rlangman · 2023-11-16T22:15:23Z

nemo/collections/tts/data/vocoder_dataset.py

+        self._dataset = wd.WebDataset(audio_tar_filepaths, nodesplitter=None)
+
+        if shuffle_n > 0:
+            self._dataset = self._dataset.shuffle(shuffle_n)


Should we do something like dataset.shuffle(size=shuffle_n, initial=shuffle_n)? The initial buffer size of 100 is really small, which I saw slow convergence a lot on my small test datasets (but might be irrelevant on very large datasets).

If the set is small then yes number of samples per worker would become small and shuffle might not have an effect and wouln;t be a problem for larger sets as number of samples per worker would be large. As per this discussion, it looks like if we want to get a true shuffle of over all sets, its better to unbatch() on loader and shuffle with a larger number.

rlangman · 2023-11-16T22:18:13Z

nemo/collections/tts/data/vocoder_dataset.py

-        sample_rate: int,
+        sample_rate: int = 24000,


Nitpick: other classes in the TTS collection either provide no default sample_rate or defaults to 22050.

Default for our audio codec models is 24000. Did we have a 22050 Hz model?

No, but this is a data loader for all TTS vocoders. We have never used 24khz in TTS before (and when I brought up the question before of whether we should do so to synchronize with SpeechLM models, the concensus was to stick to 22.05khz). So I suppose I would favor not having a default sample rate.

nithinraok · 2023-11-17T19:04:58Z

jenkins

Signed-off-by: Nithin Rao Koluguri <nithinraok>

nithinraok · 2023-11-17T22:45:11Z

jenkins

rlangman

LGTM. Thanks a lot for the feature and changes.

Signed-off-by: Chen Cui <[email protected]> support packed dataset Signed-off-by: Chen Cui <[email protected]> [Codec] Finite scalar quantizer (NVIDIA#7886) * Finite scalar quantizer Signed-off-by: Ante Jukić <[email protected]> * Updated test Signed-off-by: Ante Jukić <[email protected]> --------- Signed-off-by: Ante Jukić <[email protected]> upgrade to latest mcore and TE (NVIDIA#7908) * reimport module Signed-off-by: dimapihtar <[email protected]> * update mcore and TE commits Signed-off-by: dimapihtar <[email protected]> --------- Signed-off-by: dimapihtar <[email protected]> Tar codec (NVIDIA#7867) added missing torch import (NVIDIA#7913) Signed-off-by: David Mosallanezhad <[email protected]> add cpu init check (NVIDIA#7889) Signed-off-by: Chen Cui <[email protected]> Fix pinned triton version (NVIDIA#7925) * Fix pinned triton version Signed-off-by: Cheng-Ping Hsieh <[email protected]> * Remove comment Signed-off-by: Cheng-Ping Hsieh <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Change README Signed-off-by: Cheng-Ping Hsieh <[email protected]> * Remove flash-attn in Dockerfile Signed-off-by: Cheng-Ping Hsieh <[email protected]> * Revert Signed-off-by: Cheng-Ping Hsieh <[email protected]> --------- Signed-off-by: Cheng-Ping Hsieh <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> fix tp_overlap config var name (NVIDIA#7928) Signed-off-by: Xiaowei Ren <[email protected]> add Dutch P&C FC model info (NVIDIA#7892) * add Dutch P&C FC model info Signed-off-by: zhehuaichen <[email protected]> * update order of the results Signed-off-by: zhehuaichen <[email protected]> --------- Signed-off-by: zhehuaichen <[email protected]> fix issues with convert_nemo_llama_to_hf.py (NVIDIA#7922) [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci fix collate_fn bug for TP > 1 Signed-off-by: Chen Cui <[email protected]> [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci make packed dataset work Signed-off-by: Chen Cui <[email protected]> fix nan bug Signed-off-by: Chen Cui <[email protected]> [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci support answer only loss Signed-off-by: Chen Cui <[email protected]> [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci account for padding in cu_seqlens during dataloading for attn kernel Signed-off-by: Chen Cui <[email protected]> fix path for answer_only_loss = false Signed-off-by: Chen Cui <[email protected]> Modify GPTSFTPackedDataset to respond to pad_to_max_length setting Signed-off-by: Valerie Sarge <[email protected]>

Signed-off-by: Chen Cui <[email protected]> support packed dataset Signed-off-by: Chen Cui <[email protected]> [Codec] Finite scalar quantizer (NVIDIA#7886) * Finite scalar quantizer Signed-off-by: Ante Jukić <[email protected]> * Updated test Signed-off-by: Ante Jukić <[email protected]> --------- Signed-off-by: Ante Jukić <[email protected]> upgrade to latest mcore and TE (NVIDIA#7908) * reimport module Signed-off-by: dimapihtar <[email protected]> * update mcore and TE commits Signed-off-by: dimapihtar <[email protected]> --------- Signed-off-by: dimapihtar <[email protected]> Tar codec (NVIDIA#7867) added missing torch import (NVIDIA#7913) Signed-off-by: David Mosallanezhad <[email protected]> add cpu init check (NVIDIA#7889) Signed-off-by: Chen Cui <[email protected]> Fix pinned triton version (NVIDIA#7925) * Fix pinned triton version Signed-off-by: Cheng-Ping Hsieh <[email protected]> * Remove comment Signed-off-by: Cheng-Ping Hsieh <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Change README Signed-off-by: Cheng-Ping Hsieh <[email protected]> * Remove flash-attn in Dockerfile Signed-off-by: Cheng-Ping Hsieh <[email protected]> * Revert Signed-off-by: Cheng-Ping Hsieh <[email protected]> --------- Signed-off-by: Cheng-Ping Hsieh <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> fix tp_overlap config var name (NVIDIA#7928) Signed-off-by: Xiaowei Ren <[email protected]> add Dutch P&C FC model info (NVIDIA#7892) * add Dutch P&C FC model info Signed-off-by: zhehuaichen <[email protected]> * update order of the results Signed-off-by: zhehuaichen <[email protected]> --------- Signed-off-by: zhehuaichen <[email protected]> fix issues with convert_nemo_llama_to_hf.py (NVIDIA#7922) [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci fix collate_fn bug for TP > 1 Signed-off-by: Chen Cui <[email protected]> [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci make packed dataset work Signed-off-by: Chen Cui <[email protected]> fix nan bug Signed-off-by: Chen Cui <[email protected]> [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci support answer only loss Signed-off-by: Chen Cui <[email protected]> [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci account for padding in cu_seqlens during dataloading for attn kernel Signed-off-by: Chen Cui <[email protected]> fix path for answer_only_loss = false Signed-off-by: Chen Cui <[email protected]>

Signed-off-by: Piotr Żelasko <[email protected]>

github-actions bot added TTS ASR labels Nov 6, 2023

nithinraok requested review from rlangman and anteju November 6, 2023 22:47

rlangman reviewed Nov 6, 2023

View reviewed changes

nemo/collections/tts/data/vocoder_dataset.py Outdated Show resolved Hide resolved

github-advanced-security bot found potential problems Nov 6, 2023

View reviewed changes

nemo/collections/tts/data/vocoder_dataset.py Fixed Show fixed Hide fixed

nemo/collections/tts/data/vocoder_dataset.py Fixed Show fixed Hide fixed

nemo/collections/tts/data/vocoder_dataset.py Fixed Show fixed Hide fixed

anteju reviewed Nov 7, 2023

View reviewed changes

rlangman reviewed Nov 8, 2023

View reviewed changes

github-actions bot removed the ASR label Nov 8, 2023

nithinraok force-pushed the tar_codec branch from e7cf9f7 to 4cf41da Compare November 9, 2023 22:39

rlangman reviewed Nov 14, 2023

View reviewed changes

nithinraok force-pushed the tar_codec branch 2 times, most recently from a7a4dab to 9c52b7a Compare November 15, 2023 18:42

rlangman reviewed Nov 16, 2023

View reviewed changes

nithinraok force-pushed the tar_codec branch from 9c52b7a to 7303148 Compare November 17, 2023 19:04

Nithin Rao Koluguri added 9 commits November 17, 2023 14:43

Add tarred dataset support for VocoderDataset

f54f2b9

Signed-off-by: Nithin Rao Koluguri <nithinraok>

update style

f8ad14c

Signed-off-by: Nithin Rao Koluguri <nithinraok>

update based on comments

a7fa45f

Signed-off-by: Nithin Rao Koluguri <nithinraok>

improvements to reading sample

e869ec4

Signed-off-by: Nithin Rao Koluguri <nithinraok>

dataclass accessing and processing manifest

cb583d5

Signed-off-by: Nithin Rao Koluguri <nithinraok>

naming and duplicate funcs remove

93f9c6d

Signed-off-by: Nithin Rao Koluguri <nithinraok>

fix dataset args

3d0dda1

Signed-off-by: Nithin Rao Koluguri <nithinraok>

remove unused import

f364962

Signed-off-by: Nithin Rao Koluguri <nithinraok>

rearrange and update docs

d858c7a

Signed-off-by: Nithin Rao Koluguri <nithinraok>

remove default samplerate

cc9380a

Signed-off-by: Nithin Rao Koluguri <nithinraok>

nithinraok force-pushed the tar_codec branch from 7303148 to cc9380a Compare November 17, 2023 22:44

rlangman approved these changes Nov 17, 2023

View reviewed changes

nithinraok merged commit d81beac into main Nov 18, 2023
15 checks passed

nithinraok deleted the tar_codec branch November 18, 2023 02:47

pzelasko pushed a commit to pzelasko/NeMo that referenced this pull request Jan 3, 2024

Tar codec (NVIDIA#7867)

2a071f4

Signed-off-by: Piotr Żelasko <[email protected]>

rohitrango pushed a commit to rohitrango/NeMo that referenced this pull request Jun 25, 2024

Tar codec (NVIDIA#7867)

9b74841

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tar codec #7867

Tar codec #7867

nithinraok commented Nov 6, 2023 •

edited

Loading

rlangman Nov 6, 2023 •

edited

Loading

nithinraok Nov 7, 2023

rlangman Nov 8, 2023

nithinraok Nov 8, 2023

rlangman Nov 9, 2023

nithinraok Nov 9, 2023

rlangman Nov 14, 2023

rlangman Nov 14, 2023 •

edited

Loading

nithinraok Nov 14, 2023

nithinraok Nov 14, 2023

rlangman Nov 8, 2023

rlangman Nov 8, 2023

rlangman Nov 8, 2023

nithinraok commented Nov 9, 2023

nithinraok commented Nov 9, 2023

nithinraok commented Nov 10, 2023

rlangman Nov 14, 2023

rlangman Nov 14, 2023 •

edited

Loading

nithinraok commented Nov 15, 2023

rlangman Nov 16, 2023

nithinraok Nov 17, 2023

rlangman Nov 17, 2023

nithinraok Nov 17, 2023

rlangman Nov 16, 2023

nithinraok Nov 17, 2023

rlangman Nov 16, 2023

nithinraok Nov 17, 2023

rlangman Nov 17, 2023 •

edited

Loading

nithinraok Nov 17, 2023

nithinraok commented Nov 17, 2023

nithinraok commented Nov 17, 2023

rlangman left a comment

		self._dataset.rename(audio=VALID_FILE_FORMATS, key='__key__')
		.to_tuple('audio', 'key')

		The benefit of replication is that it allows each node to sample data points from the entire
		dataset independently of other nodes, and reduces dependence on value of `shuffle_n`.

Tar codec #7867

Tar codec #7867

Conversation

nithinraok commented Nov 6, 2023 • edited Loading

Changelog

Usage

Before your PR is Ready for review

rlangman Nov 6, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rlangman Nov 14, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nithinraok commented Nov 9, 2023

nithinraok commented Nov 9, 2023

nithinraok commented Nov 10, 2023

Choose a reason for hiding this comment

rlangman Nov 14, 2023 • edited Loading

Choose a reason for hiding this comment

nithinraok commented Nov 15, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rlangman Nov 17, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nithinraok commented Nov 17, 2023

nithinraok commented Nov 17, 2023

rlangman left a comment

Choose a reason for hiding this comment

nithinraok commented Nov 6, 2023 •

edited

Loading

rlangman Nov 6, 2023 •

edited

Loading

rlangman Nov 14, 2023 •

edited

Loading

rlangman Nov 14, 2023 •

edited

Loading

rlangman Nov 17, 2023 •

edited

Loading