Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tar codec #7867

Merged
merged 10 commits into from
Nov 18, 2023
Merged

Tar codec #7867

merged 10 commits into from
Nov 18, 2023

Conversation

nithinraok
Copy link
Collaborator

@nithinraok nithinraok commented Nov 6, 2023

Add webdataset support for VocoderDataset

Collection: TTS

Changelog

  • Add TarredVocderDataset class for dataset fetching
  • sampler is not supported with IterableDataset
  • added a conditional code to calculate correct number of max steps when webdataset is used

Known issue:

  • Haven;t tested for passing multiple tarred datasets. Will be pushing a PR later to support it.

Usage

  • add is_sharded=True in cfg for train_ds to load webdataset

Before your PR is Ready for review

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

nemo/collections/asr/data/audio_to_label.py Outdated Show resolved Hide resolved
Comment on lines 480 to 493
@staticmethod
def get_dataset(cfg):
if 'is_sharded' in cfg.dataset:
is_sharded = cfg.dataset.is_sharded
del cfg.dataset.is_sharded
else:
is_sharded = False

if is_sharded:
cfg.dataset._target_ = 'nemo.collections.tts.data.vocoder_dataset.TarredVocoderDataset'
dataset = instantiate(cfg.dataset)
sampler = None
else:
dataset = instantiate(cfg.dataset)
sampler = dataset.get_sampler(cfg.dataloader_params.batch_size)
return dataset, sampler
Copy link
Collaborator

@rlangman rlangman Nov 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the same outcome as defining this in TarredVocoderDataset?:

get_sampler():
    return None

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point i will create a simpler case here, but I created this function to set the target so as to use existing config files and provide toggle option for sharding.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that these are experimental recipes that are undergoing breaking changes and have no public models, I think it should be OK to require config updates. The tarred dataset also requires different inputs to instantiate it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am against creating one new config for each feature.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean you are against creating separate config files for new features in general (eg. sampling rate, new architectures)? Or specifically for needing separate config for tarred vs non-tarred datasets?

The former of creating lots of different overlapping configs is expected across NeMo. The latter can be addressed in several different ways, if needed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don;t want to create multiple configs with same model architecture but different setting (like tarred/sampling_rate etc).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the same model architecture with different configs are main options I have heard before are:

  1. Create different example configs.
  2. Have one example config and provide web documentation and tutorials clearly indicating to users how to manually change fields for different valid configs (eg. sample rate, language).

From past discussions around this, generally (1) was preferable, especially for changes like sample rate which are relatively complicated and error prone if you attempt to update all needed parameters manually. A combination of (1) and (2) is needed if you have too many configs to support, such as when having to support multiple sample rates and languages at the same time.

Copy link
Collaborator

@rlangman rlangman Nov 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this particular case, there are 2 problems with the existing approach.

  • The model class takes the data loader class/config as input, however the constructor requires passing information global_rank and world_size which are known only to the model class and trainer.
  • Different data loaders require significantly different inputs to instantiate (and this will diverge more as we add additional sampling strategies).

Perhaps the cleanest way to address these would be to change it so the model is expected to receive a subset of dataset parameters, and then add a module which maps the parameters to the logic to create each dataset.

For example:

In .yaml

train_ds:
  dataset_type: "vocoder_tarred"
  dataset_args:
    batch_size: ${batch_size}
    ... 

In vocoder_dataset.py

def create_dataset(dataset_type, dataset_args, global_rank, world_size, ...):
    if dataset_type == "vocoder":
        return VocoderDataset(*dataset_args)
    elif dataset_type == "vocoder_tarred":
        return TarrredVocoderDataset(*dataset_args, global_rank=global_rank, world_size=world_size)
     ...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, missed it. Updated to use those as well.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for logging, testing, validation we rely only on non tarred datasets.

nemo/collections/tts/data/vocoder_dataset.py Outdated Show resolved Hide resolved
nemo/collections/tts/models/audio_codec.py Show resolved Hide resolved
nemo/collections/tts/data/vocoder_dataset.py Outdated Show resolved Hide resolved
nemo/collections/tts/data/vocoder_dataset.py Outdated Show resolved Hide resolved
nemo/collections/tts/data/vocoder_dataset.py Outdated Show resolved Hide resolved
nemo/collections/tts/models/audio_codec.py Outdated Show resolved Hide resolved
nemo/collections/tts/data/vocoder_dataset.py Outdated Show resolved Hide resolved
Comment on lines 480 to 493
@staticmethod
def get_dataset(cfg):
if 'is_sharded' in cfg.dataset:
is_sharded = cfg.dataset.is_sharded
del cfg.dataset.is_sharded
else:
is_sharded = False

if is_sharded:
cfg.dataset._target_ = 'nemo.collections.tts.data.vocoder_dataset.TarredVocoderDataset'
dataset = instantiate(cfg.dataset)
sampler = None
else:
dataset = instantiate(cfg.dataset)
sampler = dataset.get_sampler(cfg.dataloader_params.batch_size)
return dataset, sampler
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that these are experimental recipes that are undergoing breaking changes and have no public models, I think it should be OK to require config updates. The tarred dataset also requires different inputs to instantiate it.

Comment on lines +52 to +54
self.world_size = 1
if trainer is not None:
self.world_size = trainer.num_nodes * trainer.num_devices
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general we need to figure out how to make the model class agnostic to the data loader. Especially given that the logic will need to be kept consistent across all vocoder recipes (audio codec, hifigan, and univnet at least).

If we think upgrading webdataset version is a blocker to streamline the code, then we can document that with some TODOs and leave these workarounds. Otherwise, we should redesign the contract between the dataset and the model classes so they work with both.

nemo/collections/tts/data/vocoder_dataset.py Show resolved Hide resolved
nemo/collections/tts/data/vocoder_dataset.py Outdated Show resolved Hide resolved
nemo/collections/tts/data/vocoder_dataset.py Outdated Show resolved Hide resolved
nemo/collections/tts/data/vocoder_dataset.py Outdated Show resolved Hide resolved
nemo/collections/tts/data/vocoder_dataset.py Outdated Show resolved Hide resolved
nemo/collections/tts/data/vocoder_dataset.py Show resolved Hide resolved
Comment on lines +457 to +352
self._dataset.rename(audio=VALID_FILE_FORMATS, key='__key__')
.to_tuple('audio', 'key')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: it is not very important for audio only dataset, but code would be more readable and extensible using original dictionary format instead of converting to tuple.

@github-actions github-actions bot removed the ASR label Nov 8, 2023
@nithinraok
Copy link
Collaborator Author

jenkins

@nithinraok
Copy link
Collaborator Author

jenkins

1 similar comment
@nithinraok
Copy link
Collaborator Author

jenkins

nemo/collections/tts/models/audio_codec.py Outdated Show resolved Hide resolved
Comment on lines 480 to 493
@staticmethod
def get_dataset(cfg):
if 'is_sharded' in cfg.dataset:
is_sharded = cfg.dataset.is_sharded
del cfg.dataset.is_sharded
else:
is_sharded = False

if is_sharded:
cfg.dataset._target_ = 'nemo.collections.tts.data.vocoder_dataset.TarredVocoderDataset'
dataset = instantiate(cfg.dataset)
sampler = None
else:
dataset = instantiate(cfg.dataset)
sampler = dataset.get_sampler(cfg.dataloader_params.batch_size)
return dataset, sampler
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the same model architecture with different configs are main options I have heard before are:

  1. Create different example configs.
  2. Have one example config and provide web documentation and tutorials clearly indicating to users how to manually change fields for different valid configs (eg. sample rate, language).

From past discussions around this, generally (1) was preferable, especially for changes like sample rate which are relatively complicated and error prone if you attempt to update all needed parameters manually. A combination of (1) and (2) is needed if you have too many configs to support, such as when having to support multiple sample rates and languages at the same time.

Comment on lines 480 to 493
@staticmethod
def get_dataset(cfg):
if 'is_sharded' in cfg.dataset:
is_sharded = cfg.dataset.is_sharded
del cfg.dataset.is_sharded
else:
is_sharded = False

if is_sharded:
cfg.dataset._target_ = 'nemo.collections.tts.data.vocoder_dataset.TarredVocoderDataset'
dataset = instantiate(cfg.dataset)
sampler = None
else:
dataset = instantiate(cfg.dataset)
sampler = dataset.get_sampler(cfg.dataloader_params.batch_size)
return dataset, sampler
Copy link
Collaborator

@rlangman rlangman Nov 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this particular case, there are 2 problems with the existing approach.

  • The model class takes the data loader class/config as input, however the constructor requires passing information global_rank and world_size which are known only to the model class and trainer.
  • Different data loaders require significantly different inputs to instantiate (and this will diverge more as we add additional sampling strategies).

Perhaps the cleanest way to address these would be to change it so the model is expected to receive a subset of dataset parameters, and then add a module which maps the parameters to the logic to create each dataset.

For example:

In .yaml

train_ds:
  dataset_type: "vocoder_tarred"
  dataset_args:
    batch_size: ${batch_size}
    ... 

In vocoder_dataset.py

def create_dataset(dataset_type, dataset_args, global_rank, world_size, ...):
    if dataset_type == "vocoder":
        return VocoderDataset(*dataset_args)
    elif dataset_type == "vocoder_tarred":
        return TarrredVocoderDataset(*dataset_args, global_rank=global_rank, world_size=world_size)
     ...

@nithinraok nithinraok force-pushed the tar_codec branch 2 times, most recently from a7a4dab to 9c52b7a Compare November 15, 2023 18:42
@nithinraok
Copy link
Collaborator Author

jenkins

nemo/collections/tts/data/vocoder_dataset.py Show resolved Hide resolved
nemo/collections/tts/data/vocoder_dataset.py Outdated Show resolved Hide resolved
Comment on lines +266 to +264
The benefit of replication is that it allows each node to sample data points from the entire
dataset independently of other nodes, and reduces dependence on value of `shuffle_n`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I am not mistaken, shuffle_n is the size of the shuffle buffer as the gpu reads in data from its shards. Allocating more shards to the gpu would not really reduce its dependence on shuffle_n. Intuitively, I would expect it to become more important (larger dataset requires more shuffling).

Somewhat related, wd. WebDataset() has a shardshuffle=True parameter that ensures that the shard list itself is fully shuffled.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From webdataset Docs:
https://webdataset.github.io/webdataset/gettingstarted/
shuffle(n): shuffle the dataset with a buffer of size n; also shuffles shards (see below)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the thread linked below, it sounds like the comment is effectively saying "larger datasets are less sensitive to how they are shuffled".

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated to have both buffer and inital arguments to have more control over shuffling

self._dataset = wd.WebDataset(audio_tar_filepaths, nodesplitter=None)

if shuffle_n > 0:
self._dataset = self._dataset.shuffle(shuffle_n)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we do something like dataset.shuffle(size=shuffle_n, initial=shuffle_n)? The initial buffer size of 100 is really small, which I saw slow convergence a lot on my small test datasets (but might be irrelevant on very large datasets).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the set is small then yes number of samples per worker would become small and shuffle might not have an effect and wouln;t be a problem for larger sets as number of samples per worker would be large. As per this discussion, it looks like if we want to get a true shuffle of over all sets, its better to unbatch() on loader and shuffle with a larger number.

Comment on lines 74 to 135
sample_rate: int,
sample_rate: int = 24000,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: other classes in the TTS collection either provide no default sample_rate or defaults to 22050.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Default for our audio codec models is 24000. Did we have a 22050 Hz model?

Copy link
Collaborator

@rlangman rlangman Nov 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, but this is a data loader for all TTS vocoders. We have never used 24khz in TTS before (and when I brought up the question before of whether we should do so to synchronize with SpeechLM models, the concensus was to stick to 22.05khz). So I suppose I would favor not having a default sample rate.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

@nithinraok
Copy link
Collaborator Author

jenkins

Nithin Rao Koluguri added 9 commits November 17, 2023 14:43
Signed-off-by: Nithin Rao Koluguri <nithinraok>
Signed-off-by: Nithin Rao Koluguri <nithinraok>
Signed-off-by: Nithin Rao Koluguri <nithinraok>
Signed-off-by: Nithin Rao Koluguri <nithinraok>
Signed-off-by: Nithin Rao Koluguri <nithinraok>
Signed-off-by: Nithin Rao Koluguri <nithinraok>
Signed-off-by: Nithin Rao Koluguri <nithinraok>
Signed-off-by: Nithin Rao Koluguri <nithinraok>
Signed-off-by: Nithin Rao Koluguri <nithinraok>
Signed-off-by: Nithin Rao Koluguri <nithinraok>
@nithinraok
Copy link
Collaborator Author

jenkins

Copy link
Collaborator

@rlangman rlangman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks a lot for the feature and changes.

@nithinraok nithinraok merged commit d81beac into main Nov 18, 2023
15 checks passed
@nithinraok nithinraok deleted the tar_codec branch November 18, 2023 02:47
erhoo82 pushed a commit to erhoo82/NeMo that referenced this pull request Dec 2, 2023
Signed-off-by: Chen Cui <[email protected]>

support packed dataset

Signed-off-by: Chen Cui <[email protected]>

[Codec] Finite scalar quantizer (NVIDIA#7886)

* Finite scalar quantizer

Signed-off-by: Ante Jukić <[email protected]>

* Updated test

Signed-off-by: Ante Jukić <[email protected]>

---------

Signed-off-by: Ante Jukić <[email protected]>

upgrade to latest mcore and TE (NVIDIA#7908)

* reimport module

Signed-off-by: dimapihtar <[email protected]>

* update mcore and TE commits

Signed-off-by: dimapihtar <[email protected]>

---------

Signed-off-by: dimapihtar <[email protected]>

Tar codec (NVIDIA#7867)

added missing torch import (NVIDIA#7913)

Signed-off-by: David Mosallanezhad <[email protected]>

add cpu init check (NVIDIA#7889)

Signed-off-by: Chen Cui <[email protected]>

Fix pinned triton version (NVIDIA#7925)

* Fix pinned triton version

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* Remove comment

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Change README

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* Remove flash-attn in Dockerfile

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* Revert

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

---------

Signed-off-by: Cheng-Ping Hsieh <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

fix tp_overlap config var name (NVIDIA#7928)

Signed-off-by: Xiaowei Ren <[email protected]>

add Dutch P&C FC model info (NVIDIA#7892)

* add Dutch P&C FC model info

Signed-off-by: zhehuaichen <[email protected]>

* update order of the results

Signed-off-by: zhehuaichen <[email protected]>

---------

Signed-off-by: zhehuaichen <[email protected]>

fix issues with convert_nemo_llama_to_hf.py (NVIDIA#7922)

[pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

fix collate_fn bug for TP > 1

Signed-off-by: Chen Cui <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

make packed dataset work

Signed-off-by: Chen Cui <[email protected]>

fix nan bug

Signed-off-by: Chen Cui <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

support answer only loss

Signed-off-by: Chen Cui <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

account for padding in cu_seqlens during dataloading for attn kernel

Signed-off-by: Chen Cui <[email protected]>

fix path for answer_only_loss = false

Signed-off-by: Chen Cui <[email protected]>

Modify GPTSFTPackedDataset to respond to pad_to_max_length setting

Signed-off-by: Valerie Sarge <[email protected]>
erhoo82 pushed a commit to erhoo82/NeMo that referenced this pull request Dec 2, 2023
Signed-off-by: Chen Cui <[email protected]>

support packed dataset

Signed-off-by: Chen Cui <[email protected]>

[Codec] Finite scalar quantizer (NVIDIA#7886)

* Finite scalar quantizer

Signed-off-by: Ante Jukić <[email protected]>

* Updated test

Signed-off-by: Ante Jukić <[email protected]>

---------

Signed-off-by: Ante Jukić <[email protected]>

upgrade to latest mcore and TE (NVIDIA#7908)

* reimport module

Signed-off-by: dimapihtar <[email protected]>

* update mcore and TE commits

Signed-off-by: dimapihtar <[email protected]>

---------

Signed-off-by: dimapihtar <[email protected]>

Tar codec (NVIDIA#7867)

added missing torch import (NVIDIA#7913)

Signed-off-by: David Mosallanezhad <[email protected]>

add cpu init check (NVIDIA#7889)

Signed-off-by: Chen Cui <[email protected]>

Fix pinned triton version (NVIDIA#7925)

* Fix pinned triton version

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* Remove comment

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Change README

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* Remove flash-attn in Dockerfile

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* Revert

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

---------

Signed-off-by: Cheng-Ping Hsieh <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

fix tp_overlap config var name (NVIDIA#7928)

Signed-off-by: Xiaowei Ren <[email protected]>

add Dutch P&C FC model info (NVIDIA#7892)

* add Dutch P&C FC model info

Signed-off-by: zhehuaichen <[email protected]>

* update order of the results

Signed-off-by: zhehuaichen <[email protected]>

---------

Signed-off-by: zhehuaichen <[email protected]>

fix issues with convert_nemo_llama_to_hf.py (NVIDIA#7922)

[pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

fix collate_fn bug for TP > 1

Signed-off-by: Chen Cui <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

make packed dataset work

Signed-off-by: Chen Cui <[email protected]>

fix nan bug

Signed-off-by: Chen Cui <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

support answer only loss

Signed-off-by: Chen Cui <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

account for padding in cu_seqlens during dataloading for attn kernel

Signed-off-by: Chen Cui <[email protected]>

fix path for answer_only_loss = false

Signed-off-by: Chen Cui <[email protected]>
pzelasko pushed a commit to pzelasko/NeMo that referenced this pull request Jan 3, 2024
Signed-off-by: Piotr Żelasko <[email protected]>
rohitrango pushed a commit to rohitrango/NeMo that referenced this pull request Jun 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants