Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
117 commits
Select commit Hold shift + click to select a range
e3c9234
add repeat index to help saving pred audio files for each repeat. (#50)
XuesongYang Feb 14, 2025
d174786
fix: make confidence level configurable. (#51)
XuesongYang Feb 18, 2025
0949d57
Inference prior; updated by Jason
paarthneekhara Feb 19, 2025
6f475c7
Bugfix in DPO Pareto ranking (#53)
rfejgin Feb 25, 2025
59bb9cf
Local Transformer and Binarized attention prior from alignment module…
paarthneekhara Mar 6, 2025
fb30e38
Koel onlinepo, GRPO (#54)
shehzeen Mar 13, 2025
c94d356
bug fixes after merge
shehzeen Mar 14, 2025
8f6a22e
Cleanup 2503 to include dev files and fixes naming (#45)
blisc Mar 18, 2025
3791bcd
[bugfix][magpietts] replace pytorch_lightning with lightning.pytorch …
XuesongYang Mar 25, 2025
6ebde25
[magpietts] enable pin_memory to enable fast data transfer to GPU. (#48)
XuesongYang Mar 25, 2025
bc1bcfb
[bugfix][tts_dataset] feature_dir is not a required key for magpietts…
XuesongYang Mar 25, 2025
ed60df4
[magpietts] added WandbLogger support and restructure TensorBoardLogg…
XuesongYang Mar 29, 2025
38ff49a
[magpietts] adhere CapWords convention for model class names. (#50)
XuesongYang Apr 1, 2025
21bf9c9
undo unintended change in #45 (#53)
blisc Apr 8, 2025
790e8f9
Autocasting diasbled for codec model and Making prior window strict (…
paarthneekhara Apr 9, 2025
e63b61d
[magpie][lhotse] Add lhoste shar dataset prep recipe and lhotse datal…
XuesongYang Apr 11, 2025
7f54c6d
Merge in Jason's dev changes (#57)
blisc Apr 11, 2025
718173a
Bug fix in context text embedding initialization (#58)
paarthneekhara Apr 15, 2025
7e2cdca
update defaul params in config (#59)
blisc Apr 16, 2025
aae4e60
magpie top k bug fix (#60)
paarthneekhara Apr 17, 2025
23e299a
Bugfix: num_audio_tokens_per_codebook (#62)
rfejgin Apr 22, 2025
3e1e7fc
Add num_codebooks and codebook_size to codec interface (#65)
rlangman Apr 24, 2025
4b4914e
bug fix in _setup_test_dataloader when num workers=0 (#67)
shehzeen Apr 28, 2025
902bce6
preference optimization updates, trainer updates remove redundant dat…
shehzeen Apr 28, 2025
454cca4
[magpie][wandb] add loggings of pad ratios for text tokens and audio …
XuesongYang Apr 28, 2025
e4e5a75
[magpie][wandb] add loggings of pad ratios for text tokens and audio …
XuesongYang Apr 28, 2025
8ba5506
Magpietts 2503 refactor codebook config (#64)
rfejgin Apr 28, 2025
cae27aa
[magpie][wandb][bugfix] ensure consistent validation step for audio a…
XuesongYang Apr 29, 2025
a100ad1
Codebook layout update: bugfix and README refinements (#68)
rfejgin Apr 29, 2025
db801cb
add update config to infer script (#70)
blisc May 2, 2025
789f040
Fix typo introduced during codec refactor (#72)
blisc May 2, 2025
d5cfd04
Add MaskGit support for iteratively predicting codebooks (#69)
rfejgin May 7, 2025
ef5efde
Add spectral codec modules (#74)
rlangman May 7, 2025
3823459
Update Checkpoint Loading (#73)
blisc May 7, 2025
e7723c5
Bugfix in load_state_dict() (#13555)
rfejgin May 13, 2025
f8b70be
Magpie yaml updates for LT and gradient clipping (#13614)
paarthneekhara May 21, 2025
7e510a6
Add CI/CD to Magpie dev branch (#13682)
blisc May 28, 2025
00be498
BPE char tokenizer (#13594)
shehzeen May 28, 2025
a6c73f9
Python 3.10 compatibility (#13753)
rfejgin May 28, 2025
de1fe39
Add Fréchet codec distance metric (#13553)
rfejgin May 30, 2025
98d35f2
refactor 1: typos, yaml updates, code changes (#13677)
blisc Jun 3, 2025
a445ab3
Re-enable CI (#13857)
blisc Jun 9, 2025
f116e9e
Fix num_codebooks in Group RVQ (#13841)
rlangman Jun 9, 2025
6f87a69
[tranformer_core][magpietts_config] added support to override xattn h…
XuesongYang Jun 9, 2025
024a168
[magpietts][wandb] fixed unexpected panel displaying the val_loss in …
XuesongYang Jun 10, 2025
926f1c3
Add more CI tests for Magpie branch (#13831)
blisc Jun 11, 2025
2bba976
FCD Metric bugfix: handle empty codes update (#13906)
rfejgin Jun 12, 2025
7be8621
[magpietts][lhotse_v2] make model training recipe adapt to the latest…
XuesongYang Jun 16, 2025
58b809f
[magpietts][lhotse_v2] Adding scripts of converting NeMo manifests to…
XuesongYang Jun 16, 2025
77753ed
Bugfix: handle inference of models that don't have sample_rate in the…
rfejgin Jun 18, 2025
1bccf0c
use learnable position embeddings in cas encoder (#13909)
shehzeen Jun 18, 2025
0914185
Magpietts Multilingual IPA GRPO (#13595)
shehzeen Jun 18, 2025
c63b63a
inference bugfix: run all datasets (#13967)
rfejgin Jun 24, 2025
d2f730a
Interence fix: clean up old files (#14039)
rfejgin Jun 27, 2025
3d670b7
inference: add option to include the experiment name in the output fo…
rfejgin Jun 28, 2025
4110728
Bugfix in saving predicted codes for FCD calculation (#14051)
rfejgin Jun 28, 2025
ff72456
[magpietts][eval][bugfix] fixed infer and eval scripts and supported …
XuesongYang Jun 28, 2025
dbc6e78
Inference: pad short signals before embedding them (#14055)
rfejgin Jul 1, 2025
e2797ea
[magpietts] decoder CE model type (#13727)
paarthneekhara Jul 1, 2025
5ee5698
EOS detection: check all codebooks (#14038)
rfejgin Jul 1, 2025
4ff2e0e
[eval] added the support of evaluating multiple .nemo checkpoints jus…
XuesongYang Jul 4, 2025
41379ee
make feature_dir default to None to avoid adding feature_dir in (#14080)
XuesongYang Jul 4, 2025
113e0e2
[eval] make compute_fcd optional for now. (#14081)
XuesongYang Jul 7, 2025
db68d32
FCD Metric: if provided with codes of unexpected shape, log warning b…
rfejgin Jul 8, 2025
efa9c13
[eval] Suppress TitaNet messages during initialization (#14101)
rfejgin Jul 8, 2025
04078a9
[lhotse][sampler] added lhotse sampler that filters out records that …
XuesongYang Jul 18, 2025
85a4191
Infer updates: G2P_Prob=1, Add Violin Plots, ASR Default, Cleanups (#…
blisc Jul 24, 2025
f5510e9
Quick PR to disable broken tests on magpie dev branch for now (#14429)
blisc Aug 7, 2025
3df09e8
Magpietts 2503 po july2025 (#14393)
shehzeen Aug 12, 2025
af12c35
[magpietts] Magpietts small and Attention changes, Evaluation (#14418)
subhankar-ghosh Aug 14, 2025
a307a3d
Fix path in config_evalset.py (#14509)
rfejgin Aug 19, 2025
2aa2c69
Add an option to seed the random number generators for debugging / re…
rfejgin Aug 19, 2025
bdb41df
Avoid speaker embedding crash on short signals (#14525)
rfejgin Aug 20, 2025
b3c253f
Inference: Fix for speaker embedding of short files (#14533)
rfejgin Aug 20, 2025
f78d0bf
[magpietts] custom tokenizer for text context (#14408)
paarthneekhara Aug 21, 2025
6fa8921
Frame Stacking (#14455)
rfejgin Aug 21, 2025
bce14ce
Forbid sampling of special tokens except AUDIO_EOS (#14555)
rfejgin Aug 21, 2025
4ab661c
merge with main
blisc Aug 22, 2025
368f6bf
Apply isort and black reformatting
blisc Aug 22, 2025
932da5e
Add attended text token to attention plots (#14560)
blisc Aug 25, 2025
9037131
[eval][bugfix] avoid overidden cross attention maps for multiple repe…
XuesongYang Aug 27, 2025
6d40bd4
Fix Magpie Tests; Update default params for Decoder Context Yamls (#1…
blisc Aug 28, 2025
86be9f2
[lhotse][step3] added shuffling option to nemo manifests before shard…
XuesongYang Aug 28, 2025
799aaad
Always run speech CI on magpie branch regardless of changed files #14…
blisc Sep 6, 2025
a74a31a
Whisper output normalization (#14535) Retargetted for latest Dev Bran…
blisc Sep 8, 2025
81d15b1
Remove multilingual sentence piece tokenizer (#14688)
shehzeen Sep 10, 2025
87b946f
[unittests] fixed bugs for sampling in cer, speaker similarity and va…
XuesongYang Sep 16, 2025
7194fe3
[magpietts][lhotse] add a script for creating text context manifest f…
XuesongYang Sep 19, 2025
d0cfb27
End detection updates and Text context remapping during training (#14…
paarthneekhara Sep 22, 2025
d7fee58
Checkout magpie branch instead of main (#14813)
blisc Sep 27, 2025
f3878d7
Don't allow EOS until 4 frames have been generated (#14761)
rfejgin Sep 27, 2025
55d5a01
Magpietts 2508 Attention mask bug fix (#14836)
paarthneekhara Sep 30, 2025
6b5d25b
[magpie][context audio] add speaker items limit to compute similarity…
XuesongYang Oct 1, 2025
0b48537
Add new spectral codec definition (#14794)
rlangman Oct 1, 2025
0ad7bcb
Docstrings and comments on EOS handling. (#14847)
rfejgin Oct 3, 2025
066d622
Inference metrics improvements (#14923)
rfejgin Oct 20, 2025
22be3f4
New ML yaml + changes to allow for Spectral Codec training with text …
blisc Oct 21, 2025
3f8e70d
attempt to remove coverage calls (#14963)
blisc Oct 22, 2025
df85e6d
[lhotse][aistore] added support input_cfg.yaml directly from aistore …
XuesongYang Oct 22, 2025
239adf4
Update checkpoint saving logic: Use step instead of epoch (#14974)
blisc Oct 29, 2025
d291efd
merge with main
blisc Oct 30, 2025
fe77a4c
undo changes to ci
blisc Oct 30, 2025
40f5847
undo changes to core
blisc Oct 30, 2025
b84d71e
address copilot comments
blisc Oct 31, 2025
f7230c1
remove notebooks; address flake8 comments; add import guard in magpie…
blisc Nov 3, 2025
2338adb
Apply isort and black reformatting
blisc Nov 3, 2025
c78bc75
remove experimental imports
blisc Nov 3, 2025
03949ea
Merge branch 'magpietts_main' of github.com:NVIDIA/NeMo into magpiett…
blisc Nov 3, 2025
8e581a3
more flake8
blisc Nov 3, 2025
4920950
Add in fix from #15028 by @matteolippi
blisc Nov 4, 2025
9b99d2d
attempt to address some codeQL issues
blisc Nov 4, 2025
0f4045a
fix typo
blisc Nov 4, 2025
311a6fc
Apply isort and black reformatting
blisc Nov 4, 2025
79b0eca
Remove single_encoder_sv_tts and decoder_pretrain_synthesizer model t…
blisc Nov 4, 2025
09c3900
Merge branch 'magpietts_main' of github.com:NVIDIA/NeMo into magpiett…
blisc Nov 4, 2025
8c732c5
CodeQL and Lint fixes
blisc Nov 4, 2025
1acbc8f
Update confs and readmes
blisc Nov 4, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions .github/workflows/cicd-main-speech.yml
Original file line number Diff line number Diff line change
Expand Up @@ -187,6 +187,16 @@ jobs:
- runner: self-hosted-azure
script: SPEECHLM_HF_Training_SALM
timeout: 20
- runner: self-hosted-azure
script: L2_TTS_Fast_dev_runs_Magpietts_DecoderContext
- runner: self-hosted-azure
script: L2_TTS_Fast_dev_runs_Magpietts_MultiEncoder
- runner: self-hosted-azure
script: L2_TTS_Fast_dev_runs_Magpietts_OnlinePO
- runner: self-hosted-azure
script: L2_TTS_InferEvaluate_Magpietts_ZeroShot
- runner: self-hosted-azure
script: L2_TTS_InferEvaluate_Magpietts_SeenSpeakers
needs: [unit-tests]
runs-on: ${{ matrix.runner }}
name: ${{ matrix.is-optional && 'PLEASEFIXME_' || '' }}${{ matrix.script }}
Expand Down
26 changes: 26 additions & 0 deletions examples/tts/README_frame_stacking.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# Overview
This PR introduces frame-stacking implementation in Magpie-TTS. Frame-stacking is disabled by default. It can be enabled by setting a `frame_stacking_factor` > 1 in the YAML config.

# Frame-stacking

## Overview
Frame-stacking is a technique that allows the Magpie-TTS **base decoder** (also known as the "main" for "first stage" decoder) to **process multiple consecutive audio frames in a single forward pass**, leaving the job of generating individual frames and codebooks to a second, smaller, "Local Transformer" ("LT") decoder. The goal is to accelerate inference by reducing the number of generation steps of the base decoder. In this two-stage approach:

1. The base decoder processes multiple frames at once, producing a single latent representation for each group (stack) of frames
2. The Local Transformer then generates the individual `frames * codebooks` tokens.

The Local Transformer is much faster than the base decoder, making this two-stage approach significantly faster than generating each frame with the base decoder. The speed improvement comes from two factors:
* **Fewer parameters**: The LT decoder is lightweight compared to the base decoder
* **Shorter sequences**: The LT decoder only attends to the current frame stack and the latent, not the entire frame sequence

The base decoder can also generate audio codes directly without a LT, but when frame-stacking is enabled using the LT decoder is typically necessary to achieve high-quality synthesis.

## Design and Implementation
* The `frame_stacking_factor` is the parameter that controls the number of frames to stack. The default is 1, which means no frame-stacking. We have tested values up to `4`.
* For each codebooks, we keep a separate embedding table for at each frame within the stack. At the input to the decoder, the embeddings are averages across codebooks (as usual) and also frames within the stack. The embedding tables are shared between the base and LT decoders.

## Limitations
This is still WIP with more work to be done. Specifically, the following are not yet implemented / tested:
* Online code extraction combined with frame-stacking.
* Alignment encoder with frame-stacking.
* CTC loss with frame-stacking.
95 changes: 95 additions & 0 deletions examples/tts/README_magpietts_legacy_checkpoints.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
# Background
Magpie-TTS uses special tokens like AUDIO_BOS and AUDIO_EOS for its operation. The indices of these tokens are after the audio codec tokens, at the end of the embedding table.

In April 2025 we changed the layout of the embedding table in a non-backwards compatible way:

## Old Layout (until April 16)
With the most common codec configuration (2016 codes), the layout used to look like this:
```
| Index | Token Description | Comments |
|---------|----------------------|-----------------------------------------------------------------------------------------------------------|
| [0] | Codec Token 0 | |
| [1] | Codec Token 1 | |
| [2] | Codec Token 2 | |
| ... | ... | |
| [2015] | Codec Token 2015 | |
| [2016] | <Unused> | |
| [2017] | <Unused> | |
| [2018] | <Unused> | |
| ... | | |
| [2044] | Context Audio BOS | if model_type == `decoder_context_tts` |
| [2045] | Context Audio EOS | if model_type == `decoder_context_tts` |
| [2046] | Audio BOS | also used for Context Audio BOS if model_type == `multi_encoder_context_tts` or `single_encoder_sv_tts` |
| [2047] | Audio EOS | also used for Context Audio EOS if model_type == `multi_encoder_context_tts` or `single_encoder_sv_tts` |
```

## New Layout```
The new layout for the same codec configuration is:
```
| Index | Token Description | Comments |
---------------------------------------------|
| [0] | Codec Token 0 | |
| [1] | Codec Token 1 | |
| [2] | Codec Token 2 | |
| ... | ... | |
| [2015] | Codec Token 2015 | |
| [2016] | Audio BOS | |
| [2017] | Audio EOS | |
| [2018] | Context Audio BOS | |
| [2019] | Context Audio EOS | |
| [2020] | MASK token (MaskGit) | |
| [2021] | RESERVED_1 | |
| [2022] | RESERVED_2 | |
| [2023] | RESERVED_3 | |
```

# How to Train and Load a New Checkpoint
For new trainings and inference all configuration is automatic:
* The number of codebooks, codec codebooks size, and codec downsampling rate are all read from the codec checkpoint rather than configured in Magpie.
* The embedding table size is automatically set to codec_codebook_size + number_of_special_tokens (currently 2016+8=2024). There is no risk of accidentally stepping on codec tokens since the table sizes gets automatically sized with enough room for the special tokens.

# How to Load Old Checkpoints
For checkpoints created before the change you can force legacy codebook layout in one of these ways:

## If using `infer_and_evaluate.py`
Just set the `--legacy_codebooks` command line option. No need to update your YAML file – The script will automatically add the overrides.

## If using a Hydra command line
This scenario would happen when either finetuning with an old checkpoint or doing data generation with an old checkpoint.

You have two options:
### Add these to your command line
```
# decoder context model
+model.forced_num_all_tokens_per_codebook=2048 +model.forced_audio_eos_id=2047 +model.forced_audio_bos_id=2046 +model.forced_context_audio_eos_id=2045 +model.forced_context_audio_bos_id=2044

# multi encoder context and any other model type
+model.forced_num_all_tokens_per_codebook=2048 +model.forced_audio_eos_id=2047 +model.forced_audio_bos_id=2046 +model.forced_context_audio_eos_id=2047 +model.forced_context_audio_bos_id=2046
```
# Or, add these overrides to your YAML file
```
forced_num_all_tokens_per_codebook: 2048
forced_audio_eos_id: ${sum:${model.forced_num_all_tokens_per_codebook}, -1} # 2047
forced_audio_bos_id: ${sum:${model.forced_num_all_tokens_per_codebook}, -2} # 2046

# Depending on the old model type, the context_audio_bos_id and context_audio_eos_id will be different (choose one of the pairs below)

# For `multi_encoder_context_tts`, `single_encoder_sv_tts`:
#forced_context_audio_eos_id: ${sum:${model.forced_num_all_tokens_per_codebook}, -1} # 2047
#forced_context_audio_bos_id: ${sum:${model.forced_num_all_tokens_per_codebook}, -2} # 2046

# For `decoder_context_tts` models:
#forced_context_audio_eos_id: ${sum:${model.forced_num_all_tokens_per_codebook}, -3} # 2045
#forced_context_audio_bos_id: ${sum:${model.forced_num_all_tokens_per_codebook}, -4} # 2044
```

# Additional Details
Over the last few weeks we have gone through a few embedding table layouts. When using an old checkpoint it's important to know which layout your checkpoint was trained with and configuring the system accordingly.

* Layout 1: used until April 16 (described in the table above). Add `--legacy-codebooks` to the `infer_and_evaluate.py` command line to inference using this layout.

* Layout 2: after the [config changes](https://github.com/blisc/NeMo/commit/7e2cdca74a866ecefdbe01c0076ad9b5d140ac61): 2018 tokens with special tokens at the end 2017, 2016, 2015, 2014 (the last two being overwrites of codec tokens). This is an invalid layout and these checkpoints should not be used.

* Layout 3: after the [bugfix](https://github.com/blisc/NeMo/commit/23e299a0bd14b666543b4bbcc7783f783acb0bd3) but before the [refactoring](https://github.com/blisc/NeMo/commit/8ba55061a0ebb161abff4b329e402d5307f4af98): 2024 tokens with special tokens at the end (2023, 2022, 2021, 2020). There are no automatic options provided for using this layout but it can be manually configured by updating the `hparams.yaml` file with the `forced_*` options. Set `forced_num_all_tokens_per_codebook` to `2024` and set the rest of the overrides as defined under section `# Or, add these overrides to your YAML file` above.

* Layout 4: The new layout, [from this commit onwards](https://github.com/blisc/NeMo/commit/8ba55061a0ebb161abff4b329e402d5307f4af98): 2024 tokens but with special tokens immediately after codec tokens (2016, 2017, 2018, 2019). Training and inference with the latest version of the code automatically use this layout.
199 changes: 199 additions & 0 deletions examples/tts/conf/magpietts/magpietts.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,199 @@
name: Magpie-TTS

max_epochs: ???
# Adjust batch size based on GPU memory
batch_size: 16
# When doing weighted sampling with multiple manifests, this defines how many training steps are in an epoch.
# If null, then weighted sampling is disabled.
weighted_sampling_steps_per_epoch: null

# Dataset metadata for each manifest
# See DatasetMeta in https://github.com/NVIDIA-NeMo/NeMo/blob/main/nemo/collections/tts/data/text_to_speech_dataset.py
train_ds_meta: ???
val_ds_meta: ???

model:
model_type: "decoder_ce" # decoder_context_tts or decoder_ce
use_text_conditioning_encoder: true # Enable or disable text context. Audio context is always enabled
text_conditioning_tokenizer_name: text_ce_tokenizer # The tokenizer to be used for text contexts
context_duration_min: 5.0
context_duration_max: 5.0
load_cached_codes_if_available: true
prior_scaling_factor: 0.5
prior_end_step: 12000
prior_scaledown_start_step: 8000
indefinite_prior_prob: 0.0 # If > 0, then prior will be applied after prior_end_step with this probability.
alignment_loss_scale: 0.002
embedding_dim: 768
codecmodel_path: ???
max_epochs: ${max_epochs}
steps_per_epoch: ${weighted_sampling_steps_per_epoch}
cfg_unconditional_prob: 0.1

# Alignment encoder parameters, to binarize the prior
# This is used for attention-constrained training and inference
use_alignment_encoder: false
# Below args are only relevant if use_alignment_encoder is true
use_prior_for_aligner: true # Whether to use the beta-binomial prior to train the alignment encoder
alignment_encoder_loss_scale: 1.0
binarize_prior_after_step: 10000 # Switch from beta-binomial prior to binarized prior after this step.
binarize_attn_method: "nemo_binarize" # nemo_binarize or argmax.
prior_future_context: 2 # Future window of the binarized prior.
prior_past_context: 2 # Past window of the binarized prior.
prior_future_decay: 0.8 # Decay factor for future context
prior_past_decay: 0.5 # Decay factor for past context
binarize_repeat_audio_factor: 2 # Temporally increase audio timesteps, for nemo_binarize to work better. Increase this for low frame rate codecs
binarized_prior_epsilon: 0.0
aligner_encoder_train_steps: 50000

# Local transformer parameters for autoregressive codebook prediction within a frame
local_transformer_type: "autoregressive" # "none", "autoregressive", "maskgit"
# Below args are only relevant if use_local_transformer is true
local_transformer_loss_scale: 1.0
local_transformer_n_layers: 1
local_transformer_n_heads: 1
local_transformer_hidden_dim: 256

text_context_remapping_json: null # JSON file defining mapping of multiple text contexts to a single text context. Does not need to cover all text contexts.
text_context_remapping_prob: 0.0 # Probability of remapping the original text context to a remapped text context.

text_tokenizers:
english_phoneme:
_target_: nemo.collections.common.tokenizers.text_to_speech.tts_tokenizers.IPATokenizer
punct: true
apostrophe: true
pad_with_space: false
g2p:
_target_: nemo.collections.tts.g2p.models.i18n_ipa.IpaG2p
phoneme_dict: "scripts/tts_dataset_files/ipa_cmudict-0.7b_nv23.01.txt"
heteronyms: "scripts/tts_dataset_files/heteronyms-052722"
phoneme_probability: 0.8
ignore_ambiguous_words: false
use_chars: true
use_stresses: true
text_ce_tokenizer: # Used for text context
_target_: AutoTokenizer
pretrained_model: "google/byt5-small"
### For additional languages, consider adding a generic byt5 tokenizer like the one below
# french_chartokenizer: # Used for text context
# _target_: AutoTokenizer
# pretrained_model: "google/byt5-small"

train_ds:
dataset:
_target_: nemo.collections.tts.data.text_to_speech_dataset.MagpieTTSDataset
dataset_meta: ${train_ds_meta}
weighted_sampling_steps_per_epoch: ${weighted_sampling_steps_per_epoch}
min_duration: 0.2
max_duration: 20.0

dataloader_params:
batch_size: ${batch_size}
num_workers: 4
drop_last: true
pin_memory: true

validation_ds:
dataset:
_target_: nemo.collections.tts.data.text_to_speech_dataset.MagpieTTSDataset
dataset_meta: ${val_ds_meta}
min_duration: 0.2
max_duration: 20.0

dataloader_params:
batch_size: ${batch_size}
num_workers: 4
pin_memory: true

encoder:
n_layers: 6
d_model: 768
d_ffn: 3072
sa_n_heads: 12
kernel_size: 3
p_dropout: 0.1
p_dropout_out: 0.0
has_xattn: false
is_causal: true
apply_norm_out: true
max_length_causal_mask: 2048
use_learnable_pos_emb: true

context_encoder: # Only used for decoder_ce (and multi_encoder_context_tts), ignored otherwise
n_layers: 1
d_model: 768
d_ffn: 3072
sa_n_heads: 12
kernel_size: 3
p_dropout: 0.1
p_dropout_out: 0.0
has_xattn: false
is_causal: false
apply_norm_out: true
max_length_causal_mask: 2048
use_learnable_pos_emb: true

decoder:
n_layers: 12
d_model: 768
d_ffn: 3072
sa_n_heads: 12
kernel_size: 1
p_dropout: 0.1
p_dropout_out: 0.0
has_xattn: true
xa_d_head: 128
xa_d_memory: 768
xa_n_heads: 1
is_causal: true
apply_norm_to_cond: true
apply_norm_out: true
max_length_causal_mask: 2048
use_learnable_pos_emb: true
make_prior_window_strict: true

optim:
_target_: torch.optim.AdamW
lr: 2e-4

sched:
name: ExponentialLR
gamma: 0.998

trainer:
num_nodes: 1
devices: -1
accelerator: gpu
strategy: ddp_find_unused_parameters_true
precision: 32
max_epochs: ${max_epochs}
accumulate_grad_batches: 1
enable_checkpointing: False # Provided by exp_manager
logger: false # Provided by exp_manager
log_every_n_steps: 100
check_val_every_n_epoch: 1
num_sanity_val_steps: 0
benchmark: false
gradient_clip_val: 2.5

exp_manager:
exp_dir: null
name: ${name}
create_tensorboard_logger: true
create_wandb_logger: false
wandb_logger_kwargs:
entity: null
project: null
group: null
name: ${name}
resume: true # enable resume to ensure continuous training log metrics merged on the previous run id.
create_checkpoint_callback: true
checkpoint_callback_params:
monitor: val_loss
mode: min
save_top_k: 5
save_best_model: true
always_save_nemo: true
filename: '${name}--{${exp_manager.checkpoint_callback_params.monitor}:.4f}-{step}'
resume_if_exists: true
resume_ignore_no_checkpoint: true
Loading
Loading