-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
update docs and replace speakernet with titanet in tutorials #3405
Conversation
@@ -35,7 +35,8 @@ First we prepare scp file(s) containing absolute paths to all the wav files requ | |||
|
|||
|
|||
Since we created the scp file for the train, we use `scp_to_manifest.py` to convert this scp file to a manifest file and then optionally split the files to train \& dev for evaluating the models while training by using the --split flag. | |||
We wont be needing the --split option for the test folder. Accordingly please mention the id number, which is the field num separated by / to be considered as the speaker label. | |||
We wont be needing the --split option for the test folder. Accordingly please mention the id number, which is the field num separated by / to be considered as the speaker label. If audio file path has name `path/to/folder/speaker_1234/01.wav` for which speaker id is | |||
`speaker_1234` we pick this by setting `id=3` or `id=-2`. For robust training it is recommended to create utterances of varied time lengths by creating chunks of 1.5, 2 and 3 sec. This can be set optionally by using `-create_chunks` option. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A little bit confusing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which sentence is confusing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If audio file path has name path/to/folder/speaker_1234/01.wav
for which speaker id is
speaker_1234
we pick this by setting id=3
or id=-2
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is an explanation for scp_to_manifest.py to automatically pick the label (speaker id) based on audio file path. Not sure how to put it better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If rewrite the whole thing to be clearer and methodical. This reads like random set of sentences describing different parts of an idea
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its not an idea, just argument explanation for the script. Let me see to write in a different manner
TitaNet | ||
----------- | ||
|
||
The model is based on the ContextNet architecture :cite:`sr-models-koluguri2021titanet` for extracting speaker representations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cite here is supposed to be contextnet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We explain contexnet in titanet paper too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay its minor. I thought it would be The model (cite titanet) is based on the ContextNet architecture (cite contextnet)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should cite ContextNet itself of you are referring to it, doesjt matter if titanet explains
@@ -11,6 +11,20 @@ The Checkpoints page also contains benchmark results for the available speaker r | |||
|
|||
.. _SpeakerNet_model: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is titanet belong to speakernet collection?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line is not necessary actually. Removed
@@ -60,7 +59,7 @@ For details on how to write this section, refer to `Preprocessor Configuration < | |||
Augmentation Configurations | |||
--------------------------- | |||
|
|||
For SpeakerNet training we use on-the fly augmentations with MUSAN and RIR impulses using ``noise`` augmentor section | |||
For TitaNet training we use on-the fly augmentations with MUSAN and RIR impulses using ``noise`` augmentor section |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MINOR: missing "-". on-the fly arguments with ... -> on-the-fly arguments with....
|
||
Decoder Configurations | ||
------------------------ | ||
|
||
After features have been computed from speakernet encoder, we pass these features to the decoder to compute embeddings and then to compute log probabilities | ||
After features have been computed from titanet encoder, we pass these features to the decoder to compute embeddings and then to compute log probabilities |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MINOR: from titanet encoder -> from TitaNet encoder (maintaining some consistency)
emb_sizes: 256 # number of inermediate emb layers. can be comma separated for additional layers like 512,512 | ||
feat_in: *enc_feat_out | ||
num_classes: 7205 # Total number of classes in voxceleb1,2 training manifest file | ||
pool_mode: attention # xvector, attention |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you state what other pool_mode exists ? It will help users to understand or utilize this argument.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah those two. xvector and attention. Mentioned next to it
"## Building the SpeakerNet Model\n", | ||
"SpeakerNet is an ASR model with a classification task - it generates one label for the entire provided audio stream. Therefore we encapsulate it inside the EncDecSpeakerLabelModel as follows." | ||
"## Building the TitaNet Model\n", | ||
"TitaNet is a speaker model with an identification task - it generates one label for the entire provided audio stream. Therefore we encapsulate it inside the EncDecSpeakerLabelModel as follows." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me, "TitaNet is a speaker model with an identification task" seems bit unclear or misleading. I suggest it to be "TitaNet is a speaker model that is also compatible with (speaker) identification tasks" or "TitaNet is a speaker model can be used for (speaker) identification tasks".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
b5dad25
to
2ec3e56
Compare
Signed-off-by: nithinraok <[email protected]>
Signed-off-by: nithinraok <[email protected]>
Signed-off-by: nithinraok <[email protected]>
2ec3e56
to
3aa1575
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks!
@@ -34,8 +34,13 @@ First we prepare scp file(s) containing absolute paths to all the wav files requ | |||
!head -n 3 data/train_all.scp | |||
|
|||
|
|||
Since we created the scp file for the train, we use `scp_to_manifest.py` to convert this scp file to a manifest file and then optionally split the files to train \& dev for evaluating the models while training by using the --split flag. | |||
We wont be needing the --split option for the test folder. Accordingly please mention the id number, which is the field num separated by / to be considered as the speaker label. | |||
Based on the created scp file, we use `scp_to_manifest.py` script to convert it to a manifest file. This scipt takes three optional arguments: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
scipt -> script
Signed-off-by: nithinraok <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approve again
* cache_hf (#3406) Signed-off-by: ekmb <[email protected]> Signed-off-by: Bonham79 <[email protected]> * Learning annealing scheduler fix (#3400) Signed-off-by: MaximumEntropy <[email protected]> Signed-off-by: Bonham79 <[email protected]> * T5 Pre-training in NeMo using Megatron (#3036) * add vocab_file and merge_file to megatron init Signed-off-by: ericharper <[email protected]> * add forward Signed-off-by: ericharper <[email protected]> * add train loss Signed-off-by: ericharper <[email protected]> * add optimizer Signed-off-by: ericharper <[email protected]> * add exp_manager Signed-off-by: ericharper <[email protected]> * multi-gpu is working Signed-off-by: ericharper <[email protected]> * adding val loop Signed-off-by: ericharper <[email protected]> * style Signed-off-by: ericharper <[email protected]> * adding val loop Signed-off-by: ericharper <[email protected]> * fix ranks Signed-off-by: ericharper <[email protected]> * fix model parallel checkpoint saving Signed-off-by: ericharper <[email protected]> * fix _del_model Signed-off-by: ericharper <[email protected]> * Initial megatron dataset port Signed-off-by: MaximumEntropy <[email protected]> * added megatron batch sampler Signed-off-by: ericharper <[email protected]> * try to fix num steps Signed-off-by: ericharper <[email protected]> * add wandb to config Signed-off-by: ericharper <[email protected]> * log lr Signed-off-by: ericharper <[email protected]> * add warmup ratio to config Signed-off-by: ericharper <[email protected]> * update configs Signed-off-by: ericharper <[email protected]> * update configs Signed-off-by: ericharper <[email protected]> * Fix merge conflicts Signed-off-by: MaximumEntropy <[email protected]> * add cpu init to args Signed-off-by: ericharper <[email protected]> * update config Signed-off-by: ericharper <[email protected]> * License fixes and megatron model porting Signed-off-by: MaximumEntropy <[email protected]> * Style fixes Signed-off-by: MaximumEntropy <[email protected]> * More fixes to import from nemo rather than megatron Signed-off-by: MaximumEntropy <[email protected]> * Fix circular imports Signed-off-by: MaximumEntropy <[email protected]> * Style fixes Signed-off-by: MaximumEntropy <[email protected]> * Revert config file Signed-off-by: MaximumEntropy <[email protected]> * update config Signed-off-by: ericharper <[email protected]> * Restructure further to avoid circular imports Signed-off-by: MaximumEntropy <[email protected]> * add Makefile Signed-off-by: ericharper <[email protected]> * Add megatron modules Signed-off-by: MaximumEntropy <[email protected]> * Add data makefile Signed-off-by: MaximumEntropy <[email protected]> * add license Signed-off-by: ericharper <[email protected]> * Port from latest megatron Signed-off-by: MaximumEntropy <[email protected]> * update cfg Signed-off-by: ericharper <[email protected]> * update config Signed-off-by: ericharper <[email protected]> * add _del_model_without_trainer Signed-off-by: ericharper <[email protected]> * add data preprocessing script Signed-off-by: ericharper <[email protected]> * update config Signed-off-by: ericharper <[email protected]> * use apex mpu Signed-off-by: ericharper <[email protected]> * replace print_rank_0 with nemo utils logging Signed-off-by: ericharper <[email protected]> * use apex mpu Signed-off-by: ericharper <[email protected]> * use apex mpu Signed-off-by: ericharper <[email protected]> * add use_cpu_initialization Signed-off-by: ericharper <[email protected]> * fixing autoresume in progress Signed-off-by: ericharper <[email protected]> * properly removing last checkpoint Signed-off-by: ericharper <[email protected]> * log consumed samples Signed-off-by: ericharper <[email protected]> * fix mp autoresume Signed-off-by: ericharper <[email protected]> * Megatron GPT training with NeMo tokenizers (#2818) * Update files from megatron repo Signed-off-by: MaximumEntropy <[email protected]> * Remove non NLP data related files from megatron Signed-off-by: MaximumEntropy <[email protected]> * Merge megatron and nemo tokenizers Signed-off-by: MaximumEntropy <[email protected]> * Remove get_tokenizer() calls from gpt model Signed-off-by: MaximumEntropy <[email protected]> * Update tokenizer yaml config Signed-off-by: MaximumEntropy <[email protected]> * add NLPSaveRestoreConnector Signed-off-by: ericharper <[email protected]> * add todo Signed-off-by: ericharper <[email protected]> * update config Signed-off-by: ericharper <[email protected]> * make init_method_std configurable Signed-off-by: ericharper <[email protected]> * make gpu init work by setting random seed earlier Signed-off-by: ericharper <[email protected]> * fix gpu init after removing debug print in mpu Signed-off-by: ericharper <[email protected]> * add fused_adam Signed-off-by: ericharper <[email protected]> * check ds is not none before logging len Signed-off-by: ericharper <[email protected]> * set fp16 arg to true and fix enum conflict Signed-off-by: ericharper <[email protected]> * make fp16 arg configurable Signed-off-by: ericharper <[email protected]> * add grad clip from megatron Signed-off-by: ericharper <[email protected]> * Linear warmup with cosine annealing and constant holding (#2846) * Testing cosine schedule Signed-off-by: MaximumEntropy <[email protected]> * Style fixes Signed-off-by: MaximumEntropy <[email protected]> * Fixes Signed-off-by: MaximumEntropy <[email protected]> * More fixes Signed-off-by: MaximumEntropy <[email protected]> * update config for constant steps in schedule Signed-off-by: ericharper <[email protected]> * temporarily import enum from megatron Signed-off-by: ericharper <[email protected]> * add grad clip for fp32 Signed-off-by: ericharper <[email protected]> * update check for _del_model_without_trainer Signed-off-by: ericharper <[email protected]> * updating restore for model parallel Signed-off-by: ericharper <[email protected]> * add predict script Signed-off-by: ericharper <[email protected]> * update test iters Signed-off-by: ericharper <[email protected]> * add barrier Signed-off-by: ericharper <[email protected]> * return if clip_val is 0 or None Signed-off-by: ericharper <[email protected]> * when using amp clip grads after they are unscaled Signed-off-by: ericharper <[email protected]> * make native amp scaler hyperparams configurable Signed-off-by: ericharper <[email protected]> * (1) nvfuser, (2) amp-casting decoration (#2894) * (1) nvfuser, (2) amp-casting decoration Signed-off-by: Sangkug Lym <[email protected]> * support bf16 Signed-off-by: Sangkug Lym <[email protected]> * update package info Signed-off-by: ericharper <[email protected]> * add set device to constructor Signed-off-by: ericharper <[email protected]> * set_device in constructor Signed-off-by: ericharper <[email protected]> * [BigNLP] Remove megatron-lm dependency. (#2910) * remove args Signed-off-by: ericharper <[email protected]> * remove args Signed-off-by: ericharper <[email protected]> * remove args Signed-off-by: ericharper <[email protected]> * remove args Signed-off-by: ericharper <[email protected]> * remove args in progress Signed-off-by: ericharper <[email protected]> * remove args in progress Signed-off-by: ericharper <[email protected]> * remove args in progress Signed-off-by: ericharper <[email protected]> * remove args in progress Signed-off-by: ericharper <[email protected]> * add load_fused_kernels Signed-off-by: ericharper <[email protected]> * add load_fused_kernels Signed-off-by: ericharper <[email protected]> * update megatron_init Signed-off-by: ericharper <[email protected]> * add fused kernels Signed-off-by: ericharper <[email protected]> * add fused kernels Signed-off-by: ericharper <[email protected]> * update process batch Signed-off-by: ericharper <[email protected]> * remove erroneous import Signed-off-by: ericharper <[email protected]> * remove erroneous import Signed-off-by: ericharper <[email protected]> * remove erroneous import Signed-off-by: ericharper <[email protected]> * add megatron clip_grad Signed-off-by: ericharper <[email protected]> * trying to resolve circular import error Signed-off-by: ericharper <[email protected]> * rename file Signed-off-by: ericharper <[email protected]> * remove non-gpt models and datasets from __init__ files Signed-off-by: ericharper <[email protected]> * set device in constructorfor gpu init Signed-off-by: ericharper <[email protected]> * set device in constructorfor gpu init Signed-off-by: ericharper <[email protected]> * set_device in constructor Signed-off-by: ericharper <[email protected]> * clean config Signed-off-by: ericharper <[email protected]> * update MegatronDataset Signed-off-by: ericharper <[email protected]> * clean up MegatronModule Signed-off-by: ericharper <[email protected]> * clean up MegatronModule Signed-off-by: ericharper <[email protected]> * rename fp16 and bf16 flags to fused_softmax_input_in_fp16/bf16 Signed-off-by: ericharper <[email protected]> * rename to fused_fp16 Signed-off-by: ericharper <[email protected]> * add fused_fp16 arg to LayerNorm calls Signed-off-by: ericharper <[email protected]> * fix arg name Signed-off-by: ericharper <[email protected]> * fix arg name Signed-off-by: ericharper <[email protected]> * fix import Signed-off-by: ericharper <[email protected]> * update arg Signed-off-by: ericharper <[email protected]> * skip warmup default to True Signed-off-by: ericharper <[email protected]> * skip warmup default to True Signed-off-by: ericharper <[email protected]> * Adding complete method to MegatronGPTModel (#2935) Signed-off-by: Oleksii Kuchaiev <[email protected]> * make ffn_hidden_size mandatory Signed-off-by: ericharper <[email protected]> * Manually migrating timing of step into branch (#2937) * 1. Manually migrating timing of step into branch. Signed-off-by: Micha Livne <[email protected]> * 1. Updated file name and content. Signed-off-by: Micha Livne <[email protected]> * 1. Updated to latest code. Signed-off-by: Micha Livne <[email protected]> Co-authored-by: Micha Livne <[email protected]> * remove unused imports Signed-off-by: ericharper <[email protected]> * remove unused import Signed-off-by: ericharper <[email protected]> * remove unused import Signed-off-by: ericharper <[email protected]> * remove unused import Signed-off-by: ericharper <[email protected]> * check fused_fp16 and fused_bf16 are not both True Signed-off-by: ericharper <[email protected]> * update predict script for model parallel .nemo Signed-off-by: ericharper <[email protected]> * typo Signed-off-by: ericharper <[email protected]> * typo Signed-off-by: ericharper <[email protected]> Co-authored-by: Oleksii Kuchaiev <[email protected]> Co-authored-by: Micha Livne <[email protected]> Co-authored-by: Micha Livne <[email protected]> * NVfuser (#2943) * activation checkpoint recompute Signed-off-by: Sangkug Lym <[email protected]> * selective nvfuser setup * Megatron gpt bfloat support (#2926) * Save/restore fix Signed-off-by: MaximumEntropy <[email protected]> * Another merge Signed-off-by: MaximumEntropy <[email protected]> * Bf16 args in init Signed-off-by: MaximumEntropy <[email protected]> * Set precision Signed-off-by: MaximumEntropy <[email protected]> * Remove debug stuff Signed-off-by: MaximumEntropy <[email protected]> * add bf16 casting decorator Signed-off-by: Sangkug Lym <[email protected]> * Bfloat layernorm propagation Signed-off-by: MaximumEntropy <[email protected]> * activation checkpoint recompute Signed-off-by: Sangkug Lym <[email protected]> * selective nvfuser setup * More arg removal Signed-off-by: MaximumEntropy <[email protected]> * Remove BERTDataset Signed-off-by: MaximumEntropy <[email protected]> * update to latest apex and patch transformer autocast Signed-off-by: ericharper <[email protected]> Co-authored-by: Sangkug Lym <[email protected]> Co-authored-by: ericharper <[email protected]> * don't set jit for bf16 Signed-off-by: ericharper <[email protected]> * replace apex.mpu Signed-off-by: ericharper <[email protected]> * fix grad clip Signed-off-by: ericharper <[email protected]> * NVFuser fixes (#2951) * Fuser fixes Signed-off-by: MaximumEntropy <[email protected]> * Remove dummy handler Signed-off-by: MaximumEntropy <[email protected]> * Remove PTL plugin based logic for fusion Signed-off-by: MaximumEntropy <[email protected]> * remove duplicated file Signed-off-by: ericharper <[email protected]> * T5 model initial changes Signed-off-by: MaximumEntropy <[email protected]> * typo (#2960) Signed-off-by: ericharper <[email protected]> * [BigNLP] Script to convert GPT checkpoint to .nemo (#2958) * remove args Signed-off-by: ericharper <[email protected]> * remove args Signed-off-by: ericharper <[email protected]> * remove args Signed-off-by: ericharper <[email protected]> * remove args Signed-off-by: ericharper <[email protected]> * remove args in progress Signed-off-by: ericharper <[email protected]> * remove args in progress Signed-off-by: ericharper <[email protected]> * remove args in progress Signed-off-by: ericharper <[email protected]> * remove args in progress Signed-off-by: ericharper <[email protected]> * add load_fused_kernels Signed-off-by: ericharper <[email protected]> * add load_fused_kernels Signed-off-by: ericharper <[email protected]> * update megatron_init Signed-off-by: ericharper <[email protected]> * add fused kernels Signed-off-by: ericharper <[email protected]> * add fused kernels Signed-off-by: ericharper <[email protected]> * update process batch Signed-off-by: ericharper <[email protected]> * remove erroneous import Signed-off-by: ericharper <[email protected]> * remove erroneous import Signed-off-by: ericharper <[email protected]> * remove erroneous import Signed-off-by: ericharper <[email protected]> * add megatron clip_grad Signed-off-by: ericharper <[email protected]> * trying to resolve circular import error Signed-off-by: ericharper <[email protected]> * rename file Signed-off-by: ericharper <[email protected]> * remove non-gpt models and datasets from __init__ files Signed-off-by: ericharper <[email protected]> * set device in constructorfor gpu init Signed-off-by: ericharper <[email protected]> * set device in constructorfor gpu init Signed-off-by: ericharper <[email protected]> * set_device in constructor Signed-off-by: ericharper <[email protected]> * clean config Signed-off-by: ericharper <[email protected]> * update MegatronDataset Signed-off-by: ericharper <[email protected]> * clean up MegatronModule Signed-off-by: ericharper <[email protected]> * clean up MegatronModule Signed-off-by: ericharper <[email protected]> * rename fp16 and bf16 flags to fused_softmax_input_in_fp16/bf16 Signed-off-by: ericharper <[email protected]> * rename to fused_fp16 Signed-off-by: ericharper <[email protected]> * add fused_fp16 arg to LayerNorm calls Signed-off-by: ericharper <[email protected]> * fix arg name Signed-off-by: ericharper <[email protected]> * fix arg name Signed-off-by: ericharper <[email protected]> * fix import Signed-off-by: ericharper <[email protected]> * update arg Signed-off-by: ericharper <[email protected]> * skip warmup default to True Signed-off-by: ericharper <[email protected]> * skip warmup default to True Signed-off-by: ericharper <[email protected]> * Adding complete method to MegatronGPTModel (#2935) Signed-off-by: Oleksii Kuchaiev <[email protected]> * make ffn_hidden_size mandatory Signed-off-by: ericharper <[email protected]> * Manually migrating timing of step into branch (#2937) * 1. Manually migrating timing of step into branch. Signed-off-by: Micha Livne <[email protected]> * 1. Updated file name and content. Signed-off-by: Micha Livne <[email protected]> * 1. Updated to latest code. Signed-off-by: Micha Livne <[email protected]> Co-authored-by: Micha Livne <[email protected]> * remove unused imports Signed-off-by: ericharper <[email protected]> * remove unused import Signed-off-by: ericharper <[email protected]> * remove unused import Signed-off-by: ericharper <[email protected]> * remove unused import Signed-off-by: ericharper <[email protected]> * check fused_fp16 and fused_bf16 are not both True Signed-off-by: ericharper <[email protected]> * update predict script for model parallel .nemo Signed-off-by: ericharper <[email protected]> * typo Signed-off-by: ericharper <[email protected]> * add script to convert .ckpt to .nemo Signed-off-by: ericharper <[email protected]> * in progress Signed-off-by: ericharper <[email protected]> * update Signed-off-by: ericharper <[email protected]> * convert mp checkpoints to nemo Signed-off-by: ericharper <[email protected]> * update help Signed-off-by: ericharper <[email protected]> * add safeguard for model parallel save_to Signed-off-by: ericharper <[email protected]> * adjust NLPModel save_to to be safer for model parallel Signed-off-by: Oleksii Kuchaiev <[email protected]> Co-authored-by: Oleksii Kuchaiev <[email protected]> Co-authored-by: Micha Livne <[email protected]> Co-authored-by: Micha Livne <[email protected]> Co-authored-by: Oleksii Kuchaiev <[email protected]> * [BigNLP] Update GPT evaluation to work with tensor model parallel (#2959) * in progress Signed-off-by: ericharper <[email protected]> * update args Signed-off-by: ericharper <[email protected]> * add request dataset Signed-off-by: ericharper <[email protected]> * tokenize request Signed-off-by: ericharper <[email protected]> * in progress Signed-off-by: ericharper <[email protected]> * able to run Signed-off-by: ericharper <[email protected]> * reduce logits Signed-off-by: ericharper <[email protected]> * capture response Signed-off-by: ericharper <[email protected]> * squeeze and unsqueeze Signed-off-by: ericharper <[email protected]> * handle non model parallel case Signed-off-by: ericharper <[email protected]> * clean imports Signed-off-by: ericharper <[email protected]> * add file Signed-off-by: ericharper <[email protected]> * convert logits to log_probs Signed-off-by: Oleksii Kuchaiev <[email protected]> * rename logits to log_probs Signed-off-by: Oleksii Kuchaiev <[email protected]> Co-authored-by: Oleksii Kuchaiev <[email protected]> * More changes Signed-off-by: MaximumEntropy <[email protected]> * Missing import Signed-off-by: MaximumEntropy <[email protected]> * Tokenizer fixes and adafactor Signed-off-by: MaximumEntropy <[email protected]> * Add adafactor Signed-off-by: MaximumEntropy <[email protected]> * Add training and conf scripts Signed-off-by: MaximumEntropy <[email protected]> * Add megatron t5 model Signed-off-by: MaximumEntropy <[email protected]> * t5 config to fp32 Signed-off-by: MaximumEntropy <[email protected]> * [BigNLP] Remove fused kernel code instead use Apex (#2984) * remove fused_kernels Signed-off-by: ericharper <[email protected]> * remove fused_kernels Signed-off-by: ericharper <[email protected]> * remove fused layer norm and fused softmax and use apex instead Signed-off-by: ericharper <[email protected]> * update imports Signed-off-by: ericharper <[email protected]> * remove comment Signed-off-by: ericharper <[email protected]> * use apex enums Signed-off-by: ericharper <[email protected]> * use apex enums Signed-off-by: ericharper <[email protected]> * Timer with sliding window (#3002) Co-authored-by: Micha Livne <[email protected]> * check for rank zero Signed-off-by: ericharper <[email protected]> * Remove ict dataset import Signed-off-by: MaximumEntropy <[email protected]> * Remove fused kernels Signed-off-by: MaximumEntropy <[email protected]> * style fix Signed-off-by: ericharper <[email protected]> * fix consumed_samples when resuming Signed-off-by: ericharper <[email protected]> * T5 consumed samples fix Signed-off-by: MaximumEntropy <[email protected]> * Remove megatron dep Signed-off-by: MaximumEntropy <[email protected]> * Change checkpoint filename format Signed-off-by: MaximumEntropy <[email protected]> * Log consumed samples in T5 Signed-off-by: MaximumEntropy <[email protected]> * T5 lr scheduler Signed-off-by: MaximumEntropy <[email protected]> * Checkpoint conversion and data preproc updates for t5 Signed-off-by: MaximumEntropy <[email protected]> * Denoising eval Signed-off-by: MaximumEntropy <[email protected]> * Clean up denoising example to explicitly provide mask positions Signed-off-by: MaximumEntropy <[email protected]> * Better logging of results Signed-off-by: MaximumEntropy <[email protected]> * Better printing of results Signed-off-by: MaximumEntropy <[email protected]> * Minor changes Signed-off-by: MaximumEntropy <[email protected]> * Update config Signed-off-by: MaximumEntropy <[email protected]> * Update config Signed-off-by: MaximumEntropy <[email protected]> * properly removing last checkpoint Signed-off-by: ericharper <[email protected]> * add todo Signed-off-by: ericharper <[email protected]> * add predict script Signed-off-by: ericharper <[email protected]> * T5 model initial changes Signed-off-by: MaximumEntropy <[email protected]> * Add adafactor Signed-off-by: MaximumEntropy <[email protected]> * Add training and conf scripts Signed-off-by: MaximumEntropy <[email protected]> * Add megatron t5 model Signed-off-by: MaximumEntropy <[email protected]> * t5 config to fp32 Signed-off-by: MaximumEntropy <[email protected]> * Remove fused kernels Signed-off-by: MaximumEntropy <[email protected]> * fix consumed_samples when resuming Signed-off-by: ericharper <[email protected]> * T5 consumed samples fix Signed-off-by: MaximumEntropy <[email protected]> * Remove megatron dep Signed-off-by: MaximumEntropy <[email protected]> * Change checkpoint filename format Signed-off-by: MaximumEntropy <[email protected]> * Log consumed samples in T5 Signed-off-by: MaximumEntropy <[email protected]> * T5 lr scheduler Signed-off-by: MaximumEntropy <[email protected]> * Checkpoint conversion and data preproc updates for t5 Signed-off-by: MaximumEntropy <[email protected]> * Denoising eval Signed-off-by: MaximumEntropy <[email protected]> * Clean up denoising example to explicitly provide mask positions Signed-off-by: MaximumEntropy <[email protected]> * Better logging of results Signed-off-by: MaximumEntropy <[email protected]> * Minor changes Signed-off-by: MaximumEntropy <[email protected]> * Update config Signed-off-by: MaximumEntropy <[email protected]> * Update config Signed-off-by: MaximumEntropy <[email protected]> * Merge main into megatron_t5 Signed-off-by: MaximumEntropy <[email protected]> * Dataset prerproc script Signed-off-by: MaximumEntropy <[email protected]> * Remove biencoder file Signed-off-by: MaximumEntropy <[email protected]> * Remove another unused file Signed-off-by: MaximumEntropy <[email protected]> * Remove preprocess script since it has moved Signed-off-by: MaximumEntropy <[email protected]> * Remove ICT dataset Signed-off-by: MaximumEntropy <[email protected]> * Remove orqa dataset Signed-off-by: MaximumEntropy <[email protected]> * Remove realm datase Signed-off-by: MaximumEntropy <[email protected]> * More file removing Signed-off-by: MaximumEntropy <[email protected]> * Fix 2 files Signed-off-by: MaximumEntropy <[email protected]> * Style fixes Signed-off-by: MaximumEntropy <[email protected]> * Rename checkpoint fname Signed-off-by: MaximumEntropy <[email protected]> * Loss averaging fixes in t5 Signed-off-by: MaximumEntropy <[email protected]> * Minor changes Signed-off-by: MaximumEntropy <[email protected]> * add megatron gpt pretraining Signed-off-by: ericharper <[email protected]> Signed-off-by: MaximumEntropy <[email protected]> * Remove weight decay stuff Signed-off-by: MaximumEntropy <[email protected]> * Training script update for PTL 1.5 Signed-off-by: MaximumEntropy <[email protected]> * Update grad clip Signed-off-by: MaximumEntropy <[email protected]> * Update config Signed-off-by: MaximumEntropy <[email protected]> * Add barrier Signed-off-by: MaximumEntropy <[email protected]> * Style fixes and adding more stuff Signed-off-by: MaximumEntropy <[email protected]> * Missed merge conflict fix Signed-off-by: MaximumEntropy <[email protected]> * Unittest fixes Signed-off-by: MaximumEntropy <[email protected]> * Style fix Signed-off-by: MaximumEntropy <[email protected]> * Inference changes Signed-off-by: MaximumEntropy <[email protected]> * Style fixes Signed-off-by: MaximumEntropy <[email protected]> * Fix reinstall script Signed-off-by: MaximumEntropy <[email protected]> * T5 CI tests Signed-off-by: MaximumEntropy <[email protected]> * Style fixes Signed-off-by: MaximumEntropy <[email protected]> * Style fixes Signed-off-by: MaximumEntropy <[email protected]> * Minor fixes Signed-off-by: MaximumEntropy <[email protected]> * Minor fixes Signed-off-by: MaximumEntropy <[email protected]> * Tokenizer arg fix Signed-off-by: MaximumEntropy <[email protected]> * Style fixes Signed-off-by: MaximumEntropy <[email protected]> * Helpers fix Signed-off-by: MaximumEntropy <[email protected]> * Style fix Signed-off-by: MaximumEntropy <[email protected]> * PR review changes Signed-off-by: MaximumEntropy <[email protected]> * Style fixes Signed-off-by: MaximumEntropy <[email protected]> * Refactor bert dataset stuff Signed-off-by: MaximumEntropy <[email protected]> * Fix typo Signed-off-by: MaximumEntropy <[email protected]> * Fix request dataset variable Signed-off-by: MaximumEntropy <[email protected]> * Fix sched params in CI test Signed-off-by: MaximumEntropy <[email protected]> * Change to kwargs and Jenkins test for inference Signed-off-by: MaximumEntropy <[email protected]> * PR review related changes Signed-off-by: MaximumEntropy <[email protected]> * More fixes Signed-off-by: MaximumEntropy <[email protected]> * Test helper building Signed-off-by: MaximumEntropy <[email protected]> * Restore helper compilation everywhere Signed-off-by: MaximumEntropy <[email protected]> * Fix PR comments Signed-off-by: MaximumEntropy <[email protected]> * PR comments Signed-off-by: MaximumEntropy <[email protected]> * Add docstring to additional_special_tokens Signed-off-by: MaximumEntropy <[email protected]> * Improve docstring Signed-off-by: MaximumEntropy <[email protected]> * Fix resume from checkpoint path Signed-off-by: MaximumEntropy <[email protected]> * Fix for TP>1 Signed-off-by: MaximumEntropy <[email protected]> * Remove fused fp16 and bf16 args Signed-off-by: MaximumEntropy <[email protected]> * Add missed file Signed-off-by: MaximumEntropy <[email protected]> * Learning annealing scheduler fix Signed-off-by: MaximumEntropy <[email protected]> * Change default optim and scheduler to adam Signed-off-by: MaximumEntropy <[email protected]> * dummy for CI restart Signed-off-by: MaximumEntropy <[email protected]> * Remove constant steps after switch to adam Signed-off-by: MaximumEntropy <[email protected]> Co-authored-by: ericharper <[email protected]> Co-authored-by: Sangkug Lym <[email protected]> Co-authored-by: Oleksii Kuchaiev <[email protected]> Co-authored-by: Micha Livne <[email protected]> Co-authored-by: Micha Livne <[email protected]> Co-authored-by: Oleksii Kuchaiev <[email protected]> Signed-off-by: Bonham79 <[email protected]> * Updates on ASR with diarization util files (#3359) * Initial commit Signed-off-by: Taejin Park <[email protected]> * Update LM part and multiscale part in README. Signed-off-by: Taejin Park <[email protected]> * Removed redundant parts Signed-off-by: Taejin Park <[email protected]> * modified example script Signed-off-by: Taejin Park <[email protected]> * Revised doc strings Signed-off-by: Taejin Park <[email protected]> * Changed paths_to_manifest.py script Signed-off-by: Taejin Park <[email protected]> * Reflected PR comments and revised tutorials Signed-off-by: Taejin Park <[email protected]> * Added ASR models and kenlm installation Signed-off-by: [email protected] * Added ASR models and kenlm installation Signed-off-by: [email protected] Signed-off-by: Taejin Park <[email protected]> * Changed docstrings and style fix Signed-off-by: Taejin Park <[email protected]> * Fixed unused import and vars Signed-off-by: Taejin Park <[email protected]> * Added LM part in ASR_diar tutorial. Signed-off-by: Taejin Park <[email protected]> Co-authored-by: fayejf <[email protected]> Co-authored-by: Nithin Rao <[email protected]> Signed-off-by: Bonham79 <[email protected]> * update docs and replace speakernet with titanet in tutorials (#3405) * update docs and replace speakernet with titanet in tutorials Signed-off-by: nithinraok <[email protected]> * update dataset usage description Signed-off-by: nithinraok <[email protected]> * updated based on comments Signed-off-by: nithinraok <[email protected]> * spell fix Signed-off-by: nithinraok <[email protected]> Signed-off-by: Bonham79 <[email protected]> * Update Mixer-TTS, FastPitch and TTSDataset (#3366) * update tts dataset, fastpitch and mixer tts Signed-off-by: Oktai Tatanov <[email protected]> * fix style and notebooks Signed-off-by: Oktai Tatanov <[email protected]> * update notebooks Signed-off-by: Oktai Tatanov <[email protected]> * update mixer-tts, mixer-tts-x and fastpitch configs Signed-off-by: Oktai Tatanov <[email protected]> * update notebooks and configs Signed-off-by: Oktai Tatanov <[email protected]> * update configs Signed-off-by: Oktai Tatanov <[email protected]> * add links, update README, fix tutorials Signed-off-by: Oktai Tatanov <[email protected]> * fix style Signed-off-by: Oktai Tatanov <[email protected]> * remove unnecessary code from fastpitch model Signed-off-by: Oktai Tatanov <[email protected]> * update jenkinsfile and fastpitch typo fix Signed-off-by: Oktai Tatanov <[email protected]> * fix configs Signed-off-by: Oktai Tatanov <[email protected]> * revert jenkinsfile Signed-off-by: Oktai Tatanov <[email protected]> Signed-off-by: Bonham79 <[email protected]> * Asr fr (#3404) * Pushing WFST_tutorial for open draft. (Still need to review collab code. Signed-off-by: tbartley94 <[email protected]> * Checked tutorial code for WFST_Tutorial is properly functioning. Also included some formatting edits. Signed-off-by: tbartley94 <[email protected]> * Responding to editorial comments for WFST_tutorial Signed-off-by: tbartley94 <[email protected]> * Added images to folder and wrote README for tutorials Signed-off-by: tbartley94 <[email protected]> * Few more editorial changes to explain permutations in classification. Signed-off-by: tbartley94 <[email protected]> * Updated tutorials documentation page. Signed-off-by: tbartley94 <[email protected]> * Forgot links for README Signed-off-by: tbartley94 <[email protected]> * TOC links were dead Signed-off-by: tbartley94 <[email protected]> * More dead links to fix. Signed-off-by: tbartley94 <[email protected]> * removing collab install and appending a warning instead. Signed-off-by: tbartley94 <[email protected]> * Update WFST_Tutorial.ipynb Signed-off-by: tbartley94 <[email protected]> * Adding pretrained French models to ctc_bpe_models and rnnt_bpe_models available models listing Signed-off-by: tbartley94 <[email protected]> * Updating ctc_bpe_models import for updated Fr Conformer Ctc version. Signed-off-by: tbartley94 <[email protected]> * Added new French ASR models to documentation and imports: conformer transducer and conformer ctc trained without hyphenization. Signed-off-by: tbartley94 <[email protected]> Co-authored-by: Oleksii Kuchaiev <[email protected]> Co-authored-by: Yang Zhang <[email protected]> Signed-off-by: Bonham79 <[email protected]> * [fix] for resume training on SLURM multi-node multi-gpu (#3374) * [fix] for resume training on SLURM multi-node multi-gpu On SLURM resuming training in a multi-node multi-gpu settings fails, as when `LOCAL_RANK` is undefined `is_globa_rank_zero()` returns true on all processes that run on node 0. In this case `exp_manager.py` https://github.com/NVIDIA/NeMo/blob/f83b2c5524a787be21ffea170850c4b5486eac2b/nemo/utils/exp_manager.py#L446, creates multiple `run_*` folders, and eventually leads to failure (missing files because other processes have moved them already). Checking also for `SLURM_PROCID` solves this issue, as the environment variable contains the global rank id. Signed-off-by: Iztok Lebar Bajec <[email protected]> * Update get_rank.py In SLURM environment return SLURM global_rank (SLURM_PROCID), fallback to previous behaviour otherwise. Signed-off-by: Iztok Lebar Bajec <[email protected]> * style Signed-off-by: Jason <[email protected]> * Sloved bug when either RANK or SLURM_PROC reurn 0, and conditionals return False Signed-off-by: Iztok Lebar Bajec <[email protected]> Co-authored-by: Jason <[email protected]> Signed-off-by: Bonham79 <[email protected]> * Fix running token classification in multinode setting (#3413) * fix: master device check Signed-off-by: PeganovAnton <[email protected]> * Fix bug with use_cache parameter Signed-off-by: PeganovAnton <[email protected]> * create pickled features file regardless of value of use_cache Signed-off-by: PeganovAnton <[email protected]> * Improve docs Signed-off-by: PeganovAnton <[email protected]> Signed-off-by: Bonham79 <[email protected]> * Fix order of lang checking to ignore input langs (#3417) * Fix order of lang checking Signed-off-by: MaximumEntropy <[email protected]> * Fix == error Signed-off-by: MaximumEntropy <[email protected]> Co-authored-by: PeganovAnton <[email protected]> Signed-off-by: Bonham79 <[email protected]> * Refactor ASR Examples Directory (#3392) * Begin refactor of ASR files Signed-off-by: smajumdar <[email protected]> * Update jenkins paths for ASR Signed-off-by: smajumdar <[email protected]> * Update speech_to_text_ctc Signed-off-by: smajumdar <[email protected]> * Update speech_to_text_ctc_bpe Signed-off-by: smajumdar <[email protected]> * Lowercase all directories Signed-off-by: smajumdar <[email protected]> * Fix RNNT num_workers Signed-off-by: smajumdar <[email protected]> * Fix RNNT num_workers Signed-off-by: smajumdar <[email protected]> Signed-off-by: Bonham79 <[email protected]> * NMT MIM mean variance fix (#3385) * 1. Updated default NMT bottleneck encoder to be non-autoregressive Signed-off-by: Micha Livne <[email protected]> * 1. Fixed mena/variance being tied when latent and hidden dimensions are the same. Signed-off-by: Micha Livne <[email protected]> * 1. Debugging. Signed-off-by: Micha Livne <[email protected]> * 1. Fixed style. Signed-off-by: Micha Livne <[email protected]> Co-authored-by: Micha Livne <[email protected]> Signed-off-by: Bonham79 <[email protected]> * update to 21.12 (#3424) Signed-off-by: ericharper <[email protected]> Signed-off-by: Bonham79 <[email protected]> * Working around Pytorch exporter issue with expand() (#3422) Signed-off-by: Boris Fomitchev <[email protected]> Signed-off-by: Bonham79 <[email protected]> * update copyright (#3426) Signed-off-by: ericharper <[email protected]> Signed-off-by: Bonham79 <[email protected]> * remove apex (#3428) Signed-off-by: ekmb <[email protected]> Co-authored-by: Eric Harper <[email protected]> Signed-off-by: Bonham79 <[email protected]> * vad infer refactor (#3394) * vad infer refactor Signed-off-by: fayejf <[email protected]> * remove duplicate in write_long_audio_manifest Signed-off-by: fayejf <[email protected]> * remove duplicate in script vad_overlap_posterior Signed-off-by: fayejf <[email protected]> * style fix Signed-off-by: fayejf <[email protected]> * fix nb Signed-off-by: fayejf <[email protected]> * small fix Signed-off-by: fayejf <[email protected]> * fix Signed-off-by: fayejf <[email protected]> * small fixes Signed-off-by: fayejf <[email protected]> * reflect taejin's review Signed-off-by: fayejf <[email protected]> * update tutorial about rename Signed-off-by: fayejf <[email protected]> * small fix Signed-off-by: fayejf <[email protected]> * merge main and fix Signed-off-by: fayejf <[email protected]> * tiny path fix Signed-off-by: fayejf <[email protected]> Signed-off-by: Bonham79 <[email protected]> * doc update for refactory (#3430) Signed-off-by: fayejf <[email protected]> Signed-off-by: Bonham79 <[email protected]> * Update LJSpeech preprocessing (#3423) * update lj speech preprocessing Signed-off-by: Oktai Tatanov <[email protected]> * update lj speech preprocessing 2 Signed-off-by: Oktai Tatanov <[email protected]> Signed-off-by: Bonham79 <[email protected]> * NMT Shared Embeddings Weights (#3340) * 1. Debugging. Signed-off-by: Micha Livne <[email protected]> * 1. Implemented encoder deocder embedding weights tie. Signed-off-by: Micha Livne <[email protected]> * 1. Fixed style. Signed-off-by: Micha Livne <[email protected]> * 1. Debugging. Signed-off-by: Micha Livne <[email protected]> Co-authored-by: Micha Livne <[email protected]> Signed-off-by: Bonham79 <[email protected]> * [BigNLP] Make saving .nemo during on_train_end configurable (#3427) * make save nemo configurable on train end Signed-off-by: ericharper <[email protected]> * add warning when save_best_model is True Signed-off-by: ericharper <[email protected]> Co-authored-by: Jason <[email protected]> Signed-off-by: Bonham79 <[email protected]> * Preprocess an entire folder of .json or .json.gz files into a single .bin and .idx file. (#3425) * Folder preproc Signed-off-by: MaximumEntropy <[email protected]> * Style fixes Signed-off-by: MaximumEntropy <[email protected]> * Fix usless enumerate Signed-off-by: MaximumEntropy <[email protected]> * Address PR comments Signed-off-by: MaximumEntropy <[email protected]> * Style fixes Signed-off-by: MaximumEntropy <[email protected]> Co-authored-by: Eric Harper <[email protected]> Signed-off-by: Bonham79 <[email protected]> * Update speaker diarization docs (#3419) * Initial commit Signed-off-by: Taejin Park <[email protected]> * Fixed minor mistakes Signed-off-by: Taejin Park <[email protected]> * Some changes regarding diarization utils Signed-off-by: Taejin Park <[email protected]> * Fixed minor typos Signed-off-by: Taejin Park <[email protected]> * Reflected PR comments Signed-off-by: Taejin Park <[email protected]> * Reflected PR comments Signed-off-by: Taejin Park <[email protected]> * Reflected addtional comments Signed-off-by: Taejin Park <[email protected]> * Changed pics and refined text Signed-off-by: Taejin Park <[email protected]> * Minor typos Signed-off-by: Taejin Park <[email protected]> * Minor change on dataset Signed-off-by: Taejin Park <[email protected]> * Minor change on dataset 2 Signed-off-by: Taejin Park <[email protected]> * Changed manifest input to yaml format Signed-off-by: Taejin Park <[email protected]> * Capitalization of titles Signed-off-by: Taejin Park <[email protected]> * Last commit Signed-off-by: Taejin Park <[email protected]> Co-authored-by: fayejf <[email protected]> Co-authored-by: Nithin Rao <[email protected]> Signed-off-by: Bonham79 <[email protected]> * Update ContextNet models trained on more datasets (#3440) * Update ContextNet models trained on more datasets Signed-off-by: smajumdar <[email protected]> * Update ContextNet models trained on more datasets Signed-off-by: smajumdar <[email protected]> Signed-off-by: Bonham79 <[email protected]> * 1. Updated default buffer_size for TimingCallback to 1. (#3439) Signed-off-by: Micha Livne <[email protected]> Co-authored-by: Micha Livne <[email protected]> Signed-off-by: Bonham79 <[email protected]> * Fix bug for missing variable (#3437) Signed-off-by: MaximumEntropy <[email protected]> Co-authored-by: Eric Harper <[email protected]> Signed-off-by: Bonham79 <[email protected]> * Extending input_example() to take max batch and dimension arguments (#3429) * Extending input_example() to take max batch and dimension arguments Signed-off-by: Boris Fomitchev <[email protected]> * Fixing conformer size reconfig, extending export script, some refactoring Signed-off-by: Boris Fomitchev <[email protected]> * Addressing comments Signed-off-by: Boris Fomitchev <[email protected]> * Fixing test issue Signed-off-by: Boris Fomitchev <[email protected]> * Fixing DecoderJoint input example Signed-off-by: Boris Fomitchev <[email protected]> * Removing soon-deprecated external format option addition Signed-off-by: Boris Fomitchev <[email protected]> * Fixing indentation typo Signed-off-by: Boris Fomitchev <[email protected]> Signed-off-by: Bonham79 <[email protected]> * Byte-level Multilingual NMT (#3368) * init Signed-off-by: Abhinav Khattar <[email protected]> * style Signed-off-by: Abhinav Khattar <[email protected]> * rm debug stuff Signed-off-by: Abhinav Khattar <[email protected]> * changes Signed-off-by: Abhinav Khattar <[email protected]> * fix Signed-off-by: Abhinav Khattar <[email protected]> * fix Signed-off-by: Abhinav Khattar <[email protected]> * error fix Signed-off-by: Abhinav Khattar <[email protected]> * make spl tokens optional Signed-off-by: Abhinav Khattar <[email protected]> Co-authored-by: Micha Livne <[email protected]> Signed-off-by: Bonham79 <[email protected]> * Asr patches (#3443) * Fix issues with num_workers for transcribe Signed-off-by: smajumdar <[email protected]> * During inference use full context of chunk Signed-off-by: smajumdar <[email protected]> Signed-off-by: Bonham79 <[email protected]> * Updated NumPy SDE requirement (#3442) Signed-off-by: Vitaly Lavrukhin <[email protected]> Signed-off-by: Bonham79 <[email protected]> * refactor data preprocessing script (#3444) Signed-off-by: Yang Zhang <[email protected]> Signed-off-by: Bonham79 <[email protected]> * Prompt tuning loss mask fix (#3438) * Switched to calcualte loss on answer only Signed-off-by: Virginia Adams <[email protected]> * Added CI tests and unit tests for prompt tuning dataset Signed-off-by: Virginia Adams <[email protected]> * Fixed Jenkinsfile typo Signed-off-by: Virginia Adams <[email protected]> * fixed Jenkinsfile typo Signed-off-by: Virginia Adams <[email protected]> * Fixed more typos so CI tests run all the way through Signed-off-by: Virginia Adams <[email protected]> * Fixed code formatting Signed-off-by: Virginia Adams <[email protected]> * Needed to add save nemo file on train end flag to CI test Signed-off-by: Virginia Adams <[email protected]> * Added save .nemo on train end flag to example script Signed-off-by: Virginia Adams <[email protected]> Co-authored-by: Oleksii Kuchaiev <[email protected]> Signed-off-by: Bonham79 <[email protected]> * BioMegatron token classification tutorial fix to be compatible with current Megatron BERT (#3435) * fixed the tokenizer Signed-off-by: Yi Dong <[email protected]> * training is working Signed-off-by: Yi Dong <[email protected]> * fixed text Signed-off-by: Yi Dong <[email protected]> * fixed text Signed-off-by: Yi Dong <[email protected]> * working notebook Signed-off-by: Yi Dong <[email protected]> * style fix Signed-off-by: Yi Dong <[email protected]> * fixed text Signed-off-by: Yi Dong <[email protected]> * handles the different megatron-lm checkpoint versions Signed-off-by: Yi Dong <[email protected]> * fixed the text classification notebook Signed-off-by: Yi Dong <[email protected]> * fixed key error Signed-off-by: Yi Dong <[email protected]> * more key error Signed-off-by: Yi Dong <[email protected]> * replace the old notebooks Signed-off-by: Yi Dong <[email protected]> * register vocab to nemo file Signed-off-by: Yi Dong <[email protected]> * added the missing notebook Signed-off-by: Yi Dong <[email protected]> Co-authored-by: Eric Harper <[email protected]> Signed-off-by: Bonham79 <[email protected]> * (1) O2-style mixed precision recipe, (2) Persistent layer-norm, (3) Grade scale hysteresis, (4) gradient_as_bucket_view (#3259) * half precision training w/o autocast using master param stage fp16 working version fix: fp32 grad accumulation bf16 support Signed-off-by: Sangkug Lym <[email protected]> add closure fn at bf16 * change autocast compatible with latest pytorch version Signed-off-by: Sangkug Lym <[email protected]> * add module to the state_dict naming Signed-off-by: Sangkug Lym <[email protected]> * cleanup arguments Signed-off-by: Sangkug Lym <[email protected]> * fix module state matching upon checkpoint resume Signed-off-by: Sangkug Lym <[email protected]> * persistent layer norm and dependency check Signed-off-by: Sangkug Lym <[email protected]> check container version instead of pytorch version Signed-off-by: Sangkug Lym <[email protected]> update config * dependency check Signed-off-by: Sangkug Lym <[email protected]> * add graadient_as_bucket_view arg to config Signed-off-by: Sangkug Lym <[email protected]> * (1) add hysteresis to grad scaler, and (2) add grad_scaler to TB Signed-off-by: Sangkug Lym <[email protected]> * doc link fixes (#3264) Signed-off-by: nithinraok <[email protected]> * escape chars fix (#3253) * escape chars fix Signed-off-by: ekmb <[email protected]> * bug fixes Signed-off-by: ekmb <[email protected]> * review Signed-off-by: ekmb <[email protected]> Co-authored-by: Yang Zhang <[email protected]> * Improve data pipeline for punctuation capitalization model and make other useful changes (#3159) * Fix: inference on short sequences problem Signed-off-by: PeganovAnton <[email protected]> * Add draft of new punctuation and capitalization model Signed-off-by: PeganovAnton <[email protected]> * Fix debug config Signed-off-by: PeganovAnton <[email protected]> * Add parameter check Signed-off-by: PeganovAnton <[email protected]> * Update punctuation training script Signed-off-by: PeganovAnton <[email protected]> * Fix head config parameter names Signed-off-by: PeganovAnton <[email protected]> * Fix ds_item and class_label parameters in config Signed-off-by: PeganovAnton <[email protected]> * Fix dataloader shuffling for tarred dataset Signed-off-by: PeganovAnton <[email protected]> * Reduce validation batch Signed-off-by: PeganovAnton <[email protected]> * Add debug print Signed-off-by: PeganovAnton <[email protected]> * Fix metrics initialization Signed-off-by: PeganovAnton <[email protected]> * Fix minor bug Signed-off-by: PeganovAnton <[email protected]> * Fix device problem Signed-off-by: PeganovAnton <[email protected]> * Add debug print Signed-off-by: PeganovAnton <[email protected]> * Register metrics properly Signed-off-by: PeganovAnton <[email protected]> * Put metrics setup after module init Signed-off-by: PeganovAnton <[email protected]> * Reduce model size Signed-off-by: PeganovAnton <[email protected]> * Add wandb logging Signed-off-by: PeganovAnton <[email protected]> * Change wandb name Signed-off-by: PeganovAnton <[email protected]> * Fix logging names for metrics Signed-off-by: PeganovAnton <[email protected]> * Add debug print Signed-off-by: PeganovAnton <[email protected]> * Add returning from eval steps Signed-off-by: PeganovAnton <[email protected]> * Add second dev dataset Signed-off-by: PeganovAnton <[email protected]> * Move config Signed-off-by: PeganovAnton <[email protected]> * Fix path to dataset" Signed-off-by: PeganovAnton <[email protected]> * Add more tokenizer parameters Signed-off-by: PeganovAnton <[email protected]> * Add debug script for more tokenizer in creating tarred dataset Signed-off-by: PeganovAnton <[email protected]> * Update output path in debug script Signed-off-by: PeganovAnton <[email protected]> * Fix minor bug in typing Signed-off-by: PeganovAnton <[email protected]> * Fix bug in parsing arguments Signed-off-by: PeganovAnton <[email protected]> * Do not pass tokenizer through queue Signed-off-by: PeganovAnton <[email protected]> * Set hf tokenizer in debug script Signed-off-by: PeganovAnton <[email protected]> * Try char vocabulary Signed-off-by: PeganovAnton <[email protected]> * Fix typo Signed-off-by: PeganovAnton <[email protected]> * Improve error message Signed-off-by: PeganovAnton <[email protected]> * Fix OOV problem Signed-off-by: PeganovAnton <[email protected]> * Add label ids creation and getting Signed-off-by: PeganovAnton <[email protected]> * fix: add missing parameter Signed-off-by: PeganovAnton <[email protected]> * Improve error message for label ids building Signed-off-by: PeganovAnton <[email protected]> * Add short tar files repacking Signed-off-by: PeganovAnton <[email protected]> * Fix minor bug and add more security Signed-off-by: PeganovAnton <[email protected]> * Fix minor bug Signed-off-by: PeganovAnton <[email protected]> * fix: replace Path with str Signed-off-by: PeganovAnton <[email protected]> * fix: iter datasets Signed-off-by: PeganovAnton <[email protected]> * Improve logging Signed-off-by: PeganovAnton <[email protected]> * Turn off repacking Signed-off-by: PeganovAnton <[email protected]> * Turn off repacking Signed-off-by: PeganovAnton <[email protected]> * Turn on repacking Signed-off-by: PeganovAnton <[email protected]> * Turn off repacking Signed-off-by: PeganovAnton <[email protected]> * Add debug print Signed-off-by: PeganovAnton <[email protected]> * Improve unexpected removal Signed-off-by: PeganovAnton <[email protected]> * Turn on repacking Signed-off-by: PeganovAnton <[email protected]> * fix: remove repacked files Signed-off-by: PeganovAnton <[email protected]> * Add default config for testing Signed-off-by: PeganovAnton <[email protected]> * Improve code style in evaluate script Signed-off-by: PeganovAnton <[email protected]> * Add docstrings Signed-off-by: PeganovAnton <[email protected]> * Remove debug config Signed-off-by: PeganovAnton <[email protected]> * Remove commented code Signed-off-by: PeganovAnton <[email protected]> * Fix code style in doc string Signed-off-by: PeganovAnton <[email protected]> * Fix usage of parser.error function Signed-off-by: PeganovAnton <[email protected]> * Improve working with config and fix restoring of old checkpoints Signed-off-by: PeganovAnton <[email protected]> * Do not demand cfg as dataclass Signed-off-by: PeganovAnton <[email protected]> * Add backward compatibility for absense of use_tarred_dataset Signed-off-by: PeganovAnton <[email protected]> * Fight for backwards compatibility Signed-off-by: PeganovAnton <[email protected]> * Add tokens_in_batch backward compatibility Signed-off-by: PeganovAnton <[email protected]> * Undo unintentional changes in tutorial Signed-off-by: PeganovAnton <[email protected]> * Do not allow more workers than queries Signed-off-by: PeganovAnton <[email protected]> * Fix metric names in tests Signed-off-by: PeganovAnton <[email protected]> * Fix metric location Signed-off-by: PeganovAnton <[email protected]> * Fix metric location Signed-off-by: PeganovAnton <[email protected]> * Require ds_item or data_dir Signed-off-by: PeganovAnton <[email protected]> * Disable multiprocessing data preparation by default Signed-off-by: PeganovAnton <[email protected]> * Disable multiprocessing data preparation by default Signed-off-by: PeganovAnton <[email protected]> * Disable multiprocessing data preparation by default Signed-off-by: PeganovAnton <[email protected]> * Make minor improvements in docstrings and typing Signed-off-by: PeganovAnton <[email protected]> * Fix finetuning code Signed-off-by: PeganovAnton <[email protected]> * Fix shuffle train dataset config parameter Signed-off-by: PeganovAnton <[email protected]> * Fix evaluation script Signed-off-by: PeganovAnton <[email protected]> * Update test Signed-off-by: PeganovAnton <[email protected]> * Add new test and make minor changes Signed-off-by: PeganovAnton <[email protected]> * Fix repacked file names Signed-off-by: PeganovAnton <[email protected]> * Add assertion error Signed-off-by: PeganovAnton <[email protected]> * Fix minor bug in regex Signed-off-by: PeganovAnton <[email protected]> * Improve Jenkins command Signed-off-by: PeganovAnton <[email protected]> * Fix code style Signed-off-by: PeganovAnton <[email protected]> * fix: add name to Jenkins stage Signed-off-by: PeganovAnton <[email protected]> * fix: add steps block to Jenkins stage Signed-off-by: PeganovAnton <[email protected]> * fix: move nemo_experiments removal to post section Previously I encoutered a weird error + rm -rf nemo_experiments rm: cannot remove 'nemo_experiments': Directory not empty script returned exit code 1 And suspect that this could be because to parallel stages try to remove same directory simultaneously. Signed-off-by: PeganovAnton <[email protected]> * Turn off cache usage in Jenkins for token classification models Signed-off-by: PeganovAnton <[email protected]> * Stop pickling features Signed-off-by: PeganovAnton <[email protected]> * Reference webdataset in docs Signed-off-by: PeganovAnton <[email protected]> * Make multiple minor improvements Signed-off-by: PeganovAnton <[email protected]> * Add parameters tokens_in_batch, repack to documentation Signed-off-by: PeganovAnton <[email protected]> * Refactoring and improving readability Signed-off-by: PeganovAnton <[email protected]> * Make tar_shuffle_n optional parameter Signed-off-by: PeganovAnton <[email protected]> * Fix path to label vocab files Signed-off-by: PeganovAnton <[email protected]> * Fix metadata label vocab key Signed-off-by: PeganovAnton <[email protected]> * Create for_nemo directory Signed-off-by: PeganovAnton <[email protected]> * Fix tar_shuffle_n default value Signed-off-by: PeganovAnton <[email protected]> * First round of review fixes Signed-off-by: PeganovAnton <[email protected]> * Return tokens_in_batch default value Signed-off-by: PeganovAnton <[email protected]> * Remove duplicate parameters in `CommonDatasetParameters` Signed-off-by: PeganovAnton <[email protected]> * Remove duplicate parameters in config Signed-off-by: PeganovAnton <[email protected]> * Refactor user interface Signed-off-by: PeganovAnton <[email protected]> * fix: add missing parameter in calling setting dataloader up Signed-off-by: PeganovAnton <[email protected]> * fix: replace data config with model config Signed-off-by: PeganovAnton <[email protected]> * fix: typo in config parameter name Signed-off-by: PeganovAnton <[email protected]> * fix: location of label ids parameters in config Signed-off-by: PeganovAnton <[email protected]> * fix: transforming not first legacy data config Signed-off-by: PeganovAnton <[email protected]> * fix: num_samples can be negative Signed-off-by: PeganovAnton <[email protected]> * fix: create directory for nemo ids files Signed-off-by: PeganovAnton <[email protected]> * fix: remove unremoved with_label Signed-off-by: PeganovAnton <[email protected]> * fix: features contain ids if loaded from pickle Signed-off-by: PeganovAnton <[email protected]> * Fix kwargs parameters Signed-off-by: PeganovAnton <[email protected]> * Add label setting for testing case Signed-off-by: PeganovAnton <[email protected]> * Fix: change parameter location in config Signed-off-by: PeganovAnton <[email protected]> * Fix: transform legacy config in init Signed-off-by: PeganovAnton <[email protected]> * Fix: make minor improvement in checking config Signed-off-by: PeganovAnton <[email protected]> * fix: check label ids for None before checking pad label id Signed-off-by: PeganovAnton <[email protected]> * fix: set labels when restoring Signed-off-by: PeganovAnton <[email protected]> * fix: place where label ids are taken Signed-off-by: PeganovAnton <[email protected]> * Fix minor bug Signed-off-by: PeganovAnton <[email protected]> * fix: register artifacts in set_label_ids Signed-off-by: PeganovAnton <[email protected]> * fix: perform checking only if label ids are not set Signed-off-by: PeganovAnton <[email protected]> * fix: set label_ids_are_set Signed-off-by: PeganovAnton <[email protected]> * Fix using of dataset in create tarred dataset Signed-off-by: PeganovAnton <[email protected]> * fix: manipulate label ids if fragment_idx is zero Signed-off-by: PeganovAnton <[email protected]> * fix: remove directory correctly Signed-off-by: PeganovAnton <[email protected]> * fix: vocab file names Signed-off-by: PeganovAnton <[email protected]> * fix: vocab file names Signed-off-by: PeganovAnton <[email protected]> * Add debug print Signed-off-by: PeganovAnton <[email protected]> * Add directories for cache and label info Signed-off-by: PeganovAnton <[email protected]> * Minor fixes Signed-off-by: PeganovAnton <[email protected]> * Minor fix Signed-off-by: PeganovAnton <[email protected]> * Minor fix Signed-off-by: PeganovAnton <[email protected]> * Improve debug config Signed-off-by: PeganovAnton <[email protected]> * Create missing directories Signed-off-by: PeganovAnton <[email protected]> * Improve feature pkl file name Signed-off-by: PeganovAnton <[email protected]> * WORKING VERSION OF VOCAB CONFIG Signed-off-by: PeganovAnton <[email protected]> * Improve vocab file extraction Signed-off-by: PeganovAnton <[email protected]> * Fix config Signed-off-by: PeganovAnton <[email protected]> * Improve vocab file extraction Signed-off-by: PeganovAnton <[email protected]> * fix register artifact calls Signed-off-by: PeganovAnton <[email protected]> * fix: add class_labels to legacy fixing Signed-off-by: PeganovAnton <[email protected]> * fix: add missing method Signed-off-by: PeganovAnton <[email protected]> * Add support for checkpoints without class labels artifact Signed-off-by: PeganovAnton <[email protected]> * fix: add missing return values to function Signed-off-by: PeganovAnton <[email protected]> * fix saving label ids in creation of tarred dataset Signed-off-by: PeganovAnton <[email protected]> * fix: adjust tarred dataset consistency check Signed-off-by: PeganovAnton <[email protected]> * fix: consistency check call Signed-off-by: PeganovAnton <[email protected]> * Try checking labels every time dataloader is set Signed-off-by: PeganovAnton <[email protected]> * fi…
* cache_hf (#3406) Signed-off-by: ekmb <[email protected]> Signed-off-by: Bonham79 <[email protected]> * Learning annealing scheduler fix (#3400) Signed-off-by: MaximumEntropy <[email protected]> Signed-off-by: Bonham79 <[email protected]> * T5 Pre-training in NeMo using Megatron (#3036) * add vocab_file and merge_file to megatron init Signed-off-by: ericharper <[email protected]> * add forward Signed-off-by: ericharper <[email protected]> * add train loss Signed-off-by: ericharper <[email protected]> * add optimizer Signed-off-by: ericharper <[email protected]> * add exp_manager Signed-off-by: ericharper <[email protected]> * multi-gpu is working Signed-off-by: ericharper <[email protected]> * adding val loop Signed-off-by: ericharper <[email protected]> * style Signed-off-by: ericharper <[email protected]> * adding val loop Signed-off-by: ericharper <[email protected]> * fix ranks Signed-off-by: ericharper <[email protected]> * fix model parallel checkpoint saving Signed-off-by: ericharper <[email protected]> * fix _del_model Signed-off-by: ericharper <[email protected]> * Initial megatron dataset port Signed-off-by: MaximumEntropy <[email protected]> * added megatron batch sampler Signed-off-by: ericharper <[email protected]> * try to fix num steps Signed-off-by: ericharper <[email protected]> * add wandb to config Signed-off-by: ericharper <[email protected]> * log lr Signed-off-by: ericharper <[email protected]> * add warmup ratio to config Signed-off-by: ericharper <[email protected]> * update configs Signed-off-by: ericharper <[email protected]> * update configs Signed-off-by: ericharper <[email protected]> * Fix merge conflicts Signed-off-by: MaximumEntropy <[email protected]> * add cpu init to args Signed-off-by: ericharper <[email protected]> * update config Signed-off-by: ericharper <[email protected]> * License fixes and megatron model porting Signed-off-by: MaximumEntropy <[email protected]> * Style fixes Signed-off-by: MaximumEntropy <[email protected]> * More fixes to import from nemo rather than megatron Signed-off-by: MaximumEntropy <[email protected]> * Fix circular imports Signed-off-by: MaximumEntropy <[email protected]> * Style fixes Signed-off-by: MaximumEntropy <[email protected]> * Revert config file Signed-off-by: MaximumEntropy <[email protected]> * update config Signed-off-by: ericharper <[email protected]> * Restructure further to avoid circular imports Signed-off-by: MaximumEntropy <[email protected]> * add Makefile Signed-off-by: ericharper <[email protected]> * Add megatron modules Signed-off-by: MaximumEntropy <[email protected]> * Add data makefile Signed-off-by: MaximumEntropy <[email protected]> * add license Signed-off-by: ericharper <[email protected]> * Port from latest megatron Signed-off-by: MaximumEntropy <[email protected]> * update cfg Signed-off-by: ericharper <[email protected]> * update config Signed-off-by: ericharper <[email protected]> * add _del_model_without_trainer Signed-off-by: ericharper <[email protected]> * add data preprocessing script Signed-off-by: ericharper <[email protected]> * update config Signed-off-by: ericharper <[email protected]> * use apex mpu Signed-off-by: ericharper <[email protected]> * replace print_rank_0 with nemo utils logging Signed-off-by: ericharper <[email protected]> * use apex mpu Signed-off-by: ericharper <[email protected]> * use apex mpu Signed-off-by: ericharper <[email protected]> * add use_cpu_initialization Signed-off-by: ericharper <[email protected]> * fixing autoresume in progress Signed-off-by: ericharper <[email protected]> * properly removing last checkpoint Signed-off-by: ericharper <[email protected]> * log consumed samples Signed-off-by: ericharper <[email protected]> * fix mp autoresume Signed-off-by: ericharper <[email protected]> * Megatron GPT training with NeMo tokenizers (#2818) * Update files from megatron repo Signed-off-by: MaximumEntropy <[email protected]> * Remove non NLP data related files from megatron Signed-off-by: MaximumEntropy <[email protected]> * Merge megatron and nemo tokenizers Signed-off-by: MaximumEntropy <[email protected]> * Remove get_tokenizer() calls from gpt model Signed-off-by: MaximumEntropy <[email protected]> * Update tokenizer yaml config Signed-off-by: MaximumEntropy <[email protected]> * add NLPSaveRestoreConnector Signed-off-by: ericharper <[email protected]> * add todo Signed-off-by: ericharper <[email protected]> * update config Signed-off-by: ericharper <[email protected]> * make init_method_std configurable Signed-off-by: ericharper <[email protected]> * make gpu init work by setting random seed earlier Signed-off-by: ericharper <[email protected]> * fix gpu init after removing debug print in mpu Signed-off-by: ericharper <[email protected]> * add fused_adam Signed-off-by: ericharper <[email protected]> * check ds is not none before logging len Signed-off-by: ericharper <[email protected]> * set fp16 arg to true and fix enum conflict Signed-off-by: ericharper <[email protected]> * make fp16 arg configurable Signed-off-by: ericharper <[email protected]> * add grad clip from megatron Signed-off-by: ericharper <[email protected]> * Linear warmup with cosine annealing and constant holding (#2846) * Testing cosine schedule Signed-off-by: MaximumEntropy <[email protected]> * Style fixes Signed-off-by: MaximumEntropy <[email protected]> * Fixes Signed-off-by: MaximumEntropy <[email protected]> * More fixes Signed-off-by: MaximumEntropy <[email protected]> * update config for constant steps in schedule Signed-off-by: ericharper <[email protected]> * temporarily import enum from megatron Signed-off-by: ericharper <[email protected]> * add grad clip for fp32 Signed-off-by: ericharper <[email protected]> * update check for _del_model_without_trainer Signed-off-by: ericharper <[email protected]> * updating restore for model parallel Signed-off-by: ericharper <[email protected]> * add predict script Signed-off-by: ericharper <[email protected]> * update test iters Signed-off-by: ericharper <[email protected]> * add barrier Signed-off-by: ericharper <[email protected]> * return if clip_val is 0 or None Signed-off-by: ericharper <[email protected]> * when using amp clip grads after they are unscaled Signed-off-by: ericharper <[email protected]> * make native amp scaler hyperparams configurable Signed-off-by: ericharper <[email protected]> * (1) nvfuser, (2) amp-casting decoration (#2894) * (1) nvfuser, (2) amp-casting decoration Signed-off-by: Sangkug Lym <[email protected]> * support bf16 Signed-off-by: Sangkug Lym <[email protected]> * update package info Signed-off-by: ericharper <[email protected]> * add set device to constructor Signed-off-by: ericharper <[email protected]> * set_device in constructor Signed-off-by: ericharper <[email protected]> * [BigNLP] Remove megatron-lm dependency. (#2910) * remove args Signed-off-by: ericharper <[email protected]> * remove args Signed-off-by: ericharper <[email protected]> * remove args Signed-off-by: ericharper <[email protected]> * remove args Signed-off-by: ericharper <[email protected]> * remove args in progress Signed-off-by: ericharper <[email protected]> * remove args in progress Signed-off-by: ericharper <[email protected]> * remove args in progress Signed-off-by: ericharper <[email protected]> * remove args in progress Signed-off-by: ericharper <[email protected]> * add load_fused_kernels Signed-off-by: ericharper <[email protected]> * add load_fused_kernels Signed-off-by: ericharper <[email protected]> * update megatron_init Signed-off-by: ericharper <[email protected]> * add fused kernels Signed-off-by: ericharper <[email protected]> * add fused kernels Signed-off-by: ericharper <[email protected]> * update process batch Signed-off-by: ericharper <[email protected]> * remove erroneous import Signed-off-by: ericharper <[email protected]> * remove erroneous import Signed-off-by: ericharper <[email protected]> * remove erroneous import Signed-off-by: ericharper <[email protected]> * add megatron clip_grad Signed-off-by: ericharper <[email protected]> * trying to resolve circular import error Signed-off-by: ericharper <[email protected]> * rename file Signed-off-by: ericharper <[email protected]> * remove non-gpt models and datasets from __init__ files Signed-off-by: ericharper <[email protected]> * set device in constructorfor gpu init Signed-off-by: ericharper <[email protected]> * set device in constructorfor gpu init Signed-off-by: ericharper <[email protected]> * set_device in constructor Signed-off-by: ericharper <[email protected]> * clean config Signed-off-by: ericharper <[email protected]> * update MegatronDataset Signed-off-by: ericharper <[email protected]> * clean up MegatronModule Signed-off-by: ericharper <[email protected]> * clean up MegatronModule Signed-off-by: ericharper <[email protected]> * rename fp16 and bf16 flags to fused_softmax_input_in_fp16/bf16 Signed-off-by: ericharper <[email protected]> * rename to fused_fp16 Signed-off-by: ericharper <[email protected]> * add fused_fp16 arg to LayerNorm calls Signed-off-by: ericharper <[email protected]> * fix arg name Signed-off-by: ericharper <[email protected]> * fix arg name Signed-off-by: ericharper <[email protected]> * fix import Signed-off-by: ericharper <[email protected]> * update arg Signed-off-by: ericharper <[email protected]> * skip warmup default to True Signed-off-by: ericharper <[email protected]> * skip warmup default to True Signed-off-by: ericharper <[email protected]> * Adding complete method to MegatronGPTModel (#2935) Signed-off-by: Oleksii Kuchaiev <[email protected]> * make ffn_hidden_size mandatory Signed-off-by: ericharper <[email protected]> * Manually migrating timing of step into branch (#2937) * 1. Manually migrating timing of step into branch. Signed-off-by: Micha Livne <[email protected]> * 1. Updated file name and content. Signed-off-by: Micha Livne <[email protected]> * 1. Updated to latest code. Signed-off-by: Micha Livne <[email protected]> Co-authored-by: Micha Livne <[email protected]> * remove unused imports Signed-off-by: ericharper <[email protected]> * remove unused import Signed-off-by: ericharper <[email protected]> * remove unused import Signed-off-by: ericharper <[email protected]> * remove unused import Signed-off-by: ericharper <[email protected]> * check fused_fp16 and fused_bf16 are not both True Signed-off-by: ericharper <[email protected]> * update predict script for model parallel .nemo Signed-off-by: ericharper <[email protected]> * typo Signed-off-by: ericharper <[email protected]> * typo Signed-off-by: ericharper <[email protected]> Co-authored-by: Oleksii Kuchaiev <[email protected]> Co-authored-by: Micha Livne <[email protected]> Co-authored-by: Micha Livne <[email protected]> * NVfuser (#2943) * activation checkpoint recompute Signed-off-by: Sangkug Lym <[email protected]> * selective nvfuser setup * Megatron gpt bfloat support (#2926) * Save/restore fix Signed-off-by: MaximumEntropy <[email protected]> * Another merge Signed-off-by: MaximumEntropy <[email protected]> * Bf16 args in init Signed-off-by: MaximumEntropy <[email protected]> * Set precision Signed-off-by: MaximumEntropy <[email protected]> * Remove debug stuff Signed-off-by: MaximumEntropy <[email protected]> * add bf16 casting decorator Signed-off-by: Sangkug Lym <[email protected]> * Bfloat layernorm propagation Signed-off-by: MaximumEntropy <[email protected]> * activation checkpoint recompute Signed-off-by: Sangkug Lym <[email protected]> * selective nvfuser setup * More arg removal Signed-off-by: MaximumEntropy <[email protected]> * Remove BERTDataset Signed-off-by: MaximumEntropy <[email protected]> * update to latest apex and patch transformer autocast Signed-off-by: ericharper <[email protected]> Co-authored-by: Sangkug Lym <[email protected]> Co-authored-by: ericharper <[email protected]> * don't set jit for bf16 Signed-off-by: ericharper <[email protected]> * replace apex.mpu Signed-off-by: ericharper <[email protected]> * fix grad clip Signed-off-by: ericharper <[email protected]> * NVFuser fixes (#2951) * Fuser fixes Signed-off-by: MaximumEntropy <[email protected]> * Remove dummy handler Signed-off-by: MaximumEntropy <[email protected]> * Remove PTL plugin based logic for fusion Signed-off-by: MaximumEntropy <[email protected]> * remove duplicated file Signed-off-by: ericharper <[email protected]> * T5 model initial changes Signed-off-by: MaximumEntropy <[email protected]> * typo (#2960) Signed-off-by: ericharper <[email protected]> * [BigNLP] Script to convert GPT checkpoint to .nemo (#2958) * remove args Signed-off-by: ericharper <[email protected]> * remove args Signed-off-by: ericharper <[email protected]> * remove args Signed-off-by: ericharper <[email protected]> * remove args Signed-off-by: ericharper <[email protected]> * remove args in progress Signed-off-by: ericharper <[email protected]> * remove args in progress Signed-off-by: ericharper <[email protected]> * remove args in progress Signed-off-by: ericharper <[email protected]> * remove args in progress Signed-off-by: ericharper <[email protected]> * add load_fused_kernels Signed-off-by: ericharper <[email protected]> * add load_fused_kernels Signed-off-by: ericharper <[email protected]> * update megatron_init Signed-off-by: ericharper <[email protected]> * add fused kernels Signed-off-by: ericharper <[email protected]> * add fused kernels Signed-off-by: ericharper <[email protected]> * update process batch Signed-off-by: ericharper <[email protected]> * remove erroneous import Signed-off-by: ericharper <[email protected]> * remove erroneous import Signed-off-by: ericharper <[email protected]> * remove erroneous import Signed-off-by: ericharper <[email protected]> * add megatron clip_grad Signed-off-by: ericharper <[email protected]> * trying to resolve circular import error Signed-off-by: ericharper <[email protected]> * rename file Signed-off-by: ericharper <[email protected]> * remove non-gpt models and datasets from __init__ files Signed-off-by: ericharper <[email protected]> * set device in constructorfor gpu init Signed-off-by: ericharper <[email protected]> * set device in constructorfor gpu init Signed-off-by: ericharper <[email protected]> * set_device in constructor Signed-off-by: ericharper <[email protected]> * clean config Signed-off-by: ericharper <[email protected]> * update MegatronDataset Signed-off-by: ericharper <[email protected]> * clean up MegatronModule Signed-off-by: ericharper <[email protected]> * clean up MegatronModule Signed-off-by: ericharper <[email protected]> * rename fp16 and bf16 flags to fused_softmax_input_in_fp16/bf16 Signed-off-by: ericharper <[email protected]> * rename to fused_fp16 Signed-off-by: ericharper <[email protected]> * add fused_fp16 arg to LayerNorm calls Signed-off-by: ericharper <[email protected]> * fix arg name Signed-off-by: ericharper <[email protected]> * fix arg name Signed-off-by: ericharper <[email protected]> * fix import Signed-off-by: ericharper <[email protected]> * update arg Signed-off-by: ericharper <[email protected]> * skip warmup default to True Signed-off-by: ericharper <[email protected]> * skip warmup default to True Signed-off-by: ericharper <[email protected]> * Adding complete method to MegatronGPTModel (#2935) Signed-off-by: Oleksii Kuchaiev <[email protected]> * make ffn_hidden_size mandatory Signed-off-by: ericharper <[email protected]> * Manually migrating timing of step into branch (#2937) * 1. Manually migrating timing of step into branch. Signed-off-by: Micha Livne <[email protected]> * 1. Updated file name and content. Signed-off-by: Micha Livne <[email protected]> * 1. Updated to latest code. Signed-off-by: Micha Livne <[email protected]> Co-authored-by: Micha Livne <[email protected]> * remove unused imports Signed-off-by: ericharper <[email protected]> * remove unused import Signed-off-by: ericharper <[email protected]> * remove unused import Signed-off-by: ericharper <[email protected]> * remove unused import Signed-off-by: ericharper <[email protected]> * check fused_fp16 and fused_bf16 are not both True Signed-off-by: ericharper <[email protected]> * update predict script for model parallel .nemo Signed-off-by: ericharper <[email protected]> * typo Signed-off-by: ericharper <[email protected]> * add script to convert .ckpt to .nemo Signed-off-by: ericharper <[email protected]> * in progress Signed-off-by: ericharper <[email protected]> * update Signed-off-by: ericharper <[email protected]> * convert mp checkpoints to nemo Signed-off-by: ericharper <[email protected]> * update help Signed-off-by: ericharper <[email protected]> * add safeguard for model parallel save_to Signed-off-by: ericharper <[email protected]> * adjust NLPModel save_to to be safer for model parallel Signed-off-by: Oleksii Kuchaiev <[email protected]> Co-authored-by: Oleksii Kuchaiev <[email protected]> Co-authored-by: Micha Livne <[email protected]> Co-authored-by: Micha Livne <[email protected]> Co-authored-by: Oleksii Kuchaiev <[email protected]> * [BigNLP] Update GPT evaluation to work with tensor model parallel (#2959) * in progress Signed-off-by: ericharper <[email protected]> * update args Signed-off-by: ericharper <[email protected]> * add request dataset Signed-off-by: ericharper <[email protected]> * tokenize request Signed-off-by: ericharper <[email protected]> * in progress Signed-off-by: ericharper <[email protected]> * able to run Signed-off-by: ericharper <[email protected]> * reduce logits Signed-off-by: ericharper <[email protected]> * capture response Signed-off-by: ericharper <[email protected]> * squeeze and unsqueeze Signed-off-by: ericharper <[email protected]> * handle non model parallel case Signed-off-by: ericharper <[email protected]> * clean imports Signed-off-by: ericharper <[email protected]> * add file Signed-off-by: ericharper <[email protected]> * convert logits to log_probs Signed-off-by: Oleksii Kuchaiev <[email protected]> * rename logits to log_probs Signed-off-by: Oleksii Kuchaiev <[email protected]> Co-authored-by: Oleksii Kuchaiev <[email protected]> * More changes Signed-off-by: MaximumEntropy <[email protected]> * Missing import Signed-off-by: MaximumEntropy <[email protected]> * Tokenizer fixes and adafactor Signed-off-by: MaximumEntropy <[email protected]> * Add adafactor Signed-off-by: MaximumEntropy <[email protected]> * Add training and conf scripts Signed-off-by: MaximumEntropy <[email protected]> * Add megatron t5 model Signed-off-by: MaximumEntropy <[email protected]> * t5 config to fp32 Signed-off-by: MaximumEntropy <[email protected]> * [BigNLP] Remove fused kernel code instead use Apex (#2984) * remove fused_kernels Signed-off-by: ericharper <[email protected]> * remove fused_kernels Signed-off-by: ericharper <[email protected]> * remove fused layer norm and fused softmax and use apex instead Signed-off-by: ericharper <[email protected]> * update imports Signed-off-by: ericharper <[email protected]> * remove comment Signed-off-by: ericharper <[email protected]> * use apex enums Signed-off-by: ericharper <[email protected]> * use apex enums Signed-off-by: ericharper <[email protected]> * Timer with sliding window (#3002) Co-authored-by: Micha Livne <[email protected]> * check for rank zero Signed-off-by: ericharper <[email protected]> * Remove ict dataset import Signed-off-by: MaximumEntropy <[email protected]> * Remove fused kernels Signed-off-by: MaximumEntropy <[email protected]> * style fix Signed-off-by: ericharper <[email protected]> * fix consumed_samples when resuming Signed-off-by: ericharper <[email protected]> * T5 consumed samples fix Signed-off-by: MaximumEntropy <[email protected]> * Remove megatron dep Signed-off-by: MaximumEntropy <[email protected]> * Change checkpoint filename format Signed-off-by: MaximumEntropy <[email protected]> * Log consumed samples in T5 Signed-off-by: MaximumEntropy <[email protected]> * T5 lr scheduler Signed-off-by: MaximumEntropy <[email protected]> * Checkpoint conversion and data preproc updates for t5 Signed-off-by: MaximumEntropy <[email protected]> * Denoising eval Signed-off-by: MaximumEntropy <[email protected]> * Clean up denoising example to explicitly provide mask positions Signed-off-by: MaximumEntropy <[email protected]> * Better logging of results Signed-off-by: MaximumEntropy <[email protected]> * Better printing of results Signed-off-by: MaximumEntropy <[email protected]> * Minor changes Signed-off-by: MaximumEntropy <[email protected]> * Update config Signed-off-by: MaximumEntropy <[email protected]> * Update config Signed-off-by: MaximumEntropy <[email protected]> * properly removing last checkpoint Signed-off-by: ericharper <[email protected]> * add todo Signed-off-by: ericharper <[email protected]> * add predict script Signed-off-by: ericharper <[email protected]> * T5 model initial changes Signed-off-by: MaximumEntropy <[email protected]> * Add adafactor Signed-off-by: MaximumEntropy <[email protected]> * Add training and conf scripts Signed-off-by: MaximumEntropy <[email protected]> * Add megatron t5 model Signed-off-by: MaximumEntropy <[email protected]> * t5 config to fp32 Signed-off-by: MaximumEntropy <[email protected]> * Remove fused kernels Signed-off-by: MaximumEntropy <[email protected]> * fix consumed_samples when resuming Signed-off-by: ericharper <[email protected]> * T5 consumed samples fix Signed-off-by: MaximumEntropy <[email protected]> * Remove megatron dep Signed-off-by: MaximumEntropy <[email protected]> * Change checkpoint filename format Signed-off-by: MaximumEntropy <[email protected]> * Log consumed samples in T5 Signed-off-by: MaximumEntropy <[email protected]> * T5 lr scheduler Signed-off-by: MaximumEntropy <[email protected]> * Checkpoint conversion and data preproc updates for t5 Signed-off-by: MaximumEntropy <[email protected]> * Denoising eval Signed-off-by: MaximumEntropy <[email protected]> * Clean up denoising example to explicitly provide mask positions Signed-off-by: MaximumEntropy <[email protected]> * Better logging of results Signed-off-by: MaximumEntropy <[email protected]> * Minor changes Signed-off-by: MaximumEntropy <[email protected]> * Update config Signed-off-by: MaximumEntropy <[email protected]> * Update config Signed-off-by: MaximumEntropy <[email protected]> * Merge main into megatron_t5 Signed-off-by: MaximumEntropy <[email protected]> * Dataset prerproc script Signed-off-by: MaximumEntropy <[email protected]> * Remove biencoder file Signed-off-by: MaximumEntropy <[email protected]> * Remove another unused file Signed-off-by: MaximumEntropy <[email protected]> * Remove preprocess script since it has moved Signed-off-by: MaximumEntropy <[email protected]> * Remove ICT dataset Signed-off-by: MaximumEntropy <[email protected]> * Remove orqa dataset Signed-off-by: MaximumEntropy <[email protected]> * Remove realm datase Signed-off-by: MaximumEntropy <[email protected]> * More file removing Signed-off-by: MaximumEntropy <[email protected]> * Fix 2 files Signed-off-by: MaximumEntropy <[email protected]> * Style fixes Signed-off-by: MaximumEntropy <[email protected]> * Rename checkpoint fname Signed-off-by: MaximumEntropy <[email protected]> * Loss averaging fixes in t5 Signed-off-by: MaximumEntropy <[email protected]> * Minor changes Signed-off-by: MaximumEntropy <[email protected]> * add megatron gpt pretraining Signed-off-by: ericharper <[email protected]> Signed-off-by: MaximumEntropy <[email protected]> * Remove weight decay stuff Signed-off-by: MaximumEntropy <[email protected]> * Training script update for PTL 1.5 Signed-off-by: MaximumEntropy <[email protected]> * Update grad clip Signed-off-by: MaximumEntropy <[email protected]> * Update config Signed-off-by: MaximumEntropy <[email protected]> * Add barrier Signed-off-by: MaximumEntropy <[email protected]> * Style fixes and adding more stuff Signed-off-by: MaximumEntropy <[email protected]> * Missed merge conflict fix Signed-off-by: MaximumEntropy <[email protected]> * Unittest fixes Signed-off-by: MaximumEntropy <[email protected]> * Style fix Signed-off-by: MaximumEntropy <[email protected]> * Inference changes Signed-off-by: MaximumEntropy <[email protected]> * Style fixes Signed-off-by: MaximumEntropy <[email protected]> * Fix reinstall script Signed-off-by: MaximumEntropy <[email protected]> * T5 CI tests Signed-off-by: MaximumEntropy <[email protected]> * Style fixes Signed-off-by: MaximumEntropy <[email protected]> * Style fixes Signed-off-by: MaximumEntropy <[email protected]> * Minor fixes Signed-off-by: MaximumEntropy <[email protected]> * Minor fixes Signed-off-by: MaximumEntropy <[email protected]> * Tokenizer arg fix Signed-off-by: MaximumEntropy <[email protected]> * Style fixes Signed-off-by: MaximumEntropy <[email protected]> * Helpers fix Signed-off-by: MaximumEntropy <[email protected]> * Style fix Signed-off-by: MaximumEntropy <[email protected]> * PR review changes Signed-off-by: MaximumEntropy <[email protected]> * Style fixes Signed-off-by: MaximumEntropy <[email protected]> * Refactor bert dataset stuff Signed-off-by: MaximumEntropy <[email protected]> * Fix typo Signed-off-by: MaximumEntropy <[email protected]> * Fix request dataset variable Signed-off-by: MaximumEntropy <[email protected]> * Fix sched params in CI test Signed-off-by: MaximumEntropy <[email protected]> * Change to kwargs and Jenkins test for inference Signed-off-by: MaximumEntropy <[email protected]> * PR review related changes Signed-off-by: MaximumEntropy <[email protected]> * More fixes Signed-off-by: MaximumEntropy <[email protected]> * Test helper building Signed-off-by: MaximumEntropy <[email protected]> * Restore helper compilation everywhere Signed-off-by: MaximumEntropy <[email protected]> * Fix PR comments Signed-off-by: MaximumEntropy <[email protected]> * PR comments Signed-off-by: MaximumEntropy <[email protected]> * Add docstring to additional_special_tokens Signed-off-by: MaximumEntropy <[email protected]> * Improve docstring Signed-off-by: MaximumEntropy <[email protected]> * Fix resume from checkpoint path Signed-off-by: MaximumEntropy <[email protected]> * Fix for TP>1 Signed-off-by: MaximumEntropy <[email protected]> * Remove fused fp16 and bf16 args Signed-off-by: MaximumEntropy <[email protected]> * Add missed file Signed-off-by: MaximumEntropy <[email protected]> * Learning annealing scheduler fix Signed-off-by: MaximumEntropy <[email protected]> * Change default optim and scheduler to adam Signed-off-by: MaximumEntropy <[email protected]> * dummy for CI restart Signed-off-by: MaximumEntropy <[email protected]> * Remove constant steps after switch to adam Signed-off-by: MaximumEntropy <[email protected]> Co-authored-by: ericharper <[email protected]> Co-authored-by: Sangkug Lym <[email protected]> Co-authored-by: Oleksii Kuchaiev <[email protected]> Co-authored-by: Micha Livne <[email protected]> Co-authored-by: Micha Livne <[email protected]> Co-authored-by: Oleksii Kuchaiev <[email protected]> Signed-off-by: Bonham79 <[email protected]> * Updates on ASR with diarization util files (#3359) * Initial commit Signed-off-by: Taejin Park <[email protected]> * Update LM part and multiscale part in README. Signed-off-by: Taejin Park <[email protected]> * Removed redundant parts Signed-off-by: Taejin Park <[email protected]> * modified example script Signed-off-by: Taejin Park <[email protected]> * Revised doc strings Signed-off-by: Taejin Park <[email protected]> * Changed paths_to_manifest.py script Signed-off-by: Taejin Park <[email protected]> * Reflected PR comments and revised tutorials Signed-off-by: Taejin Park <[email protected]> * Added ASR models and kenlm installation Signed-off-by: [email protected] * Added ASR models and kenlm installation Signed-off-by: [email protected] Signed-off-by: Taejin Park <[email protected]> * Changed docstrings and style fix Signed-off-by: Taejin Park <[email protected]> * Fixed unused import and vars Signed-off-by: Taejin Park <[email protected]> * Added LM part in ASR_diar tutorial. Signed-off-by: Taejin Park <[email protected]> Co-authored-by: fayejf <[email protected]> Co-authored-by: Nithin Rao <[email protected]> Signed-off-by: Bonham79 <[email protected]> * update docs and replace speakernet with titanet in tutorials (#3405) * update docs and replace speakernet with titanet in tutorials Signed-off-by: nithinraok <[email protected]> * update dataset usage description Signed-off-by: nithinraok <[email protected]> * updated based on comments Signed-off-by: nithinraok <[email protected]> * spell fix Signed-off-by: nithinraok <[email protected]> Signed-off-by: Bonham79 <[email protected]> * Update Mixer-TTS, FastPitch and TTSDataset (#3366) * update tts dataset, fastpitch and mixer tts Signed-off-by: Oktai Tatanov <[email protected]> * fix style and notebooks Signed-off-by: Oktai Tatanov <[email protected]> * update notebooks Signed-off-by: Oktai Tatanov <[email protected]> * update mixer-tts, mixer-tts-x and fastpitch configs Signed-off-by: Oktai Tatanov <[email protected]> * update notebooks and configs Signed-off-by: Oktai Tatanov <[email protected]> * update configs Signed-off-by: Oktai Tatanov <[email protected]> * add links, update README, fix tutorials Signed-off-by: Oktai Tatanov <[email protected]> * fix style Signed-off-by: Oktai Tatanov <[email protected]> * remove unnecessary code from fastpitch model Signed-off-by: Oktai Tatanov <[email protected]> * update jenkinsfile and fastpitch typo fix Signed-off-by: Oktai Tatanov <[email protected]> * fix configs Signed-off-by: Oktai Tatanov <[email protected]> * revert jenkinsfile Signed-off-by: Oktai Tatanov <[email protected]> Signed-off-by: Bonham79 <[email protected]> * Asr fr (#3404) * Pushing WFST_tutorial for open draft. (Still need to review collab code. Signed-off-by: tbartley94 <[email protected]> * Checked tutorial code for WFST_Tutorial is properly functioning. Also included some formatting edits. Signed-off-by: tbartley94 <[email protected]> * Responding to editorial comments for WFST_tutorial Signed-off-by: tbartley94 <[email protected]> * Added images to folder and wrote README for tutorials Signed-off-by: tbartley94 <[email protected]> * Few more editorial changes to explain permutations in classification. Signed-off-by: tbartley94 <[email protected]> * Updated tutorials documentation page. Signed-off-by: tbartley94 <[email protected]> * Forgot links for README Signed-off-by: tbartley94 <[email protected]> * TOC links were dead Signed-off-by: tbartley94 <[email protected]> * More dead links to fix. Signed-off-by: tbartley94 <[email protected]> * removing collab install and appending a warning instead. Signed-off-by: tbartley94 <[email protected]> * Update WFST_Tutorial.ipynb Signed-off-by: tbartley94 <[email protected]> * Adding pretrained French models to ctc_bpe_models and rnnt_bpe_models available models listing Signed-off-by: tbartley94 <[email protected]> * Updating ctc_bpe_models import for updated Fr Conformer Ctc version. Signed-off-by: tbartley94 <[email protected]> * Added new French ASR models to documentation and imports: conformer transducer and conformer ctc trained without hyphenization. Signed-off-by: tbartley94 <[email protected]> Co-authored-by: Oleksii Kuchaiev <[email protected]> Co-authored-by: Yang Zhang <[email protected]> Signed-off-by: Bonham79 <[email protected]> * [fix] for resume training on SLURM multi-node multi-gpu (#3374) * [fix] for resume training on SLURM multi-node multi-gpu On SLURM resuming training in a multi-node multi-gpu settings fails, as when `LOCAL_RANK` is undefined `is_globa_rank_zero()` returns true on all processes that run on node 0. In this case `exp_manager.py` https://github.com/NVIDIA/NeMo/blob/f83b2c5524a787be21ffea170850c4b5486eac2b/nemo/utils/exp_manager.py#L446, creates multiple `run_*` folders, and eventually leads to failure (missing files because other processes have moved them already). Checking also for `SLURM_PROCID` solves this issue, as the environment variable contains the global rank id. Signed-off-by: Iztok Lebar Bajec <[email protected]> * Update get_rank.py In SLURM environment return SLURM global_rank (SLURM_PROCID), fallback to previous behaviour otherwise. Signed-off-by: Iztok Lebar Bajec <[email protected]> * style Signed-off-by: Jason <[email protected]> * Sloved bug when either RANK or SLURM_PROC reurn 0, and conditionals return False Signed-off-by: Iztok Lebar Bajec <[email protected]> Co-authored-by: Jason <[email protected]> Signed-off-by: Bonham79 <[email protected]> * Fix running token classification in multinode setting (#3413) * fix: master device check Signed-off-by: PeganovAnton <[email protected]> * Fix bug with use_cache parameter Signed-off-by: PeganovAnton <[email protected]> * create pickled features file regardless of value of use_cache Signed-off-by: PeganovAnton <[email protected]> * Improve docs Signed-off-by: PeganovAnton <[email protected]> Signed-off-by: Bonham79 <[email protected]> * Fix order of lang checking to ignore input langs (#3417) * Fix order of lang checking Signed-off-by: MaximumEntropy <[email protected]> * Fix == error Signed-off-by: MaximumEntropy <[email protected]> Co-authored-by: PeganovAnton <[email protected]> Signed-off-by: Bonham79 <[email protected]> * Refactor ASR Examples Directory (#3392) * Begin refactor of ASR files Signed-off-by: smajumdar <[email protected]> * Update jenkins paths for ASR Signed-off-by: smajumdar <[email protected]> * Update speech_to_text_ctc Signed-off-by: smajumdar <[email protected]> * Update speech_to_text_ctc_bpe Signed-off-by: smajumdar <[email protected]> * Lowercase all directories Signed-off-by: smajumdar <[email protected]> * Fix RNNT num_workers Signed-off-by: smajumdar <[email protected]> * Fix RNNT num_workers Signed-off-by: smajumdar <[email protected]> Signed-off-by: Bonham79 <[email protected]> * NMT MIM mean variance fix (#3385) * 1. Updated default NMT bottleneck encoder to be non-autoregressive Signed-off-by: Micha Livne <[email protected]> * 1. Fixed mena/variance being tied when latent and hidden dimensions are the same. Signed-off-by: Micha Livne <[email protected]> * 1. Debugging. Signed-off-by: Micha Livne <[email protected]> * 1. Fixed style. Signed-off-by: Micha Livne <[email protected]> Co-authored-by: Micha Livne <[email protected]> Signed-off-by: Bonham79 <[email protected]> * update to 21.12 (#3424) Signed-off-by: ericharper <[email protected]> Signed-off-by: Bonham79 <[email protected]> * Working around Pytorch exporter issue with expand() (#3422) Signed-off-by: Boris Fomitchev <[email protected]> Signed-off-by: Bonham79 <[email protected]> * update copyright (#3426) Signed-off-by: ericharper <[email protected]> Signed-off-by: Bonham79 <[email protected]> * remove apex (#3428) Signed-off-by: ekmb <[email protected]> Co-authored-by: Eric Harper <[email protected]> Signed-off-by: Bonham79 <[email protected]> * vad infer refactor (#3394) * vad infer refactor Signed-off-by: fayejf <[email protected]> * remove duplicate in write_long_audio_manifest Signed-off-by: fayejf <[email protected]> * remove duplicate in script vad_overlap_posterior Signed-off-by: fayejf <[email protected]> * style fix Signed-off-by: fayejf <[email protected]> * fix nb Signed-off-by: fayejf <[email protected]> * small fix Signed-off-by: fayejf <[email protected]> * fix Signed-off-by: fayejf <[email protected]> * small fixes Signed-off-by: fayejf <[email protected]> * reflect taejin's review Signed-off-by: fayejf <[email protected]> * update tutorial about rename Signed-off-by: fayejf <[email protected]> * small fix Signed-off-by: fayejf <[email protected]> * merge main and fix Signed-off-by: fayejf <[email protected]> * tiny path fix Signed-off-by: fayejf <[email protected]> Signed-off-by: Bonham79 <[email protected]> * doc update for refactory (#3430) Signed-off-by: fayejf <[email protected]> Signed-off-by: Bonham79 <[email protected]> * Update LJSpeech preprocessing (#3423) * update lj speech preprocessing Signed-off-by: Oktai Tatanov <[email protected]> * update lj speech preprocessing 2 Signed-off-by: Oktai Tatanov <[email protected]> Signed-off-by: Bonham79 <[email protected]> * NMT Shared Embeddings Weights (#3340) * 1. Debugging. Signed-off-by: Micha Livne <[email protected]> * 1. Implemented encoder deocder embedding weights tie. Signed-off-by: Micha Livne <[email protected]> * 1. Fixed style. Signed-off-by: Micha Livne <[email protected]> * 1. Debugging. Signed-off-by: Micha Livne <[email protected]> Co-authored-by: Micha Livne <[email protected]> Signed-off-by: Bonham79 <[email protected]> * [BigNLP] Make saving .nemo during on_train_end configurable (#3427) * make save nemo configurable on train end Signed-off-by: ericharper <[email protected]> * add warning when save_best_model is True Signed-off-by: ericharper <[email protected]> Co-authored-by: Jason <[email protected]> Signed-off-by: Bonham79 <[email protected]> * Preprocess an entire folder of .json or .json.gz files into a single .bin and .idx file. (#3425) * Folder preproc Signed-off-by: MaximumEntropy <[email protected]> * Style fixes Signed-off-by: MaximumEntropy <[email protected]> * Fix usless enumerate Signed-off-by: MaximumEntropy <[email protected]> * Address PR comments Signed-off-by: MaximumEntropy <[email protected]> * Style fixes Signed-off-by: MaximumEntropy <[email protected]> Co-authored-by: Eric Harper <[email protected]> Signed-off-by: Bonham79 <[email protected]> * Update speaker diarization docs (#3419) * Initial commit Signed-off-by: Taejin Park <[email protected]> * Fixed minor mistakes Signed-off-by: Taejin Park <[email protected]> * Some changes regarding diarization utils Signed-off-by: Taejin Park <[email protected]> * Fixed minor typos Signed-off-by: Taejin Park <[email protected]> * Reflected PR comments Signed-off-by: Taejin Park <[email protected]> * Reflected PR comments Signed-off-by: Taejin Park <[email protected]> * Reflected addtional comments Signed-off-by: Taejin Park <[email protected]> * Changed pics and refined text Signed-off-by: Taejin Park <[email protected]> * Minor typos Signed-off-by: Taejin Park <[email protected]> * Minor change on dataset Signed-off-by: Taejin Park <[email protected]> * Minor change on dataset 2 Signed-off-by: Taejin Park <[email protected]> * Changed manifest input to yaml format Signed-off-by: Taejin Park <[email protected]> * Capitalization of titles Signed-off-by: Taejin Park <[email protected]> * Last commit Signed-off-by: Taejin Park <[email protected]> Co-authored-by: fayejf <[email protected]> Co-authored-by: Nithin Rao <[email protected]> Signed-off-by: Bonham79 <[email protected]> * Update ContextNet models trained on more datasets (#3440) * Update ContextNet models trained on more datasets Signed-off-by: smajumdar <[email protected]> * Update ContextNet models trained on more datasets Signed-off-by: smajumdar <[email protected]> Signed-off-by: Bonham79 <[email protected]> * 1. Updated default buffer_size for TimingCallback to 1. (#3439) Signed-off-by: Micha Livne <[email protected]> Co-authored-by: Micha Livne <[email protected]> Signed-off-by: Bonham79 <[email protected]> * Fix bug for missing variable (#3437) Signed-off-by: MaximumEntropy <[email protected]> Co-authored-by: Eric Harper <[email protected]> Signed-off-by: Bonham79 <[email protected]> * Extending input_example() to take max batch and dimension arguments (#3429) * Extending input_example() to take max batch and dimension arguments Signed-off-by: Boris Fomitchev <[email protected]> * Fixing conformer size reconfig, extending export script, some refactoring Signed-off-by: Boris Fomitchev <[email protected]> * Addressing comments Signed-off-by: Boris Fomitchev <[email protected]> * Fixing test issue Signed-off-by: Boris Fomitchev <[email protected]> * Fixing DecoderJoint input example Signed-off-by: Boris Fomitchev <[email protected]> * Removing soon-deprecated external format option addition Signed-off-by: Boris Fomitchev <[email protected]> * Fixing indentation typo Signed-off-by: Boris Fomitchev <[email protected]> Signed-off-by: Bonham79 <[email protected]> * Byte-level Multilingual NMT (#3368) * init Signed-off-by: Abhinav Khattar <[email protected]> * style Signed-off-by: Abhinav Khattar <[email protected]> * rm debug stuff Signed-off-by: Abhinav Khattar <[email protected]> * changes Signed-off-by: Abhinav Khattar <[email protected]> * fix Signed-off-by: Abhinav Khattar <[email protected]> * fix Signed-off-by: Abhinav Khattar <[email protected]> * error fix Signed-off-by: Abhinav Khattar <[email protected]> * make spl tokens optional Signed-off-by: Abhinav Khattar <[email protected]> Co-authored-by: Micha Livne <[email protected]> Signed-off-by: Bonham79 <[email protected]> * Asr patches (#3443) * Fix issues with num_workers for transcribe Signed-off-by: smajumdar <[email protected]> * During inference use full context of chunk Signed-off-by: smajumdar <[email protected]> Signed-off-by: Bonham79 <[email protected]> * Updated NumPy SDE requirement (#3442) Signed-off-by: Vitaly Lavrukhin <[email protected]> Signed-off-by: Bonham79 <[email protected]> * refactor data preprocessing script (#3444) Signed-off-by: Yang Zhang <[email protected]> Signed-off-by: Bonham79 <[email protected]> * Prompt tuning loss mask fix (#3438) * Switched to calcualte loss on answer only Signed-off-by: Virginia Adams <[email protected]> * Added CI tests and unit tests for prompt tuning dataset Signed-off-by: Virginia Adams <[email protected]> * Fixed Jenkinsfile typo Signed-off-by: Virginia Adams <[email protected]> * fixed Jenkinsfile typo Signed-off-by: Virginia Adams <[email protected]> * Fixed more typos so CI tests run all the way through Signed-off-by: Virginia Adams <[email protected]> * Fixed code formatting Signed-off-by: Virginia Adams <[email protected]> * Needed to add save nemo file on train end flag to CI test Signed-off-by: Virginia Adams <[email protected]> * Added save .nemo on train end flag to example script Signed-off-by: Virginia Adams <[email protected]> Co-authored-by: Oleksii Kuchaiev <[email protected]> Signed-off-by: Bonham79 <[email protected]> * BioMegatron token classification tutorial fix to be compatible with current Megatron BERT (#3435) * fixed the tokenizer Signed-off-by: Yi Dong <[email protected]> * training is working Signed-off-by: Yi Dong <[email protected]> * fixed text Signed-off-by: Yi Dong <[email protected]> * fixed text Signed-off-by: Yi Dong <[email protected]> * working notebook Signed-off-by: Yi Dong <[email protected]> * style fix Signed-off-by: Yi Dong <[email protected]> * fixed text Signed-off-by: Yi Dong <[email protected]> * handles the different megatron-lm checkpoint versions Signed-off-by: Yi Dong <[email protected]> * fixed the text classification notebook Signed-off-by: Yi Dong <[email protected]> * fixed key error Signed-off-by: Yi Dong <[email protected]> * more key error Signed-off-by: Yi Dong <[email protected]> * replace the old notebooks Signed-off-by: Yi Dong <[email protected]> * register vocab to nemo file Signed-off-by: Yi Dong <[email protected]> * added the missing notebook Signed-off-by: Yi Dong <[email protected]> Co-authored-by: Eric Harper <[email protected]> Signed-off-by: Bonham79 <[email protected]> * (1) O2-style mixed precision recipe, (2) Persistent layer-norm, (3) Grade scale hysteresis, (4) gradient_as_bucket_view (#3259) * half precision training w/o autocast using master param stage fp16 working version fix: fp32 grad accumulation bf16 support Signed-off-by: Sangkug Lym <[email protected]> add closure fn at bf16 * change autocast compatible with latest pytorch version Signed-off-by: Sangkug Lym <[email protected]> * add module to the state_dict naming Signed-off-by: Sangkug Lym <[email protected]> * cleanup arguments Signed-off-by: Sangkug Lym <[email protected]> * fix module state matching upon checkpoint resume Signed-off-by: Sangkug Lym <[email protected]> * persistent layer norm and dependency check Signed-off-by: Sangkug Lym <[email protected]> check container version instead of pytorch version Signed-off-by: Sangkug Lym <[email protected]> update config * dependency check Signed-off-by: Sangkug Lym <[email protected]> * add graadient_as_bucket_view arg to config Signed-off-by: Sangkug Lym <[email protected]> * (1) add hysteresis to grad scaler, and (2) add grad_scaler to TB Signed-off-by: Sangkug Lym <[email protected]> * doc link fixes (#3264) Signed-off-by: nithinraok <[email protected]> * escape chars fix (#3253) * escape chars fix Signed-off-by: ekmb <[email protected]> * bug fixes Signed-off-by: ekmb <[email protected]> * review Signed-off-by: ekmb <[email protected]> Co-authored-by: Yang Zhang <[email protected]> * Improve data pipeline for punctuation capitalization model and make other useful changes (#3159) * Fix: inference on short sequences problem Signed-off-by: PeganovAnton <[email protected]> * Add draft of new punctuation and capitalization model Signed-off-by: PeganovAnton <[email protected]> * Fix debug config Signed-off-by: PeganovAnton <[email protected]> * Add parameter check Signed-off-by: PeganovAnton <[email protected]> * Update punctuation training script Signed-off-by: PeganovAnton <[email protected]> * Fix head config parameter names Signed-off-by: PeganovAnton <[email protected]> * Fix ds_item and class_label parameters in config Signed-off-by: PeganovAnton <[email protected]> * Fix dataloader shuffling for tarred dataset Signed-off-by: PeganovAnton <[email protected]> * Reduce validation batch Signed-off-by: PeganovAnton <[email protected]> * Add debug print Signed-off-by: PeganovAnton <[email protected]> * Fix metrics initialization Signed-off-by: PeganovAnton <[email protected]> * Fix minor bug Signed-off-by: PeganovAnton <[email protected]> * Fix device problem Signed-off-by: PeganovAnton <[email protected]> * Add debug print Signed-off-by: PeganovAnton <[email protected]> * Register metrics properly Signed-off-by: PeganovAnton <[email protected]> * Put metrics setup after module init Signed-off-by: PeganovAnton <[email protected]> * Reduce model size Signed-off-by: PeganovAnton <[email protected]> * Add wandb logging Signed-off-by: PeganovAnton <[email protected]> * Change wandb name Signed-off-by: PeganovAnton <[email protected]> * Fix logging names for metrics Signed-off-by: PeganovAnton <[email protected]> * Add debug print Signed-off-by: PeganovAnton <[email protected]> * Add returning from eval steps Signed-off-by: PeganovAnton <[email protected]> * Add second dev dataset Signed-off-by: PeganovAnton <[email protected]> * Move config Signed-off-by: PeganovAnton <[email protected]> * Fix path to dataset" Signed-off-by: PeganovAnton <[email protected]> * Add more tokenizer parameters Signed-off-by: PeganovAnton <[email protected]> * Add debug script for more tokenizer in creating tarred dataset Signed-off-by: PeganovAnton <[email protected]> * Update output path in debug script Signed-off-by: PeganovAnton <[email protected]> * Fix minor bug in typing Signed-off-by: PeganovAnton <[email protected]> * Fix bug in parsing arguments Signed-off-by: PeganovAnton <[email protected]> * Do not pass tokenizer through queue Signed-off-by: PeganovAnton <[email protected]> * Set hf tokenizer in debug script Signed-off-by: PeganovAnton <[email protected]> * Try char vocabulary Signed-off-by: PeganovAnton <[email protected]> * Fix typo Signed-off-by: PeganovAnton <[email protected]> * Improve error message Signed-off-by: PeganovAnton <[email protected]> * Fix OOV problem Signed-off-by: PeganovAnton <[email protected]> * Add label ids creation and getting Signed-off-by: PeganovAnton <[email protected]> * fix: add missing parameter Signed-off-by: PeganovAnton <[email protected]> * Improve error message for label ids building Signed-off-by: PeganovAnton <[email protected]> * Add short tar files repacking Signed-off-by: PeganovAnton <[email protected]> * Fix minor bug and add more security Signed-off-by: PeganovAnton <[email protected]> * Fix minor bug Signed-off-by: PeganovAnton <[email protected]> * fix: replace Path with str Signed-off-by: PeganovAnton <[email protected]> * fix: iter datasets Signed-off-by: PeganovAnton <[email protected]> * Improve logging Signed-off-by: PeganovAnton <[email protected]> * Turn off repacking Signed-off-by: PeganovAnton <[email protected]> * Turn off repacking Signed-off-by: PeganovAnton <[email protected]> * Turn on repacking Signed-off-by: PeganovAnton <[email protected]> * Turn off repacking Signed-off-by: PeganovAnton <[email protected]> * Add debug print Signed-off-by: PeganovAnton <[email protected]> * Improve unexpected removal Signed-off-by: PeganovAnton <[email protected]> * Turn on repacking Signed-off-by: PeganovAnton <[email protected]> * fix: remove repacked files Signed-off-by: PeganovAnton <[email protected]> * Add default config for testing Signed-off-by: PeganovAnton <[email protected]> * Improve code style in evaluate script Signed-off-by: PeganovAnton <[email protected]> * Add docstrings Signed-off-by: PeganovAnton <[email protected]> * Remove debug config Signed-off-by: PeganovAnton <[email protected]> * Remove commented code Signed-off-by: PeganovAnton <[email protected]> * Fix code style in doc string Signed-off-by: PeganovAnton <[email protected]> * Fix usage of parser.error function Signed-off-by: PeganovAnton <[email protected]> * Improve working with config and fix restoring of old checkpoints Signed-off-by: PeganovAnton <[email protected]> * Do not demand cfg as dataclass Signed-off-by: PeganovAnton <[email protected]> * Add backward compatibility for absense of use_tarred_dataset Signed-off-by: PeganovAnton <[email protected]> * Fight for backwards compatibility Signed-off-by: PeganovAnton <[email protected]> * Add tokens_in_batch backward compatibility Signed-off-by: PeganovAnton <[email protected]> * Undo unintentional changes in tutorial Signed-off-by: PeganovAnton <[email protected]> * Do not allow more workers than queries Signed-off-by: PeganovAnton <[email protected]> * Fix metric names in tests Signed-off-by: PeganovAnton <[email protected]> * Fix metric location Signed-off-by: PeganovAnton <[email protected]> * Fix metric location Signed-off-by: PeganovAnton <[email protected]> * Require ds_item or data_dir Signed-off-by: PeganovAnton <[email protected]> * Disable multiprocessing data preparation by default Signed-off-by: PeganovAnton <[email protected]> * Disable multiprocessing data preparation by default Signed-off-by: PeganovAnton <[email protected]> * Disable multiprocessing data preparation by default Signed-off-by: PeganovAnton <[email protected]> * Make minor improvements in docstrings and typing Signed-off-by: PeganovAnton <[email protected]> * Fix finetuning code Signed-off-by: PeganovAnton <[email protected]> * Fix shuffle train dataset config parameter Signed-off-by: PeganovAnton <[email protected]> * Fix evaluation script Signed-off-by: PeganovAnton <[email protected]> * Update test Signed-off-by: PeganovAnton <[email protected]> * Add new test and make minor changes Signed-off-by: PeganovAnton <[email protected]> * Fix repacked file names Signed-off-by: PeganovAnton <[email protected]> * Add assertion error Signed-off-by: PeganovAnton <[email protected]> * Fix minor bug in regex Signed-off-by: PeganovAnton <[email protected]> * Improve Jenkins command Signed-off-by: PeganovAnton <[email protected]> * Fix code style Signed-off-by: PeganovAnton <[email protected]> * fix: add name to Jenkins stage Signed-off-by: PeganovAnton <[email protected]> * fix: add steps block to Jenkins stage Signed-off-by: PeganovAnton <[email protected]> * fix: move nemo_experiments removal to post section Previously I encoutered a weird error + rm -rf nemo_experiments rm: cannot remove 'nemo_experiments': Directory not empty script returned exit code 1 And suspect that this could be because to parallel stages try to remove same directory simultaneously. Signed-off-by: PeganovAnton <[email protected]> * Turn off cache usage in Jenkins for token classification models Signed-off-by: PeganovAnton <[email protected]> * Stop pickling features Signed-off-by: PeganovAnton <[email protected]> * Reference webdataset in docs Signed-off-by: PeganovAnton <[email protected]> * Make multiple minor improvements Signed-off-by: PeganovAnton <[email protected]> * Add parameters tokens_in_batch, repack to documentation Signed-off-by: PeganovAnton <[email protected]> * Refactoring and improving readability Signed-off-by: PeganovAnton <[email protected]> * Make tar_shuffle_n optional parameter Signed-off-by: PeganovAnton <[email protected]> * Fix path to label vocab files Signed-off-by: PeganovAnton <[email protected]> * Fix metadata label vocab key Signed-off-by: PeganovAnton <[email protected]> * Create for_nemo directory Signed-off-by: PeganovAnton <[email protected]> * Fix tar_shuffle_n default value Signed-off-by: PeganovAnton <[email protected]> * First round of review fixes Signed-off-by: PeganovAnton <[email protected]> * Return tokens_in_batch default value Signed-off-by: PeganovAnton <[email protected]> * Remove duplicate parameters in `CommonDatasetParameters` Signed-off-by: PeganovAnton <[email protected]> * Remove duplicate parameters in config Signed-off-by: PeganovAnton <[email protected]> * Refactor user interface Signed-off-by: PeganovAnton <[email protected]> * fix: add missing parameter in calling setting dataloader up Signed-off-by: PeganovAnton <[email protected]> * fix: replace data config with model config Signed-off-by: PeganovAnton <[email protected]> * fix: typo in config parameter name Signed-off-by: PeganovAnton <[email protected]> * fix: location of label ids parameters in config Signed-off-by: PeganovAnton <[email protected]> * fix: transforming not first legacy data config Signed-off-by: PeganovAnton <[email protected]> * fix: num_samples can be negative Signed-off-by: PeganovAnton <[email protected]> * fix: create directory for nemo ids files Signed-off-by: PeganovAnton <[email protected]> * fix: remove unremoved with_label Signed-off-by: PeganovAnton <[email protected]> * fix: features contain ids if loaded from pickle Signed-off-by: PeganovAnton <[email protected]> * Fix kwargs parameters Signed-off-by: PeganovAnton <[email protected]> * Add label setting for testing case Signed-off-by: PeganovAnton <[email protected]> * Fix: change parameter location in config Signed-off-by: PeganovAnton <[email protected]> * Fix: transform legacy config in init Signed-off-by: PeganovAnton <[email protected]> * Fix: make minor improvement in checking config Signed-off-by: PeganovAnton <[email protected]> * fix: check label ids for None before checking pad label id Signed-off-by: PeganovAnton <[email protected]> * fix: set labels when restoring Signed-off-by: PeganovAnton <[email protected]> * fix: place where label ids are taken Signed-off-by: PeganovAnton <[email protected]> * Fix minor bug Signed-off-by: PeganovAnton <[email protected]> * fix: register artifacts in set_label_ids Signed-off-by: PeganovAnton <[email protected]> * fix: perform checking only if label ids are not set Signed-off-by: PeganovAnton <[email protected]> * fix: set label_ids_are_set Signed-off-by: PeganovAnton <[email protected]> * Fix using of dataset in create tarred dataset Signed-off-by: PeganovAnton <[email protected]> * fix: manipulate label ids if fragment_idx is zero Signed-off-by: PeganovAnton <[email protected]> * fix: remove directory correctly Signed-off-by: PeganovAnton <[email protected]> * fix: vocab file names Signed-off-by: PeganovAnton <[email protected]> * fix: vocab file names Signed-off-by: PeganovAnton <[email protected]> * Add debug print Signed-off-by: PeganovAnton <[email protected]> * Add directories for cache and label info Signed-off-by: PeganovAnton <[email protected]> * Minor fixes Signed-off-by: PeganovAnton <[email protected]> * Minor fix Signed-off-by: PeganovAnton <[email protected]> * Minor fix Signed-off-by: PeganovAnton <[email protected]> * Improve debug config Signed-off-by: PeganovAnton <[email protected]> * Create missing directories Signed-off-by: PeganovAnton <[email protected]> * Improve feature pkl file name Signed-off-by: PeganovAnton <[email protected]> * WORKING VERSION OF VOCAB CONFIG Signed-off-by: PeganovAnton <[email protected]> * Improve vocab file extraction Signed-off-by: PeganovAnton <[email protected]> * Fix config Signed-off-by: PeganovAnton <[email protected]> * Improve vocab file extraction Signed-off-by: PeganovAnton <[email protected]> * fix register artifact calls Signed-off-by: PeganovAnton <[email protected]> * fix: add class_labels to legacy fixing Signed-off-by: PeganovAnton <[email protected]> * fix: add missing method Signed-off-by: PeganovAnton <[email protected]> * Add support for checkpoints without class labels artifact Signed-off-by: PeganovAnton <[email protected]> * fix: add missing return values to function Signed-off-by: PeganovAnton <[email protected]> * fix saving label ids in creation of tarred dataset Signed-off-by: PeganovAnton <[email protected]> * fix: adjust tarred dataset consistency check Signed-off-by: PeganovAnton <[email protected]> * fix: consistency check call Signed-off-by: PeganovAnton <[email protected]> * Try checking labels every time dataloader is set Signed-off-by: PeganovAnton <[email protected]> * fi…
Signed-off-by: nithinraok [email protected]