Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated NumPy SDE requirement #3442

Merged
merged 5 commits into from
Jan 14, 2022
Merged

Updated NumPy SDE requirement #3442

merged 5 commits into from
Jan 14, 2022

Conversation

vsl9
Copy link
Collaborator

@vsl9 vsl9 commented Jan 14, 2022

Signed-off-by: Vitaly Lavrukhin [email protected]

@vsl9 vsl9 merged commit e05c43c into main Jan 14, 2022
@vsl9 vsl9 deleted the numpy-fix branch January 14, 2022 20:24
yzhang123 added a commit that referenced this pull request Feb 23, 2022
* cache_hf (#3406)

Signed-off-by: ekmb <[email protected]>
Signed-off-by: Bonham79 <[email protected]>

* Learning annealing scheduler fix (#3400)

Signed-off-by: MaximumEntropy <[email protected]>
Signed-off-by: Bonham79 <[email protected]>

* T5 Pre-training in NeMo using Megatron (#3036)

* add vocab_file and merge_file to megatron init

Signed-off-by: ericharper <[email protected]>

* add forward

Signed-off-by: ericharper <[email protected]>

* add train loss

Signed-off-by: ericharper <[email protected]>

* add optimizer

Signed-off-by: ericharper <[email protected]>

* add exp_manager

Signed-off-by: ericharper <[email protected]>

* multi-gpu is working

Signed-off-by: ericharper <[email protected]>

* adding val loop

Signed-off-by: ericharper <[email protected]>

* style

Signed-off-by: ericharper <[email protected]>

* adding val loop

Signed-off-by: ericharper <[email protected]>

* fix ranks

Signed-off-by: ericharper <[email protected]>

* fix model parallel checkpoint saving

Signed-off-by: ericharper <[email protected]>

* fix _del_model

Signed-off-by: ericharper <[email protected]>

* Initial megatron dataset port

Signed-off-by: MaximumEntropy <[email protected]>

* added megatron batch sampler

Signed-off-by: ericharper <[email protected]>

* try to fix num steps

Signed-off-by: ericharper <[email protected]>

* add wandb to config

Signed-off-by: ericharper <[email protected]>

* log lr

Signed-off-by: ericharper <[email protected]>

* add warmup ratio to config

Signed-off-by: ericharper <[email protected]>

* update configs

Signed-off-by: ericharper <[email protected]>

* update configs

Signed-off-by: ericharper <[email protected]>

* Fix merge conflicts

Signed-off-by: MaximumEntropy <[email protected]>

* add cpu init to args

Signed-off-by: ericharper <[email protected]>

* update config

Signed-off-by: ericharper <[email protected]>

* License fixes and megatron model porting

Signed-off-by: MaximumEntropy <[email protected]>

* Style fixes

Signed-off-by: MaximumEntropy <[email protected]>

* More fixes to import from nemo rather than megatron

Signed-off-by: MaximumEntropy <[email protected]>

* Fix circular imports

Signed-off-by: MaximumEntropy <[email protected]>

* Style fixes

Signed-off-by: MaximumEntropy <[email protected]>

* Revert config file

Signed-off-by: MaximumEntropy <[email protected]>

* update config

Signed-off-by: ericharper <[email protected]>

* Restructure further to avoid circular imports

Signed-off-by: MaximumEntropy <[email protected]>

* add Makefile

Signed-off-by: ericharper <[email protected]>

* Add megatron modules

Signed-off-by: MaximumEntropy <[email protected]>

* Add data makefile

Signed-off-by: MaximumEntropy <[email protected]>

* add license

Signed-off-by: ericharper <[email protected]>

* Port from latest megatron

Signed-off-by: MaximumEntropy <[email protected]>

* update cfg

Signed-off-by: ericharper <[email protected]>

* update config

Signed-off-by: ericharper <[email protected]>

* add _del_model_without_trainer

Signed-off-by: ericharper <[email protected]>

* add data preprocessing script

Signed-off-by: ericharper <[email protected]>

* update config

Signed-off-by: ericharper <[email protected]>

* use apex mpu

Signed-off-by: ericharper <[email protected]>

* replace print_rank_0 with nemo utils logging

Signed-off-by: ericharper <[email protected]>

* use apex mpu

Signed-off-by: ericharper <[email protected]>

* use apex mpu

Signed-off-by: ericharper <[email protected]>

* add use_cpu_initialization

Signed-off-by: ericharper <[email protected]>

* fixing autoresume in progress

Signed-off-by: ericharper <[email protected]>

* properly removing last checkpoint

Signed-off-by: ericharper <[email protected]>

* log consumed samples

Signed-off-by: ericharper <[email protected]>

* fix mp autoresume

Signed-off-by: ericharper <[email protected]>

* Megatron GPT training with NeMo tokenizers (#2818)

* Update files from megatron repo

Signed-off-by: MaximumEntropy <[email protected]>

* Remove non NLP data related files from megatron

Signed-off-by: MaximumEntropy <[email protected]>

* Merge megatron and nemo tokenizers

Signed-off-by: MaximumEntropy <[email protected]>

* Remove get_tokenizer() calls from gpt model

Signed-off-by: MaximumEntropy <[email protected]>

* Update tokenizer yaml config

Signed-off-by: MaximumEntropy <[email protected]>

* add NLPSaveRestoreConnector

Signed-off-by: ericharper <[email protected]>

* add todo

Signed-off-by: ericharper <[email protected]>

* update config

Signed-off-by: ericharper <[email protected]>

* make init_method_std configurable

Signed-off-by: ericharper <[email protected]>

* make gpu init work by setting random seed earlier

Signed-off-by: ericharper <[email protected]>

* fix gpu init after removing debug print in mpu

Signed-off-by: ericharper <[email protected]>

* add fused_adam

Signed-off-by: ericharper <[email protected]>

* check ds is not none before logging len

Signed-off-by: ericharper <[email protected]>

* set fp16 arg to true and fix enum conflict

Signed-off-by: ericharper <[email protected]>

* make fp16 arg configurable

Signed-off-by: ericharper <[email protected]>

* add grad clip from megatron

Signed-off-by: ericharper <[email protected]>

* Linear warmup with cosine annealing and constant holding (#2846)

* Testing cosine schedule

Signed-off-by: MaximumEntropy <[email protected]>

* Style fixes

Signed-off-by: MaximumEntropy <[email protected]>

* Fixes

Signed-off-by: MaximumEntropy <[email protected]>

* More fixes

Signed-off-by: MaximumEntropy <[email protected]>

* update config for constant steps in schedule

Signed-off-by: ericharper <[email protected]>

* temporarily import enum from megatron

Signed-off-by: ericharper <[email protected]>

* add grad clip for fp32

Signed-off-by: ericharper <[email protected]>

* update check for _del_model_without_trainer

Signed-off-by: ericharper <[email protected]>

* updating restore for model parallel

Signed-off-by: ericharper <[email protected]>

* add predict script

Signed-off-by: ericharper <[email protected]>

* update test iters

Signed-off-by: ericharper <[email protected]>

* add barrier

Signed-off-by: ericharper <[email protected]>

* return if clip_val is 0 or None

Signed-off-by: ericharper <[email protected]>

* when using amp clip grads after they are unscaled

Signed-off-by: ericharper <[email protected]>

* make native amp scaler hyperparams configurable

Signed-off-by: ericharper <[email protected]>

* (1) nvfuser, (2) amp-casting decoration (#2894)

* (1) nvfuser, (2) amp-casting decoration

Signed-off-by: Sangkug Lym <[email protected]>

* support bf16

Signed-off-by: Sangkug Lym <[email protected]>

* update package info

Signed-off-by: ericharper <[email protected]>

* add set device to constructor

Signed-off-by: ericharper <[email protected]>

* set_device in constructor

Signed-off-by: ericharper <[email protected]>

* [BigNLP] Remove megatron-lm dependency. (#2910)

* remove args

Signed-off-by: ericharper <[email protected]>

* remove args

Signed-off-by: ericharper <[email protected]>

* remove args

Signed-off-by: ericharper <[email protected]>

* remove args

Signed-off-by: ericharper <[email protected]>

* remove args in progress

Signed-off-by: ericharper <[email protected]>

* remove args in progress

Signed-off-by: ericharper <[email protected]>

* remove args in progress

Signed-off-by: ericharper <[email protected]>

* remove args in progress

Signed-off-by: ericharper <[email protected]>

* add load_fused_kernels

Signed-off-by: ericharper <[email protected]>

* add load_fused_kernels

Signed-off-by: ericharper <[email protected]>

* update megatron_init

Signed-off-by: ericharper <[email protected]>

* add fused kernels

Signed-off-by: ericharper <[email protected]>

* add fused kernels

Signed-off-by: ericharper <[email protected]>

* update process batch

Signed-off-by: ericharper <[email protected]>

* remove erroneous import

Signed-off-by: ericharper <[email protected]>

* remove erroneous import

Signed-off-by: ericharper <[email protected]>

* remove erroneous import

Signed-off-by: ericharper <[email protected]>

* add megatron clip_grad

Signed-off-by: ericharper <[email protected]>

* trying to resolve circular import error

Signed-off-by: ericharper <[email protected]>

* rename file

Signed-off-by: ericharper <[email protected]>

* remove non-gpt models and datasets from __init__ files

Signed-off-by: ericharper <[email protected]>

* set device in constructorfor gpu init

Signed-off-by: ericharper <[email protected]>

* set device in constructorfor gpu init

Signed-off-by: ericharper <[email protected]>

* set_device in constructor

Signed-off-by: ericharper <[email protected]>

* clean config

Signed-off-by: ericharper <[email protected]>

* update MegatronDataset

Signed-off-by: ericharper <[email protected]>

* clean up MegatronModule

Signed-off-by: ericharper <[email protected]>

* clean up MegatronModule

Signed-off-by: ericharper <[email protected]>

* rename fp16 and bf16 flags to fused_softmax_input_in_fp16/bf16

Signed-off-by: ericharper <[email protected]>

* rename to fused_fp16

Signed-off-by: ericharper <[email protected]>

* add fused_fp16 arg to LayerNorm calls

Signed-off-by: ericharper <[email protected]>

* fix arg name

Signed-off-by: ericharper <[email protected]>

* fix arg name

Signed-off-by: ericharper <[email protected]>

* fix import

Signed-off-by: ericharper <[email protected]>

* update arg

Signed-off-by: ericharper <[email protected]>

* skip warmup default to True

Signed-off-by: ericharper <[email protected]>

* skip warmup default to True

Signed-off-by: ericharper <[email protected]>

* Adding complete method to MegatronGPTModel (#2935)

Signed-off-by: Oleksii Kuchaiev <[email protected]>

* make ffn_hidden_size mandatory

Signed-off-by: ericharper <[email protected]>

* Manually migrating timing of step into branch (#2937)

* 1. Manually migrating timing of step into branch.

Signed-off-by: Micha Livne <[email protected]>

* 1. Updated file name and content.

Signed-off-by: Micha Livne <[email protected]>

* 1. Updated to latest code.

Signed-off-by: Micha Livne <[email protected]>

Co-authored-by: Micha Livne <[email protected]>

* remove unused imports

Signed-off-by: ericharper <[email protected]>

* remove unused import

Signed-off-by: ericharper <[email protected]>

* remove unused import

Signed-off-by: ericharper <[email protected]>

* remove unused import

Signed-off-by: ericharper <[email protected]>

* check fused_fp16 and fused_bf16 are not both True

Signed-off-by: ericharper <[email protected]>

* update predict script for model parallel .nemo

Signed-off-by: ericharper <[email protected]>

* typo

Signed-off-by: ericharper <[email protected]>

* typo

Signed-off-by: ericharper <[email protected]>

Co-authored-by: Oleksii Kuchaiev <[email protected]>
Co-authored-by: Micha Livne <[email protected]>
Co-authored-by: Micha Livne <[email protected]>

* NVfuser (#2943)

* activation checkpoint recompute

Signed-off-by: Sangkug Lym <[email protected]>

* selective nvfuser setup

* Megatron gpt bfloat support (#2926)

* Save/restore fix

Signed-off-by: MaximumEntropy <[email protected]>

* Another merge

Signed-off-by: MaximumEntropy <[email protected]>

* Bf16 args in init

Signed-off-by: MaximumEntropy <[email protected]>

* Set precision

Signed-off-by: MaximumEntropy <[email protected]>

* Remove debug stuff

Signed-off-by: MaximumEntropy <[email protected]>

* add bf16 casting decorator

Signed-off-by: Sangkug Lym <[email protected]>

* Bfloat layernorm propagation

Signed-off-by: MaximumEntropy <[email protected]>

* activation checkpoint recompute

Signed-off-by: Sangkug Lym <[email protected]>

* selective nvfuser setup

* More arg removal

Signed-off-by: MaximumEntropy <[email protected]>

* Remove BERTDataset

Signed-off-by: MaximumEntropy <[email protected]>

* update to latest apex and patch transformer autocast

Signed-off-by: ericharper <[email protected]>

Co-authored-by: Sangkug Lym <[email protected]>
Co-authored-by: ericharper <[email protected]>

* don't set jit for bf16

Signed-off-by: ericharper <[email protected]>

* replace apex.mpu

Signed-off-by: ericharper <[email protected]>

* fix grad clip

Signed-off-by: ericharper <[email protected]>

* NVFuser fixes (#2951)

* Fuser fixes

Signed-off-by: MaximumEntropy <[email protected]>

* Remove dummy handler

Signed-off-by: MaximumEntropy <[email protected]>

* Remove PTL plugin based logic for fusion

Signed-off-by: MaximumEntropy <[email protected]>

* remove duplicated file

Signed-off-by: ericharper <[email protected]>

* T5 model initial changes

Signed-off-by: MaximumEntropy <[email protected]>

* typo (#2960)

Signed-off-by: ericharper <[email protected]>

* [BigNLP] Script to convert GPT checkpoint to .nemo (#2958)

* remove args

Signed-off-by: ericharper <[email protected]>

* remove args

Signed-off-by: ericharper <[email protected]>

* remove args

Signed-off-by: ericharper <[email protected]>

* remove args

Signed-off-by: ericharper <[email protected]>

* remove args in progress

Signed-off-by: ericharper <[email protected]>

* remove args in progress

Signed-off-by: ericharper <[email protected]>

* remove args in progress

Signed-off-by: ericharper <[email protected]>

* remove args in progress

Signed-off-by: ericharper <[email protected]>

* add load_fused_kernels

Signed-off-by: ericharper <[email protected]>

* add load_fused_kernels

Signed-off-by: ericharper <[email protected]>

* update megatron_init

Signed-off-by: ericharper <[email protected]>

* add fused kernels

Signed-off-by: ericharper <[email protected]>

* add fused kernels

Signed-off-by: ericharper <[email protected]>

* update process batch

Signed-off-by: ericharper <[email protected]>

* remove erroneous import

Signed-off-by: ericharper <[email protected]>

* remove erroneous import

Signed-off-by: ericharper <[email protected]>

* remove erroneous import

Signed-off-by: ericharper <[email protected]>

* add megatron clip_grad

Signed-off-by: ericharper <[email protected]>

* trying to resolve circular import error

Signed-off-by: ericharper <[email protected]>

* rename file

Signed-off-by: ericharper <[email protected]>

* remove non-gpt models and datasets from __init__ files

Signed-off-by: ericharper <[email protected]>

* set device in constructorfor gpu init

Signed-off-by: ericharper <[email protected]>

* set device in constructorfor gpu init

Signed-off-by: ericharper <[email protected]>

* set_device in constructor

Signed-off-by: ericharper <[email protected]>

* clean config

Signed-off-by: ericharper <[email protected]>

* update MegatronDataset

Signed-off-by: ericharper <[email protected]>

* clean up MegatronModule

Signed-off-by: ericharper <[email protected]>

* clean up MegatronModule

Signed-off-by: ericharper <[email protected]>

* rename fp16 and bf16 flags to fused_softmax_input_in_fp16/bf16

Signed-off-by: ericharper <[email protected]>

* rename to fused_fp16

Signed-off-by: ericharper <[email protected]>

* add fused_fp16 arg to LayerNorm calls

Signed-off-by: ericharper <[email protected]>

* fix arg name

Signed-off-by: ericharper <[email protected]>

* fix arg name

Signed-off-by: ericharper <[email protected]>

* fix import

Signed-off-by: ericharper <[email protected]>

* update arg

Signed-off-by: ericharper <[email protected]>

* skip warmup default to True

Signed-off-by: ericharper <[email protected]>

* skip warmup default to True

Signed-off-by: ericharper <[email protected]>

* Adding complete method to MegatronGPTModel (#2935)

Signed-off-by: Oleksii Kuchaiev <[email protected]>

* make ffn_hidden_size mandatory

Signed-off-by: ericharper <[email protected]>

* Manually migrating timing of step into branch (#2937)

* 1. Manually migrating timing of step into branch.

Signed-off-by: Micha Livne <[email protected]>

* 1. Updated file name and content.

Signed-off-by: Micha Livne <[email protected]>

* 1. Updated to latest code.

Signed-off-by: Micha Livne <[email protected]>

Co-authored-by: Micha Livne <[email protected]>

* remove unused imports

Signed-off-by: ericharper <[email protected]>

* remove unused import

Signed-off-by: ericharper <[email protected]>

* remove unused import

Signed-off-by: ericharper <[email protected]>

* remove unused import

Signed-off-by: ericharper <[email protected]>

* check fused_fp16 and fused_bf16 are not both True

Signed-off-by: ericharper <[email protected]>

* update predict script for model parallel .nemo

Signed-off-by: ericharper <[email protected]>

* typo

Signed-off-by: ericharper <[email protected]>

* add script to convert .ckpt to .nemo

Signed-off-by: ericharper <[email protected]>

* in progress

Signed-off-by: ericharper <[email protected]>

* update

Signed-off-by: ericharper <[email protected]>

* convert mp checkpoints to nemo

Signed-off-by: ericharper <[email protected]>

* update help

Signed-off-by: ericharper <[email protected]>

* add safeguard for model parallel save_to

Signed-off-by: ericharper <[email protected]>

* adjust NLPModel save_to to be safer for model parallel

Signed-off-by: Oleksii Kuchaiev <[email protected]>

Co-authored-by: Oleksii Kuchaiev <[email protected]>
Co-authored-by: Micha Livne <[email protected]>
Co-authored-by: Micha Livne <[email protected]>
Co-authored-by: Oleksii Kuchaiev <[email protected]>

* [BigNLP] Update GPT evaluation to work with tensor model parallel  (#2959)

* in progress

Signed-off-by: ericharper <[email protected]>

* update args

Signed-off-by: ericharper <[email protected]>

* add request dataset

Signed-off-by: ericharper <[email protected]>

* tokenize request

Signed-off-by: ericharper <[email protected]>

* in progress

Signed-off-by: ericharper <[email protected]>

* able to run

Signed-off-by: ericharper <[email protected]>

* reduce logits

Signed-off-by: ericharper <[email protected]>

* capture response

Signed-off-by: ericharper <[email protected]>

* squeeze and unsqueeze

Signed-off-by: ericharper <[email protected]>

* handle non model parallel case

Signed-off-by: ericharper <[email protected]>

* clean imports

Signed-off-by: ericharper <[email protected]>

* add file

Signed-off-by: ericharper <[email protected]>

* convert logits to log_probs

Signed-off-by: Oleksii Kuchaiev <[email protected]>

* rename logits to log_probs

Signed-off-by: Oleksii Kuchaiev <[email protected]>

Co-authored-by: Oleksii Kuchaiev <[email protected]>

* More changes

Signed-off-by: MaximumEntropy <[email protected]>

* Missing import

Signed-off-by: MaximumEntropy <[email protected]>

* Tokenizer fixes and adafactor

Signed-off-by: MaximumEntropy <[email protected]>

* Add adafactor

Signed-off-by: MaximumEntropy <[email protected]>

* Add training and conf scripts

Signed-off-by: MaximumEntropy <[email protected]>

* Add megatron t5 model

Signed-off-by: MaximumEntropy <[email protected]>

* t5 config to fp32

Signed-off-by: MaximumEntropy <[email protected]>

* [BigNLP] Remove fused kernel code instead use Apex (#2984)

* remove fused_kernels

Signed-off-by: ericharper <[email protected]>

* remove fused_kernels

Signed-off-by: ericharper <[email protected]>

* remove fused layer norm and fused softmax and use apex instead

Signed-off-by: ericharper <[email protected]>

* update imports

Signed-off-by: ericharper <[email protected]>

* remove comment

Signed-off-by: ericharper <[email protected]>

* use apex enums

Signed-off-by: ericharper <[email protected]>

* use apex enums

Signed-off-by: ericharper <[email protected]>

* Timer with sliding window (#3002)

Co-authored-by: Micha Livne <[email protected]>

* check for rank zero

Signed-off-by: ericharper <[email protected]>

* Remove ict dataset import

Signed-off-by: MaximumEntropy <[email protected]>

* Remove fused kernels

Signed-off-by: MaximumEntropy <[email protected]>

* style fix

Signed-off-by: ericharper <[email protected]>

* fix consumed_samples when resuming

Signed-off-by: ericharper <[email protected]>

* T5 consumed samples fix

Signed-off-by: MaximumEntropy <[email protected]>

* Remove megatron dep

Signed-off-by: MaximumEntropy <[email protected]>

* Change checkpoint filename format

Signed-off-by: MaximumEntropy <[email protected]>

* Log consumed samples in T5

Signed-off-by: MaximumEntropy <[email protected]>

* T5 lr scheduler

Signed-off-by: MaximumEntropy <[email protected]>

* Checkpoint conversion and data preproc updates for t5

Signed-off-by: MaximumEntropy <[email protected]>

* Denoising eval

Signed-off-by: MaximumEntropy <[email protected]>

* Clean up denoising example to explicitly provide mask positions

Signed-off-by: MaximumEntropy <[email protected]>

* Better logging of results

Signed-off-by: MaximumEntropy <[email protected]>

* Better printing of results

Signed-off-by: MaximumEntropy <[email protected]>

* Minor changes

Signed-off-by: MaximumEntropy <[email protected]>

* Update config

Signed-off-by: MaximumEntropy <[email protected]>

* Update config

Signed-off-by: MaximumEntropy <[email protected]>

* properly removing last checkpoint

Signed-off-by: ericharper <[email protected]>

* add todo

Signed-off-by: ericharper <[email protected]>

* add predict script

Signed-off-by: ericharper <[email protected]>

* T5 model initial changes

Signed-off-by: MaximumEntropy <[email protected]>

* Add adafactor

Signed-off-by: MaximumEntropy <[email protected]>

* Add training and conf scripts

Signed-off-by: MaximumEntropy <[email protected]>

* Add megatron t5 model

Signed-off-by: MaximumEntropy <[email protected]>

* t5 config to fp32

Signed-off-by: MaximumEntropy <[email protected]>

* Remove fused kernels

Signed-off-by: MaximumEntropy <[email protected]>

* fix consumed_samples when resuming

Signed-off-by: ericharper <[email protected]>

* T5 consumed samples fix

Signed-off-by: MaximumEntropy <[email protected]>

* Remove megatron dep

Signed-off-by: MaximumEntropy <[email protected]>

* Change checkpoint filename format

Signed-off-by: MaximumEntropy <[email protected]>

* Log consumed samples in T5

Signed-off-by: MaximumEntropy <[email protected]>

* T5 lr scheduler

Signed-off-by: MaximumEntropy <[email protected]>

* Checkpoint conversion and data preproc updates for t5

Signed-off-by: MaximumEntropy <[email protected]>

* Denoising eval

Signed-off-by: MaximumEntropy <[email protected]>

* Clean up denoising example to explicitly provide mask positions

Signed-off-by: MaximumEntropy <[email protected]>

* Better logging of results

Signed-off-by: MaximumEntropy <[email protected]>

* Minor changes

Signed-off-by: MaximumEntropy <[email protected]>

* Update config

Signed-off-by: MaximumEntropy <[email protected]>

* Update config

Signed-off-by: MaximumEntropy <[email protected]>

* Merge main into megatron_t5

Signed-off-by: MaximumEntropy <[email protected]>

* Dataset prerproc script

Signed-off-by: MaximumEntropy <[email protected]>

* Remove biencoder file

Signed-off-by: MaximumEntropy <[email protected]>

* Remove another unused file

Signed-off-by: MaximumEntropy <[email protected]>

* Remove preprocess script since it has moved

Signed-off-by: MaximumEntropy <[email protected]>

* Remove ICT dataset

Signed-off-by: MaximumEntropy <[email protected]>

* Remove orqa dataset

Signed-off-by: MaximumEntropy <[email protected]>

* Remove realm datase

Signed-off-by: MaximumEntropy <[email protected]>

* More file removing

Signed-off-by: MaximumEntropy <[email protected]>

* Fix 2 files

Signed-off-by: MaximumEntropy <[email protected]>

* Style fixes

Signed-off-by: MaximumEntropy <[email protected]>

* Rename checkpoint fname

Signed-off-by: MaximumEntropy <[email protected]>

* Loss averaging fixes in t5

Signed-off-by: MaximumEntropy <[email protected]>

* Minor changes

Signed-off-by: MaximumEntropy <[email protected]>

* add megatron gpt pretraining

Signed-off-by: ericharper <[email protected]>
Signed-off-by: MaximumEntropy <[email protected]>

* Remove weight decay stuff

Signed-off-by: MaximumEntropy <[email protected]>

* Training script update for PTL 1.5

Signed-off-by: MaximumEntropy <[email protected]>

* Update grad clip

Signed-off-by: MaximumEntropy <[email protected]>

* Update config

Signed-off-by: MaximumEntropy <[email protected]>

* Add barrier

Signed-off-by: MaximumEntropy <[email protected]>

* Style fixes and adding more stuff

Signed-off-by: MaximumEntropy <[email protected]>

* Missed merge conflict fix

Signed-off-by: MaximumEntropy <[email protected]>

* Unittest fixes

Signed-off-by: MaximumEntropy <[email protected]>

* Style fix

Signed-off-by: MaximumEntropy <[email protected]>

* Inference changes

Signed-off-by: MaximumEntropy <[email protected]>

* Style fixes

Signed-off-by: MaximumEntropy <[email protected]>

* Fix reinstall script

Signed-off-by: MaximumEntropy <[email protected]>

* T5 CI tests

Signed-off-by: MaximumEntropy <[email protected]>

* Style fixes

Signed-off-by: MaximumEntropy <[email protected]>

* Style fixes

Signed-off-by: MaximumEntropy <[email protected]>

* Minor fixes

Signed-off-by: MaximumEntropy <[email protected]>

* Minor fixes

Signed-off-by: MaximumEntropy <[email protected]>

* Tokenizer arg fix

Signed-off-by: MaximumEntropy <[email protected]>

* Style fixes

Signed-off-by: MaximumEntropy <[email protected]>

* Helpers fix

Signed-off-by: MaximumEntropy <[email protected]>

* Style fix

Signed-off-by: MaximumEntropy <[email protected]>

* PR review changes

Signed-off-by: MaximumEntropy <[email protected]>

* Style fixes

Signed-off-by: MaximumEntropy <[email protected]>

* Refactor bert dataset stuff

Signed-off-by: MaximumEntropy <[email protected]>

* Fix typo

Signed-off-by: MaximumEntropy <[email protected]>

* Fix request dataset variable

Signed-off-by: MaximumEntropy <[email protected]>

* Fix sched params in CI test

Signed-off-by: MaximumEntropy <[email protected]>

* Change to kwargs and Jenkins test for inference

Signed-off-by: MaximumEntropy <[email protected]>

* PR review related changes

Signed-off-by: MaximumEntropy <[email protected]>

* More fixes

Signed-off-by: MaximumEntropy <[email protected]>

* Test helper building

Signed-off-by: MaximumEntropy <[email protected]>

* Restore helper compilation everywhere

Signed-off-by: MaximumEntropy <[email protected]>

* Fix PR comments

Signed-off-by: MaximumEntropy <[email protected]>

* PR comments

Signed-off-by: MaximumEntropy <[email protected]>

* Add docstring to additional_special_tokens

Signed-off-by: MaximumEntropy <[email protected]>

* Improve docstring

Signed-off-by: MaximumEntropy <[email protected]>

* Fix resume from checkpoint path

Signed-off-by: MaximumEntropy <[email protected]>

* Fix for TP>1

Signed-off-by: MaximumEntropy <[email protected]>

* Remove fused fp16 and bf16 args

Signed-off-by: MaximumEntropy <[email protected]>

* Add missed file

Signed-off-by: MaximumEntropy <[email protected]>

* Learning annealing scheduler fix

Signed-off-by: MaximumEntropy <[email protected]>

* Change default optim and scheduler to adam

Signed-off-by: MaximumEntropy <[email protected]>

* dummy for CI restart

Signed-off-by: MaximumEntropy <[email protected]>

* Remove constant steps after switch to adam

Signed-off-by: MaximumEntropy <[email protected]>

Co-authored-by: ericharper <[email protected]>
Co-authored-by: Sangkug Lym <[email protected]>
Co-authored-by: Oleksii Kuchaiev <[email protected]>
Co-authored-by: Micha Livne <[email protected]>
Co-authored-by: Micha Livne <[email protected]>
Co-authored-by: Oleksii Kuchaiev <[email protected]>
Signed-off-by: Bonham79 <[email protected]>

* Updates on ASR with diarization util files (#3359)

* Initial commit

Signed-off-by: Taejin Park <[email protected]>

* Update LM part and multiscale part in README.

Signed-off-by: Taejin Park <[email protected]>

* Removed redundant parts

Signed-off-by: Taejin Park <[email protected]>

* modified example script

Signed-off-by: Taejin Park <[email protected]>

* Revised doc strings

Signed-off-by: Taejin Park <[email protected]>

* Changed paths_to_manifest.py script

Signed-off-by: Taejin Park <[email protected]>

* Reflected PR comments and revised tutorials

Signed-off-by: Taejin Park <[email protected]>

* Added ASR models and kenlm installation 

Signed-off-by: [email protected]

* Added ASR models and kenlm installation 

Signed-off-by: [email protected]
Signed-off-by: Taejin Park <[email protected]>

* Changed docstrings and style fix

Signed-off-by: Taejin Park <[email protected]>

* Fixed unused import and vars

Signed-off-by: Taejin Park <[email protected]>

* Added LM part in ASR_diar tutorial.

Signed-off-by: Taejin Park <[email protected]>

Co-authored-by: fayejf <[email protected]>
Co-authored-by: Nithin Rao <[email protected]>
Signed-off-by: Bonham79 <[email protected]>

* update docs and replace speakernet with titanet in tutorials (#3405)

* update docs and replace speakernet with titanet in tutorials

Signed-off-by: nithinraok <[email protected]>

* update dataset usage description

Signed-off-by: nithinraok <[email protected]>

* updated based on comments

Signed-off-by: nithinraok <[email protected]>

* spell fix

Signed-off-by: nithinraok <[email protected]>
Signed-off-by: Bonham79 <[email protected]>

* Update Mixer-TTS, FastPitch and TTSDataset (#3366)

* update tts dataset, fastpitch and mixer tts

Signed-off-by: Oktai Tatanov <[email protected]>

* fix style and notebooks

Signed-off-by: Oktai Tatanov <[email protected]>

* update notebooks

Signed-off-by: Oktai Tatanov <[email protected]>

* update mixer-tts, mixer-tts-x and fastpitch configs

Signed-off-by: Oktai Tatanov <[email protected]>

* update notebooks and configs

Signed-off-by: Oktai Tatanov <[email protected]>

* update configs

Signed-off-by: Oktai Tatanov <[email protected]>

* add links, update README, fix tutorials

Signed-off-by: Oktai Tatanov <[email protected]>

* fix style

Signed-off-by: Oktai Tatanov <[email protected]>

* remove unnecessary code from fastpitch model

Signed-off-by: Oktai Tatanov <[email protected]>

* update jenkinsfile and fastpitch typo fix

Signed-off-by: Oktai Tatanov <[email protected]>

* fix configs

Signed-off-by: Oktai Tatanov <[email protected]>

* revert jenkinsfile

Signed-off-by: Oktai Tatanov <[email protected]>
Signed-off-by: Bonham79 <[email protected]>

* Asr fr (#3404)

* Pushing WFST_tutorial for open draft. (Still need to review collab code.

Signed-off-by: tbartley94 <[email protected]>

* Checked tutorial code for WFST_Tutorial is properly functioning. Also included some formatting edits.

Signed-off-by: tbartley94 <[email protected]>

* Responding to editorial comments for WFST_tutorial

Signed-off-by: tbartley94 <[email protected]>

* Added images to folder and wrote README for tutorials

Signed-off-by: tbartley94 <[email protected]>

* Few more editorial changes to explain permutations in classification.

Signed-off-by: tbartley94 <[email protected]>

* Updated tutorials documentation page.

Signed-off-by: tbartley94 <[email protected]>

* Forgot links for README

Signed-off-by: tbartley94 <[email protected]>

* TOC links were dead

Signed-off-by: tbartley94 <[email protected]>

* More dead links to fix.

Signed-off-by: tbartley94 <[email protected]>

* removing collab install and appending a warning instead.

Signed-off-by: tbartley94 <[email protected]>

* Update WFST_Tutorial.ipynb

Signed-off-by: tbartley94 <[email protected]>

* Adding pretrained French models to ctc_bpe_models and rnnt_bpe_models available models listing

Signed-off-by: tbartley94 <[email protected]>

* Updating ctc_bpe_models import for updated Fr Conformer Ctc version.

Signed-off-by: tbartley94 <[email protected]>

* Added new French ASR models to documentation and imports: conformer transducer and conformer ctc trained without hyphenization.

Signed-off-by: tbartley94 <[email protected]>

Co-authored-by: Oleksii Kuchaiev <[email protected]>
Co-authored-by: Yang Zhang <[email protected]>
Signed-off-by: Bonham79 <[email protected]>

* [fix] for resume training on SLURM multi-node multi-gpu (#3374)

* [fix] for resume training on SLURM multi-node multi-gpu

On SLURM resuming training in a multi-node multi-gpu settings fails, as when `LOCAL_RANK` is undefined `is_globa_rank_zero()` returns true on all processes that run on node 0. In this case `exp_manager.py` https://github.com/NVIDIA/NeMo/blob/f83b2c5524a787be21ffea170850c4b5486eac2b/nemo/utils/exp_manager.py#L446, creates multiple `run_*` folders, and eventually leads to failure (missing files because other processes have moved them already).

Checking also for `SLURM_PROCID` solves this issue, as the environment variable contains the global rank id.

Signed-off-by: Iztok Lebar Bajec <[email protected]>

* Update get_rank.py

In SLURM environment return SLURM global_rank (SLURM_PROCID), fallback to previous behaviour otherwise.

Signed-off-by: Iztok Lebar Bajec <[email protected]>

* style

Signed-off-by: Jason <[email protected]>

* Sloved bug when either RANK or SLURM_PROC reurn 0, and conditionals return False

Signed-off-by: Iztok Lebar Bajec <[email protected]>

Co-authored-by: Jason <[email protected]>
Signed-off-by: Bonham79 <[email protected]>

* Fix running token classification in multinode setting (#3413)

* fix: master device check

Signed-off-by: PeganovAnton <[email protected]>

* Fix bug with use_cache parameter

Signed-off-by: PeganovAnton <[email protected]>

* create pickled features file regardless of value of use_cache

Signed-off-by: PeganovAnton <[email protected]>

* Improve docs

Signed-off-by: PeganovAnton <[email protected]>
Signed-off-by: Bonham79 <[email protected]>

* Fix order of lang checking to ignore input langs (#3417)

* Fix order of lang checking

Signed-off-by: MaximumEntropy <[email protected]>

* Fix == error

Signed-off-by: MaximumEntropy <[email protected]>

Co-authored-by: PeganovAnton <[email protected]>
Signed-off-by: Bonham79 <[email protected]>

* Refactor ASR Examples Directory (#3392)

* Begin refactor of ASR files

Signed-off-by: smajumdar <[email protected]>

* Update jenkins paths for ASR

Signed-off-by: smajumdar <[email protected]>

* Update speech_to_text_ctc

Signed-off-by: smajumdar <[email protected]>

* Update speech_to_text_ctc_bpe

Signed-off-by: smajumdar <[email protected]>

* Lowercase all directories

Signed-off-by: smajumdar <[email protected]>

* Fix RNNT num_workers

Signed-off-by: smajumdar <[email protected]>

* Fix RNNT num_workers

Signed-off-by: smajumdar <[email protected]>
Signed-off-by: Bonham79 <[email protected]>

* NMT MIM mean variance fix (#3385)

* 1. Updated default NMT bottleneck encoder to be non-autoregressive

Signed-off-by: Micha Livne <[email protected]>

* 1. Fixed mena/variance being tied when latent and hidden dimensions are the same.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Fixed style.

Signed-off-by: Micha Livne <[email protected]>

Co-authored-by: Micha Livne <[email protected]>
Signed-off-by: Bonham79 <[email protected]>

* update to 21.12 (#3424)

Signed-off-by: ericharper <[email protected]>
Signed-off-by: Bonham79 <[email protected]>

* Working around Pytorch exporter issue with expand() (#3422)

Signed-off-by: Boris Fomitchev <[email protected]>
Signed-off-by: Bonham79 <[email protected]>

* update copyright (#3426)

Signed-off-by: ericharper <[email protected]>
Signed-off-by: Bonham79 <[email protected]>

* remove apex (#3428)

Signed-off-by: ekmb <[email protected]>

Co-authored-by: Eric Harper <[email protected]>
Signed-off-by: Bonham79 <[email protected]>

* vad infer refactor (#3394)

* vad infer refactor

Signed-off-by: fayejf <[email protected]>

* remove duplicate in write_long_audio_manifest

Signed-off-by: fayejf <[email protected]>

* remove duplicate in script vad_overlap_posterior

Signed-off-by: fayejf <[email protected]>

* style fix

Signed-off-by: fayejf <[email protected]>

* fix nb

Signed-off-by: fayejf <[email protected]>

* small fix

Signed-off-by: fayejf <[email protected]>

* fix

Signed-off-by: fayejf <[email protected]>

* small fixes

Signed-off-by: fayejf <[email protected]>

* reflect taejin's review

Signed-off-by: fayejf <[email protected]>

* update tutorial about rename

Signed-off-by: fayejf <[email protected]>

* small fix

Signed-off-by: fayejf <[email protected]>

* merge main and fix

Signed-off-by: fayejf <[email protected]>

* tiny path fix

Signed-off-by: fayejf <[email protected]>
Signed-off-by: Bonham79 <[email protected]>

* doc update for refactory (#3430)

Signed-off-by: fayejf <[email protected]>
Signed-off-by: Bonham79 <[email protected]>

* Update LJSpeech preprocessing (#3423)

* update lj speech preprocessing

Signed-off-by: Oktai Tatanov <[email protected]>

* update lj speech preprocessing 2

Signed-off-by: Oktai Tatanov <[email protected]>
Signed-off-by: Bonham79 <[email protected]>

* NMT Shared Embeddings Weights (#3340)

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Implemented encoder deocder embedding weights tie.

Signed-off-by: Micha Livne <[email protected]>

* 1. Fixed style.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

Co-authored-by: Micha Livne <[email protected]>
Signed-off-by: Bonham79 <[email protected]>

* [BigNLP] Make saving .nemo during on_train_end configurable (#3427)

* make save nemo configurable on train end

Signed-off-by: ericharper <[email protected]>

* add warning when save_best_model is True

Signed-off-by: ericharper <[email protected]>

Co-authored-by: Jason <[email protected]>
Signed-off-by: Bonham79 <[email protected]>

* Preprocess an entire folder of .json or .json.gz files into a single .bin and .idx file. (#3425)

* Folder preproc

Signed-off-by: MaximumEntropy <[email protected]>

* Style fixes

Signed-off-by: MaximumEntropy <[email protected]>

* Fix usless enumerate

Signed-off-by: MaximumEntropy <[email protected]>

* Address PR comments

Signed-off-by: MaximumEntropy <[email protected]>

* Style fixes

Signed-off-by: MaximumEntropy <[email protected]>

Co-authored-by: Eric Harper <[email protected]>
Signed-off-by: Bonham79 <[email protected]>

* Update speaker diarization docs (#3419)

* Initial commit

Signed-off-by: Taejin Park <[email protected]>

* Fixed minor mistakes

Signed-off-by: Taejin Park <[email protected]>

* Some changes regarding diarization utils

Signed-off-by: Taejin Park <[email protected]>

* Fixed minor typos

Signed-off-by: Taejin Park <[email protected]>

* Reflected PR comments

Signed-off-by: Taejin Park <[email protected]>

* Reflected PR comments

Signed-off-by: Taejin Park <[email protected]>

* Reflected addtional comments

Signed-off-by: Taejin Park <[email protected]>

* Changed pics and refined text

Signed-off-by: Taejin Park <[email protected]>

* Minor typos

Signed-off-by: Taejin Park <[email protected]>

* Minor change on dataset

Signed-off-by: Taejin Park <[email protected]>

* Minor change on dataset 2

Signed-off-by: Taejin Park <[email protected]>

* Changed manifest input to yaml format

Signed-off-by: Taejin Park <[email protected]>

* Capitalization of titles

Signed-off-by: Taejin Park <[email protected]>

* Last commit

Signed-off-by: Taejin Park <[email protected]>

Co-authored-by: fayejf <[email protected]>
Co-authored-by: Nithin Rao <[email protected]>
Signed-off-by: Bonham79 <[email protected]>

* Update ContextNet models trained on more datasets (#3440)

* Update ContextNet models trained on more datasets

Signed-off-by: smajumdar <[email protected]>

* Update ContextNet models trained on more datasets

Signed-off-by: smajumdar <[email protected]>
Signed-off-by: Bonham79 <[email protected]>

* 1. Updated default buffer_size for TimingCallback to 1. (#3439)

Signed-off-by: Micha Livne <[email protected]>

Co-authored-by: Micha Livne <[email protected]>
Signed-off-by: Bonham79 <[email protected]>

* Fix bug for missing variable (#3437)

Signed-off-by: MaximumEntropy <[email protected]>

Co-authored-by: Eric Harper <[email protected]>
Signed-off-by: Bonham79 <[email protected]>

* Extending input_example() to take max batch and dimension arguments (#3429)

* Extending input_example() to take max batch and dimension arguments

Signed-off-by: Boris Fomitchev <[email protected]>

* Fixing conformer size reconfig, extending export script, some refactoring

Signed-off-by: Boris Fomitchev <[email protected]>

* Addressing comments

Signed-off-by: Boris Fomitchev <[email protected]>

* Fixing test issue

Signed-off-by: Boris Fomitchev <[email protected]>

* Fixing DecoderJoint input example

Signed-off-by: Boris Fomitchev <[email protected]>

* Removing soon-deprecated external format option addition

Signed-off-by: Boris Fomitchev <[email protected]>

* Fixing indentation typo

Signed-off-by: Boris Fomitchev <[email protected]>
Signed-off-by: Bonham79 <[email protected]>

* Byte-level Multilingual NMT (#3368)

* init

Signed-off-by: Abhinav Khattar <[email protected]>

* style

Signed-off-by: Abhinav Khattar <[email protected]>

* rm debug stuff

Signed-off-by: Abhinav Khattar <[email protected]>

* changes

Signed-off-by: Abhinav Khattar <[email protected]>

* fix

Signed-off-by: Abhinav Khattar <[email protected]>

* fix

Signed-off-by: Abhinav Khattar <[email protected]>

* error fix

Signed-off-by: Abhinav Khattar <[email protected]>

* make spl tokens optional

Signed-off-by: Abhinav Khattar <[email protected]>

Co-authored-by: Micha Livne <[email protected]>
Signed-off-by: Bonham79 <[email protected]>

* Asr patches (#3443)

* Fix issues with num_workers for transcribe

Signed-off-by: smajumdar <[email protected]>

* During inference use full context of chunk

Signed-off-by: smajumdar <[email protected]>
Signed-off-by: Bonham79 <[email protected]>

* Updated NumPy SDE requirement (#3442)

Signed-off-by: Vitaly Lavrukhin <[email protected]>
Signed-off-by: Bonham79 <[email protected]>

* refactor data preprocessing script (#3444)

Signed-off-by: Yang Zhang <[email protected]>
Signed-off-by: Bonham79 <[email protected]>

* Prompt tuning loss mask fix (#3438)

* Switched to calcualte loss on answer only

Signed-off-by: Virginia Adams <[email protected]>

* Added CI tests and unit tests for prompt tuning dataset

Signed-off-by: Virginia Adams <[email protected]>

* Fixed Jenkinsfile typo

Signed-off-by: Virginia Adams <[email protected]>

* fixed Jenkinsfile typo

Signed-off-by: Virginia Adams <[email protected]>

* Fixed more typos so CI tests run all the way through

Signed-off-by: Virginia Adams <[email protected]>

* Fixed code formatting

Signed-off-by: Virginia Adams <[email protected]>

* Needed to add save nemo file on train end flag to CI test

Signed-off-by: Virginia Adams <[email protected]>

* Added save .nemo on train end flag to example script

Signed-off-by: Virginia Adams <[email protected]>

Co-authored-by: Oleksii Kuchaiev <[email protected]>
Signed-off-by: Bonham79 <[email protected]>

* BioMegatron token classification tutorial fix to be compatible with current Megatron BERT (#3435)

* fixed the tokenizer

Signed-off-by: Yi Dong <[email protected]>

* training is working

Signed-off-by: Yi Dong <[email protected]>

* fixed text

Signed-off-by: Yi Dong <[email protected]>

* fixed text

Signed-off-by: Yi Dong <[email protected]>

* working notebook

Signed-off-by: Yi Dong <[email protected]>

* style fix

Signed-off-by: Yi Dong <[email protected]>

* fixed text

Signed-off-by: Yi Dong <[email protected]>

* handles the different megatron-lm checkpoint versions

Signed-off-by: Yi Dong <[email protected]>

* fixed the text classification notebook

Signed-off-by: Yi Dong <[email protected]>

* fixed key error

Signed-off-by: Yi Dong <[email protected]>

* more key error

Signed-off-by: Yi Dong <[email protected]>

* replace the old notebooks

Signed-off-by: Yi Dong <[email protected]>

* register vocab to nemo file

Signed-off-by: Yi Dong <[email protected]>

* added the missing notebook

Signed-off-by: Yi Dong <[email protected]>

Co-authored-by: Eric Harper <[email protected]>
Signed-off-by: Bonham79 <[email protected]>

* (1) O2-style mixed precision recipe, (2) Persistent layer-norm, (3) Grade scale hysteresis, (4) gradient_as_bucket_view (#3259)

* half precision training w/o autocast using master param

stage fp16 working version

fix: fp32 grad accumulation

bf16 support

Signed-off-by: Sangkug Lym <[email protected]>

add closure fn at bf16

* change autocast compatible with latest pytorch version

Signed-off-by: Sangkug Lym <[email protected]>

* add module to the state_dict naming

Signed-off-by: Sangkug Lym <[email protected]>

* cleanup arguments

Signed-off-by: Sangkug Lym <[email protected]>

* fix module state matching upon checkpoint resume

Signed-off-by: Sangkug Lym <[email protected]>

* persistent layer norm and dependency check

Signed-off-by: Sangkug Lym <[email protected]>

check container version instead of pytorch version

Signed-off-by: Sangkug Lym <[email protected]>

update config

* dependency check

Signed-off-by: Sangkug Lym <[email protected]>

* add graadient_as_bucket_view arg to config

Signed-off-by: Sangkug Lym <[email protected]>

* (1) add hysteresis to grad scaler, and (2) add grad_scaler to TB

Signed-off-by: Sangkug Lym <[email protected]>

* doc link fixes (#3264)

Signed-off-by: nithinraok <[email protected]>

* escape chars fix (#3253)

* escape chars fix

Signed-off-by: ekmb <[email protected]>

* bug fixes

Signed-off-by: ekmb <[email protected]>

* review

Signed-off-by: ekmb <[email protected]>

Co-authored-by: Yang Zhang <[email protected]>

* Improve data pipeline for punctuation capitalization model and make other useful changes (#3159)

* Fix: inference on short sequences problem

Signed-off-by: PeganovAnton <[email protected]>

* Add draft of new punctuation and capitalization model

Signed-off-by: PeganovAnton <[email protected]>

* Fix debug config

Signed-off-by: PeganovAnton <[email protected]>

* Add parameter check

Signed-off-by: PeganovAnton <[email protected]>

* Update punctuation training script

Signed-off-by: PeganovAnton <[email protected]>

* Fix head config parameter names

Signed-off-by: PeganovAnton <[email protected]>

* Fix ds_item and class_label parameters in config

Signed-off-by: PeganovAnton <[email protected]>

* Fix dataloader shuffling for tarred dataset

Signed-off-by: PeganovAnton <[email protected]>

* Reduce validation batch

Signed-off-by: PeganovAnton <[email protected]>

* Add debug print

Signed-off-by: PeganovAnton <[email protected]>

* Fix metrics initialization

Signed-off-by: PeganovAnton <[email protected]>

* Fix minor bug

Signed-off-by: PeganovAnton <[email protected]>

* Fix device problem

Signed-off-by: PeganovAnton <[email protected]>

* Add debug print

Signed-off-by: PeganovAnton <[email protected]>

* Register metrics properly

Signed-off-by: PeganovAnton <[email protected]>

* Put metrics setup after module init

Signed-off-by: PeganovAnton <[email protected]>

* Reduce model size

Signed-off-by: PeganovAnton <[email protected]>

* Add wandb logging

Signed-off-by: PeganovAnton <[email protected]>

* Change wandb name

Signed-off-by: PeganovAnton <[email protected]>

* Fix logging names for metrics

Signed-off-by: PeganovAnton <[email protected]>

* Add debug print

Signed-off-by: PeganovAnton <[email protected]>

* Add returning from eval steps

Signed-off-by: PeganovAnton <[email protected]>

* Add second dev dataset

Signed-off-by: PeganovAnton <[email protected]>

* Move config

Signed-off-by: PeganovAnton <[email protected]>

* Fix path to dataset"

Signed-off-by: PeganovAnton <[email protected]>

* Add more tokenizer parameters

Signed-off-by: PeganovAnton <[email protected]>

* Add debug script for more tokenizer in creating tarred dataset

Signed-off-by: PeganovAnton <[email protected]>

* Update output path in debug script

Signed-off-by: PeganovAnton <[email protected]>

* Fix minor bug in typing

Signed-off-by: PeganovAnton <[email protected]>

* Fix bug in parsing arguments

Signed-off-by: PeganovAnton <[email protected]>

* Do not pass tokenizer through queue

Signed-off-by: PeganovAnton <[email protected]>

* Set hf tokenizer in debug script

Signed-off-by: PeganovAnton <[email protected]>

* Try char vocabulary

Signed-off-by: PeganovAnton <[email protected]>

* Fix typo

Signed-off-by: PeganovAnton <[email protected]>

* Improve error message

Signed-off-by: PeganovAnton <[email protected]>

* Fix OOV problem

Signed-off-by: PeganovAnton <[email protected]>

* Add label ids creation and getting

Signed-off-by: PeganovAnton <[email protected]>

* fix: add missing parameter

Signed-off-by: PeganovAnton <[email protected]>

* Improve error message for label ids building

Signed-off-by: PeganovAnton <[email protected]>

* Add short tar files repacking

Signed-off-by: PeganovAnton <[email protected]>

* Fix minor bug and add more security

Signed-off-by: PeganovAnton <[email protected]>

* Fix minor bug

Signed-off-by: PeganovAnton <[email protected]>

* fix: replace Path with str

Signed-off-by: PeganovAnton <[email protected]>

* fix: iter datasets

Signed-off-by: PeganovAnton <[email protected]>

* Improve logging

Signed-off-by: PeganovAnton <[email protected]>

* Turn off repacking

Signed-off-by: PeganovAnton <[email protected]>

* Turn off repacking

Signed-off-by: PeganovAnton <[email protected]>

* Turn on repacking

Signed-off-by: PeganovAnton <[email protected]>

* Turn off repacking

Signed-off-by: PeganovAnton <[email protected]>

* Add debug print

Signed-off-by: PeganovAnton <[email protected]>

* Improve unexpected removal

Signed-off-by: PeganovAnton <[email protected]>

* Turn on repacking

Signed-off-by: PeganovAnton <[email protected]>

* fix: remove repacked files

Signed-off-by: PeganovAnton <[email protected]>

* Add default config for testing

Signed-off-by: PeganovAnton <[email protected]>

* Improve code style in evaluate script

Signed-off-by: PeganovAnton <[email protected]>

* Add docstrings

Signed-off-by: PeganovAnton <[email protected]>

* Remove debug config

Signed-off-by: PeganovAnton <[email protected]>

* Remove commented code

Signed-off-by: PeganovAnton <[email protected]>

* Fix code style in doc string

Signed-off-by: PeganovAnton <[email protected]>

* Fix usage of parser.error function

Signed-off-by: PeganovAnton <[email protected]>

* Improve working with config and fix restoring of old checkpoints

Signed-off-by: PeganovAnton <[email protected]>

* Do not demand cfg as dataclass

Signed-off-by: PeganovAnton <[email protected]>

* Add backward compatibility for absense of use_tarred_dataset

Signed-off-by: PeganovAnton <[email protected]>

* Fight for backwards compatibility

Signed-off-by: PeganovAnton <[email protected]>

* Add tokens_in_batch backward compatibility

Signed-off-by: PeganovAnton <[email protected]>

* Undo unintentional changes in tutorial

Signed-off-by: PeganovAnton <[email protected]>

* Do not allow more workers than queries

Signed-off-by: PeganovAnton <[email protected]>

* Fix metric names in tests

Signed-off-by: PeganovAnton <[email protected]>

* Fix metric location

Signed-off-by: PeganovAnton <[email protected]>

* Fix metric location

Signed-off-by: PeganovAnton <[email protected]>

* Require ds_item or data_dir

Signed-off-by: PeganovAnton <[email protected]>

* Disable multiprocessing data preparation by default

Signed-off-by: PeganovAnton <[email protected]>

* Disable multiprocessing data preparation by default

Signed-off-by: PeganovAnton <[email protected]>

* Disable multiprocessing data preparation by default

Signed-off-by: PeganovAnton <[email protected]>

* Make minor improvements in docstrings and typing

Signed-off-by: PeganovAnton <[email protected]>

* Fix finetuning code

Signed-off-by: PeganovAnton <[email protected]>

* Fix shuffle train dataset config parameter

Signed-off-by: PeganovAnton <[email protected]>

* Fix evaluation script

Signed-off-by: PeganovAnton <[email protected]>

* Update test

Signed-off-by: PeganovAnton <[email protected]>

* Add new test and make minor changes

Signed-off-by: PeganovAnton <[email protected]>

* Fix repacked file names

Signed-off-by: PeganovAnton <[email protected]>

* Add assertion error

Signed-off-by: PeganovAnton <[email protected]>

* Fix minor bug in regex

Signed-off-by: PeganovAnton <[email protected]>

* Improve Jenkins command

Signed-off-by: PeganovAnton <[email protected]>

* Fix code style

Signed-off-by: PeganovAnton <[email protected]>

* fix: add name to Jenkins stage

Signed-off-by: PeganovAnton <[email protected]>

* fix: add steps block to Jenkins stage

Signed-off-by: PeganovAnton <[email protected]>

* fix: move nemo_experiments removal to post section

Previously I encoutered a weird error

+ rm -rf nemo_experiments
rm: cannot remove 'nemo_experiments': Directory not empty
script returned exit code 1

And suspect that this could be because to parallel stages try to
remove same directory simultaneously.

Signed-off-by: PeganovAnton <[email protected]>

* Turn off cache usage in Jenkins for token classification models

Signed-off-by: PeganovAnton <[email protected]>

* Stop pickling features

Signed-off-by: PeganovAnton <[email protected]>

* Reference webdataset in docs

Signed-off-by: PeganovAnton <[email protected]>

* Make multiple minor improvements

Signed-off-by: PeganovAnton <[email protected]>

* Add parameters tokens_in_batch, repack to documentation

Signed-off-by: PeganovAnton <[email protected]>

* Refactoring and improving readability

Signed-off-by: PeganovAnton <[email protected]>

* Make tar_shuffle_n optional parameter

Signed-off-by: PeganovAnton <[email protected]>

* Fix path to label vocab files

Signed-off-by: PeganovAnton <[email protected]>

* Fix metadata label vocab key

Signed-off-by: PeganovAnton <[email protected]>

* Create for_nemo directory

Signed-off-by: PeganovAnton <[email protected]>

* Fix tar_shuffle_n default value

Signed-off-by: PeganovAnton <[email protected]>

* First round of review fixes

Signed-off-by: PeganovAnton <[email protected]>

* Return tokens_in_batch default value

Signed-off-by: PeganovAnton <[email protected]>

* Remove duplicate parameters in `CommonDatasetParameters`

Signed-off-by: PeganovAnton <[email protected]>

* Remove duplicate parameters in config

Signed-off-by: PeganovAnton <[email protected]>

* Refactor user interface

Signed-off-by: PeganovAnton <[email protected]>

* fix: add missing parameter in calling setting dataloader up

Signed-off-by: PeganovAnton <[email protected]>

* fix: replace data config with model config

Signed-off-by: PeganovAnton <[email protected]>

* fix: typo in config parameter name

Signed-off-by: PeganovAnton <[email protected]>

* fix: location of label ids parameters in config

Signed-off-by: PeganovAnton <[email protected]>

* fix: transforming not first legacy data config

Signed-off-by: PeganovAnton <[email protected]>

* fix: num_samples can be negative

Signed-off-by: PeganovAnton <[email protected]>

* fix: create directory for nemo ids files

Signed-off-by: PeganovAnton <[email protected]>

* fix: remove unremoved with_label

Signed-off-by: PeganovAnton <[email protected]>

* fix: features contain ids if loaded from pickle

Signed-off-by: PeganovAnton <[email protected]>

* Fix kwargs parameters

Signed-off-by: PeganovAnton <[email protected]>

* Add label setting for testing case

Signed-off-by: PeganovAnton <[email protected]>

* Fix: change parameter location in config

Signed-off-by: PeganovAnton <[email protected]>

* Fix: transform legacy config in init

Signed-off-by: PeganovAnton <[email protected]>

* Fix: make minor improvement in checking config

Signed-off-by: PeganovAnton <[email protected]>

* fix: check label ids for None before checking pad label id

Signed-off-by: PeganovAnton <[email protected]>

* fix: set labels when restoring

Signed-off-by: PeganovAnton <[email protected]>

* fix: place where label ids are taken

Signed-off-by: PeganovAnton <[email protected]>

* Fix minor bug

Signed-off-by: PeganovAnton <[email protected]>

* fix: register artifacts in set_label_ids

Signed-off-by: PeganovAnton <[email protected]>

* fix: perform checking only if label ids are not set

Signed-off-by: PeganovAnton <[email protected]>

* fix: set label_ids_are_set

Signed-off-by: PeganovAnton <[email protected]>

* Fix using of dataset in create tarred dataset

Signed-off-by: PeganovAnton <[email protected]>

* fix: manipulate label ids if fragment_idx is zero

Signed-off-by: PeganovAnton <[email protected]>

* fix: remove directory correctly

Signed-off-by: PeganovAnton <[email protected]>

* fix: vocab file names

Signed-off-by: PeganovAnton <[email protected]>

* fix: vocab file names

Signed-off-by: PeganovAnton <[email protected]>

* Add debug print

Signed-off-by: PeganovAnton <[email protected]>

* Add directories for cache and label info

Signed-off-by: PeganovAnton <[email protected]>

* Minor fixes

Signed-off-by: PeganovAnton <[email protected]>

* Minor fix

Signed-off-by: PeganovAnton <[email protected]>

* Minor fix

Signed-off-by: PeganovAnton <[email protected]>

* Improve debug config

Signed-off-by: PeganovAnton <[email protected]>

* Create missing directories

Signed-off-by: PeganovAnton <[email protected]>

* Improve feature pkl file name

Signed-off-by: PeganovAnton <[email protected]>

* WORKING VERSION OF VOCAB CONFIG

Signed-off-by: PeganovAnton <[email protected]>

* Improve vocab file extraction

Signed-off-by: PeganovAnton <[email protected]>

* Fix config

Signed-off-by: PeganovAnton <[email protected]>

* Improve vocab file extraction

Signed-off-by: PeganovAnton <[email protected]>

* fix register artifact calls

Signed-off-by: PeganovAnton <[email protected]>

* fix: add class_labels to legacy fixing

Signed-off-by: PeganovAnton <[email protected]>

* fix: add missing method

Signed-off-by: PeganovAnton <[email protected]>

* Add support for checkpoints without class labels artifact

Signed-off-by: PeganovAnton <[email protected]>

* fix: add missing return values to function

Signed-off-by: PeganovAnton <[email protected]>

* fix saving label ids in creation of tarred dataset

Signed-off-by: PeganovAnton <[email protected]>

* fix: adjust tarred dataset consistency check

Signed-off-by: PeganovAnton <[email protected]>

* fix: consistency check call

Signed-off-by: PeganovAnton <[email protected]>

* Try checking labels every time dataloader is set

Signed-off-by: PeganovAnton <[email protected]>

* fi…
fayejf added a commit that referenced this pull request Mar 2, 2022
* cache_hf (#3406)

Signed-off-by: ekmb <[email protected]>
Signed-off-by: Bonham79 <[email protected]>

* Learning annealing scheduler fix (#3400)

Signed-off-by: MaximumEntropy <[email protected]>
Signed-off-by: Bonham79 <[email protected]>

* T5 Pre-training in NeMo using Megatron (#3036)

* add vocab_file and merge_file to megatron init

Signed-off-by: ericharper <[email protected]>

* add forward

Signed-off-by: ericharper <[email protected]>

* add train loss

Signed-off-by: ericharper <[email protected]>

* add optimizer

Signed-off-by: ericharper <[email protected]>

* add exp_manager

Signed-off-by: ericharper <[email protected]>

* multi-gpu is working

Signed-off-by: ericharper <[email protected]>

* adding val loop

Signed-off-by: ericharper <[email protected]>

* style

Signed-off-by: ericharper <[email protected]>

* adding val loop

Signed-off-by: ericharper <[email protected]>

* fix ranks

Signed-off-by: ericharper <[email protected]>

* fix model parallel checkpoint saving

Signed-off-by: ericharper <[email protected]>

* fix _del_model

Signed-off-by: ericharper <[email protected]>

* Initial megatron dataset port

Signed-off-by: MaximumEntropy <[email protected]>

* added megatron batch sampler

Signed-off-by: ericharper <[email protected]>

* try to fix num steps

Signed-off-by: ericharper <[email protected]>

* add wandb to config

Signed-off-by: ericharper <[email protected]>

* log lr

Signed-off-by: ericharper <[email protected]>

* add warmup ratio to config

Signed-off-by: ericharper <[email protected]>

* update configs

Signed-off-by: ericharper <[email protected]>

* update configs

Signed-off-by: ericharper <[email protected]>

* Fix merge conflicts

Signed-off-by: MaximumEntropy <[email protected]>

* add cpu init to args

Signed-off-by: ericharper <[email protected]>

* update config

Signed-off-by: ericharper <[email protected]>

* License fixes and megatron model porting

Signed-off-by: MaximumEntropy <[email protected]>

* Style fixes

Signed-off-by: MaximumEntropy <[email protected]>

* More fixes to import from nemo rather than megatron

Signed-off-by: MaximumEntropy <[email protected]>

* Fix circular imports

Signed-off-by: MaximumEntropy <[email protected]>

* Style fixes

Signed-off-by: MaximumEntropy <[email protected]>

* Revert config file

Signed-off-by: MaximumEntropy <[email protected]>

* update config

Signed-off-by: ericharper <[email protected]>

* Restructure further to avoid circular imports

Signed-off-by: MaximumEntropy <[email protected]>

* add Makefile

Signed-off-by: ericharper <[email protected]>

* Add megatron modules

Signed-off-by: MaximumEntropy <[email protected]>

* Add data makefile

Signed-off-by: MaximumEntropy <[email protected]>

* add license

Signed-off-by: ericharper <[email protected]>

* Port from latest megatron

Signed-off-by: MaximumEntropy <[email protected]>

* update cfg

Signed-off-by: ericharper <[email protected]>

* update config

Signed-off-by: ericharper <[email protected]>

* add _del_model_without_trainer

Signed-off-by: ericharper <[email protected]>

* add data preprocessing script

Signed-off-by: ericharper <[email protected]>

* update config

Signed-off-by: ericharper <[email protected]>

* use apex mpu

Signed-off-by: ericharper <[email protected]>

* replace print_rank_0 with nemo utils logging

Signed-off-by: ericharper <[email protected]>

* use apex mpu

Signed-off-by: ericharper <[email protected]>

* use apex mpu

Signed-off-by: ericharper <[email protected]>

* add use_cpu_initialization

Signed-off-by: ericharper <[email protected]>

* fixing autoresume in progress

Signed-off-by: ericharper <[email protected]>

* properly removing last checkpoint

Signed-off-by: ericharper <[email protected]>

* log consumed samples

Signed-off-by: ericharper <[email protected]>

* fix mp autoresume

Signed-off-by: ericharper <[email protected]>

* Megatron GPT training with NeMo tokenizers (#2818)

* Update files from megatron repo

Signed-off-by: MaximumEntropy <[email protected]>

* Remove non NLP data related files from megatron

Signed-off-by: MaximumEntropy <[email protected]>

* Merge megatron and nemo tokenizers

Signed-off-by: MaximumEntropy <[email protected]>

* Remove get_tokenizer() calls from gpt model

Signed-off-by: MaximumEntropy <[email protected]>

* Update tokenizer yaml config

Signed-off-by: MaximumEntropy <[email protected]>

* add NLPSaveRestoreConnector

Signed-off-by: ericharper <[email protected]>

* add todo

Signed-off-by: ericharper <[email protected]>

* update config

Signed-off-by: ericharper <[email protected]>

* make init_method_std configurable

Signed-off-by: ericharper <[email protected]>

* make gpu init work by setting random seed earlier

Signed-off-by: ericharper <[email protected]>

* fix gpu init after removing debug print in mpu

Signed-off-by: ericharper <[email protected]>

* add fused_adam

Signed-off-by: ericharper <[email protected]>

* check ds is not none before logging len

Signed-off-by: ericharper <[email protected]>

* set fp16 arg to true and fix enum conflict

Signed-off-by: ericharper <[email protected]>

* make fp16 arg configurable

Signed-off-by: ericharper <[email protected]>

* add grad clip from megatron

Signed-off-by: ericharper <[email protected]>

* Linear warmup with cosine annealing and constant holding (#2846)

* Testing cosine schedule

Signed-off-by: MaximumEntropy <[email protected]>

* Style fixes

Signed-off-by: MaximumEntropy <[email protected]>

* Fixes

Signed-off-by: MaximumEntropy <[email protected]>

* More fixes

Signed-off-by: MaximumEntropy <[email protected]>

* update config for constant steps in schedule

Signed-off-by: ericharper <[email protected]>

* temporarily import enum from megatron

Signed-off-by: ericharper <[email protected]>

* add grad clip for fp32

Signed-off-by: ericharper <[email protected]>

* update check for _del_model_without_trainer

Signed-off-by: ericharper <[email protected]>

* updating restore for model parallel

Signed-off-by: ericharper <[email protected]>

* add predict script

Signed-off-by: ericharper <[email protected]>

* update test iters

Signed-off-by: ericharper <[email protected]>

* add barrier

Signed-off-by: ericharper <[email protected]>

* return if clip_val is 0 or None

Signed-off-by: ericharper <[email protected]>

* when using amp clip grads after they are unscaled

Signed-off-by: ericharper <[email protected]>

* make native amp scaler hyperparams configurable

Signed-off-by: ericharper <[email protected]>

* (1) nvfuser, (2) amp-casting decoration (#2894)

* (1) nvfuser, (2) amp-casting decoration

Signed-off-by: Sangkug Lym <[email protected]>

* support bf16

Signed-off-by: Sangkug Lym <[email protected]>

* update package info

Signed-off-by: ericharper <[email protected]>

* add set device to constructor

Signed-off-by: ericharper <[email protected]>

* set_device in constructor

Signed-off-by: ericharper <[email protected]>

* [BigNLP] Remove megatron-lm dependency. (#2910)

* remove args

Signed-off-by: ericharper <[email protected]>

* remove args

Signed-off-by: ericharper <[email protected]>

* remove args

Signed-off-by: ericharper <[email protected]>

* remove args

Signed-off-by: ericharper <[email protected]>

* remove args in progress

Signed-off-by: ericharper <[email protected]>

* remove args in progress

Signed-off-by: ericharper <[email protected]>

* remove args in progress

Signed-off-by: ericharper <[email protected]>

* remove args in progress

Signed-off-by: ericharper <[email protected]>

* add load_fused_kernels

Signed-off-by: ericharper <[email protected]>

* add load_fused_kernels

Signed-off-by: ericharper <[email protected]>

* update megatron_init

Signed-off-by: ericharper <[email protected]>

* add fused kernels

Signed-off-by: ericharper <[email protected]>

* add fused kernels

Signed-off-by: ericharper <[email protected]>

* update process batch

Signed-off-by: ericharper <[email protected]>

* remove erroneous import

Signed-off-by: ericharper <[email protected]>

* remove erroneous import

Signed-off-by: ericharper <[email protected]>

* remove erroneous import

Signed-off-by: ericharper <[email protected]>

* add megatron clip_grad

Signed-off-by: ericharper <[email protected]>

* trying to resolve circular import error

Signed-off-by: ericharper <[email protected]>

* rename file

Signed-off-by: ericharper <[email protected]>

* remove non-gpt models and datasets from __init__ files

Signed-off-by: ericharper <[email protected]>

* set device in constructorfor gpu init

Signed-off-by: ericharper <[email protected]>

* set device in constructorfor gpu init

Signed-off-by: ericharper <[email protected]>

* set_device in constructor

Signed-off-by: ericharper <[email protected]>

* clean config

Signed-off-by: ericharper <[email protected]>

* update MegatronDataset

Signed-off-by: ericharper <[email protected]>

* clean up MegatronModule

Signed-off-by: ericharper <[email protected]>

* clean up MegatronModule

Signed-off-by: ericharper <[email protected]>

* rename fp16 and bf16 flags to fused_softmax_input_in_fp16/bf16

Signed-off-by: ericharper <[email protected]>

* rename to fused_fp16

Signed-off-by: ericharper <[email protected]>

* add fused_fp16 arg to LayerNorm calls

Signed-off-by: ericharper <[email protected]>

* fix arg name

Signed-off-by: ericharper <[email protected]>

* fix arg name

Signed-off-by: ericharper <[email protected]>

* fix import

Signed-off-by: ericharper <[email protected]>

* update arg

Signed-off-by: ericharper <[email protected]>

* skip warmup default to True

Signed-off-by: ericharper <[email protected]>

* skip warmup default to True

Signed-off-by: ericharper <[email protected]>

* Adding complete method to MegatronGPTModel (#2935)

Signed-off-by: Oleksii Kuchaiev <[email protected]>

* make ffn_hidden_size mandatory

Signed-off-by: ericharper <[email protected]>

* Manually migrating timing of step into branch (#2937)

* 1. Manually migrating timing of step into branch.

Signed-off-by: Micha Livne <[email protected]>

* 1. Updated file name and content.

Signed-off-by: Micha Livne <[email protected]>

* 1. Updated to latest code.

Signed-off-by: Micha Livne <[email protected]>

Co-authored-by: Micha Livne <[email protected]>

* remove unused imports

Signed-off-by: ericharper <[email protected]>

* remove unused import

Signed-off-by: ericharper <[email protected]>

* remove unused import

Signed-off-by: ericharper <[email protected]>

* remove unused import

Signed-off-by: ericharper <[email protected]>

* check fused_fp16 and fused_bf16 are not both True

Signed-off-by: ericharper <[email protected]>

* update predict script for model parallel .nemo

Signed-off-by: ericharper <[email protected]>

* typo

Signed-off-by: ericharper <[email protected]>

* typo

Signed-off-by: ericharper <[email protected]>

Co-authored-by: Oleksii Kuchaiev <[email protected]>
Co-authored-by: Micha Livne <[email protected]>
Co-authored-by: Micha Livne <[email protected]>

* NVfuser (#2943)

* activation checkpoint recompute

Signed-off-by: Sangkug Lym <[email protected]>

* selective nvfuser setup

* Megatron gpt bfloat support (#2926)

* Save/restore fix

Signed-off-by: MaximumEntropy <[email protected]>

* Another merge

Signed-off-by: MaximumEntropy <[email protected]>

* Bf16 args in init

Signed-off-by: MaximumEntropy <[email protected]>

* Set precision

Signed-off-by: MaximumEntropy <[email protected]>

* Remove debug stuff

Signed-off-by: MaximumEntropy <[email protected]>

* add bf16 casting decorator

Signed-off-by: Sangkug Lym <[email protected]>

* Bfloat layernorm propagation

Signed-off-by: MaximumEntropy <[email protected]>

* activation checkpoint recompute

Signed-off-by: Sangkug Lym <[email protected]>

* selective nvfuser setup

* More arg removal

Signed-off-by: MaximumEntropy <[email protected]>

* Remove BERTDataset

Signed-off-by: MaximumEntropy <[email protected]>

* update to latest apex and patch transformer autocast

Signed-off-by: ericharper <[email protected]>

Co-authored-by: Sangkug Lym <[email protected]>
Co-authored-by: ericharper <[email protected]>

* don't set jit for bf16

Signed-off-by: ericharper <[email protected]>

* replace apex.mpu

Signed-off-by: ericharper <[email protected]>

* fix grad clip

Signed-off-by: ericharper <[email protected]>

* NVFuser fixes (#2951)

* Fuser fixes

Signed-off-by: MaximumEntropy <[email protected]>

* Remove dummy handler

Signed-off-by: MaximumEntropy <[email protected]>

* Remove PTL plugin based logic for fusion

Signed-off-by: MaximumEntropy <[email protected]>

* remove duplicated file

Signed-off-by: ericharper <[email protected]>

* T5 model initial changes

Signed-off-by: MaximumEntropy <[email protected]>

* typo (#2960)

Signed-off-by: ericharper <[email protected]>

* [BigNLP] Script to convert GPT checkpoint to .nemo (#2958)

* remove args

Signed-off-by: ericharper <[email protected]>

* remove args

Signed-off-by: ericharper <[email protected]>

* remove args

Signed-off-by: ericharper <[email protected]>

* remove args

Signed-off-by: ericharper <[email protected]>

* remove args in progress

Signed-off-by: ericharper <[email protected]>

* remove args in progress

Signed-off-by: ericharper <[email protected]>

* remove args in progress

Signed-off-by: ericharper <[email protected]>

* remove args in progress

Signed-off-by: ericharper <[email protected]>

* add load_fused_kernels

Signed-off-by: ericharper <[email protected]>

* add load_fused_kernels

Signed-off-by: ericharper <[email protected]>

* update megatron_init

Signed-off-by: ericharper <[email protected]>

* add fused kernels

Signed-off-by: ericharper <[email protected]>

* add fused kernels

Signed-off-by: ericharper <[email protected]>

* update process batch

Signed-off-by: ericharper <[email protected]>

* remove erroneous import

Signed-off-by: ericharper <[email protected]>

* remove erroneous import

Signed-off-by: ericharper <[email protected]>

* remove erroneous import

Signed-off-by: ericharper <[email protected]>

* add megatron clip_grad

Signed-off-by: ericharper <[email protected]>

* trying to resolve circular import error

Signed-off-by: ericharper <[email protected]>

* rename file

Signed-off-by: ericharper <[email protected]>

* remove non-gpt models and datasets from __init__ files

Signed-off-by: ericharper <[email protected]>

* set device in constructorfor gpu init

Signed-off-by: ericharper <[email protected]>

* set device in constructorfor gpu init

Signed-off-by: ericharper <[email protected]>

* set_device in constructor

Signed-off-by: ericharper <[email protected]>

* clean config

Signed-off-by: ericharper <[email protected]>

* update MegatronDataset

Signed-off-by: ericharper <[email protected]>

* clean up MegatronModule

Signed-off-by: ericharper <[email protected]>

* clean up MegatronModule

Signed-off-by: ericharper <[email protected]>

* rename fp16 and bf16 flags to fused_softmax_input_in_fp16/bf16

Signed-off-by: ericharper <[email protected]>

* rename to fused_fp16

Signed-off-by: ericharper <[email protected]>

* add fused_fp16 arg to LayerNorm calls

Signed-off-by: ericharper <[email protected]>

* fix arg name

Signed-off-by: ericharper <[email protected]>

* fix arg name

Signed-off-by: ericharper <[email protected]>

* fix import

Signed-off-by: ericharper <[email protected]>

* update arg

Signed-off-by: ericharper <[email protected]>

* skip warmup default to True

Signed-off-by: ericharper <[email protected]>

* skip warmup default to True

Signed-off-by: ericharper <[email protected]>

* Adding complete method to MegatronGPTModel (#2935)

Signed-off-by: Oleksii Kuchaiev <[email protected]>

* make ffn_hidden_size mandatory

Signed-off-by: ericharper <[email protected]>

* Manually migrating timing of step into branch (#2937)

* 1. Manually migrating timing of step into branch.

Signed-off-by: Micha Livne <[email protected]>

* 1. Updated file name and content.

Signed-off-by: Micha Livne <[email protected]>

* 1. Updated to latest code.

Signed-off-by: Micha Livne <[email protected]>

Co-authored-by: Micha Livne <[email protected]>

* remove unused imports

Signed-off-by: ericharper <[email protected]>

* remove unused import

Signed-off-by: ericharper <[email protected]>

* remove unused import

Signed-off-by: ericharper <[email protected]>

* remove unused import

Signed-off-by: ericharper <[email protected]>

* check fused_fp16 and fused_bf16 are not both True

Signed-off-by: ericharper <[email protected]>

* update predict script for model parallel .nemo

Signed-off-by: ericharper <[email protected]>

* typo

Signed-off-by: ericharper <[email protected]>

* add script to convert .ckpt to .nemo

Signed-off-by: ericharper <[email protected]>

* in progress

Signed-off-by: ericharper <[email protected]>

* update

Signed-off-by: ericharper <[email protected]>

* convert mp checkpoints to nemo

Signed-off-by: ericharper <[email protected]>

* update help

Signed-off-by: ericharper <[email protected]>

* add safeguard for model parallel save_to

Signed-off-by: ericharper <[email protected]>

* adjust NLPModel save_to to be safer for model parallel

Signed-off-by: Oleksii Kuchaiev <[email protected]>

Co-authored-by: Oleksii Kuchaiev <[email protected]>
Co-authored-by: Micha Livne <[email protected]>
Co-authored-by: Micha Livne <[email protected]>
Co-authored-by: Oleksii Kuchaiev <[email protected]>

* [BigNLP] Update GPT evaluation to work with tensor model parallel  (#2959)

* in progress

Signed-off-by: ericharper <[email protected]>

* update args

Signed-off-by: ericharper <[email protected]>

* add request dataset

Signed-off-by: ericharper <[email protected]>

* tokenize request

Signed-off-by: ericharper <[email protected]>

* in progress

Signed-off-by: ericharper <[email protected]>

* able to run

Signed-off-by: ericharper <[email protected]>

* reduce logits

Signed-off-by: ericharper <[email protected]>

* capture response

Signed-off-by: ericharper <[email protected]>

* squeeze and unsqueeze

Signed-off-by: ericharper <[email protected]>

* handle non model parallel case

Signed-off-by: ericharper <[email protected]>

* clean imports

Signed-off-by: ericharper <[email protected]>

* add file

Signed-off-by: ericharper <[email protected]>

* convert logits to log_probs

Signed-off-by: Oleksii Kuchaiev <[email protected]>

* rename logits to log_probs

Signed-off-by: Oleksii Kuchaiev <[email protected]>

Co-authored-by: Oleksii Kuchaiev <[email protected]>

* More changes

Signed-off-by: MaximumEntropy <[email protected]>

* Missing import

Signed-off-by: MaximumEntropy <[email protected]>

* Tokenizer fixes and adafactor

Signed-off-by: MaximumEntropy <[email protected]>

* Add adafactor

Signed-off-by: MaximumEntropy <[email protected]>

* Add training and conf scripts

Signed-off-by: MaximumEntropy <[email protected]>

* Add megatron t5 model

Signed-off-by: MaximumEntropy <[email protected]>

* t5 config to fp32

Signed-off-by: MaximumEntropy <[email protected]>

* [BigNLP] Remove fused kernel code instead use Apex (#2984)

* remove fused_kernels

Signed-off-by: ericharper <[email protected]>

* remove fused_kernels

Signed-off-by: ericharper <[email protected]>

* remove fused layer norm and fused softmax and use apex instead

Signed-off-by: ericharper <[email protected]>

* update imports

Signed-off-by: ericharper <[email protected]>

* remove comment

Signed-off-by: ericharper <[email protected]>

* use apex enums

Signed-off-by: ericharper <[email protected]>

* use apex enums

Signed-off-by: ericharper <[email protected]>

* Timer with sliding window (#3002)

Co-authored-by: Micha Livne <[email protected]>

* check for rank zero

Signed-off-by: ericharper <[email protected]>

* Remove ict dataset import

Signed-off-by: MaximumEntropy <[email protected]>

* Remove fused kernels

Signed-off-by: MaximumEntropy <[email protected]>

* style fix

Signed-off-by: ericharper <[email protected]>

* fix consumed_samples when resuming

Signed-off-by: ericharper <[email protected]>

* T5 consumed samples fix

Signed-off-by: MaximumEntropy <[email protected]>

* Remove megatron dep

Signed-off-by: MaximumEntropy <[email protected]>

* Change checkpoint filename format

Signed-off-by: MaximumEntropy <[email protected]>

* Log consumed samples in T5

Signed-off-by: MaximumEntropy <[email protected]>

* T5 lr scheduler

Signed-off-by: MaximumEntropy <[email protected]>

* Checkpoint conversion and data preproc updates for t5

Signed-off-by: MaximumEntropy <[email protected]>

* Denoising eval

Signed-off-by: MaximumEntropy <[email protected]>

* Clean up denoising example to explicitly provide mask positions

Signed-off-by: MaximumEntropy <[email protected]>

* Better logging of results

Signed-off-by: MaximumEntropy <[email protected]>

* Better printing of results

Signed-off-by: MaximumEntropy <[email protected]>

* Minor changes

Signed-off-by: MaximumEntropy <[email protected]>

* Update config

Signed-off-by: MaximumEntropy <[email protected]>

* Update config

Signed-off-by: MaximumEntropy <[email protected]>

* properly removing last checkpoint

Signed-off-by: ericharper <[email protected]>

* add todo

Signed-off-by: ericharper <[email protected]>

* add predict script

Signed-off-by: ericharper <[email protected]>

* T5 model initial changes

Signed-off-by: MaximumEntropy <[email protected]>

* Add adafactor

Signed-off-by: MaximumEntropy <[email protected]>

* Add training and conf scripts

Signed-off-by: MaximumEntropy <[email protected]>

* Add megatron t5 model

Signed-off-by: MaximumEntropy <[email protected]>

* t5 config to fp32

Signed-off-by: MaximumEntropy <[email protected]>

* Remove fused kernels

Signed-off-by: MaximumEntropy <[email protected]>

* fix consumed_samples when resuming

Signed-off-by: ericharper <[email protected]>

* T5 consumed samples fix

Signed-off-by: MaximumEntropy <[email protected]>

* Remove megatron dep

Signed-off-by: MaximumEntropy <[email protected]>

* Change checkpoint filename format

Signed-off-by: MaximumEntropy <[email protected]>

* Log consumed samples in T5

Signed-off-by: MaximumEntropy <[email protected]>

* T5 lr scheduler

Signed-off-by: MaximumEntropy <[email protected]>

* Checkpoint conversion and data preproc updates for t5

Signed-off-by: MaximumEntropy <[email protected]>

* Denoising eval

Signed-off-by: MaximumEntropy <[email protected]>

* Clean up denoising example to explicitly provide mask positions

Signed-off-by: MaximumEntropy <[email protected]>

* Better logging of results

Signed-off-by: MaximumEntropy <[email protected]>

* Minor changes

Signed-off-by: MaximumEntropy <[email protected]>

* Update config

Signed-off-by: MaximumEntropy <[email protected]>

* Update config

Signed-off-by: MaximumEntropy <[email protected]>

* Merge main into megatron_t5

Signed-off-by: MaximumEntropy <[email protected]>

* Dataset prerproc script

Signed-off-by: MaximumEntropy <[email protected]>

* Remove biencoder file

Signed-off-by: MaximumEntropy <[email protected]>

* Remove another unused file

Signed-off-by: MaximumEntropy <[email protected]>

* Remove preprocess script since it has moved

Signed-off-by: MaximumEntropy <[email protected]>

* Remove ICT dataset

Signed-off-by: MaximumEntropy <[email protected]>

* Remove orqa dataset

Signed-off-by: MaximumEntropy <[email protected]>

* Remove realm datase

Signed-off-by: MaximumEntropy <[email protected]>

* More file removing

Signed-off-by: MaximumEntropy <[email protected]>

* Fix 2 files

Signed-off-by: MaximumEntropy <[email protected]>

* Style fixes

Signed-off-by: MaximumEntropy <[email protected]>

* Rename checkpoint fname

Signed-off-by: MaximumEntropy <[email protected]>

* Loss averaging fixes in t5

Signed-off-by: MaximumEntropy <[email protected]>

* Minor changes

Signed-off-by: MaximumEntropy <[email protected]>

* add megatron gpt pretraining

Signed-off-by: ericharper <[email protected]>
Signed-off-by: MaximumEntropy <[email protected]>

* Remove weight decay stuff

Signed-off-by: MaximumEntropy <[email protected]>

* Training script update for PTL 1.5

Signed-off-by: MaximumEntropy <[email protected]>

* Update grad clip

Signed-off-by: MaximumEntropy <[email protected]>

* Update config

Signed-off-by: MaximumEntropy <[email protected]>

* Add barrier

Signed-off-by: MaximumEntropy <[email protected]>

* Style fixes and adding more stuff

Signed-off-by: MaximumEntropy <[email protected]>

* Missed merge conflict fix

Signed-off-by: MaximumEntropy <[email protected]>

* Unittest fixes

Signed-off-by: MaximumEntropy <[email protected]>

* Style fix

Signed-off-by: MaximumEntropy <[email protected]>

* Inference changes

Signed-off-by: MaximumEntropy <[email protected]>

* Style fixes

Signed-off-by: MaximumEntropy <[email protected]>

* Fix reinstall script

Signed-off-by: MaximumEntropy <[email protected]>

* T5 CI tests

Signed-off-by: MaximumEntropy <[email protected]>

* Style fixes

Signed-off-by: MaximumEntropy <[email protected]>

* Style fixes

Signed-off-by: MaximumEntropy <[email protected]>

* Minor fixes

Signed-off-by: MaximumEntropy <[email protected]>

* Minor fixes

Signed-off-by: MaximumEntropy <[email protected]>

* Tokenizer arg fix

Signed-off-by: MaximumEntropy <[email protected]>

* Style fixes

Signed-off-by: MaximumEntropy <[email protected]>

* Helpers fix

Signed-off-by: MaximumEntropy <[email protected]>

* Style fix

Signed-off-by: MaximumEntropy <[email protected]>

* PR review changes

Signed-off-by: MaximumEntropy <[email protected]>

* Style fixes

Signed-off-by: MaximumEntropy <[email protected]>

* Refactor bert dataset stuff

Signed-off-by: MaximumEntropy <[email protected]>

* Fix typo

Signed-off-by: MaximumEntropy <[email protected]>

* Fix request dataset variable

Signed-off-by: MaximumEntropy <[email protected]>

* Fix sched params in CI test

Signed-off-by: MaximumEntropy <[email protected]>

* Change to kwargs and Jenkins test for inference

Signed-off-by: MaximumEntropy <[email protected]>

* PR review related changes

Signed-off-by: MaximumEntropy <[email protected]>

* More fixes

Signed-off-by: MaximumEntropy <[email protected]>

* Test helper building

Signed-off-by: MaximumEntropy <[email protected]>

* Restore helper compilation everywhere

Signed-off-by: MaximumEntropy <[email protected]>

* Fix PR comments

Signed-off-by: MaximumEntropy <[email protected]>

* PR comments

Signed-off-by: MaximumEntropy <[email protected]>

* Add docstring to additional_special_tokens

Signed-off-by: MaximumEntropy <[email protected]>

* Improve docstring

Signed-off-by: MaximumEntropy <[email protected]>

* Fix resume from checkpoint path

Signed-off-by: MaximumEntropy <[email protected]>

* Fix for TP>1

Signed-off-by: MaximumEntropy <[email protected]>

* Remove fused fp16 and bf16 args

Signed-off-by: MaximumEntropy <[email protected]>

* Add missed file

Signed-off-by: MaximumEntropy <[email protected]>

* Learning annealing scheduler fix

Signed-off-by: MaximumEntropy <[email protected]>

* Change default optim and scheduler to adam

Signed-off-by: MaximumEntropy <[email protected]>

* dummy for CI restart

Signed-off-by: MaximumEntropy <[email protected]>

* Remove constant steps after switch to adam

Signed-off-by: MaximumEntropy <[email protected]>

Co-authored-by: ericharper <[email protected]>
Co-authored-by: Sangkug Lym <[email protected]>
Co-authored-by: Oleksii Kuchaiev <[email protected]>
Co-authored-by: Micha Livne <[email protected]>
Co-authored-by: Micha Livne <[email protected]>
Co-authored-by: Oleksii Kuchaiev <[email protected]>
Signed-off-by: Bonham79 <[email protected]>

* Updates on ASR with diarization util files (#3359)

* Initial commit

Signed-off-by: Taejin Park <[email protected]>

* Update LM part and multiscale part in README.

Signed-off-by: Taejin Park <[email protected]>

* Removed redundant parts

Signed-off-by: Taejin Park <[email protected]>

* modified example script

Signed-off-by: Taejin Park <[email protected]>

* Revised doc strings

Signed-off-by: Taejin Park <[email protected]>

* Changed paths_to_manifest.py script

Signed-off-by: Taejin Park <[email protected]>

* Reflected PR comments and revised tutorials

Signed-off-by: Taejin Park <[email protected]>

* Added ASR models and kenlm installation 

Signed-off-by: [email protected]

* Added ASR models and kenlm installation 

Signed-off-by: [email protected]
Signed-off-by: Taejin Park <[email protected]>

* Changed docstrings and style fix

Signed-off-by: Taejin Park <[email protected]>

* Fixed unused import and vars

Signed-off-by: Taejin Park <[email protected]>

* Added LM part in ASR_diar tutorial.

Signed-off-by: Taejin Park <[email protected]>

Co-authored-by: fayejf <[email protected]>
Co-authored-by: Nithin Rao <[email protected]>
Signed-off-by: Bonham79 <[email protected]>

* update docs and replace speakernet with titanet in tutorials (#3405)

* update docs and replace speakernet with titanet in tutorials

Signed-off-by: nithinraok <[email protected]>

* update dataset usage description

Signed-off-by: nithinraok <[email protected]>

* updated based on comments

Signed-off-by: nithinraok <[email protected]>

* spell fix

Signed-off-by: nithinraok <[email protected]>
Signed-off-by: Bonham79 <[email protected]>

* Update Mixer-TTS, FastPitch and TTSDataset (#3366)

* update tts dataset, fastpitch and mixer tts

Signed-off-by: Oktai Tatanov <[email protected]>

* fix style and notebooks

Signed-off-by: Oktai Tatanov <[email protected]>

* update notebooks

Signed-off-by: Oktai Tatanov <[email protected]>

* update mixer-tts, mixer-tts-x and fastpitch configs

Signed-off-by: Oktai Tatanov <[email protected]>

* update notebooks and configs

Signed-off-by: Oktai Tatanov <[email protected]>

* update configs

Signed-off-by: Oktai Tatanov <[email protected]>

* add links, update README, fix tutorials

Signed-off-by: Oktai Tatanov <[email protected]>

* fix style

Signed-off-by: Oktai Tatanov <[email protected]>

* remove unnecessary code from fastpitch model

Signed-off-by: Oktai Tatanov <[email protected]>

* update jenkinsfile and fastpitch typo fix

Signed-off-by: Oktai Tatanov <[email protected]>

* fix configs

Signed-off-by: Oktai Tatanov <[email protected]>

* revert jenkinsfile

Signed-off-by: Oktai Tatanov <[email protected]>
Signed-off-by: Bonham79 <[email protected]>

* Asr fr (#3404)

* Pushing WFST_tutorial for open draft. (Still need to review collab code.

Signed-off-by: tbartley94 <[email protected]>

* Checked tutorial code for WFST_Tutorial is properly functioning. Also included some formatting edits.

Signed-off-by: tbartley94 <[email protected]>

* Responding to editorial comments for WFST_tutorial

Signed-off-by: tbartley94 <[email protected]>

* Added images to folder and wrote README for tutorials

Signed-off-by: tbartley94 <[email protected]>

* Few more editorial changes to explain permutations in classification.

Signed-off-by: tbartley94 <[email protected]>

* Updated tutorials documentation page.

Signed-off-by: tbartley94 <[email protected]>

* Forgot links for README

Signed-off-by: tbartley94 <[email protected]>

* TOC links were dead

Signed-off-by: tbartley94 <[email protected]>

* More dead links to fix.

Signed-off-by: tbartley94 <[email protected]>

* removing collab install and appending a warning instead.

Signed-off-by: tbartley94 <[email protected]>

* Update WFST_Tutorial.ipynb

Signed-off-by: tbartley94 <[email protected]>

* Adding pretrained French models to ctc_bpe_models and rnnt_bpe_models available models listing

Signed-off-by: tbartley94 <[email protected]>

* Updating ctc_bpe_models import for updated Fr Conformer Ctc version.

Signed-off-by: tbartley94 <[email protected]>

* Added new French ASR models to documentation and imports: conformer transducer and conformer ctc trained without hyphenization.

Signed-off-by: tbartley94 <[email protected]>

Co-authored-by: Oleksii Kuchaiev <[email protected]>
Co-authored-by: Yang Zhang <[email protected]>
Signed-off-by: Bonham79 <[email protected]>

* [fix] for resume training on SLURM multi-node multi-gpu (#3374)

* [fix] for resume training on SLURM multi-node multi-gpu

On SLURM resuming training in a multi-node multi-gpu settings fails, as when `LOCAL_RANK` is undefined `is_globa_rank_zero()` returns true on all processes that run on node 0. In this case `exp_manager.py` https://github.com/NVIDIA/NeMo/blob/f83b2c5524a787be21ffea170850c4b5486eac2b/nemo/utils/exp_manager.py#L446, creates multiple `run_*` folders, and eventually leads to failure (missing files because other processes have moved them already).

Checking also for `SLURM_PROCID` solves this issue, as the environment variable contains the global rank id.

Signed-off-by: Iztok Lebar Bajec <[email protected]>

* Update get_rank.py

In SLURM environment return SLURM global_rank (SLURM_PROCID), fallback to previous behaviour otherwise.

Signed-off-by: Iztok Lebar Bajec <[email protected]>

* style

Signed-off-by: Jason <[email protected]>

* Sloved bug when either RANK or SLURM_PROC reurn 0, and conditionals return False

Signed-off-by: Iztok Lebar Bajec <[email protected]>

Co-authored-by: Jason <[email protected]>
Signed-off-by: Bonham79 <[email protected]>

* Fix running token classification in multinode setting (#3413)

* fix: master device check

Signed-off-by: PeganovAnton <[email protected]>

* Fix bug with use_cache parameter

Signed-off-by: PeganovAnton <[email protected]>

* create pickled features file regardless of value of use_cache

Signed-off-by: PeganovAnton <[email protected]>

* Improve docs

Signed-off-by: PeganovAnton <[email protected]>
Signed-off-by: Bonham79 <[email protected]>

* Fix order of lang checking to ignore input langs (#3417)

* Fix order of lang checking

Signed-off-by: MaximumEntropy <[email protected]>

* Fix == error

Signed-off-by: MaximumEntropy <[email protected]>

Co-authored-by: PeganovAnton <[email protected]>
Signed-off-by: Bonham79 <[email protected]>

* Refactor ASR Examples Directory (#3392)

* Begin refactor of ASR files

Signed-off-by: smajumdar <[email protected]>

* Update jenkins paths for ASR

Signed-off-by: smajumdar <[email protected]>

* Update speech_to_text_ctc

Signed-off-by: smajumdar <[email protected]>

* Update speech_to_text_ctc_bpe

Signed-off-by: smajumdar <[email protected]>

* Lowercase all directories

Signed-off-by: smajumdar <[email protected]>

* Fix RNNT num_workers

Signed-off-by: smajumdar <[email protected]>

* Fix RNNT num_workers

Signed-off-by: smajumdar <[email protected]>
Signed-off-by: Bonham79 <[email protected]>

* NMT MIM mean variance fix (#3385)

* 1. Updated default NMT bottleneck encoder to be non-autoregressive

Signed-off-by: Micha Livne <[email protected]>

* 1. Fixed mena/variance being tied when latent and hidden dimensions are the same.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Fixed style.

Signed-off-by: Micha Livne <[email protected]>

Co-authored-by: Micha Livne <[email protected]>
Signed-off-by: Bonham79 <[email protected]>

* update to 21.12 (#3424)

Signed-off-by: ericharper <[email protected]>
Signed-off-by: Bonham79 <[email protected]>

* Working around Pytorch exporter issue with expand() (#3422)

Signed-off-by: Boris Fomitchev <[email protected]>
Signed-off-by: Bonham79 <[email protected]>

* update copyright (#3426)

Signed-off-by: ericharper <[email protected]>
Signed-off-by: Bonham79 <[email protected]>

* remove apex (#3428)

Signed-off-by: ekmb <[email protected]>

Co-authored-by: Eric Harper <[email protected]>
Signed-off-by: Bonham79 <[email protected]>

* vad infer refactor (#3394)

* vad infer refactor

Signed-off-by: fayejf <[email protected]>

* remove duplicate in write_long_audio_manifest

Signed-off-by: fayejf <[email protected]>

* remove duplicate in script vad_overlap_posterior

Signed-off-by: fayejf <[email protected]>

* style fix

Signed-off-by: fayejf <[email protected]>

* fix nb

Signed-off-by: fayejf <[email protected]>

* small fix

Signed-off-by: fayejf <[email protected]>

* fix

Signed-off-by: fayejf <[email protected]>

* small fixes

Signed-off-by: fayejf <[email protected]>

* reflect taejin's review

Signed-off-by: fayejf <[email protected]>

* update tutorial about rename

Signed-off-by: fayejf <[email protected]>

* small fix

Signed-off-by: fayejf <[email protected]>

* merge main and fix

Signed-off-by: fayejf <[email protected]>

* tiny path fix

Signed-off-by: fayejf <[email protected]>
Signed-off-by: Bonham79 <[email protected]>

* doc update for refactory (#3430)

Signed-off-by: fayejf <[email protected]>
Signed-off-by: Bonham79 <[email protected]>

* Update LJSpeech preprocessing (#3423)

* update lj speech preprocessing

Signed-off-by: Oktai Tatanov <[email protected]>

* update lj speech preprocessing 2

Signed-off-by: Oktai Tatanov <[email protected]>
Signed-off-by: Bonham79 <[email protected]>

* NMT Shared Embeddings Weights (#3340)

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Implemented encoder deocder embedding weights tie.

Signed-off-by: Micha Livne <[email protected]>

* 1. Fixed style.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

Co-authored-by: Micha Livne <[email protected]>
Signed-off-by: Bonham79 <[email protected]>

* [BigNLP] Make saving .nemo during on_train_end configurable (#3427)

* make save nemo configurable on train end

Signed-off-by: ericharper <[email protected]>

* add warning when save_best_model is True

Signed-off-by: ericharper <[email protected]>

Co-authored-by: Jason <[email protected]>
Signed-off-by: Bonham79 <[email protected]>

* Preprocess an entire folder of .json or .json.gz files into a single .bin and .idx file. (#3425)

* Folder preproc

Signed-off-by: MaximumEntropy <[email protected]>

* Style fixes

Signed-off-by: MaximumEntropy <[email protected]>

* Fix usless enumerate

Signed-off-by: MaximumEntropy <[email protected]>

* Address PR comments

Signed-off-by: MaximumEntropy <[email protected]>

* Style fixes

Signed-off-by: MaximumEntropy <[email protected]>

Co-authored-by: Eric Harper <[email protected]>
Signed-off-by: Bonham79 <[email protected]>

* Update speaker diarization docs (#3419)

* Initial commit

Signed-off-by: Taejin Park <[email protected]>

* Fixed minor mistakes

Signed-off-by: Taejin Park <[email protected]>

* Some changes regarding diarization utils

Signed-off-by: Taejin Park <[email protected]>

* Fixed minor typos

Signed-off-by: Taejin Park <[email protected]>

* Reflected PR comments

Signed-off-by: Taejin Park <[email protected]>

* Reflected PR comments

Signed-off-by: Taejin Park <[email protected]>

* Reflected addtional comments

Signed-off-by: Taejin Park <[email protected]>

* Changed pics and refined text

Signed-off-by: Taejin Park <[email protected]>

* Minor typos

Signed-off-by: Taejin Park <[email protected]>

* Minor change on dataset

Signed-off-by: Taejin Park <[email protected]>

* Minor change on dataset 2

Signed-off-by: Taejin Park <[email protected]>

* Changed manifest input to yaml format

Signed-off-by: Taejin Park <[email protected]>

* Capitalization of titles

Signed-off-by: Taejin Park <[email protected]>

* Last commit

Signed-off-by: Taejin Park <[email protected]>

Co-authored-by: fayejf <[email protected]>
Co-authored-by: Nithin Rao <[email protected]>
Signed-off-by: Bonham79 <[email protected]>

* Update ContextNet models trained on more datasets (#3440)

* Update ContextNet models trained on more datasets

Signed-off-by: smajumdar <[email protected]>

* Update ContextNet models trained on more datasets

Signed-off-by: smajumdar <[email protected]>
Signed-off-by: Bonham79 <[email protected]>

* 1. Updated default buffer_size for TimingCallback to 1. (#3439)

Signed-off-by: Micha Livne <[email protected]>

Co-authored-by: Micha Livne <[email protected]>
Signed-off-by: Bonham79 <[email protected]>

* Fix bug for missing variable (#3437)

Signed-off-by: MaximumEntropy <[email protected]>

Co-authored-by: Eric Harper <[email protected]>
Signed-off-by: Bonham79 <[email protected]>

* Extending input_example() to take max batch and dimension arguments (#3429)

* Extending input_example() to take max batch and dimension arguments

Signed-off-by: Boris Fomitchev <[email protected]>

* Fixing conformer size reconfig, extending export script, some refactoring

Signed-off-by: Boris Fomitchev <[email protected]>

* Addressing comments

Signed-off-by: Boris Fomitchev <[email protected]>

* Fixing test issue

Signed-off-by: Boris Fomitchev <[email protected]>

* Fixing DecoderJoint input example

Signed-off-by: Boris Fomitchev <[email protected]>

* Removing soon-deprecated external format option addition

Signed-off-by: Boris Fomitchev <[email protected]>

* Fixing indentation typo

Signed-off-by: Boris Fomitchev <[email protected]>
Signed-off-by: Bonham79 <[email protected]>

* Byte-level Multilingual NMT (#3368)

* init

Signed-off-by: Abhinav Khattar <[email protected]>

* style

Signed-off-by: Abhinav Khattar <[email protected]>

* rm debug stuff

Signed-off-by: Abhinav Khattar <[email protected]>

* changes

Signed-off-by: Abhinav Khattar <[email protected]>

* fix

Signed-off-by: Abhinav Khattar <[email protected]>

* fix

Signed-off-by: Abhinav Khattar <[email protected]>

* error fix

Signed-off-by: Abhinav Khattar <[email protected]>

* make spl tokens optional

Signed-off-by: Abhinav Khattar <[email protected]>

Co-authored-by: Micha Livne <[email protected]>
Signed-off-by: Bonham79 <[email protected]>

* Asr patches (#3443)

* Fix issues with num_workers for transcribe

Signed-off-by: smajumdar <[email protected]>

* During inference use full context of chunk

Signed-off-by: smajumdar <[email protected]>
Signed-off-by: Bonham79 <[email protected]>

* Updated NumPy SDE requirement (#3442)

Signed-off-by: Vitaly Lavrukhin <[email protected]>
Signed-off-by: Bonham79 <[email protected]>

* refactor data preprocessing script (#3444)

Signed-off-by: Yang Zhang <[email protected]>
Signed-off-by: Bonham79 <[email protected]>

* Prompt tuning loss mask fix (#3438)

* Switched to calcualte loss on answer only

Signed-off-by: Virginia Adams <[email protected]>

* Added CI tests and unit tests for prompt tuning dataset

Signed-off-by: Virginia Adams <[email protected]>

* Fixed Jenkinsfile typo

Signed-off-by: Virginia Adams <[email protected]>

* fixed Jenkinsfile typo

Signed-off-by: Virginia Adams <[email protected]>

* Fixed more typos so CI tests run all the way through

Signed-off-by: Virginia Adams <[email protected]>

* Fixed code formatting

Signed-off-by: Virginia Adams <[email protected]>

* Needed to add save nemo file on train end flag to CI test

Signed-off-by: Virginia Adams <[email protected]>

* Added save .nemo on train end flag to example script

Signed-off-by: Virginia Adams <[email protected]>

Co-authored-by: Oleksii Kuchaiev <[email protected]>
Signed-off-by: Bonham79 <[email protected]>

* BioMegatron token classification tutorial fix to be compatible with current Megatron BERT (#3435)

* fixed the tokenizer

Signed-off-by: Yi Dong <[email protected]>

* training is working

Signed-off-by: Yi Dong <[email protected]>

* fixed text

Signed-off-by: Yi Dong <[email protected]>

* fixed text

Signed-off-by: Yi Dong <[email protected]>

* working notebook

Signed-off-by: Yi Dong <[email protected]>

* style fix

Signed-off-by: Yi Dong <[email protected]>

* fixed text

Signed-off-by: Yi Dong <[email protected]>

* handles the different megatron-lm checkpoint versions

Signed-off-by: Yi Dong <[email protected]>

* fixed the text classification notebook

Signed-off-by: Yi Dong <[email protected]>

* fixed key error

Signed-off-by: Yi Dong <[email protected]>

* more key error

Signed-off-by: Yi Dong <[email protected]>

* replace the old notebooks

Signed-off-by: Yi Dong <[email protected]>

* register vocab to nemo file

Signed-off-by: Yi Dong <[email protected]>

* added the missing notebook

Signed-off-by: Yi Dong <[email protected]>

Co-authored-by: Eric Harper <[email protected]>
Signed-off-by: Bonham79 <[email protected]>

* (1) O2-style mixed precision recipe, (2) Persistent layer-norm, (3) Grade scale hysteresis, (4) gradient_as_bucket_view (#3259)

* half precision training w/o autocast using master param

stage fp16 working version

fix: fp32 grad accumulation

bf16 support

Signed-off-by: Sangkug Lym <[email protected]>

add closure fn at bf16

* change autocast compatible with latest pytorch version

Signed-off-by: Sangkug Lym <[email protected]>

* add module to the state_dict naming

Signed-off-by: Sangkug Lym <[email protected]>

* cleanup arguments

Signed-off-by: Sangkug Lym <[email protected]>

* fix module state matching upon checkpoint resume

Signed-off-by: Sangkug Lym <[email protected]>

* persistent layer norm and dependency check

Signed-off-by: Sangkug Lym <[email protected]>

check container version instead of pytorch version

Signed-off-by: Sangkug Lym <[email protected]>

update config

* dependency check

Signed-off-by: Sangkug Lym <[email protected]>

* add graadient_as_bucket_view arg to config

Signed-off-by: Sangkug Lym <[email protected]>

* (1) add hysteresis to grad scaler, and (2) add grad_scaler to TB

Signed-off-by: Sangkug Lym <[email protected]>

* doc link fixes (#3264)

Signed-off-by: nithinraok <[email protected]>

* escape chars fix (#3253)

* escape chars fix

Signed-off-by: ekmb <[email protected]>

* bug fixes

Signed-off-by: ekmb <[email protected]>

* review

Signed-off-by: ekmb <[email protected]>

Co-authored-by: Yang Zhang <[email protected]>

* Improve data pipeline for punctuation capitalization model and make other useful changes (#3159)

* Fix: inference on short sequences problem

Signed-off-by: PeganovAnton <[email protected]>

* Add draft of new punctuation and capitalization model

Signed-off-by: PeganovAnton <[email protected]>

* Fix debug config

Signed-off-by: PeganovAnton <[email protected]>

* Add parameter check

Signed-off-by: PeganovAnton <[email protected]>

* Update punctuation training script

Signed-off-by: PeganovAnton <[email protected]>

* Fix head config parameter names

Signed-off-by: PeganovAnton <[email protected]>

* Fix ds_item and class_label parameters in config

Signed-off-by: PeganovAnton <[email protected]>

* Fix dataloader shuffling for tarred dataset

Signed-off-by: PeganovAnton <[email protected]>

* Reduce validation batch

Signed-off-by: PeganovAnton <[email protected]>

* Add debug print

Signed-off-by: PeganovAnton <[email protected]>

* Fix metrics initialization

Signed-off-by: PeganovAnton <[email protected]>

* Fix minor bug

Signed-off-by: PeganovAnton <[email protected]>

* Fix device problem

Signed-off-by: PeganovAnton <[email protected]>

* Add debug print

Signed-off-by: PeganovAnton <[email protected]>

* Register metrics properly

Signed-off-by: PeganovAnton <[email protected]>

* Put metrics setup after module init

Signed-off-by: PeganovAnton <[email protected]>

* Reduce model size

Signed-off-by: PeganovAnton <[email protected]>

* Add wandb logging

Signed-off-by: PeganovAnton <[email protected]>

* Change wandb name

Signed-off-by: PeganovAnton <[email protected]>

* Fix logging names for metrics

Signed-off-by: PeganovAnton <[email protected]>

* Add debug print

Signed-off-by: PeganovAnton <[email protected]>

* Add returning from eval steps

Signed-off-by: PeganovAnton <[email protected]>

* Add second dev dataset

Signed-off-by: PeganovAnton <[email protected]>

* Move config

Signed-off-by: PeganovAnton <[email protected]>

* Fix path to dataset"

Signed-off-by: PeganovAnton <[email protected]>

* Add more tokenizer parameters

Signed-off-by: PeganovAnton <[email protected]>

* Add debug script for more tokenizer in creating tarred dataset

Signed-off-by: PeganovAnton <[email protected]>

* Update output path in debug script

Signed-off-by: PeganovAnton <[email protected]>

* Fix minor bug in typing

Signed-off-by: PeganovAnton <[email protected]>

* Fix bug in parsing arguments

Signed-off-by: PeganovAnton <[email protected]>

* Do not pass tokenizer through queue

Signed-off-by: PeganovAnton <[email protected]>

* Set hf tokenizer in debug script

Signed-off-by: PeganovAnton <[email protected]>

* Try char vocabulary

Signed-off-by: PeganovAnton <[email protected]>

* Fix typo

Signed-off-by: PeganovAnton <[email protected]>

* Improve error message

Signed-off-by: PeganovAnton <[email protected]>

* Fix OOV problem

Signed-off-by: PeganovAnton <[email protected]>

* Add label ids creation and getting

Signed-off-by: PeganovAnton <[email protected]>

* fix: add missing parameter

Signed-off-by: PeganovAnton <[email protected]>

* Improve error message for label ids building

Signed-off-by: PeganovAnton <[email protected]>

* Add short tar files repacking

Signed-off-by: PeganovAnton <[email protected]>

* Fix minor bug and add more security

Signed-off-by: PeganovAnton <[email protected]>

* Fix minor bug

Signed-off-by: PeganovAnton <[email protected]>

* fix: replace Path with str

Signed-off-by: PeganovAnton <[email protected]>

* fix: iter datasets

Signed-off-by: PeganovAnton <[email protected]>

* Improve logging

Signed-off-by: PeganovAnton <[email protected]>

* Turn off repacking

Signed-off-by: PeganovAnton <[email protected]>

* Turn off repacking

Signed-off-by: PeganovAnton <[email protected]>

* Turn on repacking

Signed-off-by: PeganovAnton <[email protected]>

* Turn off repacking

Signed-off-by: PeganovAnton <[email protected]>

* Add debug print

Signed-off-by: PeganovAnton <[email protected]>

* Improve unexpected removal

Signed-off-by: PeganovAnton <[email protected]>

* Turn on repacking

Signed-off-by: PeganovAnton <[email protected]>

* fix: remove repacked files

Signed-off-by: PeganovAnton <[email protected]>

* Add default config for testing

Signed-off-by: PeganovAnton <[email protected]>

* Improve code style in evaluate script

Signed-off-by: PeganovAnton <[email protected]>

* Add docstrings

Signed-off-by: PeganovAnton <[email protected]>

* Remove debug config

Signed-off-by: PeganovAnton <[email protected]>

* Remove commented code

Signed-off-by: PeganovAnton <[email protected]>

* Fix code style in doc string

Signed-off-by: PeganovAnton <[email protected]>

* Fix usage of parser.error function

Signed-off-by: PeganovAnton <[email protected]>

* Improve working with config and fix restoring of old checkpoints

Signed-off-by: PeganovAnton <[email protected]>

* Do not demand cfg as dataclass

Signed-off-by: PeganovAnton <[email protected]>

* Add backward compatibility for absense of use_tarred_dataset

Signed-off-by: PeganovAnton <[email protected]>

* Fight for backwards compatibility

Signed-off-by: PeganovAnton <[email protected]>

* Add tokens_in_batch backward compatibility

Signed-off-by: PeganovAnton <[email protected]>

* Undo unintentional changes in tutorial

Signed-off-by: PeganovAnton <[email protected]>

* Do not allow more workers than queries

Signed-off-by: PeganovAnton <[email protected]>

* Fix metric names in tests

Signed-off-by: PeganovAnton <[email protected]>

* Fix metric location

Signed-off-by: PeganovAnton <[email protected]>

* Fix metric location

Signed-off-by: PeganovAnton <[email protected]>

* Require ds_item or data_dir

Signed-off-by: PeganovAnton <[email protected]>

* Disable multiprocessing data preparation by default

Signed-off-by: PeganovAnton <[email protected]>

* Disable multiprocessing data preparation by default

Signed-off-by: PeganovAnton <[email protected]>

* Disable multiprocessing data preparation by default

Signed-off-by: PeganovAnton <[email protected]>

* Make minor improvements in docstrings and typing

Signed-off-by: PeganovAnton <[email protected]>

* Fix finetuning code

Signed-off-by: PeganovAnton <[email protected]>

* Fix shuffle train dataset config parameter

Signed-off-by: PeganovAnton <[email protected]>

* Fix evaluation script

Signed-off-by: PeganovAnton <[email protected]>

* Update test

Signed-off-by: PeganovAnton <[email protected]>

* Add new test and make minor changes

Signed-off-by: PeganovAnton <[email protected]>

* Fix repacked file names

Signed-off-by: PeganovAnton <[email protected]>

* Add assertion error

Signed-off-by: PeganovAnton <[email protected]>

* Fix minor bug in regex

Signed-off-by: PeganovAnton <[email protected]>

* Improve Jenkins command

Signed-off-by: PeganovAnton <[email protected]>

* Fix code style

Signed-off-by: PeganovAnton <[email protected]>

* fix: add name to Jenkins stage

Signed-off-by: PeganovAnton <[email protected]>

* fix: add steps block to Jenkins stage

Signed-off-by: PeganovAnton <[email protected]>

* fix: move nemo_experiments removal to post section

Previously I encoutered a weird error

+ rm -rf nemo_experiments
rm: cannot remove 'nemo_experiments': Directory not empty
script returned exit code 1

And suspect that this could be because to parallel stages try to
remove same directory simultaneously.

Signed-off-by: PeganovAnton <[email protected]>

* Turn off cache usage in Jenkins for token classification models

Signed-off-by: PeganovAnton <[email protected]>

* Stop pickling features

Signed-off-by: PeganovAnton <[email protected]>

* Reference webdataset in docs

Signed-off-by: PeganovAnton <[email protected]>

* Make multiple minor improvements

Signed-off-by: PeganovAnton <[email protected]>

* Add parameters tokens_in_batch, repack to documentation

Signed-off-by: PeganovAnton <[email protected]>

* Refactoring and improving readability

Signed-off-by: PeganovAnton <[email protected]>

* Make tar_shuffle_n optional parameter

Signed-off-by: PeganovAnton <[email protected]>

* Fix path to label vocab files

Signed-off-by: PeganovAnton <[email protected]>

* Fix metadata label vocab key

Signed-off-by: PeganovAnton <[email protected]>

* Create for_nemo directory

Signed-off-by: PeganovAnton <[email protected]>

* Fix tar_shuffle_n default value

Signed-off-by: PeganovAnton <[email protected]>

* First round of review fixes

Signed-off-by: PeganovAnton <[email protected]>

* Return tokens_in_batch default value

Signed-off-by: PeganovAnton <[email protected]>

* Remove duplicate parameters in `CommonDatasetParameters`

Signed-off-by: PeganovAnton <[email protected]>

* Remove duplicate parameters in config

Signed-off-by: PeganovAnton <[email protected]>

* Refactor user interface

Signed-off-by: PeganovAnton <[email protected]>

* fix: add missing parameter in calling setting dataloader up

Signed-off-by: PeganovAnton <[email protected]>

* fix: replace data config with model config

Signed-off-by: PeganovAnton <[email protected]>

* fix: typo in config parameter name

Signed-off-by: PeganovAnton <[email protected]>

* fix: location of label ids parameters in config

Signed-off-by: PeganovAnton <[email protected]>

* fix: transforming not first legacy data config

Signed-off-by: PeganovAnton <[email protected]>

* fix: num_samples can be negative

Signed-off-by: PeganovAnton <[email protected]>

* fix: create directory for nemo ids files

Signed-off-by: PeganovAnton <[email protected]>

* fix: remove unremoved with_label

Signed-off-by: PeganovAnton <[email protected]>

* fix: features contain ids if loaded from pickle

Signed-off-by: PeganovAnton <[email protected]>

* Fix kwargs parameters

Signed-off-by: PeganovAnton <[email protected]>

* Add label setting for testing case

Signed-off-by: PeganovAnton <[email protected]>

* Fix: change parameter location in config

Signed-off-by: PeganovAnton <[email protected]>

* Fix: transform legacy config in init

Signed-off-by: PeganovAnton <[email protected]>

* Fix: make minor improvement in checking config

Signed-off-by: PeganovAnton <[email protected]>

* fix: check label ids for None before checking pad label id

Signed-off-by: PeganovAnton <[email protected]>

* fix: set labels when restoring

Signed-off-by: PeganovAnton <[email protected]>

* fix: place where label ids are taken

Signed-off-by: PeganovAnton <[email protected]>

* Fix minor bug

Signed-off-by: PeganovAnton <[email protected]>

* fix: register artifacts in set_label_ids

Signed-off-by: PeganovAnton <[email protected]>

* fix: perform checking only if label ids are not set

Signed-off-by: PeganovAnton <[email protected]>

* fix: set label_ids_are_set

Signed-off-by: PeganovAnton <[email protected]>

* Fix using of dataset in create tarred dataset

Signed-off-by: PeganovAnton <[email protected]>

* fix: manipulate label ids if fragment_idx is zero

Signed-off-by: PeganovAnton <[email protected]>

* fix: remove directory correctly

Signed-off-by: PeganovAnton <[email protected]>

* fix: vocab file names

Signed-off-by: PeganovAnton <[email protected]>

* fix: vocab file names

Signed-off-by: PeganovAnton <[email protected]>

* Add debug print

Signed-off-by: PeganovAnton <[email protected]>

* Add directories for cache and label info

Signed-off-by: PeganovAnton <[email protected]>

* Minor fixes

Signed-off-by: PeganovAnton <[email protected]>

* Minor fix

Signed-off-by: PeganovAnton <[email protected]>

* Minor fix

Signed-off-by: PeganovAnton <[email protected]>

* Improve debug config

Signed-off-by: PeganovAnton <[email protected]>

* Create missing directories

Signed-off-by: PeganovAnton <[email protected]>

* Improve feature pkl file name

Signed-off-by: PeganovAnton <[email protected]>

* WORKING VERSION OF VOCAB CONFIG

Signed-off-by: PeganovAnton <[email protected]>

* Improve vocab file extraction

Signed-off-by: PeganovAnton <[email protected]>

* Fix config

Signed-off-by: PeganovAnton <[email protected]>

* Improve vocab file extraction

Signed-off-by: PeganovAnton <[email protected]>

* fix register artifact calls

Signed-off-by: PeganovAnton <[email protected]>

* fix: add class_labels to legacy fixing

Signed-off-by: PeganovAnton <[email protected]>

* fix: add missing method

Signed-off-by: PeganovAnton <[email protected]>

* Add support for checkpoints without class labels artifact

Signed-off-by: PeganovAnton <[email protected]>

* fix: add missing return values to function

Signed-off-by: PeganovAnton <[email protected]>

* fix saving label ids in creation of tarred dataset

Signed-off-by: PeganovAnton <[email protected]>

* fix: adjust tarred dataset consistency check

Signed-off-by: PeganovAnton <[email protected]>

* fix: consistency check call

Signed-off-by: PeganovAnton <[email protected]>

* Try checking labels every time dataloader is set

Signed-off-by: PeganovAnton <[email protected]>

* fi…
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants