Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fixed trainer.strategy=auto from None. #7369

Merged
merged 1 commit into from
Sep 6, 2023
Merged

Conversation

XuesongYang
Copy link
Collaborator

@XuesongYang XuesongYang commented Sep 5, 2023

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Collection: [Note which collection this PR will affect]

Changelog

  • Add specific line by line info of high level changes in this PR.

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this 

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Copy link
Collaborator

@titu1994 titu1994 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks !

@titu1994
Copy link
Collaborator

titu1994 commented Sep 5, 2023

I don't think this needs more approvals, it's restricted to TTS. Merge when tests pass

@XuesongYang XuesongYang merged commit a850ec5 into main Sep 6, 2023
15 checks passed
@XuesongYang XuesongYang deleted the bugfix-ptl-strategy branch September 6, 2023 16:32
yaoyu-33 pushed a commit that referenced this pull request Oct 16, 2023
ericharper added a commit that referenced this pull request Nov 3, 2023
* ControlNet TRT export

* Final MR before release

* SD2 update

* Fixed export issue

* Fix for instruct p2p and reformat

* Fix SD export issue

* Add nemo clip export for DB

* Fix ins pix2pix

* fix sd2 config

* [Mingyuan Ma] BF16 and SD conversion script

* [Imagen] NHWC Feature

* Fix .nemo loading issue for NeMo CLIP in SD

* NeMo r1.20.0 Multimodal Merge

* fix the inductor issue in inference

* Fix inductor loading .nemo issue

* Add Neva Model Support

* Imagen Optimizations

* Neva inference code

* NeMo TOT 1.21 to Internal/main

* Update neva_inference.yaml

* REBASING  for latest code changes

* Update internal/main to main tot

* Parallel DDIM implementation

* 1. Fixing indentation bug. (#7352)

Signed-off-by: Micha Livne <[email protected]>

* NeMo MCore llama2 support + MCore PEFT adapters (#7299)

* start adding gpt from megatron core path

Signed-off-by: ericharper <[email protected]>

* set model parallel config

Signed-off-by: ericharper <[email protected]>

* use model parallel config object

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update args

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* set vp size to none if it is 1

Signed-off-by: ericharper <[email protected]>

* set vp size to none if it is 1

Signed-off-by: ericharper <[email protected]>

* add TransformerConfig

Signed-off-by: ericharper <[email protected]>

* start updating to TransformerConfig

Signed-off-by: ericharper <[email protected]>

* add todo

Signed-off-by: ericharper <[email protected]>

* revert to model parallel config

Signed-off-by: ericharper <[email protected]>

* add hidden_size to model_parallel_config

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove imports

Signed-off-by: ericharper <[email protected]>

* revert

Signed-off-by: ericharper <[email protected]>

* remove import

Signed-off-by: ericharper <[email protected]>

* small clean up

Signed-off-by: ericharper <[email protected]>

* update hidden size in peft base model, add mcore commit to jenkins

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update module args

Signed-off-by: ericharper <[email protected]>

* add config obj to flash attention tests

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove args

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove sequence parallel arg

Signed-off-by: ericharper <[email protected]>

* update args

Signed-off-by: ericharper <[email protected]>

* add config to self

Signed-off-by: ericharper <[email protected]>

* update args

Signed-off-by: ericharper <[email protected]>

* update args

Signed-off-by: ericharper <[email protected]>

* update args

Signed-off-by: ericharper <[email protected]>

* add config to test

Signed-off-by: ericharper <[email protected]>

* get hidden_size from config

Signed-off-by: ericharper <[email protected]>

* add try except

Signed-off-by: ericharper <[email protected]>

* use default

Signed-off-by: ericharper <[email protected]>

* update config with hidden size

Signed-off-by: ericharper <[email protected]>

* remove arg

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* comment out jenkins test

Signed-off-by: ericharper <[email protected]>

* revert import

Signed-off-by: ericharper <[email protected]>

* build transformer config

Signed-off-by: ericharper <[email protected]>

* add model to provider func

Signed-off-by: ericharper <[email protected]>

* update forward and float16 wrapper

Signed-off-by: ericharper <[email protected]>

* instantiate model parallel config after init model parallel

Signed-off-by: ericharper <[email protected]>

* set virtual rank

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add GQA config to megatron gpt model (#7096)

* Add GQA config in gpt config file

Signed-off-by: jasonwan <[email protected]>

* Verify mcore is enabled when using GQA

Signed-off-by: jasonwan <[email protected]>

---------

Signed-off-by: jasonwan <[email protected]>

* revert

Signed-off-by: ericharper <[email protected]>

* mcore llama2 ckpt conversion & small fix

Signed-off-by: jasonwan <[email protected]>

* Add inference & sft config by Hongbin

Co-authored-by: Hongbin Liu <[email protected]>

Signed-off-by: jasonwan <[email protected]>

* fix config

Signed-off-by: jasonwan <[email protected]>

* add inference param. update TP/PP script to support mcore gpt

Signed-off-by: jasonwan <[email protected]>

* p-tuning

Signed-off-by: jasonwan <[email protected]>

* modify ckpt conversion script (adding model cast)

Signed-off-by: jasonwan <[email protected]>

* ckpt conversion use relative path for config

Signed-off-by: jasonwan <[email protected]>

* start adding gpt from megatron core path

Signed-off-by: ericharper <[email protected]>

* set model parallel config

Signed-off-by: ericharper <[email protected]>

* use model parallel config object

Signed-off-by: ericharper <[email protected]>

* update args

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* set vp size to none if it is 1

Signed-off-by: ericharper <[email protected]>

* set vp size to none if it is 1

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add TransformerConfig

Signed-off-by: ericharper <[email protected]>

* start updating to TransformerConfig

Signed-off-by: ericharper <[email protected]>

* add todo

Signed-off-by: ericharper <[email protected]>

* revert to model parallel config

Signed-off-by: ericharper <[email protected]>

* add hidden_size to model_parallel_config

Signed-off-by: ericharper <[email protected]>

* remove imports

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove import

Signed-off-by: ericharper <[email protected]>

* small clean up

Signed-off-by: ericharper <[email protected]>

* update hidden size in peft base model, add mcore commit to jenkins

Signed-off-by: ericharper <[email protected]>

* update module args

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add config obj to flash attention tests

Signed-off-by: ericharper <[email protected]>

* remove args

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove sequence parallel arg

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update args

Signed-off-by: ericharper <[email protected]>

* add config to self

Signed-off-by: ericharper <[email protected]>

* update args

Signed-off-by: ericharper <[email protected]>

* update args

Signed-off-by: ericharper <[email protected]>

* update args

Signed-off-by: ericharper <[email protected]>

* add config to test

Signed-off-by: ericharper <[email protected]>

* get hidden_size from config

Signed-off-by: ericharper <[email protected]>

* add try except

Signed-off-by: ericharper <[email protected]>

* use default

Signed-off-by: ericharper <[email protected]>

* update config with hidden size

Signed-off-by: ericharper <[email protected]>

* remove arg

Signed-off-by: ericharper <[email protected]>

* comment out jenkins test

Signed-off-by: ericharper <[email protected]>

* revert import

Signed-off-by: ericharper <[email protected]>

* remove optimizer_idx

Signed-off-by: eharper <[email protected]>

* prefetch num microbatches

Signed-off-by: eharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* start adding gpt from megatron core path

Signed-off-by: ericharper <[email protected]>

* set model parallel config

Signed-off-by: ericharper <[email protected]>

* use model parallel config object

Signed-off-by: ericharper <[email protected]>

* update args

Signed-off-by: ericharper <[email protected]>

* fix for p-tuning sequence parallel

Signed-off-by: jasonwan <[email protected]>

* support SFT/distOpt mcore (#7207)

* add inference param. update TP/PP script to support mcore gpt

* p-tuning

Signed-off-by: jasonwan <[email protected]>

* change layer names for SFT

Signed-off-by: Hongbin Liu <[email protected]>

* fix bug in SFT

Signed-off-by: Hongbin Liu <[email protected]>

---------

Signed-off-by: jasonwan <[email protected]>
Signed-off-by: Hongbin Liu <[email protected]>
Co-authored-by: Hongbin Liu <[email protected]>
Co-authored-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* start updating to TransformerConfig

Signed-off-by: ericharper <[email protected]>

* revert to model parallel config

Signed-off-by: ericharper <[email protected]>

* add hidden_size to model_parallel_config

Signed-off-by: ericharper <[email protected]>

* remove imports

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update module args

Signed-off-by: ericharper <[email protected]>

* add config to self

Signed-off-by: ericharper <[email protected]>

* build transformer config

Signed-off-by: ericharper <[email protected]>

* add model to provider func

Signed-off-by: ericharper <[email protected]>

* update forward and float16 wrapper

Signed-off-by: ericharper <[email protected]>

* instantiate model parallel config after init model parallel

Signed-off-by: ericharper <[email protected]>

* set virtual rank

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add GQA config to megatron gpt model (#7096)

* Add GQA config in gpt config file

Signed-off-by: jasonwan <[email protected]>

* Verify mcore is enabled when using GQA

Signed-off-by: jasonwan <[email protected]>

---------

Signed-off-by: jasonwan <[email protected]>

* revert

Signed-off-by: ericharper <[email protected]>

* remove import

Signed-off-by: eharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* rollback model cast for p-tuning

Signed-off-by: jasonwan <[email protected]>

* update for dist adam

Signed-off-by: eharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* use get_gpt_module_list

Signed-off-by: eharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update ckpt conversion script

Signed-off-by: jasonwan <[email protected]>

* ptl2.0 patch for llama config

Signed-off-by: jasonwan <[email protected]>

* add plugins to trainer in scripts

Signed-off-by: jasonwan <[email protected]>

* fix activation checkpointing mcore

Signed-off-by: jasonwan <[email protected]>

* fix variable names

Signed-off-by: jasonwan <[email protected]>

* overwrite normalization type for mcore/te

Signed-off-by: jasonwan <[email protected]>

* Update megatron_llama_sft.yaml

Signed-off-by: Jason Wang <[email protected]>

* add PEFT adapter support for mcore gpt path (#7276)

* implementation for mcore adapter/mxins

Signed-off-by: jasonwan <[email protected]>

* small fix for lora and ptuning

Signed-off-by: jasonwan <[email protected]>

* support layerwise peft

Signed-off-by: jasonwan <[email protected]>

* support multiple target layers

Signed-off-by: jasonwan <[email protected]>

* support lora GQA

Signed-off-by: jasonwan <[email protected]>

* support amp O2

Signed-off-by: jasonwan <[email protected]>

* revert & more O2 fix

Signed-off-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* lora inject to attention

Signed-off-by: jasonwan <[email protected]>

* support lora weight tying

Signed-off-by: jasonwan <[email protected]>

* add copyright header

Signed-off-by: jasonwan <[email protected]>

* rollback ptuning name change. full string match mcore target

Signed-off-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove comment

Signed-off-by: jasonwan <[email protected]>

---------

Signed-off-by: jasonwan <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* clean up config

Signed-off-by: jasonwan <[email protected]>

* Sync llama branch (#7297)

* add inference param. update TP/PP script to support mcore gpt

* p-tuning

Signed-off-by: jasonwan <[email protected]>

* change layer names for SFT

Signed-off-by: Hongbin Liu <[email protected]>

* fix bug in SFT

Signed-off-by: Hongbin Liu <[email protected]>

* fix bug: cpu initialization is not really enabled

Signed-off-by: Hongbin Liu <[email protected]>

* add use_cpu_initialization to TransformerConfig

Signed-off-by: Hongbin Liu <[email protected]>

* fix bug: wrong config path when using relative cjpt path

Signed-off-by: Hongbin Liu <[email protected]>

* revert mcore config change

Signed-off-by: Jason Wang <[email protected]>

---------

Signed-off-by: jasonwan <[email protected]>
Signed-off-by: Hongbin Liu <[email protected]>
Signed-off-by: Jason Wang <[email protected]>
Co-authored-by: Hongbin Liu <[email protected]>

* clean up ckpt conversion script

Signed-off-by: jasonwan <[email protected]>

* rollback git merge errors

Signed-off-by: jasonwan <[email protected]>

* update mcore, add check for mcore+te

Signed-off-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* formatting

Signed-off-by: jasonwan <[email protected]>

* make sft test dataset optional. fix indentation in config

Signed-off-by: jasonwan <[email protected]>

* one more fix for optional test set

Signed-off-by: jasonwan <[email protected]>

* support merging lora weights in mcore

Signed-off-by: jasonwan <[email protected]>

* update mcore for cpu init

Signed-off-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update ckpt conversion for code llama

Signed-off-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add seq_len_interpolation_factor support for long-context llama ckpts (#7312)

* add inference param. update TP/PP script to support mcore gpt

* p-tuning

Signed-off-by: jasonwan <[email protected]>

* add seq_len_interpolation_factor

Signed-off-by: Hongbin Liu <[email protected]>

---------

Signed-off-by: jasonwan <[email protected]>
Signed-off-by: Hongbin Liu <[email protected]>
Co-authored-by: jasonwan <[email protected]>
Co-authored-by: Hongbin Liu <[email protected]>

* fix old ptuning model, update mcore to support seq_len_interpolation_factor

Signed-off-by: jasonwan <[email protected]>

* support fused layernorm linear, fix ptuning O2

Signed-off-by: jasonwan <[email protected]>

* drop loss mask for mcore for now

Signed-off-by: jasonwan <[email protected]>

* disable dist ckpt in peft

Signed-off-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix loading non dist ckpt

Signed-off-by: jasonwan <[email protected]>

* add ckpt conversion to CI

Signed-off-by: jasonwan <[email protected]>

* update CI

Signed-off-by: jasonwan <[email protected]>

* mcore_mixin docstring

Signed-off-by: jasonwan <[email protected]>

* minor change in mcore peft error message

Signed-off-by: jasonwan <[email protected]>

* fix amp o2 in lora weight tying

Signed-off-by: jasonwan <[email protected]>

* correct mcore fp8 config

Signed-off-by: jasonwan <[email protected]>

* add TE installation

Signed-off-by: jasonwan <[email protected]>

* support mcore adapter tuning

Signed-off-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* comment out new CI test. rollback docker image

Signed-off-by: jasonwan <[email protected]>

* ignore FA tests, try new CI on 23.08

Signed-off-by: jasonwan <[email protected]>

* mark new CI as L2, put to beginning to test

Signed-off-by: jasonwan <[email protected]>

* minor fix for prompt learning

Signed-off-by: jasonwan <[email protected]>

* rollback to 23.06. comment out CI

Signed-off-by: jasonwan <[email protected]>

* minor fix ckpt conversion script

Signed-off-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* minor rollback gpt model change

Signed-off-by: jasonwan <[email protected]>

---------

Signed-off-by: ericharper <[email protected]>
Signed-off-by: jasonwan <[email protected]>
Signed-off-by: eharper <[email protected]>
Signed-off-by: Hongbin Liu <[email protected]>
Signed-off-by: Jason Wang <[email protected]>
Co-authored-by: ericharper <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: eharper <[email protected]>
Co-authored-by: Hongbin Liu <[email protected]>
Co-authored-by: Kelvin Liu <[email protected]>

* Hiddens modules documentation (#7303)

* 1. Changed hiddens transformations module from `transformations` to `hiddens`.

Signed-off-by: Micha Livne <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* 1. Debugging. Signed-off-by: Micha Livne <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* 1. Finished doc.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging. Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging. Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging. Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging. Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging. Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging. Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging. Signed-off-by: Micha Livne <[email protected]>

---------

Signed-off-by: Micha Livne <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Eric Harper <[email protected]>

* Support for flash attention 2.0 (#7063)

* Add flash attn 2

Signed-off-by: MaximumEntropy <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add FA2 feature

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* Remove debugging

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: MaximumEntropy <[email protected]>
Signed-off-by: Cheng-Ping Hsieh <[email protected]>
Signed-off-by: Cheng-Ping Hsieh <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Oleksii Kuchaiev <[email protected]>
Co-authored-by: Cheng-Ping Hsieh <[email protected]>
Co-authored-by: Cheng-Ping Hsieh <[email protected]>

* lora merge fix for O2 names (#7325)

* wip

Signed-off-by: arendu <[email protected]>

* adjust key names based on O2

Signed-off-by: arendu <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update

Signed-off-by: arendu <[email protected]>

* minor

Signed-off-by: arendu <[email protected]>

---------

Signed-off-by: arendu <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* multiple fields can form a context (#7147)

* list of context fields and flexible prompt template

Signed-off-by: arendu <[email protected]>

* list of fields for context

Signed-off-by: arendu <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix bug

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* Fix bug

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* Add multiple truncation fields and middle truncation

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Compatible to old ckpt

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix tokenize detokenize issue

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Remove detokenization, add truncation augmentation

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Resolve comments

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* Remove unused import

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* revert eos

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* Add tokenizer space_sensitive attribute

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix error

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* Fix erorr and use re

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix bug

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* Change assert logic

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Follow adi suggestion

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Remove merge function

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add example and comment

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* Remove context_key and add comment

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* Remove random truncation

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix bug

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix template none

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix bug

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

---------

Signed-off-by: arendu <[email protected]>
Signed-off-by: Cheng-Ping Hsieh <[email protected]>
Signed-off-by: Cheng-Ping Hsieh <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Cheng-Ping Hsieh <[email protected]>
Co-authored-by: Cheng-Ping Hsieh <[email protected]>

* Load buffers in checkpoint (#7357)

Signed-off-by: Jason Wang <[email protected]>

* Add migration guide for lightning 2.0 upgrade (#7360)

* Add lightning 2.0 migration guide in NeMo docs

Signed-off-by: Abhishree <[email protected]>

* Add remaining guide for lightning 2.0 upgrade

Signed-off-by: Abhishree <[email protected]>

* Remove line spill over and continue in next line

Signed-off-by: Abhishree <[email protected]>

* Add missing dataloader_iter in the guide

Signed-off-by: Abhishree <[email protected]>

* Fix minor typo

Signed-off-by: Abhishree <[email protected]>

---------

Signed-off-by: Abhishree <[email protected]>

* adding bias_dropout_add_fusion option for BERT (#7332)

Signed-off-by: Alexander Jipa <[email protected]>
Co-authored-by: Alexander Jipa <[email protected]>

* [TTS] Change audio codec token type to TokenIndex (#7356)

Signed-off-by: Ryan <[email protected]>

* enable selective unfreeze (#7326)

* wip

Signed-off-by: arendu <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* wip

Signed-off-by: arendu <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* avoid PTL method conflicts

Signed-off-by: arendu <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update

Signed-off-by: arendu <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update

Signed-off-by: arendu <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: arendu <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Fix typos (#7361)

* fix typos

Signed-off-by: omahs <[email protected]>

* fix typo

Signed-off-by: omahs <[email protected]>

* fix typos

Signed-off-by: omahs <[email protected]>

* fix typos

Signed-off-by: omahs <[email protected]>

* fix typo

Signed-off-by: omahs <[email protected]>

* fix typos

Signed-off-by: omahs <[email protected]>

* fix typo

Signed-off-by: omahs <[email protected]>

* fix typo

Signed-off-by: omahs <[email protected]>

* fix typo

Signed-off-by: omahs <[email protected]>

---------

Signed-off-by: omahs <[email protected]>

* pin numba=0.57.1 to fix reinstall.sh error (#7366)

Signed-off-by: Xuesong Yang <[email protected]>

* Update new conversion script for converting safetensors.

* Upgrade pytorch container to 23.08 (#7353)

* upgrade pytorch container

Signed-off-by: eharper <[email protected]>

* use mcore

Signed-off-by: eharper <[email protected]>

* revert test change

Signed-off-by: eharper <[email protected]>

* pleasefixme

Signed-off-by: eharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* check for ampere

Signed-off-by: eharper <[email protected]>

* comment test temporarily

Signed-off-by: eharper <[email protected]>

---------

Signed-off-by: eharper <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* enable fp32 optimizer for output_layer in mcore (#7355)

Signed-off-by: lhb8125 <[email protected]>

* revert comment (#7368)

Signed-off-by: eharper <[email protected]>

* Update to core 23.08 branch ToT (#7371)

Signed-off-by: Abhinav Khattar <[email protected]>

* upper bounding ptl (#7370)

Signed-off-by: eharper <[email protected]>

* fix pipeline parallel inference (#7367)

* fix pp inference

Signed-off-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: jasonwan <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* fix for peft tied weights (#7372)

Signed-off-by: arendu <[email protected]>

* fixed trainer.strategy=auto from None. (#7369)

Signed-off-by: Xuesong Yang <[email protected]>

* add O2 option in gpt eval (#7358)

* add O2 option in eval

Signed-off-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add doc for O2 config

Signed-off-by: jasonwan <[email protected]>

* add to llama inference config

Signed-off-by: jasonwan <[email protected]>

---------

Signed-off-by: jasonwan <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Eric Harper <[email protected]>

* Move model precision copy (#7336)

* move cfg precision set to megatron base model

Signed-off-by: Maanu Grover <[email protected]>

* remove copy from other models

Signed-off-by: Maanu Grover <[email protected]>

* modify attribute not arg

Signed-off-by: Maanu Grover <[email protected]>

* fix gpt model test for ptl 2.0

Signed-off-by: Maanu Grover <[email protected]>

* rename function and add docstring

Signed-off-by: Maanu Grover <[email protected]>

* replace precision to dtype conditionals with func call

Signed-off-by: Maanu Grover <[email protected]>

* unnecessary function and cfg reset

Signed-off-by: Maanu Grover <[email protected]>

* set default value

Signed-off-by: Maanu Grover <[email protected]>

* fix precision lookup in a few more places

Signed-off-by: Maanu Grover <[email protected]>

* rename mapping function

Signed-off-by: Maanu Grover <[email protected]>

* ununsed import

Signed-off-by: Maanu Grover <[email protected]>

* save torch datatype to model

Signed-off-by: Maanu Grover <[email protected]>

* set weights precision wrt amp o2

Signed-off-by: Maanu Grover <[email protected]>

* Revert "set weights precision wrt amp o2"

This reverts commit 313a4bfe5eb69d771a6d2433898c0685836aef5c.

Signed-off-by: Maanu Grover <[email protected]>

* revert half precision at inference attempt

Signed-off-by: Maanu Grover <[email protected]>

* move autocast dtype to base model

Signed-off-by: Maanu Grover <[email protected]>

* move params dtype to base model, enable fp16 O2 inf

Signed-off-by: Maanu Grover <[email protected]>

* unused imports

Signed-off-by: Maanu Grover <[email protected]>

---------

Signed-off-by: Maanu Grover <[email protected]>

* Fix PEFT checkpoint loading (#7388)

* Fix PEFT checkpoint loading

Signed-off-by: Jason Wang <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Jason Wang <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Use distributed optimizer support for multiple dtypes (#7359)

* Update distopt wrapper with multiple dtype support

Remove manual handling of separate FP32 optimizer.

Signed-off-by: Tim Moon <[email protected]>

* Use distopt support for contiguous buffers with multiple dtypes

Signed-off-by: Tim Moon <[email protected]>

* Fix typo

Signed-off-by: Tim Moon <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Separate distopt buckets for first GPT layer and non-overlapped params

Signed-off-by: Tim Moon <[email protected]>

* Add distopt logic for int dtypes

Signed-off-by: Tim Moon <[email protected]>

* Update Apex commit

Signed-off-by: Tim Moon <[email protected]>

* Remove unused variables

Signed-off-by: Tim Moon <[email protected]>

* Update Apex commit in README and Jenkensfile

Signed-off-by: Tim Moon <[email protected]>

* Debug Dockerfile and Jenkinsfile

Signed-off-by: Tim Moon <[email protected]>

---------

Signed-off-by: Tim Moon <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Eric Harper <[email protected]>

* minor fix for llama ckpt conversion script (#7387)

* minor fix for llama ckpt conversion script

Signed-off-by: Jason Wang <[email protected]>

* Update Jenkinsfile

Signed-off-by: Jason Wang <[email protected]>

* remove fast_swiglu configuration

Signed-off-by: Jason Wang <[email protected]>

---------

Signed-off-by: Jason Wang <[email protected]>
Co-authored-by: Eric Harper <[email protected]>

* Fix wrong calling of librosa.get_duration() in notebook (#7376)

Signed-off-by: Robin Dong <[email protected]>
Co-authored-by: Somshubra Majumdar <[email protected]>

* [PATCH] PEFT import mcore (#7393)

* [PATCH] PEFT import mcore

Signed-off-by: Jason Wang <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Jason Wang <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* [TTS] Added a callback for logging initial data (#7384)

Signed-off-by: Ante Jukić <[email protected]>

* Update Core Commit (#7402)

* Update Core Commit

Signed-off-by: Abhinav Khattar <[email protected]>

* update commit

Signed-off-by: Abhinav Khattar <[email protected]>

---------

Signed-off-by: Abhinav Khattar <[email protected]>

* Use cfg attribute in bert (#7394)

* use cfg attribute instead of arg

Signed-off-by: Maanu Grover <[email protected]>

* use torch_dtype in place of cfg.precision

Signed-off-by: Maanu Grover <[email protected]>

* move precision copy before super constructor

Signed-off-by: Maanu Grover <[email protected]>

* use trainer arg

Signed-off-by: Maanu Grover <[email protected]>

---------

Signed-off-by: Maanu Grover <[email protected]>

* Add support for bias conversion in Swiglu models (#7386)

* Add support for bias conversion in Swiglu models

Signed-off-by: smajumdar <[email protected]>

* Add support for auto extracting tokenizer model

Signed-off-by: smajumdar <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add support for auto extracting tokenizer model

Signed-off-by: smajumdar <[email protected]>

* Fix issue with missing tokenizer

Signed-off-by: smajumdar <[email protected]>

* Refactor

Signed-off-by: smajumdar <[email protected]>

* Refactor

Signed-off-by: smajumdar <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: smajumdar <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Update save_to and restore_from for dist checkpointing (#7343)

* add dist ckpt to save to, in progress

Signed-off-by: eharper <[email protected]>

* move dist ckpt

Signed-off-by: eharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* clean up

Signed-off-by: eharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update restore from, need to figure out how to initialize distributed

Signed-off-by: eharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* launch distrib if needed when restoring dist ckpt

Signed-off-by: eharper <[email protected]>

* when using mcore we can change tp pp on the fly

Signed-off-by: eharper <[email protected]>

* add load_from_checkpoint support for dist ckpt

Signed-off-by: eharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update llama convert script to save dist .nemo

Signed-off-by: eharper <[email protected]>

* fix load dist ckpt

Signed-off-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* setup TE TP groups if needed

Signed-off-by: eharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* setup te tp groups if needed

Signed-off-by: eharper <[email protected]>

* remove import

Signed-off-by: eharper <[email protected]>

---------

Signed-off-by: eharper <[email protected]>
Signed-off-by: jasonwan <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: jasonwan <[email protected]>

* fix forward for with mcore=false (#7403)

Signed-off-by: Jimmy Zhang <[email protected]>
Co-authored-by: Jimmy Zhang <[email protected]>

* Fix logging to remove 's/it' from progress bar in Megatron models and add train_step_timing (#7374)

* Add CustomProgressBar class to exp_manager and trainer callbacks

Signed-off-by: Abhishree <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix the progress bar to reflect total microbatch cnt

Signed-off-by: Abhishree <[email protected]>

* Modify CustomProgressBar class

1) Modify CustomProgressBar class to update progress bar per global_step instead of per microbatch
2) Add the callback to other megatron training/finetuning files that are not using MegatronTrainerBuilder

Signed-off-by: Abhishree <[email protected]>

* Add CustomProgressBar callback to tuning files

Signed-off-by: Abhishree <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Abhishree <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Set Activation Checkpointing Defaults (#7404)

* Set Activation Checkpointing Defaults

Signed-off-by: Abhinav Khattar <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* check for None

Signed-off-by: Abhinav Khattar <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Abhinav Khattar <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* make loss mask default to false (#7407)

Signed-off-by: eharper <[email protected]>

* Add dummy userbuffer config files (#7408)

Signed-off-by: Sangkug Lym <[email protected]>

* add missing ubconf files (#7412)

Signed-off-by: Abhinav Khattar <[email protected]>

* New tutorial on Speech Data Explorer (#7405)

* Added Google Colab based tutorial on Speech Data Explorer 

Signed-off-by: George Zelenfroynd <[email protected]>

* Update ptl training ckpt conversion script to work with dist ckpt (#7416)

* update ptl convert script

Signed-off-by: eharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* don't break legacy

Signed-off-by: eharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: eharper <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Allow disabling sanity checking when num_sanity_val_steps=0 (#7413)

* Allow disabling sanity checking when num_sanity_val_steps=0

Signed-off-by: Abhishree <[email protected]>

* Update num_sanity_val_steps to be a multiple of num_microbatches

Signed-off-by: Abhishree Thittenamane <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Abhishree <[email protected]>
Signed-off-by: Abhishree Thittenamane <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Add comprehensive error messages (#7261)

Signed-off-by: Anton Peganov <[email protected]>

* check NEMO_PATH (#7418)

Signed-off-by: Nikolay Karpov <[email protected]>

* layer selection for ia3 (#7417)

* layer selection for ia3

Signed-off-by: arendu <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: arendu <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Fix missing pip package 'einops' (#7397)

Signed-off-by: Robin Dong <[email protected]>

* Fix failure of pyaudio in Google Colab (#7396)

Signed-off-by: Robin Dong <[email protected]>

* Update README.md: output_path --> output_manifest_filepath (#7442)

Signed-off-by: Samuele Cornell <[email protected]>

* Updating FlashAttention API to match FlashAttentionV2

* Multiple fixes for mm

* Fix CI inductor issue and update to torch compile

* Remove suppress error

* Fix when conversion config uses fp16 and it complains about precision plugin

* Fixing FAv2 API usage

* Initial release of content filtering model

* Added synthetic dataloader for precached and online mode

* Mingyuanm/dreambooth opt

* Add llama2 support in neva training

* Fix sampler length

* Fix all precision issues in nemo multimodal

* Add rope dynamic linear scaling (#7437)

* Add dynamic linear scaling

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix bug

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

---------

Signed-off-by: Cheng-Ping Hsieh <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Yang Zhang <[email protected]>

* Fix None dataloader issue in PTL2.0 (#7455)

* Fix None dataloader issue in PTL2.0

Signed-off-by: KunalDhawan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* updating values of self._validation_dl and self._test_dl as well

Signed-off-by: KunalDhawan <[email protected]>

* updating values of self._validation_dl and self._test_dl as well

Signed-off-by: KunalDhawan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: KunalDhawan <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* [ASR] Confidence measure -> method renames (#7434)

* measure -> method

Signed-off-by: Aleksandr Laptev <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Aleksandr Laptev <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Add steps for document of getting dataset 'SF Bilingual Speech' (#7378)

* Add steps for document of getting dataset 'SF Bilingual Speech'

Signed-off-by: Robin Dong <[email protected]>

* Update datasets.rst

added a link from a tutorial demonstrating detailed data prep steps.

Signed-off-by: Xuesong Yang <[email protected]>

---------

Signed-off-by: Robin Dong <[email protected]>
Signed-off-by: Xuesong Yang <[email protected]>
Co-authored-by: Xuesong Yang <[email protected]>

* RNN-T confidence and alignment bugfix (#7381)

* new frame_confidence and alignments lists are now always created after the while loop

Signed-off-by: Aleksandr Laptev <[email protected]>

* tests added

Signed-off-by: Aleksandr Laptev <[email protected]>

---------

Signed-off-by: Aleksandr Laptev <[email protected]>

* Fix resume from checkpoint in exp_manager (#7424) (#7426)

Signed-off-by: Abhishree <[email protected]>
Co-authored-by: Abhishree Thittenamane <[email protected]>
Co-authored-by: Eric Harper <[email protected]>

* Fix checking of cuda/cpu device for inputs of Decoder (#7444)

* Fix checking of cuda/cpu device for inputs of Decoder

Signed-off-by: Robin Dong <[email protected]>

* Update tacotron2.py

Signed-off-by: Jason <[email protected]>

---------

Signed-off-by: Robin Dong <[email protected]>
Signed-off-by: Jason <[email protected]>
Co-authored-by: Jason <[email protected]>

* Fix failure of ljspeech's get_data.py (#7430)

* Fix failure of ljspeech's get_data.py

Signed-off-by: Robin Dong <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Robin Dong <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* [TTS] Fix audio codec type checks (#7373)

* [TTS] Fix audio codec type checks

Signed-off-by: Ryan <[email protected]>

* [TTS] Fix audio codec tests

Signed-off-by: Ryan <[email protected]>

---------

Signed-off-by: Ryan <[email protected]>

* [TTS] Add dataset to path of logged artifacts (#7462)

* [TTS] Add dataset to path of logged artifacts

Signed-off-by: Ryan <[email protected]>

* [TTS] Revert axis name back to Audio Frames

Signed-off-by: Ryan <[email protected]>

---------

Signed-off-by: Ryan <[email protected]>

* Fix sft dataset truncation (#7464)

* Add fix

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

---------

Signed-off-by: Cheng-Ping Hsieh <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Automatic Lip Reading Recognition (ALR) - ASR/CV (Visual ASR) (#7330)

* striding_conv1d_k5 and dw_striding_conv1d_k5 subsampling

Signed-off-by: mburchi <[email protected]>

* transpose conv1d inputs

Signed-off-by: mburchi <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Signed-off-by: mburchi <[email protected]>

* Update subsampling.py

change striding_conv1d_k5 to striding_conv1d

Signed-off-by: Maxime Burchi <[email protected]>

* cv branch

Signed-off-by: mburchi <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* video manifest

Signed-off-by: mburchi <[email protected]>

* add collection classes

Signed-off-by: mburchi <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add test_step_outputs

Signed-off-by: mburchi <[email protected]>

* correct manifest bug when having only audio or only videos

Signed-off-by: mburchi <[email protected]>

* correct manifest bug when having only audio or only videos

Signed-off-by: mburchi <[email protected]>

* clean references

Signed-off-by: mburchi <[email protected]>

* freeze unfreeze transcribe cv models

Signed-off-by: mburchi <[email protected]>

* correct manifest get_full_path bug

Signed-off-by: mburchi <[email protected]>

* update for PR

Signed-off-by: mburchi <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* guard torchvision

Signed-off-by: mburchi <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update nemo/collections/cv/data/video_to_text_dataset.py

Co-authored-by: Igor Gitman <[email protected]>
Signed-off-by: Maxime Burchi <[email protected]>

* _video_speech_collate_fn in cv/data/video_to_text.py

Signed-off-by: mburchi <[email protected]>

* add self.out = None to asr subsampling

Signed-off-by: mburchi <[email protected]>

* Update nemo/collections/cv/data/video_to_text_dataset.py

Co-authored-by: Igor Gitman <[email protected]>
Signed-off-by: Maxime Burchi <[email protected]>

* cv -> multimodal/speech_cv branch

Signed-off-by: mburchi <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: mburchi <[email protected]>
Signed-off-by: Maxime Burchi <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Igor Gitman <[email protected]>

* HF StarCoder to NeMo conversion script (#7421)

* Script to convert HF StarCoder checkpoint to NeMo

Signed-off-by: Jan Lasek <[email protected]>

* StarCoder conversion test

Signed-off-by: Jan Lasek <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Signed-off-by: Jan Lasek <[email protected]>

* Fix test

Signed-off-by: Jan Lasek <[email protected]>

* Catch up with save_to changes

Signed-off-by: Jan Lasek <[email protected]>

* Don't abbreviate args for clarity

Signed-off-by: Jan Lasek <[email protected]>

* Configurable precision: BF16 vs FP32

Signed-off-by: Jan Lasek <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Jan Lasek <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* fix bug when loading dist ckpt in peft (#7452)

Signed-off-by: Hongbin Liu <[email protected]>
Co-authored-by: Hongbin Liu <[email protected]>

* Fix adding positional embeddings in-place in transformer module (#7440)

Signed-off-by: Tamerlan Tabolov <[email protected]>
Co-authored-by: Cheng-Ping Hsieh <[email protected]>

* Fix (#7478)

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* add sleep (#7498) (#7499)

* add sleep



* add sleep onto config instead



* add comment



---------

Signed-off-by: Gerald Shen <[email protected]>
Co-authored-by: Gerald Shen <[email protected]>

* Fix exp manager check for sleep (#7503) (#7504)

Signed-off-by: smajumdar <[email protected]>
Co-authored-by: Somshubra Majumdar <[email protected]>

* bugfix: trainer.accelerator=auto from None. (#7492) (#7493)

Signed-off-by: Xuesong Yang <[email protected]>
Co-authored-by: Xuesong Yang <[email protected]>

* [doc] fix broken link (#7481)

Signed-off-by: Stas Bekman <[email protected]>

* [TTS] Read audio as int32 to avoid flac read errors (#7477)

* [TTS] Read audio as int32 to avoid flac read errors

Signed-off-by: Ryan <[email protected]>

* [TTS] Add comment about read failures

Signed-off-by: Ryan <[email protected]>

---------

Signed-off-by: Ryan <[email protected]>

* Add dataset 'AISHELL-3' from OpenSLR for training mandarin TTS (#7409)

* Add dataset 'AISHELL-3' from OpenSLR for training mandarin TTS
* Train 'AISHELL-3' dataset with multi-speakers

Signed-off-by: Robin Dong <[email protected]>

* Update get_data.py

update copyright header

Signed-off-by: Xuesong Yang <[email protected]>

* Update get_data.py

added a disclaimer

Signed-off-by: Xuesong Yang <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add new configuration file for AISHELL3 with multispeaker of fastpitch

Signed-off-by: Robin Dong <[email protected]>

---------

Signed-off-by: Robin Dong <[email protected]>
Signed-off-by: Xuesong Yang <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Xuesong Yang <[email protected]>

* dllogger - log on rank 0 only (#7513)

Signed-off-by: Stas Bekman <[email protected]>

* Fix TTS FastPitch tutorial (#7494) (#7516)

* Fix

---------

Signed-off-by: Cheng-Ping Hsieh <[email protected]>
Co-authored-by: Cheng-Ping Hsieh <[email protected]>

* Fix get_dist() tensor dimension (#7506) (#7515)

Signed-off-by: Jocelyn Huang <[email protected]>
Co-authored-by: Jocelyn <[email protected]>

* bugfix: specify trainer.strategy=auto when devices=1 (#7509) (#7512)

Signed-off-by: Xuesong Yang <[email protected]>
Co-authored-by: Xuesong Yang <[email protected]>
Co-authored-by: Xuesong Yang <[email protected]>

* fix (#7511)

Signed-off-by: Abhinav Khattar <[email protected]>

* [TTS] Fix FastPitch data prep tutorial (#7524)

Signed-off-by: Ryan <[email protected]>

* add italian tokenization (#7486)

* add italian tokenization

Signed-off-by: GiacomoLeoneMaria <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add more ipa lexicon it

Signed-off-by: GiacomoLeoneMaria <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix error deletion

Signed-off-by: GiacomoLeoneMaria <[email protected]>

* add test

Signed-off-by: GiacomoLeoneMaria <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: GiacomoLeoneMaria <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Replace None strategy with auto in tutorial notebooks (#7521) (#7527)

Signed-off-by: Abhishree <[email protected]>
Co-authored-by: Abhishree Thittenamane <[email protected]>

* unpin setuptools (#7534) (#7535)

Signed-off-by: fayejf <[email protected]>
Co-authored-by: fayejf <[email protected]>

* remove auto generated examples (#7510)

* explicitly remove autogenerated examples for data parallel evaluation

Signed-off-by: arendu <[email protected]>

* mark autogenrated and remove it for test

Signed-off-by: arendu <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: arendu <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Add the `strategy` argument to `MegatronGPTModel.generate()` (#7264)

It is passed as an explicit argument rather than through
`**strategy_args` so as to ensure someone cannot accidentally pass other
arguments that would end up being ignored.

It is a keyword-only argument to ensure that if in the future we want to
update the signature to `**strategy_args`, we can do it without breaking
code.

Signed-off-by: Olivier Delalleau <[email protected]>

* Fix PTL2.0 related ASR bugs in r1.21.0: Val metrics logging, None dataloader issue (#7531) (#7533)

* fix none dataloader issue ptl2



* ptl2.0 logging fixes for rnnt_models



---------

Signed-off-by: KunalDhawan <[email protected]>
Co-authored-by: Kunal Dhawan <[email protected]>
Co-authored-by: Nithin Rao <[email protected]>

* gpus -> devices (#7542) (#7545)

Signed-off-by: Nithin Rao Koluguri <nithinraok>
Co-authored-by: Nithin Rao <[email protected]>

* Update FFMPEG version to fix issue with torchaudio (#7551) (#7553)

Signed-off-by: smajumdar <[email protected]>
Co-authored-by: Somshubra Majumdar <[email protected]>

* PEFT GPT & T5 Refactor (#7308)

* initial implementation of add_adapters API

* correct type hint

* Add config in add_adapters for save and load (@author bobchen)

* Remove AdapterConfig to avoid import error

* Add AdaterConfig back and move adaptermixin to sft model

* Add NLPSaveRestoreConnector as default in NLPModel.restore_from

* Add restore_from_nemo_with_adapter and test script

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* rename t5 file and classes to be consistent with GPT

* add t5 sft dataset

* add support for single-file format with T5SFTDataset

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Various small changes to make T5 SFT work like GPT SFT

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add adapter evaluation test script

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add MultiAdaterConfig for ia3 and fix builder issue

* Make ptuning for T5SFTModel work using mixin

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add IA3_Adapter for AdapterName

* Add adapter name for ptuning and attention adapter

* Make test script GPT/T5 agnostic

* Add layer selection feature

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Integrate adapter name and config

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update gpt peft tuning script to new API

* add t5 peft tuning script with new API

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix IA3 layer selection issue

* Override state_dict on SFT model instead of mixin

* Add load adapter by adapter config

* move peft config map away from example script

* auto get config from nemo adapter

* Move PEFTConfig to new file

* fix ckpt save/load for t5

* name change: add_adapters -> add_adapter

* variable name change

* update t5 script

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix t5 issues

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add weight tying

* update gpt tuning script

* PEFT-API proposal

* Fix according to comments

* update tuning scripts

* move merge_cfg_with to mixin class since it applies to both gpt and t5 and requires the model class for restore

* Add mcore_gpt support for NLPAdapterMixin

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix typo

* variable name change to distinguish "peft" and "adapter"

* override `load_adapters` to support `add_adapter` name change

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update tuning and eval script for adapter save/load

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add Ptuning on first stage only

* add lora tutorial for review

* Fix layer selection for mcore

* add landing page

* fix resume training

Signed-off-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add mcore condition in sharded_state_dict to make sft work

* Update lora_tutorial.md

First edit of this file for PEFT documentation for NeMO

Signed-off-by: hkelly33 <[email protected]>

* rename Adapter to AttentionAdapter to avoid confusion in doc

* Change load_adapters to load .nemo

* add quick start guide

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add load_adapters with .ckpt

* Remove setup_complete changes in load_adapters

* update landing page

* remove typo

* Updated quick_start.md per Chen Cui

Signed-off-by: hkelly33 <[email protected]>

* Add inference config merger and tutorial

* Add doc string for NLPAdapterModelMixin and deprecated warning on MegatronGPTPEFTModel

* add suppor…
pzelasko pushed a commit to pzelasko/NeMo that referenced this pull request Jan 3, 2024
* ControlNet TRT export

* Final MR before release

* SD2 update

* Fixed export issue

* Fix for instruct p2p and reformat

* Fix SD export issue

* Add nemo clip export for DB

* Fix ins pix2pix

* fix sd2 config

* [Mingyuan Ma] BF16 and SD conversion script

* [Imagen] NHWC Feature

* Fix .nemo loading issue for NeMo CLIP in SD

* NeMo r1.20.0 Multimodal Merge

* fix the inductor issue in inference

* Fix inductor loading .nemo issue

* Add Neva Model Support

* Imagen Optimizations

* Neva inference code

* NeMo TOT 1.21 to Internal/main

* Update neva_inference.yaml

* REBASING  for latest code changes

* Update internal/main to main tot

* Parallel DDIM implementation

* 1. Fixing indentation bug. (#7352)

Signed-off-by: Micha Livne <[email protected]>

* NeMo MCore llama2 support + MCore PEFT adapters (#7299)

* start adding gpt from megatron core path

Signed-off-by: ericharper <[email protected]>

* set model parallel config

Signed-off-by: ericharper <[email protected]>

* use model parallel config object

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update args

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* set vp size to none if it is 1

Signed-off-by: ericharper <[email protected]>

* set vp size to none if it is 1

Signed-off-by: ericharper <[email protected]>

* add TransformerConfig

Signed-off-by: ericharper <[email protected]>

* start updating to TransformerConfig

Signed-off-by: ericharper <[email protected]>

* add todo

Signed-off-by: ericharper <[email protected]>

* revert to model parallel config

Signed-off-by: ericharper <[email protected]>

* add hidden_size to model_parallel_config

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove imports

Signed-off-by: ericharper <[email protected]>

* revert

Signed-off-by: ericharper <[email protected]>

* remove import

Signed-off-by: ericharper <[email protected]>

* small clean up

Signed-off-by: ericharper <[email protected]>

* update hidden size in peft base model, add mcore commit to jenkins

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update module args

Signed-off-by: ericharper <[email protected]>

* add config obj to flash attention tests

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove args

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove sequence parallel arg

Signed-off-by: ericharper <[email protected]>

* update args

Signed-off-by: ericharper <[email protected]>

* add config to self

Signed-off-by: ericharper <[email protected]>

* update args

Signed-off-by: ericharper <[email protected]>

* update args

Signed-off-by: ericharper <[email protected]>

* update args

Signed-off-by: ericharper <[email protected]>

* add config to test

Signed-off-by: ericharper <[email protected]>

* get hidden_size from config

Signed-off-by: ericharper <[email protected]>

* add try except

Signed-off-by: ericharper <[email protected]>

* use default

Signed-off-by: ericharper <[email protected]>

* update config with hidden size

Signed-off-by: ericharper <[email protected]>

* remove arg

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* comment out jenkins test

Signed-off-by: ericharper <[email protected]>

* revert import

Signed-off-by: ericharper <[email protected]>

* build transformer config

Signed-off-by: ericharper <[email protected]>

* add model to provider func

Signed-off-by: ericharper <[email protected]>

* update forward and float16 wrapper

Signed-off-by: ericharper <[email protected]>

* instantiate model parallel config after init model parallel

Signed-off-by: ericharper <[email protected]>

* set virtual rank

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add GQA config to megatron gpt model (#7096)

* Add GQA config in gpt config file

Signed-off-by: jasonwan <[email protected]>

* Verify mcore is enabled when using GQA

Signed-off-by: jasonwan <[email protected]>

---------

Signed-off-by: jasonwan <[email protected]>

* revert

Signed-off-by: ericharper <[email protected]>

* mcore llama2 ckpt conversion & small fix

Signed-off-by: jasonwan <[email protected]>

* Add inference & sft config by Hongbin

Co-authored-by: Hongbin Liu <[email protected]>

Signed-off-by: jasonwan <[email protected]>

* fix config

Signed-off-by: jasonwan <[email protected]>

* add inference param. update TP/PP script to support mcore gpt

Signed-off-by: jasonwan <[email protected]>

* p-tuning

Signed-off-by: jasonwan <[email protected]>

* modify ckpt conversion script (adding model cast)

Signed-off-by: jasonwan <[email protected]>

* ckpt conversion use relative path for config

Signed-off-by: jasonwan <[email protected]>

* start adding gpt from megatron core path

Signed-off-by: ericharper <[email protected]>

* set model parallel config

Signed-off-by: ericharper <[email protected]>

* use model parallel config object

Signed-off-by: ericharper <[email protected]>

* update args

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* set vp size to none if it is 1

Signed-off-by: ericharper <[email protected]>

* set vp size to none if it is 1

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add TransformerConfig

Signed-off-by: ericharper <[email protected]>

* start updating to TransformerConfig

Signed-off-by: ericharper <[email protected]>

* add todo

Signed-off-by: ericharper <[email protected]>

* revert to model parallel config

Signed-off-by: ericharper <[email protected]>

* add hidden_size to model_parallel_config

Signed-off-by: ericharper <[email protected]>

* remove imports

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove import

Signed-off-by: ericharper <[email protected]>

* small clean up

Signed-off-by: ericharper <[email protected]>

* update hidden size in peft base model, add mcore commit to jenkins

Signed-off-by: ericharper <[email protected]>

* update module args

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add config obj to flash attention tests

Signed-off-by: ericharper <[email protected]>

* remove args

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove sequence parallel arg

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update args

Signed-off-by: ericharper <[email protected]>

* add config to self

Signed-off-by: ericharper <[email protected]>

* update args

Signed-off-by: ericharper <[email protected]>

* update args

Signed-off-by: ericharper <[email protected]>

* update args

Signed-off-by: ericharper <[email protected]>

* add config to test

Signed-off-by: ericharper <[email protected]>

* get hidden_size from config

Signed-off-by: ericharper <[email protected]>

* add try except

Signed-off-by: ericharper <[email protected]>

* use default

Signed-off-by: ericharper <[email protected]>

* update config with hidden size

Signed-off-by: ericharper <[email protected]>

* remove arg

Signed-off-by: ericharper <[email protected]>

* comment out jenkins test

Signed-off-by: ericharper <[email protected]>

* revert import

Signed-off-by: ericharper <[email protected]>

* remove optimizer_idx

Signed-off-by: eharper <[email protected]>

* prefetch num microbatches

Signed-off-by: eharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* start adding gpt from megatron core path

Signed-off-by: ericharper <[email protected]>

* set model parallel config

Signed-off-by: ericharper <[email protected]>

* use model parallel config object

Signed-off-by: ericharper <[email protected]>

* update args

Signed-off-by: ericharper <[email protected]>

* fix for p-tuning sequence parallel

Signed-off-by: jasonwan <[email protected]>

* support SFT/distOpt mcore (#7207)

* add inference param. update TP/PP script to support mcore gpt

* p-tuning

Signed-off-by: jasonwan <[email protected]>

* change layer names for SFT

Signed-off-by: Hongbin Liu <[email protected]>

* fix bug in SFT

Signed-off-by: Hongbin Liu <[email protected]>

---------

Signed-off-by: jasonwan <[email protected]>
Signed-off-by: Hongbin Liu <[email protected]>
Co-authored-by: Hongbin Liu <[email protected]>
Co-authored-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* start updating to TransformerConfig

Signed-off-by: ericharper <[email protected]>

* revert to model parallel config

Signed-off-by: ericharper <[email protected]>

* add hidden_size to model_parallel_config

Signed-off-by: ericharper <[email protected]>

* remove imports

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update module args

Signed-off-by: ericharper <[email protected]>

* add config to self

Signed-off-by: ericharper <[email protected]>

* build transformer config

Signed-off-by: ericharper <[email protected]>

* add model to provider func

Signed-off-by: ericharper <[email protected]>

* update forward and float16 wrapper

Signed-off-by: ericharper <[email protected]>

* instantiate model parallel config after init model parallel

Signed-off-by: ericharper <[email protected]>

* set virtual rank

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add GQA config to megatron gpt model (#7096)

* Add GQA config in gpt config file

Signed-off-by: jasonwan <[email protected]>

* Verify mcore is enabled when using GQA

Signed-off-by: jasonwan <[email protected]>

---------

Signed-off-by: jasonwan <[email protected]>

* revert

Signed-off-by: ericharper <[email protected]>

* remove import

Signed-off-by: eharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* rollback model cast for p-tuning

Signed-off-by: jasonwan <[email protected]>

* update for dist adam

Signed-off-by: eharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* use get_gpt_module_list

Signed-off-by: eharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update ckpt conversion script

Signed-off-by: jasonwan <[email protected]>

* ptl2.0 patch for llama config

Signed-off-by: jasonwan <[email protected]>

* add plugins to trainer in scripts

Signed-off-by: jasonwan <[email protected]>

* fix activation checkpointing mcore

Signed-off-by: jasonwan <[email protected]>

* fix variable names

Signed-off-by: jasonwan <[email protected]>

* overwrite normalization type for mcore/te

Signed-off-by: jasonwan <[email protected]>

* Update megatron_llama_sft.yaml

Signed-off-by: Jason Wang <[email protected]>

* add PEFT adapter support for mcore gpt path (#7276)

* implementation for mcore adapter/mxins

Signed-off-by: jasonwan <[email protected]>

* small fix for lora and ptuning

Signed-off-by: jasonwan <[email protected]>

* support layerwise peft

Signed-off-by: jasonwan <[email protected]>

* support multiple target layers

Signed-off-by: jasonwan <[email protected]>

* support lora GQA

Signed-off-by: jasonwan <[email protected]>

* support amp O2

Signed-off-by: jasonwan <[email protected]>

* revert & more O2 fix

Signed-off-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* lora inject to attention

Signed-off-by: jasonwan <[email protected]>

* support lora weight tying

Signed-off-by: jasonwan <[email protected]>

* add copyright header

Signed-off-by: jasonwan <[email protected]>

* rollback ptuning name change. full string match mcore target

Signed-off-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove comment

Signed-off-by: jasonwan <[email protected]>

---------

Signed-off-by: jasonwan <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* clean up config

Signed-off-by: jasonwan <[email protected]>

* Sync llama branch (#7297)

* add inference param. update TP/PP script to support mcore gpt

* p-tuning

Signed-off-by: jasonwan <[email protected]>

* change layer names for SFT

Signed-off-by: Hongbin Liu <[email protected]>

* fix bug in SFT

Signed-off-by: Hongbin Liu <[email protected]>

* fix bug: cpu initialization is not really enabled

Signed-off-by: Hongbin Liu <[email protected]>

* add use_cpu_initialization to TransformerConfig

Signed-off-by: Hongbin Liu <[email protected]>

* fix bug: wrong config path when using relative cjpt path

Signed-off-by: Hongbin Liu <[email protected]>

* revert mcore config change

Signed-off-by: Jason Wang <[email protected]>

---------

Signed-off-by: jasonwan <[email protected]>
Signed-off-by: Hongbin Liu <[email protected]>
Signed-off-by: Jason Wang <[email protected]>
Co-authored-by: Hongbin Liu <[email protected]>

* clean up ckpt conversion script

Signed-off-by: jasonwan <[email protected]>

* rollback git merge errors

Signed-off-by: jasonwan <[email protected]>

* update mcore, add check for mcore+te

Signed-off-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* formatting

Signed-off-by: jasonwan <[email protected]>

* make sft test dataset optional. fix indentation in config

Signed-off-by: jasonwan <[email protected]>

* one more fix for optional test set

Signed-off-by: jasonwan <[email protected]>

* support merging lora weights in mcore

Signed-off-by: jasonwan <[email protected]>

* update mcore for cpu init

Signed-off-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update ckpt conversion for code llama

Signed-off-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add seq_len_interpolation_factor support for long-context llama ckpts (#7312)

* add inference param. update TP/PP script to support mcore gpt

* p-tuning

Signed-off-by: jasonwan <[email protected]>

* add seq_len_interpolation_factor

Signed-off-by: Hongbin Liu <[email protected]>

---------

Signed-off-by: jasonwan <[email protected]>
Signed-off-by: Hongbin Liu <[email protected]>
Co-authored-by: jasonwan <[email protected]>
Co-authored-by: Hongbin Liu <[email protected]>

* fix old ptuning model, update mcore to support seq_len_interpolation_factor

Signed-off-by: jasonwan <[email protected]>

* support fused layernorm linear, fix ptuning O2

Signed-off-by: jasonwan <[email protected]>

* drop loss mask for mcore for now

Signed-off-by: jasonwan <[email protected]>

* disable dist ckpt in peft

Signed-off-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix loading non dist ckpt

Signed-off-by: jasonwan <[email protected]>

* add ckpt conversion to CI

Signed-off-by: jasonwan <[email protected]>

* update CI

Signed-off-by: jasonwan <[email protected]>

* mcore_mixin docstring

Signed-off-by: jasonwan <[email protected]>

* minor change in mcore peft error message

Signed-off-by: jasonwan <[email protected]>

* fix amp o2 in lora weight tying

Signed-off-by: jasonwan <[email protected]>

* correct mcore fp8 config

Signed-off-by: jasonwan <[email protected]>

* add TE installation

Signed-off-by: jasonwan <[email protected]>

* support mcore adapter tuning

Signed-off-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* comment out new CI test. rollback docker image

Signed-off-by: jasonwan <[email protected]>

* ignore FA tests, try new CI on 23.08

Signed-off-by: jasonwan <[email protected]>

* mark new CI as L2, put to beginning to test

Signed-off-by: jasonwan <[email protected]>

* minor fix for prompt learning

Signed-off-by: jasonwan <[email protected]>

* rollback to 23.06. comment out CI

Signed-off-by: jasonwan <[email protected]>

* minor fix ckpt conversion script

Signed-off-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* minor rollback gpt model change

Signed-off-by: jasonwan <[email protected]>

---------

Signed-off-by: ericharper <[email protected]>
Signed-off-by: jasonwan <[email protected]>
Signed-off-by: eharper <[email protected]>
Signed-off-by: Hongbin Liu <[email protected]>
Signed-off-by: Jason Wang <[email protected]>
Co-authored-by: ericharper <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: eharper <[email protected]>
Co-authored-by: Hongbin Liu <[email protected]>
Co-authored-by: Kelvin Liu <[email protected]>

* Hiddens modules documentation (#7303)

* 1. Changed hiddens transformations module from `transformations` to `hiddens`.

Signed-off-by: Micha Livne <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* 1. Debugging. Signed-off-by: Micha Livne <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* 1. Finished doc.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging. Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging. Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging. Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging. Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging. Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging. Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging. Signed-off-by: Micha Livne <[email protected]>

---------

Signed-off-by: Micha Livne <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Eric Harper <[email protected]>

* Support for flash attention 2.0 (#7063)

* Add flash attn 2

Signed-off-by: MaximumEntropy <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add FA2 feature

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* Remove debugging

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: MaximumEntropy <[email protected]>
Signed-off-by: Cheng-Ping Hsieh <[email protected]>
Signed-off-by: Cheng-Ping Hsieh <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Oleksii Kuchaiev <[email protected]>
Co-authored-by: Cheng-Ping Hsieh <[email protected]>
Co-authored-by: Cheng-Ping Hsieh <[email protected]>

* lora merge fix for O2 names (#7325)

* wip

Signed-off-by: arendu <[email protected]>

* adjust key names based on O2

Signed-off-by: arendu <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update

Signed-off-by: arendu <[email protected]>

* minor

Signed-off-by: arendu <[email protected]>

---------

Signed-off-by: arendu <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* multiple fields can form a context (#7147)

* list of context fields and flexible prompt template

Signed-off-by: arendu <[email protected]>

* list of fields for context

Signed-off-by: arendu <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix bug

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* Fix bug

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* Add multiple truncation fields and middle truncation

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Compatible to old ckpt

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix tokenize detokenize issue

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Remove detokenization, add truncation augmentation

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Resolve comments

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* Remove unused import

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* revert eos

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* Add tokenizer space_sensitive attribute

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix error

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* Fix erorr and use re

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix bug

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* Change assert logic

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Follow adi suggestion

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Remove merge function

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add example and comment

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* Remove context_key and add comment

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* Remove random truncation

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix bug

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix template none

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix bug

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

---------

Signed-off-by: arendu <[email protected]>
Signed-off-by: Cheng-Ping Hsieh <[email protected]>
Signed-off-by: Cheng-Ping Hsieh <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Cheng-Ping Hsieh <[email protected]>
Co-authored-by: Cheng-Ping Hsieh <[email protected]>

* Load buffers in checkpoint (#7357)

Signed-off-by: Jason Wang <[email protected]>

* Add migration guide for lightning 2.0 upgrade (#7360)

* Add lightning 2.0 migration guide in NeMo docs

Signed-off-by: Abhishree <[email protected]>

* Add remaining guide for lightning 2.0 upgrade

Signed-off-by: Abhishree <[email protected]>

* Remove line spill over and continue in next line

Signed-off-by: Abhishree <[email protected]>

* Add missing dataloader_iter in the guide

Signed-off-by: Abhishree <[email protected]>

* Fix minor typo

Signed-off-by: Abhishree <[email protected]>

---------

Signed-off-by: Abhishree <[email protected]>

* adding bias_dropout_add_fusion option for BERT (#7332)

Signed-off-by: Alexander Jipa <[email protected]>
Co-authored-by: Alexander Jipa <[email protected]>

* [TTS] Change audio codec token type to TokenIndex (#7356)

Signed-off-by: Ryan <[email protected]>

* enable selective unfreeze (#7326)

* wip

Signed-off-by: arendu <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* wip

Signed-off-by: arendu <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* avoid PTL method conflicts

Signed-off-by: arendu <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update

Signed-off-by: arendu <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update

Signed-off-by: arendu <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: arendu <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Fix typos (#7361)

* fix typos

Signed-off-by: omahs <[email protected]>

* fix typo

Signed-off-by: omahs <[email protected]>

* fix typos

Signed-off-by: omahs <[email protected]>

* fix typos

Signed-off-by: omahs <[email protected]>

* fix typo

Signed-off-by: omahs <[email protected]>

* fix typos

Signed-off-by: omahs <[email protected]>

* fix typo

Signed-off-by: omahs <[email protected]>

* fix typo

Signed-off-by: omahs <[email protected]>

* fix typo

Signed-off-by: omahs <[email protected]>

---------

Signed-off-by: omahs <[email protected]>

* pin numba=0.57.1 to fix reinstall.sh error (#7366)

Signed-off-by: Xuesong Yang <[email protected]>

* Update new conversion script for converting safetensors.

* Upgrade pytorch container to 23.08 (#7353)

* upgrade pytorch container

Signed-off-by: eharper <[email protected]>

* use mcore

Signed-off-by: eharper <[email protected]>

* revert test change

Signed-off-by: eharper <[email protected]>

* pleasefixme

Signed-off-by: eharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* check for ampere

Signed-off-by: eharper <[email protected]>

* comment test temporarily

Signed-off-by: eharper <[email protected]>

---------

Signed-off-by: eharper <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* enable fp32 optimizer for output_layer in mcore (#7355)

Signed-off-by: lhb8125 <[email protected]>

* revert comment (#7368)

Signed-off-by: eharper <[email protected]>

* Update to core 23.08 branch ToT (#7371)

Signed-off-by: Abhinav Khattar <[email protected]>

* upper bounding ptl (#7370)

Signed-off-by: eharper <[email protected]>

* fix pipeline parallel inference (#7367)

* fix pp inference

Signed-off-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: jasonwan <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* fix for peft tied weights (#7372)

Signed-off-by: arendu <[email protected]>

* fixed trainer.strategy=auto from None. (#7369)

Signed-off-by: Xuesong Yang <[email protected]>

* add O2 option in gpt eval (#7358)

* add O2 option in eval

Signed-off-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add doc for O2 config

Signed-off-by: jasonwan <[email protected]>

* add to llama inference config

Signed-off-by: jasonwan <[email protected]>

---------

Signed-off-by: jasonwan <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Eric Harper <[email protected]>

* Move model precision copy (#7336)

* move cfg precision set to megatron base model

Signed-off-by: Maanu Grover <[email protected]>

* remove copy from other models

Signed-off-by: Maanu Grover <[email protected]>

* modify attribute not arg

Signed-off-by: Maanu Grover <[email protected]>

* fix gpt model test for ptl 2.0

Signed-off-by: Maanu Grover <[email protected]>

* rename function and add docstring

Signed-off-by: Maanu Grover <[email protected]>

* replace precision to dtype conditionals with func call

Signed-off-by: Maanu Grover <[email protected]>

* unnecessary function and cfg reset

Signed-off-by: Maanu Grover <[email protected]>

* set default value

Signed-off-by: Maanu Grover <[email protected]>

* fix precision lookup in a few more places

Signed-off-by: Maanu Grover <[email protected]>

* rename mapping function

Signed-off-by: Maanu Grover <[email protected]>

* ununsed import

Signed-off-by: Maanu Grover <[email protected]>

* save torch datatype to model

Signed-off-by: Maanu Grover <[email protected]>

* set weights precision wrt amp o2

Signed-off-by: Maanu Grover <[email protected]>

* Revert "set weights precision wrt amp o2"

This reverts commit 313a4bfe5eb69d771a6d2433898c0685836aef5c.

Signed-off-by: Maanu Grover <[email protected]>

* revert half precision at inference attempt

Signed-off-by: Maanu Grover <[email protected]>

* move autocast dtype to base model

Signed-off-by: Maanu Grover <[email protected]>

* move params dtype to base model, enable fp16 O2 inf

Signed-off-by: Maanu Grover <[email protected]>

* unused imports

Signed-off-by: Maanu Grover <[email protected]>

---------

Signed-off-by: Maanu Grover <[email protected]>

* Fix PEFT checkpoint loading (#7388)

* Fix PEFT checkpoint loading

Signed-off-by: Jason Wang <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Jason Wang <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Use distributed optimizer support for multiple dtypes (#7359)

* Update distopt wrapper with multiple dtype support

Remove manual handling of separate FP32 optimizer.

Signed-off-by: Tim Moon <[email protected]>

* Use distopt support for contiguous buffers with multiple dtypes

Signed-off-by: Tim Moon <[email protected]>

* Fix typo

Signed-off-by: Tim Moon <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Separate distopt buckets for first GPT layer and non-overlapped params

Signed-off-by: Tim Moon <[email protected]>

* Add distopt logic for int dtypes

Signed-off-by: Tim Moon <[email protected]>

* Update Apex commit

Signed-off-by: Tim Moon <[email protected]>

* Remove unused variables

Signed-off-by: Tim Moon <[email protected]>

* Update Apex commit in README and Jenkensfile

Signed-off-by: Tim Moon <[email protected]>

* Debug Dockerfile and Jenkinsfile

Signed-off-by: Tim Moon <[email protected]>

---------

Signed-off-by: Tim Moon <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Eric Harper <[email protected]>

* minor fix for llama ckpt conversion script (#7387)

* minor fix for llama ckpt conversion script

Signed-off-by: Jason Wang <[email protected]>

* Update Jenkinsfile

Signed-off-by: Jason Wang <[email protected]>

* remove fast_swiglu configuration

Signed-off-by: Jason Wang <[email protected]>

---------

Signed-off-by: Jason Wang <[email protected]>
Co-authored-by: Eric Harper <[email protected]>

* Fix wrong calling of librosa.get_duration() in notebook (#7376)

Signed-off-by: Robin Dong <[email protected]>
Co-authored-by: Somshubra Majumdar <[email protected]>

* [PATCH] PEFT import mcore (#7393)

* [PATCH] PEFT import mcore

Signed-off-by: Jason Wang <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Jason Wang <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* [TTS] Added a callback for logging initial data (#7384)

Signed-off-by: Ante Jukić <[email protected]>

* Update Core Commit (#7402)

* Update Core Commit

Signed-off-by: Abhinav Khattar <[email protected]>

* update commit

Signed-off-by: Abhinav Khattar <[email protected]>

---------

Signed-off-by: Abhinav Khattar <[email protected]>

* Use cfg attribute in bert (#7394)

* use cfg attribute instead of arg

Signed-off-by: Maanu Grover <[email protected]>

* use torch_dtype in place of cfg.precision

Signed-off-by: Maanu Grover <[email protected]>

* move precision copy before super constructor

Signed-off-by: Maanu Grover <[email protected]>

* use trainer arg

Signed-off-by: Maanu Grover <[email protected]>

---------

Signed-off-by: Maanu Grover <[email protected]>

* Add support for bias conversion in Swiglu models (#7386)

* Add support for bias conversion in Swiglu models

Signed-off-by: smajumdar <[email protected]>

* Add support for auto extracting tokenizer model

Signed-off-by: smajumdar <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add support for auto extracting tokenizer model

Signed-off-by: smajumdar <[email protected]>

* Fix issue with missing tokenizer

Signed-off-by: smajumdar <[email protected]>

* Refactor

Signed-off-by: smajumdar <[email protected]>

* Refactor

Signed-off-by: smajumdar <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: smajumdar <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Update save_to and restore_from for dist checkpointing (#7343)

* add dist ckpt to save to, in progress

Signed-off-by: eharper <[email protected]>

* move dist ckpt

Signed-off-by: eharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* clean up

Signed-off-by: eharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update restore from, need to figure out how to initialize distributed

Signed-off-by: eharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* launch distrib if needed when restoring dist ckpt

Signed-off-by: eharper <[email protected]>

* when using mcore we can change tp pp on the fly

Signed-off-by: eharper <[email protected]>

* add load_from_checkpoint support for dist ckpt

Signed-off-by: eharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update llama convert script to save dist .nemo

Signed-off-by: eharper <[email protected]>

* fix load dist ckpt

Signed-off-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* setup TE TP groups if needed

Signed-off-by: eharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* setup te tp groups if needed

Signed-off-by: eharper <[email protected]>

* remove import

Signed-off-by: eharper <[email protected]>

---------

Signed-off-by: eharper <[email protected]>
Signed-off-by: jasonwan <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: jasonwan <[email protected]>

* fix forward for with mcore=false (#7403)

Signed-off-by: Jimmy Zhang <[email protected]>
Co-authored-by: Jimmy Zhang <[email protected]>

* Fix logging to remove 's/it' from progress bar in Megatron models and add train_step_timing (#7374)

* Add CustomProgressBar class to exp_manager and trainer callbacks

Signed-off-by: Abhishree <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix the progress bar to reflect total microbatch cnt

Signed-off-by: Abhishree <[email protected]>

* Modify CustomProgressBar class

1) Modify CustomProgressBar class to update progress bar per global_step instead of per microbatch
2) Add the callback to other megatron training/finetuning files that are not using MegatronTrainerBuilder

Signed-off-by: Abhishree <[email protected]>

* Add CustomProgressBar callback to tuning files

Signed-off-by: Abhishree <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Abhishree <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Set Activation Checkpointing Defaults (#7404)

* Set Activation Checkpointing Defaults

Signed-off-by: Abhinav Khattar <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* check for None

Signed-off-by: Abhinav Khattar <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Abhinav Khattar <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* make loss mask default to false (#7407)

Signed-off-by: eharper <[email protected]>

* Add dummy userbuffer config files (#7408)

Signed-off-by: Sangkug Lym <[email protected]>

* add missing ubconf files (#7412)

Signed-off-by: Abhinav Khattar <[email protected]>

* New tutorial on Speech Data Explorer (#7405)

* Added Google Colab based tutorial on Speech Data Explorer

Signed-off-by: George Zelenfroynd <[email protected]>

* Update ptl training ckpt conversion script to work with dist ckpt (#7416)

* update ptl convert script

Signed-off-by: eharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* don't break legacy

Signed-off-by: eharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: eharper <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Allow disabling sanity checking when num_sanity_val_steps=0 (#7413)

* Allow disabling sanity checking when num_sanity_val_steps=0

Signed-off-by: Abhishree <[email protected]>

* Update num_sanity_val_steps to be a multiple of num_microbatches

Signed-off-by: Abhishree Thittenamane <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Abhishree <[email protected]>
Signed-off-by: Abhishree Thittenamane <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Add comprehensive error messages (#7261)

Signed-off-by: Anton Peganov <[email protected]>

* check NEMO_PATH (#7418)

Signed-off-by: Nikolay Karpov <[email protected]>

* layer selection for ia3 (#7417)

* layer selection for ia3

Signed-off-by: arendu <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: arendu <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Fix missing pip package 'einops' (#7397)

Signed-off-by: Robin Dong <[email protected]>

* Fix failure of pyaudio in Google Colab (#7396)

Signed-off-by: Robin Dong <[email protected]>

* Update README.md: output_path --> output_manifest_filepath (#7442)

Signed-off-by: Samuele Cornell <[email protected]>

* Updating FlashAttention API to match FlashAttentionV2

* Multiple fixes for mm

* Fix CI inductor issue and update to torch compile

* Remove suppress error

* Fix when conversion config uses fp16 and it complains about precision plugin

* Fixing FAv2 API usage

* Initial release of content filtering model

* Added synthetic dataloader for precached and online mode

* Mingyuanm/dreambooth opt

* Add llama2 support in neva training

* Fix sampler length

* Fix all precision issues in nemo multimodal

* Add rope dynamic linear scaling (#7437)

* Add dynamic linear scaling

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix bug

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

---------

Signed-off-by: Cheng-Ping Hsieh <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Yang Zhang <[email protected]>

* Fix None dataloader issue in PTL2.0 (#7455)

* Fix None dataloader issue in PTL2.0

Signed-off-by: KunalDhawan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* updating values of self._validation_dl and self._test_dl as well

Signed-off-by: KunalDhawan <[email protected]>

* updating values of self._validation_dl and self._test_dl as well

Signed-off-by: KunalDhawan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: KunalDhawan <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* [ASR] Confidence measure -> method renames (#7434)

* measure -> method

Signed-off-by: Aleksandr Laptev <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Aleksandr Laptev <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Add steps for document of getting dataset 'SF Bilingual Speech' (#7378)

* Add steps for document of getting dataset 'SF Bilingual Speech'

Signed-off-by: Robin Dong <[email protected]>

* Update datasets.rst

added a link from a tutorial demonstrating detailed data prep steps.

Signed-off-by: Xuesong Yang <[email protected]>

---------

Signed-off-by: Robin Dong <[email protected]>
Signed-off-by: Xuesong Yang <[email protected]>
Co-authored-by: Xuesong Yang <[email protected]>

* RNN-T confidence and alignment bugfix (#7381)

* new frame_confidence and alignments lists are now always created after the while loop

Signed-off-by: Aleksandr Laptev <[email protected]>

* tests added

Signed-off-by: Aleksandr Laptev <[email protected]>

---------

Signed-off-by: Aleksandr Laptev <[email protected]>

* Fix resume from checkpoint in exp_manager (#7424) (#7426)

Signed-off-by: Abhishree <[email protected]>
Co-authored-by: Abhishree Thittenamane <[email protected]>
Co-authored-by: Eric Harper <[email protected]>

* Fix checking of cuda/cpu device for inputs of Decoder (#7444)

* Fix checking of cuda/cpu device for inputs of Decoder

Signed-off-by: Robin Dong <[email protected]>

* Update tacotron2.py

Signed-off-by: Jason <[email protected]>

---------

Signed-off-by: Robin Dong <[email protected]>
Signed-off-by: Jason <[email protected]>
Co-authored-by: Jason <[email protected]>

* Fix failure of ljspeech's get_data.py (#7430)

* Fix failure of ljspeech's get_data.py

Signed-off-by: Robin Dong <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Robin Dong <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* [TTS] Fix audio codec type checks (#7373)

* [TTS] Fix audio codec type checks

Signed-off-by: Ryan <[email protected]>

* [TTS] Fix audio codec tests

Signed-off-by: Ryan <[email protected]>

---------

Signed-off-by: Ryan <[email protected]>

* [TTS] Add dataset to path of logged artifacts (#7462)

* [TTS] Add dataset to path of logged artifacts

Signed-off-by: Ryan <[email protected]>

* [TTS] Revert axis name back to Audio Frames

Signed-off-by: Ryan <[email protected]>

---------

Signed-off-by: Ryan <[email protected]>

* Fix sft dataset truncation (#7464)

* Add fix

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

---------

Signed-off-by: Cheng-Ping Hsieh <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Automatic Lip Reading Recognition (ALR) - ASR/CV (Visual ASR) (#7330)

* striding_conv1d_k5 and dw_striding_conv1d_k5 subsampling

Signed-off-by: mburchi <[email protected]>

* transpose conv1d inputs

Signed-off-by: mburchi <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Signed-off-by: mburchi <[email protected]>

* Update subsampling.py

change striding_conv1d_k5 to striding_conv1d

Signed-off-by: Maxime Burchi <[email protected]>

* cv branch

Signed-off-by: mburchi <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* video manifest

Signed-off-by: mburchi <[email protected]>

* add collection classes

Signed-off-by: mburchi <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add test_step_outputs

Signed-off-by: mburchi <[email protected]>

* correct manifest bug when having only audio or only videos

Signed-off-by: mburchi <[email protected]>

* correct manifest bug when having only audio or only videos

Signed-off-by: mburchi <[email protected]>

* clean references

Signed-off-by: mburchi <[email protected]>

* freeze unfreeze transcribe cv models

Signed-off-by: mburchi <[email protected]>

* correct manifest get_full_path bug

Signed-off-by: mburchi <[email protected]>

* update for PR

Signed-off-by: mburchi <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* guard torchvision

Signed-off-by: mburchi <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update nemo/collections/cv/data/video_to_text_dataset.py

Co-authored-by: Igor Gitman <[email protected]>
Signed-off-by: Maxime Burchi <[email protected]>

* _video_speech_collate_fn in cv/data/video_to_text.py

Signed-off-by: mburchi <[email protected]>

* add self.out = None to asr subsampling

Signed-off-by: mburchi <[email protected]>

* Update nemo/collections/cv/data/video_to_text_dataset.py

Co-authored-by: Igor Gitman <[email protected]>
Signed-off-by: Maxime Burchi <[email protected]>

* cv -> multimodal/speech_cv branch

Signed-off-by: mburchi <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: mburchi <[email protected]>
Signed-off-by: Maxime Burchi <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Igor Gitman <[email protected]>

* HF StarCoder to NeMo conversion script (#7421)

* Script to convert HF StarCoder checkpoint to NeMo

Signed-off-by: Jan Lasek <[email protected]>

* StarCoder conversion test

Signed-off-by: Jan Lasek <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Signed-off-by: Jan Lasek <[email protected]>

* Fix test

Signed-off-by: Jan Lasek <[email protected]>

* Catch up with save_to changes

Signed-off-by: Jan Lasek <[email protected]>

* Don't abbreviate args for clarity

Signed-off-by: Jan Lasek <[email protected]>

* Configurable precision: BF16 vs FP32

Signed-off-by: Jan Lasek <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Jan Lasek <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* fix bug when loading dist ckpt in peft (#7452)

Signed-off-by: Hongbin Liu <[email protected]>
Co-authored-by: Hongbin Liu <[email protected]>

* Fix adding positional embeddings in-place in transformer module (#7440)

Signed-off-by: Tamerlan Tabolov <[email protected]>
Co-authored-by: Cheng-Ping Hsieh <[email protected]>

* Fix (#7478)

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* add sleep (#7498) (#7499)

* add sleep

* add sleep onto config instead

* add comment

---------

Signed-off-by: Gerald Shen <[email protected]>
Co-authored-by: Gerald Shen <[email protected]>

* Fix exp manager check for sleep (#7503) (#7504)

Signed-off-by: smajumdar <[email protected]>
Co-authored-by: Somshubra Majumdar <[email protected]>

* bugfix: trainer.accelerator=auto from None. (#7492) (#7493)

Signed-off-by: Xuesong Yang <[email protected]>
Co-authored-by: Xuesong Yang <[email protected]>

* [doc] fix broken link (#7481)

Signed-off-by: Stas Bekman <[email protected]>

* [TTS] Read audio as int32 to avoid flac read errors (#7477)

* [TTS] Read audio as int32 to avoid flac read errors

Signed-off-by: Ryan <[email protected]>

* [TTS] Add comment about read failures

Signed-off-by: Ryan <[email protected]>

---------

Signed-off-by: Ryan <[email protected]>

* Add dataset 'AISHELL-3' from OpenSLR for training mandarin TTS (#7409)

* Add dataset 'AISHELL-3' from OpenSLR for training mandarin TTS
* Train 'AISHELL-3' dataset with multi-speakers

Signed-off-by: Robin Dong <[email protected]>

* Update get_data.py

update copyright header

Signed-off-by: Xuesong Yang <[email protected]>

* Update get_data.py

added a disclaimer

Signed-off-by: Xuesong Yang <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add new configuration file for AISHELL3 with multispeaker of fastpitch

Signed-off-by: Robin Dong <[email protected]>

---------

Signed-off-by: Robin Dong <[email protected]>
Signed-off-by: Xuesong Yang <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Xuesong Yang <[email protected]>

* dllogger - log on rank 0 only (#7513)

Signed-off-by: Stas Bekman <[email protected]>

* Fix TTS FastPitch tutorial (#7494) (#7516)

* Fix

---------

Signed-off-by: Cheng-Ping Hsieh <[email protected]>
Co-authored-by: Cheng-Ping Hsieh <[email protected]>

* Fix get_dist() tensor dimension (#7506) (#7515)

Signed-off-by: Jocelyn Huang <[email protected]>
Co-authored-by: Jocelyn <[email protected]>

* bugfix: specify trainer.strategy=auto when devices=1 (#7509) (#7512)

Signed-off-by: Xuesong Yang <[email protected]>
Co-authored-by: Xuesong Yang <[email protected]>
Co-authored-by: Xuesong Yang <[email protected]>

* fix (#7511)

Signed-off-by: Abhinav Khattar <[email protected]>

* [TTS] Fix FastPitch data prep tutorial (#7524)

Signed-off-by: Ryan <[email protected]>

* add italian tokenization (#7486)

* add italian tokenization

Signed-off-by: GiacomoLeoneMaria <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add more ipa lexicon it

Signed-off-by: GiacomoLeoneMaria <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix error deletion

Signed-off-by: GiacomoLeoneMaria <[email protected]>

* add test

Signed-off-by: GiacomoLeoneMaria <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: GiacomoLeoneMaria <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Replace None strategy with auto in tutorial notebooks (#7521) (#7527)

Signed-off-by: Abhishree <[email protected]>
Co-authored-by: Abhishree Thittenamane <[email protected]>

* unpin setuptools (#7534) (#7535)

Signed-off-by: fayejf <[email protected]>
Co-authored-by: fayejf <[email protected]>

* remove auto generated examples (#7510)

* explicitly remove autogenerated examples for data parallel evaluation

Signed-off-by: arendu <[email protected]>

* mark autogenrated and remove it for test

Signed-off-by: arendu <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: arendu <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Add the `strategy` argument to `MegatronGPTModel.generate()` (#7264)

It is passed as an explicit argument rather than through
`**strategy_args` so as to ensure someone cannot accidentally pass other
arguments that would end up being ignored.

It is a keyword-only argument to ensure that if in the future we want to
update the signature to `**strategy_args`, we can do it without breaking
code.

Signed-off-by: Olivier Delalleau <[email protected]>

* Fix PTL2.0 related ASR bugs in r1.21.0: Val metrics logging, None dataloader issue (#7531) (#7533)

* fix none dataloader issue ptl2

* ptl2.0 logging fixes for rnnt_models

---------

Signed-off-by: KunalDhawan <[email protected]>
Co-authored-by: Kunal Dhawan <[email protected]>
Co-authored-by: Nithin Rao <[email protected]>

* gpus -> devices (#7542) (#7545)

Signed-off-by: Nithin Rao Koluguri <nithinraok>
Co-authored-by: Nithin Rao <[email protected]>

* Update FFMPEG version to fix issue with torchaudio (#7551) (#7553)

Signed-off-by: smajumdar <[email protected]>
Co-authored-by: Somshubra Majumdar <[email protected]>

* PEFT GPT & T5 Refactor (#7308)

* initial implementation of add_adapters API

* correct type hint

* Add config in add_adapters for save and load (@author bobchen)

* Remove AdapterConfig to avoid import error

* Add AdaterConfig back and move adaptermixin to sft model

* Add NLPSaveRestoreConnector as default in NLPModel.restore_from

* Add restore_from_nemo_with_adapter and test script

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* rename t5 file and classes to be consistent with GPT

* add t5 sft dataset

* add support for single-file format with T5SFTDataset

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Various small changes to make T5 SFT work like GPT SFT

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add adapter evaluation test script

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add MultiAdaterConfig for ia3 and fix builder issue

* Make ptuning for T5SFTModel work using mixin

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add IA3_Adapter for AdapterName

* Add adapter name for ptuning and attention adapter

* Make test script GPT/T5 agnostic

* Add layer selection feature

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Integrate adapter name and config

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update gpt peft tuning script to new API

* add t5 peft tuning script with new API

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix IA3 layer selection issue

* Override state_dict on SFT model instead of mixin

* Add load adapter by adapter config

* move peft config map away from example script

* auto get config from nemo adapter

* Move PEFTConfig to new file

* fix ckpt save/load for t5

* name change: add_adapters -> add_adapter

* variable name change

* update t5 script

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix t5 issues

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add weight tying

* update gpt tuning script

* PEFT-API proposal

* Fix according to comments

* update tuning scripts

* move merge_cfg_with to mixin class since it applies to both gpt and t5 and requires the model class for restore

* Add mcore_gpt support for NLPAdapterMixin

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix typo

* variable name change to distinguish "peft" and "adapter"

* override `load_adapters` to support `add_adapter` name change

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update tuning and eval script for adapter save/load

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add Ptuning on first stage only

* add lora tutorial for review

* Fix layer selection for mcore

* add landing page

* fix resume training

Signed-off-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add mcore condition in sharded_state_dict to make sft work

* Update lora_tutorial.md

First edit of this file for PEFT documentation for NeMO

Signed-off-by: hkelly33 <[email protected]>

* rename Adapter to AttentionAdapter to avoid confusion in doc

* Change load_adapters to load .nemo

* add quick start guide

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add load_adapters with .ckpt

* Remove setup_complete changes in load_adapters

* update landing page

* remove typo

* Updated quick_start.md per Chen Cui

Signed-off-by: hkelly33 <[email protected]>

* Add inference config merger and tutorial

* Add doc string for NLPAdapterModelMixin and deprecated warning on MegatronGPTPEFTModel

* add supported_methods.md and update other documentations

* Update supported_methods.md

minor updates.

Signed-off-by: Adi Renduchintala <[email protected]>

* Update landing_page.md

minor update.

Signed-off-by: Adi Renduchintala <[email protected]>

* Modify doc string for NLPAdapterModelMixin

* Add doc string add_adapters in NLPAdapterModelMixin

* rename canonical adapters

* remove mcore hard dependency

* [PATCH] move microbatch calculator to nemo from apex

* remove apex dependency in gpt and t5 sft models

* remove apex dependency in gpt model

* render doc strings

* fix

* Add missing virtual_tokens on ptuning

* fix docstrings

* update gpt-style model coverage in docs

* update docstring

* Remove pdb

* add lightning_fabric to make docstring rendering work

* Add Ptuning missing key

* try docstring rendering

* Fix ptuning issue

* update gpt t5 peft tuning and eval scripts

* typos

* update eval config

* fix bug relating to apex dependency removal

* typo

* make predict step behave the same as test step

* make lora tutorial work in notebook

* cosmetics

* update yaml scripts

* mcore_gpt attribute optional

* typo

* update eval scripts and fix T5 eval bugs

* add NLPDDPStrategyNotebook and trainer builder logic to use it

* update lora notebook to use new trainer builder

* fix microbatch calculator bug for inference after training

* Convert markdown files to RST and incorporate with doc

* typo

* revise language

* remove extra cell

* remove unnecessary inheritance

* remove old tests

* move layer selection default so logging messages make sense

* remove `save_adapters` as adapter weights are saved automatically during training

* initialize weights from a checkpoint instead of randomly

* multiple fields can form a context (#7147)

* list of context fields and flexible prompt template

Signed-off-by: arendu <[email protected]>

* list of fields for context

Signed-off-by: arendu <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix bug

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* Fix bug

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* Add multiple truncation fields and middle truncation

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Compatible to old ckpt

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix tokenize detokenize issue

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fix…
stevehuang52 added a commit that referenced this pull request Feb 21, 2024
* add pleasefixme marker for potential failed nightly tests. (#7678)

Signed-off-by: Xuesong Yang <[email protected]>

* Add new text segmentation library for better TTS quality (#7645)

* Add new text segmentation library for better TTS quality
* Update zh_cn_pinyin.py

added detailed instruction on how to install pkuseg.

Signed-off-by: Xuesong Yang <[email protected]>

* Update requirements_tts.txt

remove pkuseg as the default dependency of NeMo TTS, and instead, direct users to manually install pkuseg if they really need.

Signed-off-by: Xuesong Yang <[email protected]>


---------

Signed-off-by: Xuesong Yang <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Xuesong Yang <[email protected]>

* Create PrecisionPlugin for megatron_ckpt_to_nemo.py trainer (#7767) (#7774)

* Create PrecisionPlugin for megatron_ckpt_to_nemo.py trainer



* Add ddp_find_unused_parameters_true for punctuation_capitalization_train_evaluate.py



* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add '32-true' for precision values



---------

Signed-off-by: Abhishree <[email protected]>
Signed-off-by: Abhishree Thittenamane <[email protected]>
Co-authored-by: Abhishree Thittenamane <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* fix(clustering_diarizer.py): fix typo (#7772)

Signed-off-by: Jean-Louis Queguiner <[email protected]>

* fix(diarization-README): typo (#7771)

Signed-off-by: Jean-Louis Queguiner <[email protected]>

* Fix bug wrt change decoding strategy for bpe models (#7762) (#7764)

* Fix bug wrt change decoding strategy for bpe models



* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: smajumdar <[email protected]>
Co-authored-by: Somshubra Majumdar <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Remove incorrect extra argument for load_from_checkpoint_dir() (#7500)

Signed-off-by: Robin Dong <[email protected]>
Co-authored-by: Eric Harper <[email protected]>

* Add nemo to mcore GPT conversion script  (#7730)

* add conversion script

Signed-off-by: Chen Cui <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove references to 'ckpt'

Signed-off-by: Chen Cui <[email protected]>

* add one more sanity check to make sure there is no unexpected keys in state dict

Signed-off-by: Chen Cui <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* make cpu loading work

Signed-off-by: Chen Cui <[email protected]>

* make script work for llama2 models

Signed-off-by: Chen Cui <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* address code check

Signed-off-by: Chen Cui <[email protected]>

* remove trainer precision (was for old sanity check)

Signed-off-by: Chen Cui <[email protected]>

* fix script for llama2 model

Signed-off-by: Chen Cui <[email protected]>

* remove commented code

Signed-off-by: Chen Cui <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Chen Cui <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Eric Harper <[email protected]>

* Fix bug in ConditionalInput: cat along the feature dim, not the batch dim (#7785)

Signed-off-by: anferico <[email protected]>

* Add some docs and update scripts for ASR (#7790)

* Add some docs and update scripts

Signed-off-by: smajumdar <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: smajumdar <[email protected]>
Signed-off-by: Somshubra Majumdar <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* set context for text memmap to fork (#7784)

* set context for text memmap to fork

Signed-off-by: arendu <[email protected]>

* typo

Signed-off-by: arendu <[email protected]>

---------

Signed-off-by: arendu <[email protected]>

* add training with multiple audios

Signed-off-by: stevehuang52 <[email protected]>

* Support flash decoding (#7744)

* Add flash-decoding

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* Fix

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

---------

Signed-off-by: Cheng-Ping Hsieh <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Yang Zhang <[email protected]>

* Change accelerator to 'auto' in nlp_checkpoint_port.py (#7761)

* Change accelerator to 'auto' in nlp_checkpoint_port.py (#7747)

* Change accelerator to auto

Signed-off-by: Abhishree <[email protected]>

* Pass omegaconf object to trainer in nlp_checkpoint_port.py

Signed-off-by: Abhishree <[email protected]>

* Pass omegaconf object to trainer in export.py

Signed-off-by: Abhishree <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Abhishree <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Eric Harper <[email protected]>
Signed-off-by: Abhishree <[email protected]>

* docs: fix typos (#7758)

Signed-off-by: shuoer86 <[email protected]>
Co-authored-by: Xuesong Yang <[email protected]>
Signed-off-by: Abhishree <[email protected]>

* Snake act (#7736)

Signed-off-by: Abhishree <[email protected]>

* Update gpt_dataset.py (#6963)

Signed-off-by: Xin Yao <[email protected]>
Co-authored-by: Sandeep Subramanian <[email protected]>
Signed-off-by: Abhishree <[email protected]>

---------

Signed-off-by: Abhishree <[email protected]>
Signed-off-by: shuoer86 <[email protected]>
Signed-off-by: Xin Yao <[email protected]>
Co-authored-by: Abhishree Thittenamane <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Eric Harper <[email protected]>
Co-authored-by: shuoer86 <[email protected]>
Co-authored-by: Xuesong Yang <[email protected]>
Co-authored-by: Nithin Rao <[email protected]>
Co-authored-by: Xin Yao <[email protected]>
Co-authored-by: Sandeep Subramanian <[email protected]>

* Add selection criteria for reference audios in the `GlobalStyleToken` submodule (#7788)

* add selection criteria for reference audios

Signed-off-by: anferico <[email protected]>

* Update configuration files

Signed-off-by: anferico <[email protected]>

* add informative comment in config files

Signed-off-by: anferico <[email protected]>

* sample random index for reference audio selection

Signed-off-by: anferico <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: anferico <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* update text server to support compute logprobs (#7733)

* update text server to support compute logprobs

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix typo

---------

Signed-off-by: Zhilin Wang <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* add multi-layer feat extract and fix random question insertion

Signed-off-by: stevehuang52 <[email protected]>

* Configure MCore logger (#7781)

Signed-off-by: Mikołaj Błaż <[email protected]>

* Revert "PEFT eval fix (#7626) (#7638)" (#7693)

This reverts commit f03dd660bd26d88fd569e76c6f74b83a7c203ff9.

* remove TN from ctc_segm tut (#7807)

Signed-off-by: Evelina <[email protected]>

* [TTS] Support audio offsets in TTS data loaders (#7156)

* [TTS] Support audio offsets in TTS data loaders

Signed-off-by: Ryan <[email protected]>

* [TTS] Change docstring mentions of .pt to .npy

Signed-off-by: Ryan <[email protected]>

---------

Signed-off-by: Ryan <[email protected]>

* Update Apex install command in Dockerfile (#7794) (#7804)

* move core install to /workspace (#7706)



* update apex install in dockerfile



* use fetch head



---------

Signed-off-by: Abhinav Khattar <[email protected]>
Signed-off-by: eharper <[email protected]>
Co-authored-by: Eric Harper <[email protected]>
Co-authored-by: Abhinav Khattar <[email protected]>

* fix typo

Signed-off-by: stevehuang52 <[email protected]>

* Nemo to HF converter for LLaMA model (#7770)

* Create config_llama_truncate.yaml

Signed-off-by: Utkarsh <[email protected]>

* Add files via upload

Signed-off-by: Utkarsh <[email protected]>

* Update convert_nemo_llama_to_hf.py

Signed-off-by: Utkarsh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update config_llama_truncate.yaml

Signed-off-by: Utkarsh <[email protected]>

* Update convert_nemo_llama_to_hf.py

Signed-off-by: Utkarsh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update convert_nemo_llama_to_hf.py

Signed-off-by: Utkarsh <[email protected]>

* clean up trainer

* remove dependency on yaml config. load config from nemo file instead.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* enable ckpt saving into other precision formats

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* support 70b + cleanup qkv slice logic

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix bug

* move hf model folder code from comment to function and add instruction to run

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Utkarsh <[email protected]>
Signed-off-by: Chen Cui <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Eric Harper <[email protected]>
Co-authored-by: Chen Cui <[email protected]>

* Save best NeMo model only when necessary (#7836)

Signed-off-by: Ante Jukić <[email protected]>

* add guard if its a distributed checkpoint (#7845)

Signed-off-by: Gerald Shen <[email protected]>

* Fix tn duplex (#7808)

* fix duplex tn infer

Signed-off-by: Evelina <[email protected]>

* fix typo

Signed-off-by: Evelina <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix TN docs

Signed-off-by: Evelina <[email protected]>

---------

Signed-off-by: Evelina <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Update transformers cache on Jenkins (#7854)

* update transformers cache

Signed-off-by: eharper <[email protected]>

* update

Signed-off-by: eharper <[email protected]>

* add cd

Signed-off-by: eharper <[email protected]>

---------

Signed-off-by: eharper <[email protected]>

* Update README.rst for container update (#7844)

Signed-off-by: fayejf <[email protected]>

* Add support for finetuning with huggingface datasets (#7834)

* add finetune with huggingface dataset

Signed-off-by: stevehuang52 <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update yaml

Signed-off-by: stevehuang52 <[email protected]>

* update

Signed-off-by: stevehuang52 <[email protected]>

* update and refactor

Signed-off-by: stevehuang52 <[email protected]>

* add extrac hf text and update

Signed-off-by: stevehuang52 <[email protected]>

* update and refactor

Signed-off-by: stevehuang52 <[email protected]>

* move dataset dependency to common

Signed-off-by: stevehuang52 <[email protected]>

* add docstring

Signed-off-by: stevehuang52 <[email protected]>

* Add to Dics

Signed-off-by: Nithin Rao Koluguri <nithinraok>

* add ci test

Signed-off-by: Nithin Rao Koluguri <nithinraok>

* add max steps in jenkins

Signed-off-by: Nithin Rao Koluguri <nithinraok>

* reduce max steps

Signed-off-by: Nithin Rao Koluguri <nithinraok>

* jenkins test

Signed-off-by: Nithin Rao Koluguri <nithinraok>

* add bs=2

Signed-off-by: Nithin Rao Koluguri <nithinraok>

---------

Signed-off-by: stevehuang52 <[email protected]>
Signed-off-by: Nithin Rao Koluguri <nithinraok>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Nithin Rao Koluguri <nithinraok>
Co-authored-by: Nithin Rao <[email protected]>

* Multimodal merge (#7728)

* ControlNet TRT export

* Final MR before release

* SD2 update

* Fixed export issue

* Fix for instruct p2p and reformat

* Fix SD export issue

* Add nemo clip export for DB

* Fix ins pix2pix

* fix sd2 config

* [Mingyuan Ma] BF16 and SD conversion script

* [Imagen] NHWC Feature

* Fix .nemo loading issue for NeMo CLIP in SD

* NeMo r1.20.0 Multimodal Merge

* fix the inductor issue in inference

* Fix inductor loading .nemo issue

* Add Neva Model Support

* Imagen Optimizations

* Neva inference code

* NeMo TOT 1.21 to Internal/main

* Update neva_inference.yaml

* REBASING  for latest code changes

* Update internal/main to main tot

* Parallel DDIM implementation

* 1. Fixing indentation bug. (#7352)

Signed-off-by: Micha Livne <[email protected]>

* NeMo MCore llama2 support + MCore PEFT adapters (#7299)

* start adding gpt from megatron core path

Signed-off-by: ericharper <[email protected]>

* set model parallel config

Signed-off-by: ericharper <[email protected]>

* use model parallel config object

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update args

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* set vp size to none if it is 1

Signed-off-by: ericharper <[email protected]>

* set vp size to none if it is 1

Signed-off-by: ericharper <[email protected]>

* add TransformerConfig

Signed-off-by: ericharper <[email protected]>

* start updating to TransformerConfig

Signed-off-by: ericharper <[email protected]>

* add todo

Signed-off-by: ericharper <[email protected]>

* revert to model parallel config

Signed-off-by: ericharper <[email protected]>

* add hidden_size to model_parallel_config

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove imports

Signed-off-by: ericharper <[email protected]>

* revert

Signed-off-by: ericharper <[email protected]>

* remove import

Signed-off-by: ericharper <[email protected]>

* small clean up

Signed-off-by: ericharper <[email protected]>

* update hidden size in peft base model, add mcore commit to jenkins

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update module args

Signed-off-by: ericharper <[email protected]>

* add config obj to flash attention tests

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove args

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove sequence parallel arg

Signed-off-by: ericharper <[email protected]>

* update args

Signed-off-by: ericharper <[email protected]>

* add config to self

Signed-off-by: ericharper <[email protected]>

* update args

Signed-off-by: ericharper <[email protected]>

* update args

Signed-off-by: ericharper <[email protected]>

* update args

Signed-off-by: ericharper <[email protected]>

* add config to test

Signed-off-by: ericharper <[email protected]>

* get hidden_size from config

Signed-off-by: ericharper <[email protected]>

* add try except

Signed-off-by: ericharper <[email protected]>

* use default

Signed-off-by: ericharper <[email protected]>

* update config with hidden size

Signed-off-by: ericharper <[email protected]>

* remove arg

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* comment out jenkins test

Signed-off-by: ericharper <[email protected]>

* revert import

Signed-off-by: ericharper <[email protected]>

* build transformer config

Signed-off-by: ericharper <[email protected]>

* add model to provider func

Signed-off-by: ericharper <[email protected]>

* update forward and float16 wrapper

Signed-off-by: ericharper <[email protected]>

* instantiate model parallel config after init model parallel

Signed-off-by: ericharper <[email protected]>

* set virtual rank

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add GQA config to megatron gpt model (#7096)

* Add GQA config in gpt config file

Signed-off-by: jasonwan <[email protected]>

* Verify mcore is enabled when using GQA

Signed-off-by: jasonwan <[email protected]>

---------

Signed-off-by: jasonwan <[email protected]>

* revert

Signed-off-by: ericharper <[email protected]>

* mcore llama2 ckpt conversion & small fix

Signed-off-by: jasonwan <[email protected]>

* Add inference & sft config by Hongbin

Co-authored-by: Hongbin Liu <[email protected]>

Signed-off-by: jasonwan <[email protected]>

* fix config

Signed-off-by: jasonwan <[email protected]>

* add inference param. update TP/PP script to support mcore gpt

Signed-off-by: jasonwan <[email protected]>

* p-tuning

Signed-off-by: jasonwan <[email protected]>

* modify ckpt conversion script (adding model cast)

Signed-off-by: jasonwan <[email protected]>

* ckpt conversion use relative path for config

Signed-off-by: jasonwan <[email protected]>

* start adding gpt from megatron core path

Signed-off-by: ericharper <[email protected]>

* set model parallel config

Signed-off-by: ericharper <[email protected]>

* use model parallel config object

Signed-off-by: ericharper <[email protected]>

* update args

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* set vp size to none if it is 1

Signed-off-by: ericharper <[email protected]>

* set vp size to none if it is 1

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add TransformerConfig

Signed-off-by: ericharper <[email protected]>

* start updating to TransformerConfig

Signed-off-by: ericharper <[email protected]>

* add todo

Signed-off-by: ericharper <[email protected]>

* revert to model parallel config

Signed-off-by: ericharper <[email protected]>

* add hidden_size to model_parallel_config

Signed-off-by: ericharper <[email protected]>

* remove imports

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove import

Signed-off-by: ericharper <[email protected]>

* small clean up

Signed-off-by: ericharper <[email protected]>

* update hidden size in peft base model, add mcore commit to jenkins

Signed-off-by: ericharper <[email protected]>

* update module args

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add config obj to flash attention tests

Signed-off-by: ericharper <[email protected]>

* remove args

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove sequence parallel arg

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update args

Signed-off-by: ericharper <[email protected]>

* add config to self

Signed-off-by: ericharper <[email protected]>

* update args

Signed-off-by: ericharper <[email protected]>

* update args

Signed-off-by: ericharper <[email protected]>

* update args

Signed-off-by: ericharper <[email protected]>

* add config to test

Signed-off-by: ericharper <[email protected]>

* get hidden_size from config

Signed-off-by: ericharper <[email protected]>

* add try except

Signed-off-by: ericharper <[email protected]>

* use default

Signed-off-by: ericharper <[email protected]>

* update config with hidden size

Signed-off-by: ericharper <[email protected]>

* remove arg

Signed-off-by: ericharper <[email protected]>

* comment out jenkins test

Signed-off-by: ericharper <[email protected]>

* revert import

Signed-off-by: ericharper <[email protected]>

* remove optimizer_idx

Signed-off-by: eharper <[email protected]>

* prefetch num microbatches

Signed-off-by: eharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* start adding gpt from megatron core path

Signed-off-by: ericharper <[email protected]>

* set model parallel config

Signed-off-by: ericharper <[email protected]>

* use model parallel config object

Signed-off-by: ericharper <[email protected]>

* update args

Signed-off-by: ericharper <[email protected]>

* fix for p-tuning sequence parallel

Signed-off-by: jasonwan <[email protected]>

* support SFT/distOpt mcore (#7207)

* add inference param. update TP/PP script to support mcore gpt

* p-tuning

Signed-off-by: jasonwan <[email protected]>

* change layer names for SFT

Signed-off-by: Hongbin Liu <[email protected]>

* fix bug in SFT

Signed-off-by: Hongbin Liu <[email protected]>

---------

Signed-off-by: jasonwan <[email protected]>
Signed-off-by: Hongbin Liu <[email protected]>
Co-authored-by: Hongbin Liu <[email protected]>
Co-authored-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* start updating to TransformerConfig

Signed-off-by: ericharper <[email protected]>

* revert to model parallel config

Signed-off-by: ericharper <[email protected]>

* add hidden_size to model_parallel_config

Signed-off-by: ericharper <[email protected]>

* remove imports

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update module args

Signed-off-by: ericharper <[email protected]>

* add config to self

Signed-off-by: ericharper <[email protected]>

* build transformer config

Signed-off-by: ericharper <[email protected]>

* add model to provider func

Signed-off-by: ericharper <[email protected]>

* update forward and float16 wrapper

Signed-off-by: ericharper <[email protected]>

* instantiate model parallel config after init model parallel

Signed-off-by: ericharper <[email protected]>

* set virtual rank

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add GQA config to megatron gpt model (#7096)

* Add GQA config in gpt config file

Signed-off-by: jasonwan <[email protected]>

* Verify mcore is enabled when using GQA

Signed-off-by: jasonwan <[email protected]>

---------

Signed-off-by: jasonwan <[email protected]>

* revert

Signed-off-by: ericharper <[email protected]>

* remove import

Signed-off-by: eharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* rollback model cast for p-tuning

Signed-off-by: jasonwan <[email protected]>

* update for dist adam

Signed-off-by: eharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* use get_gpt_module_list

Signed-off-by: eharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update ckpt conversion script

Signed-off-by: jasonwan <[email protected]>

* ptl2.0 patch for llama config

Signed-off-by: jasonwan <[email protected]>

* add plugins to trainer in scripts

Signed-off-by: jasonwan <[email protected]>

* fix activation checkpointing mcore

Signed-off-by: jasonwan <[email protected]>

* fix variable names

Signed-off-by: jasonwan <[email protected]>

* overwrite normalization type for mcore/te

Signed-off-by: jasonwan <[email protected]>

* Update megatron_llama_sft.yaml

Signed-off-by: Jason Wang <[email protected]>

* add PEFT adapter support for mcore gpt path (#7276)

* implementation for mcore adapter/mxins

Signed-off-by: jasonwan <[email protected]>

* small fix for lora and ptuning

Signed-off-by: jasonwan <[email protected]>

* support layerwise peft

Signed-off-by: jasonwan <[email protected]>

* support multiple target layers

Signed-off-by: jasonwan <[email protected]>

* support lora GQA

Signed-off-by: jasonwan <[email protected]>

* support amp O2

Signed-off-by: jasonwan <[email protected]>

* revert & more O2 fix

Signed-off-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* lora inject to attention

Signed-off-by: jasonwan <[email protected]>

* support lora weight tying

Signed-off-by: jasonwan <[email protected]>

* add copyright header

Signed-off-by: jasonwan <[email protected]>

* rollback ptuning name change. full string match mcore target

Signed-off-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove comment

Signed-off-by: jasonwan <[email protected]>

---------

Signed-off-by: jasonwan <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* clean up config

Signed-off-by: jasonwan <[email protected]>

* Sync llama branch (#7297)

* add inference param. update TP/PP script to support mcore gpt

* p-tuning

Signed-off-by: jasonwan <[email protected]>

* change layer names for SFT

Signed-off-by: Hongbin Liu <[email protected]>

* fix bug in SFT

Signed-off-by: Hongbin Liu <[email protected]>

* fix bug: cpu initialization is not really enabled

Signed-off-by: Hongbin Liu <[email protected]>

* add use_cpu_initialization to TransformerConfig

Signed-off-by: Hongbin Liu <[email protected]>

* fix bug: wrong config path when using relative cjpt path

Signed-off-by: Hongbin Liu <[email protected]>

* revert mcore config change

Signed-off-by: Jason Wang <[email protected]>

---------

Signed-off-by: jasonwan <[email protected]>
Signed-off-by: Hongbin Liu <[email protected]>
Signed-off-by: Jason Wang <[email protected]>
Co-authored-by: Hongbin Liu <[email protected]>

* clean up ckpt conversion script

Signed-off-by: jasonwan <[email protected]>

* rollback git merge errors

Signed-off-by: jasonwan <[email protected]>

* update mcore, add check for mcore+te

Signed-off-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* formatting

Signed-off-by: jasonwan <[email protected]>

* make sft test dataset optional. fix indentation in config

Signed-off-by: jasonwan <[email protected]>

* one more fix for optional test set

Signed-off-by: jasonwan <[email protected]>

* support merging lora weights in mcore

Signed-off-by: jasonwan <[email protected]>

* update mcore for cpu init

Signed-off-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update ckpt conversion for code llama

Signed-off-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add seq_len_interpolation_factor support for long-context llama ckpts (#7312)

* add inference param. update TP/PP script to support mcore gpt

* p-tuning

Signed-off-by: jasonwan <[email protected]>

* add seq_len_interpolation_factor

Signed-off-by: Hongbin Liu <[email protected]>

---------

Signed-off-by: jasonwan <[email protected]>
Signed-off-by: Hongbin Liu <[email protected]>
Co-authored-by: jasonwan <[email protected]>
Co-authored-by: Hongbin Liu <[email protected]>

* fix old ptuning model, update mcore to support seq_len_interpolation_factor

Signed-off-by: jasonwan <[email protected]>

* support fused layernorm linear, fix ptuning O2

Signed-off-by: jasonwan <[email protected]>

* drop loss mask for mcore for now

Signed-off-by: jasonwan <[email protected]>

* disable dist ckpt in peft

Signed-off-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix loading non dist ckpt

Signed-off-by: jasonwan <[email protected]>

* add ckpt conversion to CI

Signed-off-by: jasonwan <[email protected]>

* update CI

Signed-off-by: jasonwan <[email protected]>

* mcore_mixin docstring

Signed-off-by: jasonwan <[email protected]>

* minor change in mcore peft error message

Signed-off-by: jasonwan <[email protected]>

* fix amp o2 in lora weight tying

Signed-off-by: jasonwan <[email protected]>

* correct mcore fp8 config

Signed-off-by: jasonwan <[email protected]>

* add TE installation

Signed-off-by: jasonwan <[email protected]>

* support mcore adapter tuning

Signed-off-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* comment out new CI test. rollback docker image

Signed-off-by: jasonwan <[email protected]>

* ignore FA tests, try new CI on 23.08

Signed-off-by: jasonwan <[email protected]>

* mark new CI as L2, put to beginning to test

Signed-off-by: jasonwan <[email protected]>

* minor fix for prompt learning

Signed-off-by: jasonwan <[email protected]>

* rollback to 23.06. comment out CI

Signed-off-by: jasonwan <[email protected]>

* minor fix ckpt conversion script

Signed-off-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* minor rollback gpt model change

Signed-off-by: jasonwan <[email protected]>

---------

Signed-off-by: ericharper <[email protected]>
Signed-off-by: jasonwan <[email protected]>
Signed-off-by: eharper <[email protected]>
Signed-off-by: Hongbin Liu <[email protected]>
Signed-off-by: Jason Wang <[email protected]>
Co-authored-by: ericharper <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: eharper <[email protected]>
Co-authored-by: Hongbin Liu <[email protected]>
Co-authored-by: Kelvin Liu <[email protected]>

* Hiddens modules documentation (#7303)

* 1. Changed hiddens transformations module from `transformations` to `hiddens`.

Signed-off-by: Micha Livne <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* 1. Debugging. Signed-off-by: Micha Livne <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* 1. Finished doc.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging. Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging. Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging. Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging. Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging. Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging. Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging. Signed-off-by: Micha Livne <[email protected]>

---------

Signed-off-by: Micha Livne <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Eric Harper <[email protected]>

* Support for flash attention 2.0 (#7063)

* Add flash attn 2

Signed-off-by: MaximumEntropy <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add FA2 feature

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* Remove debugging

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: MaximumEntropy <[email protected]>
Signed-off-by: Cheng-Ping Hsieh <[email protected]>
Signed-off-by: Cheng-Ping Hsieh <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Oleksii Kuchaiev <[email protected]>
Co-authored-by: Cheng-Ping Hsieh <[email protected]>
Co-authored-by: Cheng-Ping Hsieh <[email protected]>

* lora merge fix for O2 names (#7325)

* wip

Signed-off-by: arendu <[email protected]>

* adjust key names based on O2

Signed-off-by: arendu <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update

Signed-off-by: arendu <[email protected]>

* minor

Signed-off-by: arendu <[email protected]>

---------

Signed-off-by: arendu <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* multiple fields can form a context (#7147)

* list of context fields and flexible prompt template

Signed-off-by: arendu <[email protected]>

* list of fields for context

Signed-off-by: arendu <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix bug

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* Fix bug

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* Add multiple truncation fields and middle truncation

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Compatible to old ckpt

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix tokenize detokenize issue

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Remove detokenization, add truncation augmentation

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Resolve comments

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* Remove unused import

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* revert eos

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* Add tokenizer space_sensitive attribute

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix error

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* Fix erorr and use re

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix bug

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* Change assert logic

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Follow adi suggestion

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Remove merge function

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add example and comment

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* Remove context_key and add comment

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* Remove random truncation

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix bug

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix template none

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix bug

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

---------

Signed-off-by: arendu <[email protected]>
Signed-off-by: Cheng-Ping Hsieh <[email protected]>
Signed-off-by: Cheng-Ping Hsieh <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Cheng-Ping Hsieh <[email protected]>
Co-authored-by: Cheng-Ping Hsieh <[email protected]>

* Load buffers in checkpoint (#7357)

Signed-off-by: Jason Wang <[email protected]>

* Add migration guide for lightning 2.0 upgrade (#7360)

* Add lightning 2.0 migration guide in NeMo docs

Signed-off-by: Abhishree <[email protected]>

* Add remaining guide for lightning 2.0 upgrade

Signed-off-by: Abhishree <[email protected]>

* Remove line spill over and continue in next line

Signed-off-by: Abhishree <[email protected]>

* Add missing dataloader_iter in the guide

Signed-off-by: Abhishree <[email protected]>

* Fix minor typo

Signed-off-by: Abhishree <[email protected]>

---------

Signed-off-by: Abhishree <[email protected]>

* adding bias_dropout_add_fusion option for BERT (#7332)

Signed-off-by: Alexander Jipa <[email protected]>
Co-authored-by: Alexander Jipa <[email protected]>

* [TTS] Change audio codec token type to TokenIndex (#7356)

Signed-off-by: Ryan <[email protected]>

* enable selective unfreeze (#7326)

* wip

Signed-off-by: arendu <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* wip

Signed-off-by: arendu <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* avoid PTL method conflicts

Signed-off-by: arendu <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update

Signed-off-by: arendu <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update

Signed-off-by: arendu <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: arendu <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Fix typos (#7361)

* fix typos

Signed-off-by: omahs <[email protected]>

* fix typo

Signed-off-by: omahs <[email protected]>

* fix typos

Signed-off-by: omahs <[email protected]>

* fix typos

Signed-off-by: omahs <[email protected]>

* fix typo

Signed-off-by: omahs <[email protected]>

* fix typos

Signed-off-by: omahs <[email protected]>

* fix typo

Signed-off-by: omahs <[email protected]>

* fix typo

Signed-off-by: omahs <[email protected]>

* fix typo

Signed-off-by: omahs <[email protected]>

---------

Signed-off-by: omahs <[email protected]>

* pin numba=0.57.1 to fix reinstall.sh error (#7366)

Signed-off-by: Xuesong Yang <[email protected]>

* Update new conversion script for converting safetensors.

* Upgrade pytorch container to 23.08 (#7353)

* upgrade pytorch container

Signed-off-by: eharper <[email protected]>

* use mcore

Signed-off-by: eharper <[email protected]>

* revert test change

Signed-off-by: eharper <[email protected]>

* pleasefixme

Signed-off-by: eharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* check for ampere

Signed-off-by: eharper <[email protected]>

* comment test temporarily

Signed-off-by: eharper <[email protected]>

---------

Signed-off-by: eharper <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* enable fp32 optimizer for output_layer in mcore (#7355)

Signed-off-by: lhb8125 <[email protected]>

* revert comment (#7368)

Signed-off-by: eharper <[email protected]>

* Update to core 23.08 branch ToT (#7371)

Signed-off-by: Abhinav Khattar <[email protected]>

* upper bounding ptl (#7370)

Signed-off-by: eharper <[email protected]>

* fix pipeline parallel inference (#7367)

* fix pp inference

Signed-off-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: jasonwan <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* fix for peft tied weights (#7372)

Signed-off-by: arendu <[email protected]>

* fixed trainer.strategy=auto from None. (#7369)

Signed-off-by: Xuesong Yang <[email protected]>

* add O2 option in gpt eval (#7358)

* add O2 option in eval

Signed-off-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add doc for O2 config

Signed-off-by: jasonwan <[email protected]>

* add to llama inference config

Signed-off-by: jasonwan <[email protected]>

---------

Signed-off-by: jasonwan <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Eric Harper <[email protected]>

* Move model precision copy (#7336)

* move cfg precision set to megatron base model

Signed-off-by: Maanu Grover <[email protected]>

* remove copy from other models

Signed-off-by: Maanu Grover <[email protected]>

* modify attribute not arg

Signed-off-by: Maanu Grover <[email protected]>

* fix gpt model test for ptl 2.0

Signed-off-by: Maanu Grover <[email protected]>

* rename function and add docstring

Signed-off-by: Maanu Grover <[email protected]>

* replace precision to dtype conditionals with func call

Signed-off-by: Maanu Grover <[email protected]>

* unnecessary function and cfg reset

Signed-off-by: Maanu Grover <[email protected]>

* set default value

Signed-off-by: Maanu Grover <[email protected]>

* fix precision lookup in a few more places

Signed-off-by: Maanu Grover <[email protected]>

* rename mapping function

Signed-off-by: Maanu Grover <[email protected]>

* ununsed import

Signed-off-by: Maanu Grover <[email protected]>

* save torch datatype to model

Signed-off-by: Maanu Grover <[email protected]>

* set weights precision wrt amp o2

Signed-off-by: Maanu Grover <[email protected]>

* Revert "set weights precision wrt amp o2"

This reverts commit 313a4bfe5eb69d771a6d2433898c0685836aef5c.

Signed-off-by: Maanu Grover <[email protected]>

* revert half precision at inference attempt

Signed-off-by: Maanu Grover <[email protected]>

* move autocast dtype to base model

Signed-off-by: Maanu Grover <[email protected]>

* move params dtype to base model, enable fp16 O2 inf

Signed-off-by: Maanu Grover <[email protected]>

* unused imports

Signed-off-by: Maanu Grover <[email protected]>

---------

Signed-off-by: Maanu Grover <[email protected]>

* Fix PEFT checkpoint loading (#7388)

* Fix PEFT checkpoint loading

Signed-off-by: Jason Wang <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Jason Wang <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Use distributed optimizer support for multiple dtypes (#7359)

* Update distopt wrapper with multiple dtype support

Remove manual handling of separate FP32 optimizer.

Signed-off-by: Tim Moon <[email protected]>

* Use distopt support for contiguous buffers with multiple dtypes

Signed-off-by: Tim Moon <[email protected]>

* Fix typo

Signed-off-by: Tim Moon <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Separate distopt buckets for first GPT layer and non-overlapped params

Signed-off-by: Tim Moon <[email protected]>

* Add distopt logic for int dtypes

Signed-off-by: Tim Moon <[email protected]>

* Update Apex commit

Signed-off-by: Tim Moon <[email protected]>

* Remove unused variables

Signed-off-by: Tim Moon <[email protected]>

* Update Apex commit in README and Jenkensfile

Signed-off-by: Tim Moon <[email protected]>

* Debug Dockerfile and Jenkinsfile

Signed-off-by: Tim Moon <[email protected]>

---------

Signed-off-by: Tim Moon <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Eric Harper <[email protected]>

* minor fix for llama ckpt conversion script (#7387)

* minor fix for llama ckpt conversion script

Signed-off-by: Jason Wang <[email protected]>

* Update Jenkinsfile

Signed-off-by: Jason Wang <[email protected]>

* remove fast_swiglu configuration

Signed-off-by: Jason Wang <[email protected]>

---------

Signed-off-by: Jason Wang <[email protected]>
Co-authored-by: Eric Harper <[email protected]>

* Fix wrong calling of librosa.get_duration() in notebook (#7376)

Signed-off-by: Robin Dong <[email protected]>
Co-authored-by: Somshubra Majumdar <[email protected]>

* [PATCH] PEFT import mcore (#7393)

* [PATCH] PEFT import mcore

Signed-off-by: Jason Wang <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Jason Wang <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* [TTS] Added a callback for logging initial data (#7384)

Signed-off-by: Ante Jukić <[email protected]>

* Update Core Commit (#7402)

* Update Core Commit

Signed-off-by: Abhinav Khattar <[email protected]>

* update commit

Signed-off-by: Abhinav Khattar <[email protected]>

---------

Signed-off-by: Abhinav Khattar <[email protected]>

* Use cfg attribute in bert (#7394)

* use cfg attribute instead of arg

Signed-off-by: Maanu Grover <[email protected]>

* use torch_dtype in place of cfg.precision

Signed-off-by: Maanu Grover <[email protected]>

* move precision copy before super constructor

Signed-off-by: Maanu Grover <[email protected]>

* use trainer arg

Signed-off-by: Maanu Grover <[email protected]>

---------

Signed-off-by: Maanu Grover <[email protected]>

* Add support for bias conversion in Swiglu models (#7386)

* Add support for bias conversion in Swiglu models

Signed-off-by: smajumdar <[email protected]>

* Add support for auto extracting tokenizer model

Signed-off-by: smajumdar <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add support for auto extracting tokenizer model

Signed-off-by: smajumdar <[email protected]>

* Fix issue with missing tokenizer

Signed-off-by: smajumdar <[email protected]>

* Refactor

Signed-off-by: smajumdar <[email protected]>

* Refactor

Signed-off-by: smajumdar <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: smajumdar <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Update save_to and restore_from for dist checkpointing (#7343)

* add dist ckpt to save to, in progress

Signed-off-by: eharper <[email protected]>

* move dist ckpt

Signed-off-by: eharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* clean up

Signed-off-by: eharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update restore from, need to figure out how to initialize distributed

Signed-off-by: eharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* launch distrib if needed when restoring dist ckpt

Signed-off-by: eharper <[email protected]>

* when using mcore we can change tp pp on the fly

Signed-off-by: eharper <[email protected]>

* add load_from_checkpoint support for dist ckpt

Signed-off-by: eharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update llama convert script to save dist .nemo

Signed-off-by: eharper <[email protected]>

* fix load dist ckpt

Signed-off-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* setup TE TP groups if needed

Signed-off-by: eharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* setup te tp groups if needed

Signed-off-by: eharper <[email protected]>

* remove import

Signed-off-by: eharper <[email protected]>

---------

Signed-off-by: eharper <[email protected]>
Signed-off-by: jasonwan <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: jasonwan <[email protected]>

* fix forward for with mcore=false (#7403)

Signed-off-by: Jimmy Zhang <[email protected]>
Co-authored-by: Jimmy Zhang <[email protected]>

* Fix logging to remove 's/it' from progress bar in Megatron models and add train_step_timing (#7374)

* Add CustomProgressBar class to exp_manager and trainer callbacks

Signed-off-by: Abhishree <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix the progress bar to reflect total microbatch cnt

Signed-off-by: Abhishree <[email protected]>

* Modify CustomProgressBar class

1) Modify CustomProgressBar class to update progress bar per global_step instead of per microbatch
2) Add the callback to other megatron training/finetuning files that are not using MegatronTrainerBuilder

Signed-off-by: Abhishree <[email protected]>

* Add CustomProgressBar callback to tuning files

Signed-off-by: Abhishree <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Abhishree <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Set Activation Checkpointing Defaults (#7404)

* Set Activation Checkpointing Defaults

Signed-off-by: Abhinav Khattar <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* check for None

Signed-off-by: Abhinav Khattar <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Abhinav Khattar <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* make loss mask default to false (#7407)

Signed-off-by: eharper <[email protected]>

* Add dummy userbuffer config files (#7408)

Signed-off-by: Sangkug Lym <[email protected]>

* add missing ubconf files (#7412)

Signed-off-by: Abhinav Khattar <[email protected]>

* New tutorial on Speech Data Explorer (#7405)

* Added Google Colab based tutorial on Speech Data Explorer 

Signed-off-by: George Zelenfroynd <[email protected]>

* Update ptl training ckpt conversion script to work with dist ckpt (#7416)

* update ptl convert script

Signed-off-by: eharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* don't break legacy

Signed-off-by: eharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: eharper <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Allow disabling sanity checking when num_sanity_val_steps=0 (#7413)

* Allow disabling sanity checking when num_sanity_val_steps=0

Signed-off-by: Abhishree <[email protected]>

* Update num_sanity_val_steps to be a multiple of num_microbatches

Signed-off-by: Abhishree Thittenamane <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Abhishree <[email protected]>
Signed-off-by: Abhishree Thittenamane <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Add comprehensive error messages (#7261)

Signed-off-by: Anton Peganov <[email protected]>

* check NEMO_PATH (#7418)

Signed-off-by: Nikolay Karpov <[email protected]>

* layer selection for ia3 (#7417)

* layer selection for ia3

Signed-off-by: arendu <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: arendu <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Fix missing pip package 'einops' (#7397)

Signed-off-by: Robin Dong <[email protected]>

* Fix failure of pyaudio in Google Colab (#7396)

Signed-off-by: Robin Dong <[email protected]>

* Update README.md: output_path --> output_manifest_filepath (#7442)

Signed-off-by: Samuele Cornell <[email protected]>

* Updating FlashAttention API to match FlashAttentionV2

* Multiple fixes for mm

* Fix CI inductor issue and update to torch compile

* Remove suppress error

* Fix when conversion config uses fp16 and it complains about precision plugin

* Fixing FAv2 API usage

* Initial release of content filtering model

* Added synthetic dataloader for precached and online mode

* Mingyuanm/dreambooth opt

* Add llama2 support in neva training

* Fix sampler length

* Fix all precision issues in nemo multimodal

* Add rope dynamic linear scaling (#7437)

* Add dynamic linear scaling

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix bug

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

---------

Signed-off-by: Cheng-Ping Hsieh <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Yang Zhang <[email protected]>

* Fix None dataloader issue in PTL2.0 (#7455)

* Fix None dataloader issue in PTL2.0

Signed-off-by: KunalDhawan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* updating values of self._validation_dl and self._test_dl as well

Signed-off-by: KunalDhawan <[email protected]>

* updating values of self._validation_dl and self._test_dl as well

Signed-off-by: KunalDhawan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: KunalDhawan <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* [ASR] Confidence measure -> method renames (#7434)

* measure -> method

Signed-off-by: Aleksandr Laptev <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Aleksandr Laptev <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Add steps for document of getting dataset 'SF Bilingual Speech' (#7378)

* Add steps for document of getting dataset 'SF Bilingual Speech'

Signed-off-by: Robin Dong <[email protected]>

* Update datasets.rst

added a link from a tutorial demonstrating detailed data prep steps.

Signed-off-by: Xuesong Yang <[email protected]>

---------

Signed-off-by: Robin Dong <[email protected]>
Signed-off-by: Xuesong Yang <[email protected]>
Co-authored-by: Xuesong Yang <[email protected]>

* RNN-T confidence and alignment bugfix (#7381)

* new frame_confidence and alignments lists are now always created after the while loop

Signed-off-by: Aleksandr Laptev <[email protected]>

* tests added

Signed-off-by: Aleksandr Laptev <[email protected]>

---------

Signed-off-by: Aleksandr Laptev <[email protected]>

* Fix resume from checkpoint in exp_manager (#7424) (#7426)

Signed-off-by: Abhishree <[email protected]>
Co-authored-by: Abhishree Thittenamane <[email protected]>
Co-authored-by: Eric Harper <[email protected]>

* Fix checking of cuda/cpu device for inputs of Decoder (#7444)

* Fix checking of cuda/cpu device for inputs of Decoder

Signed-off-by: Robin Dong <[email protected]>

* Update tacotron2.py

Signed-off-by: Jason <[email protected]>

---------

Signed-off-by: Robin Dong <[email protected]>
Signed-off-by: Jason <[email protected]>
Co-authored-by: Jason <[email protected]>

* Fix failure of ljspeech's get_data.py (#7430)

* Fix failure of ljspeech's get_data.py

Signed-off-by: Robin Dong <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Robin Dong <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* [TTS] Fix audio codec type checks (#7373)

* [TTS] Fix audio codec type checks

Signed-off-by: Ryan <[email protected]>

* [TTS] Fix audio codec tests

Signed-off-by: Ryan <[email protected]>

---------

Signed-off-by: Ryan <[email protected]>

* [TTS] Add dataset to path of logged artifacts (#7462)

* [TTS] Add dataset to path of logged artifacts

Signed-off-by: Ryan <[email protected]>

* [TTS] Revert axis name back to Audio Frames

Signed-off-by: Ryan <[email protected]>

---------

Signed-off-by: Ryan <[email protected]>

* Fix sft dataset truncation (#7464)

* Add fix

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

---------

Signed-off-by: Cheng-Ping Hsieh <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Automatic Lip Reading Recognition (ALR) - ASR/CV (Visual ASR) (#7330)

* striding_conv1d_k5 and dw_striding_conv1d_k5 subsampling

Signed-off-by: mburchi <[email protected]>

* transpose conv1d inputs

Signed-off-by: mburchi <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, s…
stevehuang52 added a commit that referenced this pull request Feb 22, 2024
* fix(clustering_diarizer.py): fix typo (#7772)

Signed-off-by: Jean-Louis Queguiner <[email protected]>

* fix(diarization-README): typo (#7771)

Signed-off-by: Jean-Louis Queguiner <[email protected]>

* Fix bug wrt change decoding strategy for bpe models (#7762) (#7764)

* Fix bug wrt change decoding strategy for bpe models



* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: smajumdar <[email protected]>
Co-authored-by: Somshubra Majumdar <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Remove incorrect extra argument for load_from_checkpoint_dir() (#7500)

Signed-off-by: Robin Dong <[email protected]>
Co-authored-by: Eric Harper <[email protected]>

* Add nemo to mcore GPT conversion script  (#7730)

* add conversion script

Signed-off-by: Chen Cui <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove references to 'ckpt'

Signed-off-by: Chen Cui <[email protected]>

* add one more sanity check to make sure there is no unexpected keys in state dict

Signed-off-by: Chen Cui <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* make cpu loading work

Signed-off-by: Chen Cui <[email protected]>

* make script work for llama2 models

Signed-off-by: Chen Cui <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* address code check

Signed-off-by: Chen Cui <[email protected]>

* remove trainer precision (was for old sanity check)

Signed-off-by: Chen Cui <[email protected]>

* fix script for llama2 model

Signed-off-by: Chen Cui <[email protected]>

* remove commented code

Signed-off-by: Chen Cui <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Chen Cui <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Eric Harper <[email protected]>

* Fix bug in ConditionalInput: cat along the feature dim, not the batch dim (#7785)

Signed-off-by: anferico <[email protected]>

* Add some docs and update scripts for ASR (#7790)

* Add some docs and update scripts

Signed-off-by: smajumdar <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: smajumdar <[email protected]>
Signed-off-by: Somshubra Majumdar <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* set context for text memmap to fork (#7784)

* set context for text memmap to fork

Signed-off-by: arendu <[email protected]>

* typo

Signed-off-by: arendu <[email protected]>

---------

Signed-off-by: arendu <[email protected]>

* add training with multiple audios

Signed-off-by: stevehuang52 <[email protected]>

* Support flash decoding (#7744)

* Add flash-decoding

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* Fix

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

---------

Signed-off-by: Cheng-Ping Hsieh <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Yang Zhang <[email protected]>

* Change accelerator to 'auto' in nlp_checkpoint_port.py (#7761)

* Change accelerator to 'auto' in nlp_checkpoint_port.py (#7747)

* Change accelerator to auto

Signed-off-by: Abhishree <[email protected]>

* Pass omegaconf object to trainer in nlp_checkpoint_port.py

Signed-off-by: Abhishree <[email protected]>

* Pass omegaconf object to trainer in export.py

Signed-off-by: Abhishree <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Abhishree <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Eric Harper <[email protected]>
Signed-off-by: Abhishree <[email protected]>

* docs: fix typos (#7758)

Signed-off-by: shuoer86 <[email protected]>
Co-authored-by: Xuesong Yang <[email protected]>
Signed-off-by: Abhishree <[email protected]>

* Snake act (#7736)

Signed-off-by: Abhishree <[email protected]>

* Update gpt_dataset.py (#6963)

Signed-off-by: Xin Yao <[email protected]>
Co-authored-by: Sandeep Subramanian <[email protected]>
Signed-off-by: Abhishree <[email protected]>

---------

Signed-off-by: Abhishree <[email protected]>
Signed-off-by: shuoer86 <[email protected]>
Signed-off-by: Xin Yao <[email protected]>
Co-authored-by: Abhishree Thittenamane <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Eric Harper <[email protected]>
Co-authored-by: shuoer86 <[email protected]>
Co-authored-by: Xuesong Yang <[email protected]>
Co-authored-by: Nithin Rao <[email protected]>
Co-authored-by: Xin Yao <[email protected]>
Co-authored-by: Sandeep Subramanian <[email protected]>

* Add selection criteria for reference audios in the `GlobalStyleToken` submodule (#7788)

* add selection criteria for reference audios

Signed-off-by: anferico <[email protected]>

* Update configuration files

Signed-off-by: anferico <[email protected]>

* add informative comment in config files

Signed-off-by: anferico <[email protected]>

* sample random index for reference audio selection

Signed-off-by: anferico <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: anferico <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* update text server to support compute logprobs (#7733)

* update text server to support compute logprobs

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix typo

---------

Signed-off-by: Zhilin Wang <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* add multi-layer feat extract and fix random question insertion

Signed-off-by: stevehuang52 <[email protected]>

* Configure MCore logger (#7781)

Signed-off-by: Mikołaj Błaż <[email protected]>

* Revert "PEFT eval fix (#7626) (#7638)" (#7693)

This reverts commit f03dd660bd26d88fd569e76c6f74b83a7c203ff9.

* remove TN from ctc_segm tut (#7807)

Signed-off-by: Evelina <[email protected]>

* [TTS] Support audio offsets in TTS data loaders (#7156)

* [TTS] Support audio offsets in TTS data loaders

Signed-off-by: Ryan <[email protected]>

* [TTS] Change docstring mentions of .pt to .npy

Signed-off-by: Ryan <[email protected]>

---------

Signed-off-by: Ryan <[email protected]>

* Update Apex install command in Dockerfile (#7794) (#7804)

* move core install to /workspace (#7706)



* update apex install in dockerfile



* use fetch head



---------

Signed-off-by: Abhinav Khattar <[email protected]>
Signed-off-by: eharper <[email protected]>
Co-authored-by: Eric Harper <[email protected]>
Co-authored-by: Abhinav Khattar <[email protected]>

* fix typo

Signed-off-by: stevehuang52 <[email protected]>

* Nemo to HF converter for LLaMA model (#7770)

* Create config_llama_truncate.yaml

Signed-off-by: Utkarsh <[email protected]>

* Add files via upload

Signed-off-by: Utkarsh <[email protected]>

* Update convert_nemo_llama_to_hf.py

Signed-off-by: Utkarsh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update config_llama_truncate.yaml

Signed-off-by: Utkarsh <[email protected]>

* Update convert_nemo_llama_to_hf.py

Signed-off-by: Utkarsh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update convert_nemo_llama_to_hf.py

Signed-off-by: Utkarsh <[email protected]>

* clean up trainer

* remove dependency on yaml config. load config from nemo file instead.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* enable ckpt saving into other precision formats

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* support 70b + cleanup qkv slice logic

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix bug

* move hf model folder code from comment to function and add instruction to run

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Utkarsh <[email protected]>
Signed-off-by: Chen Cui <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Eric Harper <[email protected]>
Co-authored-by: Chen Cui <[email protected]>

* Save best NeMo model only when necessary (#7836)

Signed-off-by: Ante Jukić <[email protected]>

* add guard if its a distributed checkpoint (#7845)

Signed-off-by: Gerald Shen <[email protected]>

* Fix tn duplex (#7808)

* fix duplex tn infer

Signed-off-by: Evelina <[email protected]>

* fix typo

Signed-off-by: Evelina <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix TN docs

Signed-off-by: Evelina <[email protected]>

---------

Signed-off-by: Evelina <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Update transformers cache on Jenkins (#7854)

* update transformers cache

Signed-off-by: eharper <[email protected]>

* update

Signed-off-by: eharper <[email protected]>

* add cd

Signed-off-by: eharper <[email protected]>

---------

Signed-off-by: eharper <[email protected]>

* Update README.rst for container update (#7844)

Signed-off-by: fayejf <[email protected]>

* Add support for finetuning with huggingface datasets (#7834)

* add finetune with huggingface dataset

Signed-off-by: stevehuang52 <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update yaml

Signed-off-by: stevehuang52 <[email protected]>

* update

Signed-off-by: stevehuang52 <[email protected]>

* update and refactor

Signed-off-by: stevehuang52 <[email protected]>

* add extrac hf text and update

Signed-off-by: stevehuang52 <[email protected]>

* update and refactor

Signed-off-by: stevehuang52 <[email protected]>

* move dataset dependency to common

Signed-off-by: stevehuang52 <[email protected]>

* add docstring

Signed-off-by: stevehuang52 <[email protected]>

* Add to Dics

Signed-off-by: Nithin Rao Koluguri <nithinraok>

* add ci test

Signed-off-by: Nithin Rao Koluguri <nithinraok>

* add max steps in jenkins

Signed-off-by: Nithin Rao Koluguri <nithinraok>

* reduce max steps

Signed-off-by: Nithin Rao Koluguri <nithinraok>

* jenkins test

Signed-off-by: Nithin Rao Koluguri <nithinraok>

* add bs=2

Signed-off-by: Nithin Rao Koluguri <nithinraok>

---------

Signed-off-by: stevehuang52 <[email protected]>
Signed-off-by: Nithin Rao Koluguri <nithinraok>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Nithin Rao Koluguri <nithinraok>
Co-authored-by: Nithin Rao <[email protected]>

* Multimodal merge (#7728)

* ControlNet TRT export

* Final MR before release

* SD2 update

* Fixed export issue

* Fix for instruct p2p and reformat

* Fix SD export issue

* Add nemo clip export for DB

* Fix ins pix2pix

* fix sd2 config

* [Mingyuan Ma] BF16 and SD conversion script

* [Imagen] NHWC Feature

* Fix .nemo loading issue for NeMo CLIP in SD

* NeMo r1.20.0 Multimodal Merge

* fix the inductor issue in inference

* Fix inductor loading .nemo issue

* Add Neva Model Support

* Imagen Optimizations

* Neva inference code

* NeMo TOT 1.21 to Internal/main

* Update neva_inference.yaml

* REBASING  for latest code changes

* Update internal/main to main tot

* Parallel DDIM implementation

* 1. Fixing indentation bug. (#7352)

Signed-off-by: Micha Livne <[email protected]>

* NeMo MCore llama2 support + MCore PEFT adapters (#7299)

* start adding gpt from megatron core path

Signed-off-by: ericharper <[email protected]>

* set model parallel config

Signed-off-by: ericharper <[email protected]>

* use model parallel config object

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update args

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* set vp size to none if it is 1

Signed-off-by: ericharper <[email protected]>

* set vp size to none if it is 1

Signed-off-by: ericharper <[email protected]>

* add TransformerConfig

Signed-off-by: ericharper <[email protected]>

* start updating to TransformerConfig

Signed-off-by: ericharper <[email protected]>

* add todo

Signed-off-by: ericharper <[email protected]>

* revert to model parallel config

Signed-off-by: ericharper <[email protected]>

* add hidden_size to model_parallel_config

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove imports

Signed-off-by: ericharper <[email protected]>

* revert

Signed-off-by: ericharper <[email protected]>

* remove import

Signed-off-by: ericharper <[email protected]>

* small clean up

Signed-off-by: ericharper <[email protected]>

* update hidden size in peft base model, add mcore commit to jenkins

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update module args

Signed-off-by: ericharper <[email protected]>

* add config obj to flash attention tests

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove args

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove sequence parallel arg

Signed-off-by: ericharper <[email protected]>

* update args

Signed-off-by: ericharper <[email protected]>

* add config to self

Signed-off-by: ericharper <[email protected]>

* update args

Signed-off-by: ericharper <[email protected]>

* update args

Signed-off-by: ericharper <[email protected]>

* update args

Signed-off-by: ericharper <[email protected]>

* add config to test

Signed-off-by: ericharper <[email protected]>

* get hidden_size from config

Signed-off-by: ericharper <[email protected]>

* add try except

Signed-off-by: ericharper <[email protected]>

* use default

Signed-off-by: ericharper <[email protected]>

* update config with hidden size

Signed-off-by: ericharper <[email protected]>

* remove arg

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* comment out jenkins test

Signed-off-by: ericharper <[email protected]>

* revert import

Signed-off-by: ericharper <[email protected]>

* build transformer config

Signed-off-by: ericharper <[email protected]>

* add model to provider func

Signed-off-by: ericharper <[email protected]>

* update forward and float16 wrapper

Signed-off-by: ericharper <[email protected]>

* instantiate model parallel config after init model parallel

Signed-off-by: ericharper <[email protected]>

* set virtual rank

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add GQA config to megatron gpt model (#7096)

* Add GQA config in gpt config file

Signed-off-by: jasonwan <[email protected]>

* Verify mcore is enabled when using GQA

Signed-off-by: jasonwan <[email protected]>

---------

Signed-off-by: jasonwan <[email protected]>

* revert

Signed-off-by: ericharper <[email protected]>

* mcore llama2 ckpt conversion & small fix

Signed-off-by: jasonwan <[email protected]>

* Add inference & sft config by Hongbin

Co-authored-by: Hongbin Liu <[email protected]>

Signed-off-by: jasonwan <[email protected]>

* fix config

Signed-off-by: jasonwan <[email protected]>

* add inference param. update TP/PP script to support mcore gpt

Signed-off-by: jasonwan <[email protected]>

* p-tuning

Signed-off-by: jasonwan <[email protected]>

* modify ckpt conversion script (adding model cast)

Signed-off-by: jasonwan <[email protected]>

* ckpt conversion use relative path for config

Signed-off-by: jasonwan <[email protected]>

* start adding gpt from megatron core path

Signed-off-by: ericharper <[email protected]>

* set model parallel config

Signed-off-by: ericharper <[email protected]>

* use model parallel config object

Signed-off-by: ericharper <[email protected]>

* update args

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* set vp size to none if it is 1

Signed-off-by: ericharper <[email protected]>

* set vp size to none if it is 1

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add TransformerConfig

Signed-off-by: ericharper <[email protected]>

* start updating to TransformerConfig

Signed-off-by: ericharper <[email protected]>

* add todo

Signed-off-by: ericharper <[email protected]>

* revert to model parallel config

Signed-off-by: ericharper <[email protected]>

* add hidden_size to model_parallel_config

Signed-off-by: ericharper <[email protected]>

* remove imports

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove import

Signed-off-by: ericharper <[email protected]>

* small clean up

Signed-off-by: ericharper <[email protected]>

* update hidden size in peft base model, add mcore commit to jenkins

Signed-off-by: ericharper <[email protected]>

* update module args

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add config obj to flash attention tests

Signed-off-by: ericharper <[email protected]>

* remove args

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove sequence parallel arg

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update args

Signed-off-by: ericharper <[email protected]>

* add config to self

Signed-off-by: ericharper <[email protected]>

* update args

Signed-off-by: ericharper <[email protected]>

* update args

Signed-off-by: ericharper <[email protected]>

* update args

Signed-off-by: ericharper <[email protected]>

* add config to test

Signed-off-by: ericharper <[email protected]>

* get hidden_size from config

Signed-off-by: ericharper <[email protected]>

* add try except

Signed-off-by: ericharper <[email protected]>

* use default

Signed-off-by: ericharper <[email protected]>

* update config with hidden size

Signed-off-by: ericharper <[email protected]>

* remove arg

Signed-off-by: ericharper <[email protected]>

* comment out jenkins test

Signed-off-by: ericharper <[email protected]>

* revert import

Signed-off-by: ericharper <[email protected]>

* remove optimizer_idx

Signed-off-by: eharper <[email protected]>

* prefetch num microbatches

Signed-off-by: eharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* start adding gpt from megatron core path

Signed-off-by: ericharper <[email protected]>

* set model parallel config

Signed-off-by: ericharper <[email protected]>

* use model parallel config object

Signed-off-by: ericharper <[email protected]>

* update args

Signed-off-by: ericharper <[email protected]>

* fix for p-tuning sequence parallel

Signed-off-by: jasonwan <[email protected]>

* support SFT/distOpt mcore (#7207)

* add inference param. update TP/PP script to support mcore gpt

* p-tuning

Signed-off-by: jasonwan <[email protected]>

* change layer names for SFT

Signed-off-by: Hongbin Liu <[email protected]>

* fix bug in SFT

Signed-off-by: Hongbin Liu <[email protected]>

---------

Signed-off-by: jasonwan <[email protected]>
Signed-off-by: Hongbin Liu <[email protected]>
Co-authored-by: Hongbin Liu <[email protected]>
Co-authored-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* start updating to TransformerConfig

Signed-off-by: ericharper <[email protected]>

* revert to model parallel config

Signed-off-by: ericharper <[email protected]>

* add hidden_size to model_parallel_config

Signed-off-by: ericharper <[email protected]>

* remove imports

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update module args

Signed-off-by: ericharper <[email protected]>

* add config to self

Signed-off-by: ericharper <[email protected]>

* build transformer config

Signed-off-by: ericharper <[email protected]>

* add model to provider func

Signed-off-by: ericharper <[email protected]>

* update forward and float16 wrapper

Signed-off-by: ericharper <[email protected]>

* instantiate model parallel config after init model parallel

Signed-off-by: ericharper <[email protected]>

* set virtual rank

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add GQA config to megatron gpt model (#7096)

* Add GQA config in gpt config file

Signed-off-by: jasonwan <[email protected]>

* Verify mcore is enabled when using GQA

Signed-off-by: jasonwan <[email protected]>

---------

Signed-off-by: jasonwan <[email protected]>

* revert

Signed-off-by: ericharper <[email protected]>

* remove import

Signed-off-by: eharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* rollback model cast for p-tuning

Signed-off-by: jasonwan <[email protected]>

* update for dist adam

Signed-off-by: eharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* use get_gpt_module_list

Signed-off-by: eharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update ckpt conversion script

Signed-off-by: jasonwan <[email protected]>

* ptl2.0 patch for llama config

Signed-off-by: jasonwan <[email protected]>

* add plugins to trainer in scripts

Signed-off-by: jasonwan <[email protected]>

* fix activation checkpointing mcore

Signed-off-by: jasonwan <[email protected]>

* fix variable names

Signed-off-by: jasonwan <[email protected]>

* overwrite normalization type for mcore/te

Signed-off-by: jasonwan <[email protected]>

* Update megatron_llama_sft.yaml

Signed-off-by: Jason Wang <[email protected]>

* add PEFT adapter support for mcore gpt path (#7276)

* implementation for mcore adapter/mxins

Signed-off-by: jasonwan <[email protected]>

* small fix for lora and ptuning

Signed-off-by: jasonwan <[email protected]>

* support layerwise peft

Signed-off-by: jasonwan <[email protected]>

* support multiple target layers

Signed-off-by: jasonwan <[email protected]>

* support lora GQA

Signed-off-by: jasonwan <[email protected]>

* support amp O2

Signed-off-by: jasonwan <[email protected]>

* revert & more O2 fix

Signed-off-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* lora inject to attention

Signed-off-by: jasonwan <[email protected]>

* support lora weight tying

Signed-off-by: jasonwan <[email protected]>

* add copyright header

Signed-off-by: jasonwan <[email protected]>

* rollback ptuning name change. full string match mcore target

Signed-off-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove comment

Signed-off-by: jasonwan <[email protected]>

---------

Signed-off-by: jasonwan <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* clean up config

Signed-off-by: jasonwan <[email protected]>

* Sync llama branch (#7297)

* add inference param. update TP/PP script to support mcore gpt

* p-tuning

Signed-off-by: jasonwan <[email protected]>

* change layer names for SFT

Signed-off-by: Hongbin Liu <[email protected]>

* fix bug in SFT

Signed-off-by: Hongbin Liu <[email protected]>

* fix bug: cpu initialization is not really enabled

Signed-off-by: Hongbin Liu <[email protected]>

* add use_cpu_initialization to TransformerConfig

Signed-off-by: Hongbin Liu <[email protected]>

* fix bug: wrong config path when using relative cjpt path

Signed-off-by: Hongbin Liu <[email protected]>

* revert mcore config change

Signed-off-by: Jason Wang <[email protected]>

---------

Signed-off-by: jasonwan <[email protected]>
Signed-off-by: Hongbin Liu <[email protected]>
Signed-off-by: Jason Wang <[email protected]>
Co-authored-by: Hongbin Liu <[email protected]>

* clean up ckpt conversion script

Signed-off-by: jasonwan <[email protected]>

* rollback git merge errors

Signed-off-by: jasonwan <[email protected]>

* update mcore, add check for mcore+te

Signed-off-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* formatting

Signed-off-by: jasonwan <[email protected]>

* make sft test dataset optional. fix indentation in config

Signed-off-by: jasonwan <[email protected]>

* one more fix for optional test set

Signed-off-by: jasonwan <[email protected]>

* support merging lora weights in mcore

Signed-off-by: jasonwan <[email protected]>

* update mcore for cpu init

Signed-off-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update ckpt conversion for code llama

Signed-off-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add seq_len_interpolation_factor support for long-context llama ckpts (#7312)

* add inference param. update TP/PP script to support mcore gpt

* p-tuning

Signed-off-by: jasonwan <[email protected]>

* add seq_len_interpolation_factor

Signed-off-by: Hongbin Liu <[email protected]>

---------

Signed-off-by: jasonwan <[email protected]>
Signed-off-by: Hongbin Liu <[email protected]>
Co-authored-by: jasonwan <[email protected]>
Co-authored-by: Hongbin Liu <[email protected]>

* fix old ptuning model, update mcore to support seq_len_interpolation_factor

Signed-off-by: jasonwan <[email protected]>

* support fused layernorm linear, fix ptuning O2

Signed-off-by: jasonwan <[email protected]>

* drop loss mask for mcore for now

Signed-off-by: jasonwan <[email protected]>

* disable dist ckpt in peft

Signed-off-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix loading non dist ckpt

Signed-off-by: jasonwan <[email protected]>

* add ckpt conversion to CI

Signed-off-by: jasonwan <[email protected]>

* update CI

Signed-off-by: jasonwan <[email protected]>

* mcore_mixin docstring

Signed-off-by: jasonwan <[email protected]>

* minor change in mcore peft error message

Signed-off-by: jasonwan <[email protected]>

* fix amp o2 in lora weight tying

Signed-off-by: jasonwan <[email protected]>

* correct mcore fp8 config

Signed-off-by: jasonwan <[email protected]>

* add TE installation

Signed-off-by: jasonwan <[email protected]>

* support mcore adapter tuning

Signed-off-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* comment out new CI test. rollback docker image

Signed-off-by: jasonwan <[email protected]>

* ignore FA tests, try new CI on 23.08

Signed-off-by: jasonwan <[email protected]>

* mark new CI as L2, put to beginning to test

Signed-off-by: jasonwan <[email protected]>

* minor fix for prompt learning

Signed-off-by: jasonwan <[email protected]>

* rollback to 23.06. comment out CI

Signed-off-by: jasonwan <[email protected]>

* minor fix ckpt conversion script

Signed-off-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* minor rollback gpt model change

Signed-off-by: jasonwan <[email protected]>

---------

Signed-off-by: ericharper <[email protected]>
Signed-off-by: jasonwan <[email protected]>
Signed-off-by: eharper <[email protected]>
Signed-off-by: Hongbin Liu <[email protected]>
Signed-off-by: Jason Wang <[email protected]>
Co-authored-by: ericharper <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: eharper <[email protected]>
Co-authored-by: Hongbin Liu <[email protected]>
Co-authored-by: Kelvin Liu <[email protected]>

* Hiddens modules documentation (#7303)

* 1. Changed hiddens transformations module from `transformations` to `hiddens`.

Signed-off-by: Micha Livne <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* 1. Debugging. Signed-off-by: Micha Livne <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* 1. Finished doc.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging. Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging. Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging. Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging. Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging. Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging. Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging. Signed-off-by: Micha Livne <[email protected]>

---------

Signed-off-by: Micha Livne <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Eric Harper <[email protected]>

* Support for flash attention 2.0 (#7063)

* Add flash attn 2

Signed-off-by: MaximumEntropy <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add FA2 feature

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* Remove debugging

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: MaximumEntropy <[email protected]>
Signed-off-by: Cheng-Ping Hsieh <[email protected]>
Signed-off-by: Cheng-Ping Hsieh <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Oleksii Kuchaiev <[email protected]>
Co-authored-by: Cheng-Ping Hsieh <[email protected]>
Co-authored-by: Cheng-Ping Hsieh <[email protected]>

* lora merge fix for O2 names (#7325)

* wip

Signed-off-by: arendu <[email protected]>

* adjust key names based on O2

Signed-off-by: arendu <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update

Signed-off-by: arendu <[email protected]>

* minor

Signed-off-by: arendu <[email protected]>

---------

Signed-off-by: arendu <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* multiple fields can form a context (#7147)

* list of context fields and flexible prompt template

Signed-off-by: arendu <[email protected]>

* list of fields for context

Signed-off-by: arendu <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix bug

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* Fix bug

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* Add multiple truncation fields and middle truncation

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Compatible to old ckpt

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix tokenize detokenize issue

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Remove detokenization, add truncation augmentation

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Resolve comments

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* Remove unused import

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* revert eos

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* Add tokenizer space_sensitive attribute

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix error

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* Fix erorr and use re

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix bug

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* Change assert logic

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Follow adi suggestion

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Remove merge function

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add example and comment

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* Remove context_key and add comment

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* Remove random truncation

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix bug

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix template none

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix bug

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

---------

Signed-off-by: arendu <[email protected]>
Signed-off-by: Cheng-Ping Hsieh <[email protected]>
Signed-off-by: Cheng-Ping Hsieh <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Cheng-Ping Hsieh <[email protected]>
Co-authored-by: Cheng-Ping Hsieh <[email protected]>

* Load buffers in checkpoint (#7357)

Signed-off-by: Jason Wang <[email protected]>

* Add migration guide for lightning 2.0 upgrade (#7360)

* Add lightning 2.0 migration guide in NeMo docs

Signed-off-by: Abhishree <[email protected]>

* Add remaining guide for lightning 2.0 upgrade

Signed-off-by: Abhishree <[email protected]>

* Remove line spill over and continue in next line

Signed-off-by: Abhishree <[email protected]>

* Add missing dataloader_iter in the guide

Signed-off-by: Abhishree <[email protected]>

* Fix minor typo

Signed-off-by: Abhishree <[email protected]>

---------

Signed-off-by: Abhishree <[email protected]>

* adding bias_dropout_add_fusion option for BERT (#7332)

Signed-off-by: Alexander Jipa <[email protected]>
Co-authored-by: Alexander Jipa <[email protected]>

* [TTS] Change audio codec token type to TokenIndex (#7356)

Signed-off-by: Ryan <[email protected]>

* enable selective unfreeze (#7326)

* wip

Signed-off-by: arendu <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* wip

Signed-off-by: arendu <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* avoid PTL method conflicts

Signed-off-by: arendu <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update

Signed-off-by: arendu <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update

Signed-off-by: arendu <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: arendu <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Fix typos (#7361)

* fix typos

Signed-off-by: omahs <[email protected]>

* fix typo

Signed-off-by: omahs <[email protected]>

* fix typos

Signed-off-by: omahs <[email protected]>

* fix typos

Signed-off-by: omahs <[email protected]>

* fix typo

Signed-off-by: omahs <[email protected]>

* fix typos

Signed-off-by: omahs <[email protected]>

* fix typo

Signed-off-by: omahs <[email protected]>

* fix typo

Signed-off-by: omahs <[email protected]>

* fix typo

Signed-off-by: omahs <[email protected]>

---------

Signed-off-by: omahs <[email protected]>

* pin numba=0.57.1 to fix reinstall.sh error (#7366)

Signed-off-by: Xuesong Yang <[email protected]>

* Update new conversion script for converting safetensors.

* Upgrade pytorch container to 23.08 (#7353)

* upgrade pytorch container

Signed-off-by: eharper <[email protected]>

* use mcore

Signed-off-by: eharper <[email protected]>

* revert test change

Signed-off-by: eharper <[email protected]>

* pleasefixme

Signed-off-by: eharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* check for ampere

Signed-off-by: eharper <[email protected]>

* comment test temporarily

Signed-off-by: eharper <[email protected]>

---------

Signed-off-by: eharper <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* enable fp32 optimizer for output_layer in mcore (#7355)

Signed-off-by: lhb8125 <[email protected]>

* revert comment (#7368)

Signed-off-by: eharper <[email protected]>

* Update to core 23.08 branch ToT (#7371)

Signed-off-by: Abhinav Khattar <[email protected]>

* upper bounding ptl (#7370)

Signed-off-by: eharper <[email protected]>

* fix pipeline parallel inference (#7367)

* fix pp inference

Signed-off-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: jasonwan <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* fix for peft tied weights (#7372)

Signed-off-by: arendu <[email protected]>

* fixed trainer.strategy=auto from None. (#7369)

Signed-off-by: Xuesong Yang <[email protected]>

* add O2 option in gpt eval (#7358)

* add O2 option in eval

Signed-off-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add doc for O2 config

Signed-off-by: jasonwan <[email protected]>

* add to llama inference config

Signed-off-by: jasonwan <[email protected]>

---------

Signed-off-by: jasonwan <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Eric Harper <[email protected]>

* Move model precision copy (#7336)

* move cfg precision set to megatron base model

Signed-off-by: Maanu Grover <[email protected]>

* remove copy from other models

Signed-off-by: Maanu Grover <[email protected]>

* modify attribute not arg

Signed-off-by: Maanu Grover <[email protected]>

* fix gpt model test for ptl 2.0

Signed-off-by: Maanu Grover <[email protected]>

* rename function and add docstring

Signed-off-by: Maanu Grover <[email protected]>

* replace precision to dtype conditionals with func call

Signed-off-by: Maanu Grover <[email protected]>

* unnecessary function and cfg reset

Signed-off-by: Maanu Grover <[email protected]>

* set default value

Signed-off-by: Maanu Grover <[email protected]>

* fix precision lookup in a few more places

Signed-off-by: Maanu Grover <[email protected]>

* rename mapping function

Signed-off-by: Maanu Grover <[email protected]>

* ununsed import

Signed-off-by: Maanu Grover <[email protected]>

* save torch datatype to model

Signed-off-by: Maanu Grover <[email protected]>

* set weights precision wrt amp o2

Signed-off-by: Maanu Grover <[email protected]>

* Revert "set weights precision wrt amp o2"

This reverts commit 313a4bfe5eb69d771a6d2433898c0685836aef5c.

Signed-off-by: Maanu Grover <[email protected]>

* revert half precision at inference attempt

Signed-off-by: Maanu Grover <[email protected]>

* move autocast dtype to base model

Signed-off-by: Maanu Grover <[email protected]>

* move params dtype to base model, enable fp16 O2 inf

Signed-off-by: Maanu Grover <[email protected]>

* unused imports

Signed-off-by: Maanu Grover <[email protected]>

---------

Signed-off-by: Maanu Grover <[email protected]>

* Fix PEFT checkpoint loading (#7388)

* Fix PEFT checkpoint loading

Signed-off-by: Jason Wang <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Jason Wang <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Use distributed optimizer support for multiple dtypes (#7359)

* Update distopt wrapper with multiple dtype support

Remove manual handling of separate FP32 optimizer.

Signed-off-by: Tim Moon <[email protected]>

* Use distopt support for contiguous buffers with multiple dtypes

Signed-off-by: Tim Moon <[email protected]>

* Fix typo

Signed-off-by: Tim Moon <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Separate distopt buckets for first GPT layer and non-overlapped params

Signed-off-by: Tim Moon <[email protected]>

* Add distopt logic for int dtypes

Signed-off-by: Tim Moon <[email protected]>

* Update Apex commit

Signed-off-by: Tim Moon <[email protected]>

* Remove unused variables

Signed-off-by: Tim Moon <[email protected]>

* Update Apex commit in README and Jenkensfile

Signed-off-by: Tim Moon <[email protected]>

* Debug Dockerfile and Jenkinsfile

Signed-off-by: Tim Moon <[email protected]>

---------

Signed-off-by: Tim Moon <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Eric Harper <[email protected]>

* minor fix for llama ckpt conversion script (#7387)

* minor fix for llama ckpt conversion script

Signed-off-by: Jason Wang <[email protected]>

* Update Jenkinsfile

Signed-off-by: Jason Wang <[email protected]>

* remove fast_swiglu configuration

Signed-off-by: Jason Wang <[email protected]>

---------

Signed-off-by: Jason Wang <[email protected]>
Co-authored-by: Eric Harper <[email protected]>

* Fix wrong calling of librosa.get_duration() in notebook (#7376)

Signed-off-by: Robin Dong <[email protected]>
Co-authored-by: Somshubra Majumdar <[email protected]>

* [PATCH] PEFT import mcore (#7393)

* [PATCH] PEFT import mcore

Signed-off-by: Jason Wang <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Jason Wang <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* [TTS] Added a callback for logging initial data (#7384)

Signed-off-by: Ante Jukić <[email protected]>

* Update Core Commit (#7402)

* Update Core Commit

Signed-off-by: Abhinav Khattar <[email protected]>

* update commit

Signed-off-by: Abhinav Khattar <[email protected]>

---------

Signed-off-by: Abhinav Khattar <[email protected]>

* Use cfg attribute in bert (#7394)

* use cfg attribute instead of arg

Signed-off-by: Maanu Grover <[email protected]>

* use torch_dtype in place of cfg.precision

Signed-off-by: Maanu Grover <[email protected]>

* move precision copy before super constructor

Signed-off-by: Maanu Grover <[email protected]>

* use trainer arg

Signed-off-by: Maanu Grover <[email protected]>

---------

Signed-off-by: Maanu Grover <[email protected]>

* Add support for bias conversion in Swiglu models (#7386)

* Add support for bias conversion in Swiglu models

Signed-off-by: smajumdar <[email protected]>

* Add support for auto extracting tokenizer model

Signed-off-by: smajumdar <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add support for auto extracting tokenizer model

Signed-off-by: smajumdar <[email protected]>

* Fix issue with missing tokenizer

Signed-off-by: smajumdar <[email protected]>

* Refactor

Signed-off-by: smajumdar <[email protected]>

* Refactor

Signed-off-by: smajumdar <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: smajumdar <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Update save_to and restore_from for dist checkpointing (#7343)

* add dist ckpt to save to, in progress

Signed-off-by: eharper <[email protected]>

* move dist ckpt

Signed-off-by: eharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* clean up

Signed-off-by: eharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update restore from, need to figure out how to initialize distributed

Signed-off-by: eharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* launch distrib if needed when restoring dist ckpt

Signed-off-by: eharper <[email protected]>

* when using mcore we can change tp pp on the fly

Signed-off-by: eharper <[email protected]>

* add load_from_checkpoint support for dist ckpt

Signed-off-by: eharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update llama convert script to save dist .nemo

Signed-off-by: eharper <[email protected]>

* fix load dist ckpt

Signed-off-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* setup TE TP groups if needed

Signed-off-by: eharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* setup te tp groups if needed

Signed-off-by: eharper <[email protected]>

* remove import

Signed-off-by: eharper <[email protected]>

---------

Signed-off-by: eharper <[email protected]>
Signed-off-by: jasonwan <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: jasonwan <[email protected]>

* fix forward for with mcore=false (#7403)

Signed-off-by: Jimmy Zhang <[email protected]>
Co-authored-by: Jimmy Zhang <[email protected]>

* Fix logging to remove 's/it' from progress bar in Megatron models and add train_step_timing (#7374)

* Add CustomProgressBar class to exp_manager and trainer callbacks

Signed-off-by: Abhishree <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix the progress bar to reflect total microbatch cnt

Signed-off-by: Abhishree <[email protected]>

* Modify CustomProgressBar class

1) Modify CustomProgressBar class to update progress bar per global_step instead of per microbatch
2) Add the callback to other megatron training/finetuning files that are not using MegatronTrainerBuilder

Signed-off-by: Abhishree <[email protected]>

* Add CustomProgressBar callback to tuning files

Signed-off-by: Abhishree <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Abhishree <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Set Activation Checkpointing Defaults (#7404)

* Set Activation Checkpointing Defaults

Signed-off-by: Abhinav Khattar <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* check for None

Signed-off-by: Abhinav Khattar <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Abhinav Khattar <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* make loss mask default to false (#7407)

Signed-off-by: eharper <[email protected]>

* Add dummy userbuffer config files (#7408)

Signed-off-by: Sangkug Lym <[email protected]>

* add missing ubconf files (#7412)

Signed-off-by: Abhinav Khattar <[email protected]>

* New tutorial on Speech Data Explorer (#7405)

* Added Google Colab based tutorial on Speech Data Explorer 

Signed-off-by: George Zelenfroynd <[email protected]>

* Update ptl training ckpt conversion script to work with dist ckpt (#7416)

* update ptl convert script

Signed-off-by: eharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* don't break legacy

Signed-off-by: eharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: eharper <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Allow disabling sanity checking when num_sanity_val_steps=0 (#7413)

* Allow disabling sanity checking when num_sanity_val_steps=0

Signed-off-by: Abhishree <[email protected]>

* Update num_sanity_val_steps to be a multiple of num_microbatches

Signed-off-by: Abhishree Thittenamane <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Abhishree <[email protected]>
Signed-off-by: Abhishree Thittenamane <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Add comprehensive error messages (#7261)

Signed-off-by: Anton Peganov <[email protected]>

* check NEMO_PATH (#7418)

Signed-off-by: Nikolay Karpov <[email protected]>

* layer selection for ia3 (#7417)

* layer selection for ia3

Signed-off-by: arendu <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: arendu <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Fix missing pip package 'einops' (#7397)

Signed-off-by: Robin Dong <[email protected]>

* Fix failure of pyaudio in Google Colab (#7396)

Signed-off-by: Robin Dong <[email protected]>

* Update README.md: output_path --> output_manifest_filepath (#7442)

Signed-off-by: Samuele Cornell <[email protected]>

* Updating FlashAttention API to match FlashAttentionV2

* Multiple fixes for mm

* Fix CI inductor issue and update to torch compile

* Remove suppress error

* Fix when conversion config uses fp16 and it complains about precision plugin

* Fixing FAv2 API usage

* Initial release of content filtering model

* Added synthetic dataloader for precached and online mode

* Mingyuanm/dreambooth opt

* Add llama2 support in neva training

* Fix sampler length

* Fix all precision issues in nemo multimodal

* Add rope dynamic linear scaling (#7437)

* Add dynamic linear scaling

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix bug

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

---------

Signed-off-by: Cheng-Ping Hsieh <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Yang Zhang <[email protected]>

* Fix None dataloader issue in PTL2.0 (#7455)

* Fix None dataloader issue in PTL2.0

Signed-off-by: KunalDhawan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* updating values of self._validation_dl and self._test_dl as well

Signed-off-by: KunalDhawan <[email protected]>

* updating values of self._validation_dl and self._test_dl as well

Signed-off-by: KunalDhawan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: KunalDhawan <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* [ASR] Confidence measure -> method renames (#7434)

* measure -> method

Signed-off-by: Aleksandr Laptev <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Aleksandr Laptev <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Add steps for document of getting dataset 'SF Bilingual Speech' (#7378)

* Add steps for document of getting dataset 'SF Bilingual Speech'

Signed-off-by: Robin Dong <[email protected]>

* Update datasets.rst

added a link from a tutorial demonstrating detailed data prep steps.

Signed-off-by: Xuesong Yang <[email protected]>

---------

Signed-off-by: Robin Dong <[email protected]>
Signed-off-by: Xuesong Yang <[email protected]>
Co-authored-by: Xuesong Yang <[email protected]>

* RNN-T confidence and alignment bugfix (#7381)

* new frame_confidence and alignments lists are now always created after the while loop

Signed-off-by: Aleksandr Laptev <[email protected]>

* tests added

Signed-off-by: Aleksandr Laptev <[email protected]>

---------

Signed-off-by: Aleksandr Laptev <[email protected]>

* Fix resume from checkpoint in exp_manager (#7424) (#7426)

Signed-off-by: Abhishree <[email protected]>
Co-authored-by: Abhishree Thittenamane <[email protected]>
Co-authored-by: Eric Harper <[email protected]>

* Fix checking of cuda/cpu device for inputs of Decoder (#7444)

* Fix checking of cuda/cpu device for inputs of Decoder

Signed-off-by: Robin Dong <[email protected]>

* Update tacotron2.py

Signed-off-by: Jason <[email protected]>

---------

Signed-off-by: Robin Dong <[email protected]>
Signed-off-by: Jason <[email protected]>
Co-authored-by: Jason <[email protected]>

* Fix failure of ljspeech's get_data.py (#7430)

* Fix failure of ljspeech's get_data.py

Signed-off-by: Robin Dong <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Robin Dong <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* [TTS] Fix audio codec type checks (#7373)

* [TTS] Fix audio codec type checks

Signed-off-by: Ryan <[email protected]>

* [TTS] Fix audio codec tests

Signed-off-by: Ryan <[email protected]>

---------

Signed-off-by: Ryan <[email protected]>

* [TTS] Add dataset to path of logged artifacts (#7462)

* [TTS] Add dataset to path of logged artifacts

Signed-off-by: Ryan <[email protected]>

* [TTS] Revert axis name back to Audio Frames

Signed-off-by: Ryan <[email protected]>

---------

Signed-off-by: Ryan <[email protected]>

* Fix sft dataset truncation (#7464)

* Add fix

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

---------

Signed-off-by: Cheng-Ping Hsieh <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Automatic Lip Reading Recognition (ALR) - ASR/CV (Visual ASR) (#7330)

* striding_conv1d_k5 and dw_striding_conv1d_k5 subsampling

Signed-off-by: mburchi <[email protected]>

* transpose conv1d inputs

Signed-off-by: mburchi <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Signed-off-by: mburchi <[email protected]>

* Update subsampling.py

change striding_conv1d_k5 to striding_conv1d

Signed-off-by: Maxime Burchi <[email protected]>

* cv branch

Signed-off-by: mburchi <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* video manifest

Signed-off-by: mburchi <[email protected]>

* add collection classes

Signed-off-by: mburchi <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add test_step_outputs

Signed-off-by: mburchi <[email protected]>

* correct manifest bug when having only audio or only videos

Signed-off-by: mburchi <[email protected]>

* correct manifest bug when having only audio or only videos

Signed-off-by: mburchi <[email protected]>

* clean references

Signed-off-by: mburchi <[email protected]>

* freeze unfreeze transcribe cv models

Signed-off-by: mburchi <[email protected]>

* correct manifest get_full_path bug

Signed-off-by: mburchi <[email protected]>

* update for PR

Signed-off-by: mburchi <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* guard torchvision

Signed-off-by: mburchi <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update nemo/collections/cv/data/video_to_text_dataset.py

Co-aut…
titu1994 added a commit that referenced this pull request Jun 7, 2024
* Fixes

* Docs fix

* Add support for custom NeMo fields in Lhotse-NeMo adapters (attach to cut.custom)

* Add support for custom NeMo fields in Lhotse-NeMo adapters (attach to cut.custom)

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* support distributed_fused_adam

Signed-off-by: zhehuaichen <[email protected]>

* support distributed_fused_adam

Signed-off-by: zhehuaichen <[email protected]>

* Add support for sharded NeMo manifest files

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* support megatron_amp_O2

Signed-off-by: zhehuaichen <[email protected]>

* Support heterogeneous sampling rates in non tarred NeMo manifests

* migrate to PTL2.0

Signed-off-by: stevehuang52 <[email protected]>

* clean up

Signed-off-by: stevehuang52 <[email protected]>

* update manifest util

Signed-off-by: stevehuang52 <[email protected]>

* Support multiple tokenizer/parser types, aggregate tokenizers, and custom language fields

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* agg and normal tokenizers actually work

* Support weights for NeMo tarred manifests

* Temporarily hardcoded pnc stripping/lowercasing

* fix

* make pnc hack configurable from the config and disabled by default

* fix the hack

* migrate to ptl2.1 to support multiple dataloaders

Signed-off-by: stevehuang52 <[email protected]>

* support encoder overwrite

Signed-off-by: zhehuaichen <[email protected]>

* update misc

Signed-off-by: stevehuang52 <[email protected]>

* fix eval and clean up

Signed-off-by: stevehuang52 <[email protected]>

* support add_sep for perception model

Signed-off-by: zhehuaichen <[email protected]>

* fix https://github.com/Lightning-AI/pytorch-lightning/issues/18803

Signed-off-by: zhehuaichen <[email protected]>

* add_bos

Signed-off-by: zhehuaichen <[email protected]>

* Transformer decoder with conditioning for canary (#8091)

* initial commit for multi-task conf-enc transf-dec for canary

Signed-off-by: Krishna Puvvada <[email protected]>

* removing decoder states caching during training

Signed-off-by: Krishna Puvvada <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Krishna Puvvada <[email protected]>
Co-authored-by: Krishna Puvvada <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Option to limit the number of open streams (#8095)

* audio signal support in multi

Signed-off-by: zhehuaichen <[email protected]>

* update asr evaluator

Signed-off-by: stevehuang52 <[email protected]>

* fix from
https://github.com/NVIDIA/NeMo/commit/fcc0f9f6ff7947c3c7fba3ed17d8ec8af6391397
and
https://github.com/NVIDIA/NeMo/commit/f97c9016e6438ca4174b66bf9c3e248b28197aaa

Signed-off-by: zhehuaichen <[email protected]>

* transcribe fn for Canary models (#8110)

* improve readability

Signed-off-by: Krishna Puvvada <[email protected]>

* adding context in transcribe function for ConfTransfModels

Signed-off-by: Krishna Puvvada <[email protected]>

* supporting relative paths in transcribe function for canary

Signed-off-by: Krishna Puvvada <[email protected]>

* removing cuts.sort_by_duration in __getitem__ to maintain manifest order during inference

Signed-off-by: Krishna Puvvada <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Krishna Puvvada <[email protected]>
Co-authored-by: Krishna Puvvada <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* update for evaluation

Signed-off-by: stevehuang52 <[email protected]>

* update for eval

Signed-off-by: stevehuang52 <[email protected]>

* update for evaluation

Signed-off-by: stevehuang52 <[email protected]>

* fix bleu

Signed-off-by: stevehuang52 <[email protected]>

* fix typo

Signed-off-by: stevehuang52 <[email protected]>

* Add missing audio_filepath validation for Canary (#8119)

* Add missing audio_filepath validation for Canary

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* add default concat_sampling_probabilities

Signed-off-by: zhehuaichen <[email protected]>

* support lhotse dataset in speechllm

Signed-off-by: zhehuaichen <[email protected]>

* bypass get_iterator_k_split

Signed-off-by: zhehuaichen <[email protected]>

* tmp fix

Signed-off-by: zhehuaichen <[email protected]>

* try to use fixed batch with megatron

Signed-off-by: zhehuaichen <[email protected]>

* add batch logging

Signed-off-by: zhehuaichen <[email protected]>

* support unfrozen llm

Signed-off-by: zhehuaichen <[email protected]>

* Create README.md

Signed-off-by: He Huang (Steve) <[email protected]>

* Update README.md

Signed-off-by: He Huang (Steve) <[email protected]>

* Update README.md

Signed-off-by: He Huang (Steve) <[email protected]>

* update

Signed-off-by: stevehuang52 <[email protected]>

* rename

Signed-off-by: stevehuang52 <[email protected]>

* add llama prompt template

Signed-off-by: zhehuaichen <[email protected]>

* update and refactor

Signed-off-by: stevehuang52 <[email protected]>

* support sample alpha

Signed-off-by: zhehuaichen <[email protected]>

* support lhotse validation set and canary pretrained ckpt with pseudo label

Signed-off-by: zhehuaichen <[email protected]>

* make sure backward compatibility

Signed-off-by: zhehuaichen <[email protected]>

* remove pad

Signed-off-by: zhehuaichen <[email protected]>

* make sure asr_model is frozen

Signed-off-by: zhehuaichen <[email protected]>

* support greedy decoding

Signed-off-by: zhehuaichen <[email protected]>

* valid on lhotse

Signed-off-by: zhehuaichen <[email protected]>

* fix multi dataloader in val case for lhotse SALM; add default data
names; keep asr model tokenizer by default to enable adding canary
dataset

Signed-off-by: zhehuaichen <[email protected]>

* remove the bruteforce _keep_special_tokens implementation

Signed-off-by: zhehuaichen <[email protected]>

* decoding_ratio and convert_canary_prompt_to_text support

Signed-off-by: zhehuaichen <[email protected]>

* canary_tokens_augment_ratio

Signed-off-by: zhehuaichen <[email protected]>

* debug

Signed-off-by: zhehuaichen <[email protected]>

* bug fix

Signed-off-by: zhehuaichen <[email protected]>

* fix lhotse based eval of llama canary model

Signed-off-by: zhehuaichen <[email protected]>

* support some overwrite for eval

Signed-off-by: zhehuaichen <[email protected]>

* support zero shot prompt in training

Signed-off-by: zhehuaichen <[email protected]>

* support cross attention based SALM

Signed-off-by: zhehuaichen <[email protected]>

* support cross attention based SALM

Signed-off-by: zhehuaichen <[email protected]>

* fix for batch train/valid of cross

Signed-off-by: zhehuaichen <[email protected]>

* support learnable gate and plotting

Signed-off-by: zhehuaichen <[email protected]>

* support using pseudo label in prompt rather than cross att

Signed-off-by: zhehuaichen <[email protected]>

* bug fix for perception cfg and context tokens shift

Signed-off-by: zhehuaichen <[email protected]>

* DentityConnectorsAdd

Signed-off-by: zhehuaichen <[email protected]>

* fix ckpt saving

Signed-off-by: zhehuaichen <[email protected]>

* Support RnnGatedCrossAttention

Signed-off-by: zhehuaichen <[email protected]>

* add include_ffw and fix _optimizer_param_groups for all unfrozen run

Signed-off-by: zhehuaichen <[email protected]>

* support grad acc when using bucket

Signed-off-by: zhehuaichen <[email protected]>

* support TransformerCrossAttention

Signed-off-by: zhehuaichen <[email protected]>

* support ProjectTransformerCrossAttention

Signed-off-by: zhehuaichen <[email protected]>

* support ++model.use_am_tokenizer ++model.override_vocab_size ++model.override.hidden_size

Signed-off-by: zhehuaichen <[email protected]>

* support question set on val without canary

Signed-off-by: zhehuaichen <[email protected]>

* support load_audio_encoder and wip in optim_param_groups

Signed-off-by: zhehuaichen <[email protected]>

* minor fix for audio pretrain model init

Signed-off-by: zhehuaichen <[email protected]>

* simplify canary_tokens_augment

Signed-off-by: zhehuaichen <[email protected]>

* use question in the manifest if it exists

Signed-off-by: zhehuaichen <[email protected]>

* support dataset weighting for non tar

Signed-off-by: zhehuaichen <[email protected]>

* Update SpeechLLM code (#8475)

* add pleasefixme marker for potential failed nightly tests. (#7678)

Signed-off-by: Xuesong Yang <[email protected]>

* Add new text segmentation library for better TTS quality (#7645)

* Add new text segmentation library for better TTS quality
* Update zh_cn_pinyin.py

added detailed instruction on how to install pkuseg.

Signed-off-by: Xuesong Yang <[email protected]>

* Update requirements_tts.txt

remove pkuseg as the default dependency of NeMo TTS, and instead, direct users to manually install pkuseg if they really need.

Signed-off-by: Xuesong Yang <[email protected]>


---------

Signed-off-by: Xuesong Yang <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Xuesong Yang <[email protected]>

* Create PrecisionPlugin for megatron_ckpt_to_nemo.py trainer (#7767) (#7774)

* Create PrecisionPlugin for megatron_ckpt_to_nemo.py trainer



* Add ddp_find_unused_parameters_true for punctuation_capitalization_train_evaluate.py



* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add '32-true' for precision values



---------

Signed-off-by: Abhishree <[email protected]>
Signed-off-by: Abhishree Thittenamane <[email protected]>
Co-authored-by: Abhishree Thittenamane <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* fix(clustering_diarizer.py): fix typo (#7772)

Signed-off-by: Jean-Louis Queguiner <[email protected]>

* fix(diarization-README): typo (#7771)

Signed-off-by: Jean-Louis Queguiner <[email protected]>

* Fix bug wrt change decoding strategy for bpe models (#7762) (#7764)

* Fix bug wrt change decoding strategy for bpe models



* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: smajumdar <[email protected]>
Co-authored-by: Somshubra Majumdar <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Remove incorrect extra argument for load_from_checkpoint_dir() (#7500)

Signed-off-by: Robin Dong <[email protected]>
Co-authored-by: Eric Harper <[email protected]>

* Add nemo to mcore GPT conversion script  (#7730)

* add conversion script

Signed-off-by: Chen Cui <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove references to 'ckpt'

Signed-off-by: Chen Cui <[email protected]>

* add one more sanity check to make sure there is no unexpected keys in state dict

Signed-off-by: Chen Cui <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* make cpu loading work

Signed-off-by: Chen Cui <[email protected]>

* make script work for llama2 models

Signed-off-by: Chen Cui <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* address code check

Signed-off-by: Chen Cui <[email protected]>

* remove trainer precision (was for old sanity check)

Signed-off-by: Chen Cui <[email protected]>

* fix script for llama2 model

Signed-off-by: Chen Cui <[email protected]>

* remove commented code

Signed-off-by: Chen Cui <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Chen Cui <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Eric Harper <[email protected]>

* Fix bug in ConditionalInput: cat along the feature dim, not the batch dim (#7785)

Signed-off-by: anferico <[email protected]>

* Add some docs and update scripts for ASR (#7790)

* Add some docs and update scripts

Signed-off-by: smajumdar <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: smajumdar <[email protected]>
Signed-off-by: Somshubra Majumdar <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* set context for text memmap to fork (#7784)

* set context for text memmap to fork

Signed-off-by: arendu <[email protected]>

* typo

Signed-off-by: arendu <[email protected]>

---------

Signed-off-by: arendu <[email protected]>

* add training with multiple audios

Signed-off-by: stevehuang52 <[email protected]>

* Support flash decoding (#7744)

* Add flash-decoding

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* Fix

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

---------

Signed-off-by: Cheng-Ping Hsieh <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Yang Zhang <[email protected]>

* Change accelerator to 'auto' in nlp_checkpoint_port.py (#7761)

* Change accelerator to 'auto' in nlp_checkpoint_port.py (#7747)

* Change accelerator to auto

Signed-off-by: Abhishree <[email protected]>

* Pass omegaconf object to trainer in nlp_checkpoint_port.py

Signed-off-by: Abhishree <[email protected]>

* Pass omegaconf object to trainer in export.py

Signed-off-by: Abhishree <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Abhishree <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Eric Harper <[email protected]>
Signed-off-by: Abhishree <[email protected]>

* docs: fix typos (#7758)

Signed-off-by: shuoer86 <[email protected]>
Co-authored-by: Xuesong Yang <[email protected]>
Signed-off-by: Abhishree <[email protected]>

* Snake act (#7736)

Signed-off-by: Abhishree <[email protected]>

* Update gpt_dataset.py (#6963)

Signed-off-by: Xin Yao <[email protected]>
Co-authored-by: Sandeep Subramanian <[email protected]>
Signed-off-by: Abhishree <[email protected]>

---------

Signed-off-by: Abhishree <[email protected]>
Signed-off-by: shuoer86 <[email protected]>
Signed-off-by: Xin Yao <[email protected]>
Co-authored-by: Abhishree Thittenamane <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Eric Harper <[email protected]>
Co-authored-by: shuoer86 <[email protected]>
Co-authored-by: Xuesong Yang <[email protected]>
Co-authored-by: Nithin Rao <[email protected]>
Co-authored-by: Xin Yao <[email protected]>
Co-authored-by: Sandeep Subramanian <[email protected]>

* Add selection criteria for reference audios in the `GlobalStyleToken` submodule (#7788)

* add selection criteria for reference audios

Signed-off-by: anferico <[email protected]>

* Update configuration files

Signed-off-by: anferico <[email protected]>

* add informative comment in config files

Signed-off-by: anferico <[email protected]>

* sample random index for reference audio selection

Signed-off-by: anferico <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: anferico <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* update text server to support compute logprobs (#7733)

* update text server to support compute logprobs

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix typo

---------

Signed-off-by: Zhilin Wang <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* add multi-layer feat extract and fix random question insertion

Signed-off-by: stevehuang52 <[email protected]>

* Configure MCore logger (#7781)

Signed-off-by: Mikołaj Błaż <[email protected]>

* Revert "PEFT eval fix (#7626) (#7638)" (#7693)

This reverts commit f03dd660bd26d88fd569e76c6f74b83a7c203ff9.

* remove TN from ctc_segm tut (#7807)

Signed-off-by: Evelina <[email protected]>

* [TTS] Support audio offsets in TTS data loaders (#7156)

* [TTS] Support audio offsets in TTS data loaders

Signed-off-by: Ryan <[email protected]>

* [TTS] Change docstring mentions of .pt to .npy

Signed-off-by: Ryan <[email protected]>

---------

Signed-off-by: Ryan <[email protected]>

* Update Apex install command in Dockerfile (#7794) (#7804)

* move core install to /workspace (#7706)



* update apex install in dockerfile



* use fetch head



---------

Signed-off-by: Abhinav Khattar <[email protected]>
Signed-off-by: eharper <[email protected]>
Co-authored-by: Eric Harper <[email protected]>
Co-authored-by: Abhinav Khattar <[email protected]>

* fix typo

Signed-off-by: stevehuang52 <[email protected]>

* Nemo to HF converter for LLaMA model (#7770)

* Create config_llama_truncate.yaml

Signed-off-by: Utkarsh <[email protected]>

* Add files via upload

Signed-off-by: Utkarsh <[email protected]>

* Update convert_nemo_llama_to_hf.py

Signed-off-by: Utkarsh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update config_llama_truncate.yaml

Signed-off-by: Utkarsh <[email protected]>

* Update convert_nemo_llama_to_hf.py

Signed-off-by: Utkarsh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update convert_nemo_llama_to_hf.py

Signed-off-by: Utkarsh <[email protected]>

* clean up trainer

* remove dependency on yaml config. load config from nemo file instead.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* enable ckpt saving into other precision formats

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* support 70b + cleanup qkv slice logic

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix bug

* move hf model folder code from comment to function and add instruction to run

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Utkarsh <[email protected]>
Signed-off-by: Chen Cui <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Eric Harper <[email protected]>
Co-authored-by: Chen Cui <[email protected]>

* Save best NeMo model only when necessary (#7836)

Signed-off-by: Ante Jukić <[email protected]>

* add guard if its a distributed checkpoint (#7845)

Signed-off-by: Gerald Shen <[email protected]>

* Fix tn duplex (#7808)

* fix duplex tn infer

Signed-off-by: Evelina <[email protected]>

* fix typo

Signed-off-by: Evelina <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix TN docs

Signed-off-by: Evelina <[email protected]>

---------

Signed-off-by: Evelina <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Update transformers cache on Jenkins (#7854)

* update transformers cache

Signed-off-by: eharper <[email protected]>

* update

Signed-off-by: eharper <[email protected]>

* add cd

Signed-off-by: eharper <[email protected]>

---------

Signed-off-by: eharper <[email protected]>

* Update README.rst for container update (#7844)

Signed-off-by: fayejf <[email protected]>

* Add support for finetuning with huggingface datasets (#7834)

* add finetune with huggingface dataset

Signed-off-by: stevehuang52 <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update yaml

Signed-off-by: stevehuang52 <[email protected]>

* update

Signed-off-by: stevehuang52 <[email protected]>

* update and refactor

Signed-off-by: stevehuang52 <[email protected]>

* add extrac hf text and update

Signed-off-by: stevehuang52 <[email protected]>

* update and refactor

Signed-off-by: stevehuang52 <[email protected]>

* move dataset dependency to common

Signed-off-by: stevehuang52 <[email protected]>

* add docstring

Signed-off-by: stevehuang52 <[email protected]>

* Add to Dics

Signed-off-by: Nithin Rao Koluguri <nithinraok>

* add ci test

Signed-off-by: Nithin Rao Koluguri <nithinraok>

* add max steps in jenkins

Signed-off-by: Nithin Rao Koluguri <nithinraok>

* reduce max steps

Signed-off-by: Nithin Rao Koluguri <nithinraok>

* jenkins test

Signed-off-by: Nithin Rao Koluguri <nithinraok>

* add bs=2

Signed-off-by: Nithin Rao Koluguri <nithinraok>

---------

Signed-off-by: stevehuang52 <[email protected]>
Signed-off-by: Nithin Rao Koluguri <nithinraok>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Nithin Rao Koluguri <nithinraok>
Co-authored-by: Nithin Rao <[email protected]>

* Multimodal merge (#7728)

* ControlNet TRT export

* Final MR before release

* SD2 update

* Fixed export issue

* Fix for instruct p2p and reformat

* Fix SD export issue

* Add nemo clip export for DB

* Fix ins pix2pix

* fix sd2 config

* [Mingyuan Ma] BF16 and SD conversion script

* [Imagen] NHWC Feature

* Fix .nemo loading issue for NeMo CLIP in SD

* NeMo r1.20.0 Multimodal Merge

* fix the inductor issue in inference

* Fix inductor loading .nemo issue

* Add Neva Model Support

* Imagen Optimizations

* Neva inference code

* NeMo TOT 1.21 to Internal/main

* Update neva_inference.yaml

* REBASING  for latest code changes

* Update internal/main to main tot

* Parallel DDIM implementation

* 1. Fixing indentation bug. (#7352)

Signed-off-by: Micha Livne <[email protected]>

* NeMo MCore llama2 support + MCore PEFT adapters (#7299)

* start adding gpt from megatron core path

Signed-off-by: ericharper <[email protected]>

* set model parallel config

Signed-off-by: ericharper <[email protected]>

* use model parallel config object

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update args

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* set vp size to none if it is 1

Signed-off-by: ericharper <[email protected]>

* set vp size to none if it is 1

Signed-off-by: ericharper <[email protected]>

* add TransformerConfig

Signed-off-by: ericharper <[email protected]>

* start updating to TransformerConfig

Signed-off-by: ericharper <[email protected]>

* add todo

Signed-off-by: ericharper <[email protected]>

* revert to model parallel config

Signed-off-by: ericharper <[email protected]>

* add hidden_size to model_parallel_config

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove imports

Signed-off-by: ericharper <[email protected]>

* revert

Signed-off-by: ericharper <[email protected]>

* remove import

Signed-off-by: ericharper <[email protected]>

* small clean up

Signed-off-by: ericharper <[email protected]>

* update hidden size in peft base model, add mcore commit to jenkins

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update module args

Signed-off-by: ericharper <[email protected]>

* add config obj to flash attention tests

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove args

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove sequence parallel arg

Signed-off-by: ericharper <[email protected]>

* update args

Signed-off-by: ericharper <[email protected]>

* add config to self

Signed-off-by: ericharper <[email protected]>

* update args

Signed-off-by: ericharper <[email protected]>

* update args

Signed-off-by: ericharper <[email protected]>

* update args

Signed-off-by: ericharper <[email protected]>

* add config to test

Signed-off-by: ericharper <[email protected]>

* get hidden_size from config

Signed-off-by: ericharper <[email protected]>

* add try except

Signed-off-by: ericharper <[email protected]>

* use default

Signed-off-by: ericharper <[email protected]>

* update config with hidden size

Signed-off-by: ericharper <[email protected]>

* remove arg

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* comment out jenkins test

Signed-off-by: ericharper <[email protected]>

* revert import

Signed-off-by: ericharper <[email protected]>

* build transformer config

Signed-off-by: ericharper <[email protected]>

* add model to provider func

Signed-off-by: ericharper <[email protected]>

* update forward and float16 wrapper

Signed-off-by: ericharper <[email protected]>

* instantiate model parallel config after init model parallel

Signed-off-by: ericharper <[email protected]>

* set virtual rank

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add GQA config to megatron gpt model (#7096)

* Add GQA config in gpt config file

Signed-off-by: jasonwan <[email protected]>

* Verify mcore is enabled when using GQA

Signed-off-by: jasonwan <[email protected]>

---------

Signed-off-by: jasonwan <[email protected]>

* revert

Signed-off-by: ericharper <[email protected]>

* mcore llama2 ckpt conversion & small fix

Signed-off-by: jasonwan <[email protected]>

* Add inference & sft config by Hongbin

Co-authored-by: Hongbin Liu <[email protected]>

Signed-off-by: jasonwan <[email protected]>

* fix config

Signed-off-by: jasonwan <[email protected]>

* add inference param. update TP/PP script to support mcore gpt

Signed-off-by: jasonwan <[email protected]>

* p-tuning

Signed-off-by: jasonwan <[email protected]>

* modify ckpt conversion script (adding model cast)

Signed-off-by: jasonwan <[email protected]>

* ckpt conversion use relative path for config

Signed-off-by: jasonwan <[email protected]>

* start adding gpt from megatron core path

Signed-off-by: ericharper <[email protected]>

* set model parallel config

Signed-off-by: ericharper <[email protected]>

* use model parallel config object

Signed-off-by: ericharper <[email protected]>

* update args

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* set vp size to none if it is 1

Signed-off-by: ericharper <[email protected]>

* set vp size to none if it is 1

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add TransformerConfig

Signed-off-by: ericharper <[email protected]>

* start updating to TransformerConfig

Signed-off-by: ericharper <[email protected]>

* add todo

Signed-off-by: ericharper <[email protected]>

* revert to model parallel config

Signed-off-by: ericharper <[email protected]>

* add hidden_size to model_parallel_config

Signed-off-by: ericharper <[email protected]>

* remove imports

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove import

Signed-off-by: ericharper <[email protected]>

* small clean up

Signed-off-by: ericharper <[email protected]>

* update hidden size in peft base model, add mcore commit to jenkins

Signed-off-by: ericharper <[email protected]>

* update module args

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add config obj to flash attention tests

Signed-off-by: ericharper <[email protected]>

* remove args

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove sequence parallel arg

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update args

Signed-off-by: ericharper <[email protected]>

* add config to self

Signed-off-by: ericharper <[email protected]>

* update args

Signed-off-by: ericharper <[email protected]>

* update args

Signed-off-by: ericharper <[email protected]>

* update args

Signed-off-by: ericharper <[email protected]>

* add config to test

Signed-off-by: ericharper <[email protected]>

* get hidden_size from config

Signed-off-by: ericharper <[email protected]>

* add try except

Signed-off-by: ericharper <[email protected]>

* use default

Signed-off-by: ericharper <[email protected]>

* update config with hidden size

Signed-off-by: ericharper <[email protected]>

* remove arg

Signed-off-by: ericharper <[email protected]>

* comment out jenkins test

Signed-off-by: ericharper <[email protected]>

* revert import

Signed-off-by: ericharper <[email protected]>

* remove optimizer_idx

Signed-off-by: eharper <[email protected]>

* prefetch num microbatches

Signed-off-by: eharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* start adding gpt from megatron core path

Signed-off-by: ericharper <[email protected]>

* set model parallel config

Signed-off-by: ericharper <[email protected]>

* use model parallel config object

Signed-off-by: ericharper <[email protected]>

* update args

Signed-off-by: ericharper <[email protected]>

* fix for p-tuning sequence parallel

Signed-off-by: jasonwan <[email protected]>

* support SFT/distOpt mcore (#7207)

* add inference param. update TP/PP script to support mcore gpt

* p-tuning

Signed-off-by: jasonwan <[email protected]>

* change layer names for SFT

Signed-off-by: Hongbin Liu <[email protected]>

* fix bug in SFT

Signed-off-by: Hongbin Liu <[email protected]>

---------

Signed-off-by: jasonwan <[email protected]>
Signed-off-by: Hongbin Liu <[email protected]>
Co-authored-by: Hongbin Liu <[email protected]>
Co-authored-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* start updating to TransformerConfig

Signed-off-by: ericharper <[email protected]>

* revert to model parallel config

Signed-off-by: ericharper <[email protected]>

* add hidden_size to model_parallel_config

Signed-off-by: ericharper <[email protected]>

* remove imports

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update module args

Signed-off-by: ericharper <[email protected]>

* add config to self

Signed-off-by: ericharper <[email protected]>

* build transformer config

Signed-off-by: ericharper <[email protected]>

* add model to provider func

Signed-off-by: ericharper <[email protected]>

* update forward and float16 wrapper

Signed-off-by: ericharper <[email protected]>

* instantiate model parallel config after init model parallel

Signed-off-by: ericharper <[email protected]>

* set virtual rank

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add GQA config to megatron gpt model (#7096)

* Add GQA config in gpt config file

Signed-off-by: jasonwan <[email protected]>

* Verify mcore is enabled when using GQA

Signed-off-by: jasonwan <[email protected]>

---------

Signed-off-by: jasonwan <[email protected]>

* revert

Signed-off-by: ericharper <[email protected]>

* remove import

Signed-off-by: eharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* rollback model cast for p-tuning

Signed-off-by: jasonwan <[email protected]>

* update for dist adam

Signed-off-by: eharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* use get_gpt_module_list

Signed-off-by: eharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update ckpt conversion script

Signed-off-by: jasonwan <[email protected]>

* ptl2.0 patch for llama config

Signed-off-by: jasonwan <[email protected]>

* add plugins to trainer in scripts

Signed-off-by: jasonwan <[email protected]>

* fix activation checkpointing mcore

Signed-off-by: jasonwan <[email protected]>

* fix variable names

Signed-off-by: jasonwan <[email protected]>

* overwrite normalization type for mcore/te

Signed-off-by: jasonwan <[email protected]>

* Update megatron_llama_sft.yaml

Signed-off-by: Jason Wang <[email protected]>

* add PEFT adapter support for mcore gpt path (#7276)

* implementation for mcore adapter/mxins

Signed-off-by: jasonwan <[email protected]>

* small fix for lora and ptuning

Signed-off-by: jasonwan <[email protected]>

* support layerwise peft

Signed-off-by: jasonwan <[email protected]>

* support multiple target layers

Signed-off-by: jasonwan <[email protected]>

* support lora GQA

Signed-off-by: jasonwan <[email protected]>

* support amp O2

Signed-off-by: jasonwan <[email protected]>

* revert & more O2 fix

Signed-off-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* lora inject to attention

Signed-off-by: jasonwan <[email protected]>

* support lora weight tying

Signed-off-by: jasonwan <[email protected]>

* add copyright header

Signed-off-by: jasonwan <[email protected]>

* rollback ptuning name change. full string match mcore target

Signed-off-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove comment

Signed-off-by: jasonwan <[email protected]>

---------

Signed-off-by: jasonwan <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* clean up config

Signed-off-by: jasonwan <[email protected]>

* Sync llama branch (#7297)

* add inference param. update TP/PP script to support mcore gpt

* p-tuning

Signed-off-by: jasonwan <[email protected]>

* change layer names for SFT

Signed-off-by: Hongbin Liu <[email protected]>

* fix bug in SFT

Signed-off-by: Hongbin Liu <[email protected]>

* fix bug: cpu initialization is not really enabled

Signed-off-by: Hongbin Liu <[email protected]>

* add use_cpu_initialization to TransformerConfig

Signed-off-by: Hongbin Liu <[email protected]>

* fix bug: wrong config path when using relative cjpt path

Signed-off-by: Hongbin Liu <[email protected]>

* revert mcore config change

Signed-off-by: Jason Wang <[email protected]>

---------

Signed-off-by: jasonwan <[email protected]>
Signed-off-by: Hongbin Liu <[email protected]>
Signed-off-by: Jason Wang <[email protected]>
Co-authored-by: Hongbin Liu <[email protected]>

* clean up ckpt conversion script

Signed-off-by: jasonwan <[email protected]>

* rollback git merge errors

Signed-off-by: jasonwan <[email protected]>

* update mcore, add check for mcore+te

Signed-off-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* formatting

Signed-off-by: jasonwan <[email protected]>

* make sft test dataset optional. fix indentation in config

Signed-off-by: jasonwan <[email protected]>

* one more fix for optional test set

Signed-off-by: jasonwan <[email protected]>

* support merging lora weights in mcore

Signed-off-by: jasonwan <[email protected]>

* update mcore for cpu init

Signed-off-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update ckpt conversion for code llama

Signed-off-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add seq_len_interpolation_factor support for long-context llama ckpts (#7312)

* add inference param. update TP/PP script to support mcore gpt

* p-tuning

Signed-off-by: jasonwan <[email protected]>

* add seq_len_interpolation_factor

Signed-off-by: Hongbin Liu <[email protected]>

---------

Signed-off-by: jasonwan <[email protected]>
Signed-off-by: Hongbin Liu <[email protected]>
Co-authored-by: jasonwan <[email protected]>
Co-authored-by: Hongbin Liu <[email protected]>

* fix old ptuning model, update mcore to support seq_len_interpolation_factor

Signed-off-by: jasonwan <[email protected]>

* support fused layernorm linear, fix ptuning O2

Signed-off-by: jasonwan <[email protected]>

* drop loss mask for mcore for now

Signed-off-by: jasonwan <[email protected]>

* disable dist ckpt in peft

Signed-off-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix loading non dist ckpt

Signed-off-by: jasonwan <[email protected]>

* add ckpt conversion to CI

Signed-off-by: jasonwan <[email protected]>

* update CI

Signed-off-by: jasonwan <[email protected]>

* mcore_mixin docstring

Signed-off-by: jasonwan <[email protected]>

* minor change in mcore peft error message

Signed-off-by: jasonwan <[email protected]>

* fix amp o2 in lora weight tying

Signed-off-by: jasonwan <[email protected]>

* correct mcore fp8 config

Signed-off-by: jasonwan <[email protected]>

* add TE installation

Signed-off-by: jasonwan <[email protected]>

* support mcore adapter tuning

Signed-off-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* comment out new CI test. rollback docker image

Signed-off-by: jasonwan <[email protected]>

* ignore FA tests, try new CI on 23.08

Signed-off-by: jasonwan <[email protected]>

* mark new CI as L2, put to beginning to test

Signed-off-by: jasonwan <[email protected]>

* minor fix for prompt learning

Signed-off-by: jasonwan <[email protected]>

* rollback to 23.06. comment out CI

Signed-off-by: jasonwan <[email protected]>

* minor fix ckpt conversion script

Signed-off-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* minor rollback gpt model change

Signed-off-by: jasonwan <[email protected]>

---------

Signed-off-by: ericharper <[email protected]>
Signed-off-by: jasonwan <[email protected]>
Signed-off-by: eharper <[email protected]>
Signed-off-by: Hongbin Liu <[email protected]>
Signed-off-by: Jason Wang <[email protected]>
Co-authored-by: ericharper <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: eharper <[email protected]>
Co-authored-by: Hongbin Liu <[email protected]>
Co-authored-by: Kelvin Liu <[email protected]>

* Hiddens modules documentation (#7303)

* 1. Changed hiddens transformations module from `transformations` to `hiddens`.

Signed-off-by: Micha Livne <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* 1. Debugging. Signed-off-by: Micha Livne <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* 1. Finished doc.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging. Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging. Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging. Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging. Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging. Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging. Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging. Signed-off-by: Micha Livne <[email protected]>

---------

Signed-off-by: Micha Livne <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Eric Harper <[email protected]>

* Support for flash attention 2.0 (#7063)

* Add flash attn 2

Signed-off-by: MaximumEntropy <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add FA2 feature

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* Remove debugging

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: MaximumEntropy <[email protected]>
Signed-off-by: Cheng-Ping Hsieh <[email protected]>
Signed-off-by: Cheng-Ping Hsieh <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Oleksii Kuchaiev <[email protected]>
Co-authored-by: Cheng-Ping Hsieh <[email protected]>
Co-authored-by: Cheng-Ping Hsieh <[email protected]>

* lora merge fix for O2 names (#7325)

* wip

Signed-off-by: arendu <[email protected]>

* adjust key names based on O2

Signed-off-by: arendu <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update

Signed-off-by: arendu <[email protected]>

* minor

Signed-off-by: arendu <[email protected]>

---------

Signed-off-by: arendu <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* multiple fields can form a context (#7147)

* list of context fields and flexible prompt template

Signed-off-by: arendu <[email protected]>

* list of fields for context

Signed-off-by: arendu <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix bug

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* Fix bug

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* Add multiple truncation fields and middle truncation

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Compatible to old ckpt

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix tokenize detokenize issue

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Remove detokenization, add truncation augmentation

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Resolve comments

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* Remove unused import

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* revert eos

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* Add tokenizer space_sensitive attribute

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix error

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* Fix erorr and use re

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix bug

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* Change assert logic

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Follow adi suggestion

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Remove merge function

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add example and comment

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* Remove context_key and add comment

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* Remove random truncation

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix bug

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix template none

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix bug

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

---------

Signed-off-by: arendu <[email protected]>
Signed-off-by: Cheng-Ping Hsieh <[email protected]>
Signed-off-by: Cheng-Ping Hsieh <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Cheng-Ping Hsieh <[email protected]>
Co-authored-by: Cheng-Ping Hsieh <[email protected]>

* Load buffers in checkpoint (#7357)

Signed-off-by: Jason Wang <[email protected]>

* Add migration guide for lightning 2.0 upgrade (#7360)

* Add lightning 2.0 migration guide in NeMo docs

Signed-off-by: Abhishree <[email protected]>

* Add remaining guide for lightning 2.0 upgrade

Signed-off-by: Abhishree <[email protected]>

* Remove line spill over and continue in next line

Signed-off-by: Abhishree <[email protected]>

* Add missing dataloader_iter in the guide

Signed-off-by: Abhishree <[email protected]>

* Fix minor typo

Signed-off-by: Abhishree <[email protected]>

---------

Signed-off-by: Abhishree <[email protected]>

* adding bias_dropout_add_fusion option for BERT (#7332)

Signed-off-by: Alexander Jipa <[email protected]>
Co-authored-by: Alexander Jipa <[email protected]>

* [TTS] Change audio codec token type to TokenIndex (#7356)

Signed-off-by: Ryan <[email protected]>

* enable selective unfreeze (#7326)

* wip

Signed-off-by: arendu <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* wip

Signed-off-by: arendu <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* avoid PTL method conflicts

Signed-off-by: arendu <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update

Signed-off-by: arendu <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update

Signed-off-by: arendu <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: arendu <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Fix typos (#7361)

* fix typos

Signed-off-by: omahs <[email protected]>

* fix typo

Signed-off-by: omahs <[email protected]>

* fix typos

Signed-off-by: omahs <[email protected]>

* fix typos

Signed-off-by: omahs <[email protected]>

* fix typo

Signed-off-by: omahs <[email protected]>

* fix typos

Signed-off-by: omahs <[email protected]>

* fix typo

Signed-off-by: omahs <[email protected]>

* fix typo

Signed-off-by: omahs <[email protected]>

* fix typo

Signed-off-by: omahs <[email protected]>

---------

Signed-off-by: omahs <[email protected]>

* pin numba=0.57.1 to fix reinstall.sh error (#7366)

Signed-off-by: Xuesong Yang <[email protected]>

* Update new conversion script for converting safetensors.

* Upgrade pytorch container to 23.08 (#7353)

* upgrade pytorch container

Signed-off-by: eharper <[email protected]>

* use mcore

Signed-off-by: eharper <[email protected]>

* revert test change

Signed-off-by: eharper <[email protected]>

* pleasefixme

Signed-off-by: eharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* check for ampere

Signed-off-by: eharper <[email protected]>

* comment test temporarily

Signed-off-by: eharper <[email protected]>

---------

Signed-off-by: eharper <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* enable fp32 optimizer for output_layer in mcore (#7355)

Signed-off-by: lhb8125 <[email protected]>

* revert comment (#7368)

Signed-off-by: eharper <[email protected]>

* Update to core 23.08 branch ToT (#7371)

Signed-off-by: Abhinav Khattar <[email protected]>

* upper bounding ptl (#7370)

Signed-off-by: eharper <[email protected]>

* fix pipeline parallel inference (#7367)

* fix pp inference

Signed-off-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: jasonwan <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* fix for peft tied weights (#7372)

Signed-off-by: arendu <[email protected]>

* fixed trainer.strategy=auto from None. (#7369)

Signed-off-by: Xuesong Yang <[email protected]>

* add O2 option in gpt eval (#7358)

* add O2 option in eval

Signed-off-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add doc for O2 config

Signed-off-by: jasonwan <[email protected]>

* add to llama inference config

Signed-off-by: jasonwan <[email protected]>

---------

Signed-off-by: jasonwan <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Eric Harper <[email protected]>

* Move model precision copy (#7336)

* move cfg precision set to megatron base model

Signed-off-by: Maanu Grover <[email protected]>

* remove copy from other models

Signed-off-by: Maanu Grover <[email protected]>

* modify attribute not arg

Signed-off-by: Maanu Grover <[email protected]>

* fix gpt model test for ptl 2.0

Signed-off-by: Maanu Grover <[email protected]>

* rename function and add docstring

Signed-off-by: Maanu Grover <[email protected]>

* replace precision to dtype conditionals with func call

Signed-off-by: Maanu Grover <[email protected]>

* unnecessary function and cfg reset

Signed-off-by: Maanu Grover <[email protected]>

* set default value

Signed-off-by: Maanu Grover <[email protected]>

* fix precision lookup in a few more places

Signed-off-by: Maanu Grover <[email protected]>

* rename mapping function

Signed-off-by: Maanu Grover <[email protected]>

* ununsed import

Signed-off-by: Maanu Grover <[email protected]>

* save torch datatype to model

Signed-off-by: Maanu Grover <[email protected]>

* set weights precision wrt amp o2

Signed-off-by: Maanu Grover <[email protected]>

* Revert "set weights precision wrt amp o2"

This reverts commit 313a4bfe5eb69d771a6d2433898c0685836aef5c.

Signed-off-by: Maanu Grover <[email protected]>

* revert half precision at inference attempt

Signed-off-by: Maanu Grover <[email protected]>

* move autocast dtype to base model

Signed-off-by: Maanu Grover <[email protected]>

* move params dtype to base model, enable fp16 O2 inf

Signed-off-by: Maanu Grover <[email protected]>

* unused imports

Signed-off-by: Maanu Grover <[email protected]>

---------

Signed-off-by: Maanu Grover <[email protected]>

* Fix PEFT checkpoint loading (#7388)

* Fix PEFT checkpoint loading

Signed-off-by: Jason Wang <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Jason Wang <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Use distributed optimizer support for multiple dtypes (#7359)

* Update distopt wrapper with multiple dtype support

Remove manual handling of separate FP32 optimizer.

Signed-off-by: Tim Moon <[email protected]>

* Use distopt support for contiguous buffers with multiple dtypes

Signed-off-by: Tim Moon <[email protected]>

* Fix typo

Signed-off-by: Tim Moon <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Separate distopt buckets for first GPT layer and non-overlapped params

Signed-off-by: Tim Moon <[email protected]>

* Add distopt logic for int dtypes

Signed-off-by: Tim Moon <[email protected]>

* Update Apex commit

Signed-off-by: Tim Moon <[email protected]>

* Remove unused variables

Signed-off-by: Tim Moon <[email protected]>

* Update Apex commit in README and Jenkensfile

Signed-off-by: Tim Moon <[email protected]>

* Debug Dockerfile and Jenkinsfile

Signed-off-by: Tim Moon <[email protected]>

---------

Signed-off-by: Tim Moon <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Eric Harper <[email protected]>

* minor fix for llama ckpt conversion script (#7387)

* minor fix for llama ckpt conversion script

Signed-off-by: Jason Wang <[email protected]>

* Update Jenkinsfile

Signed-off-by: Jason Wang <[email protected]>

* remove fast_swiglu configuration

Signed-off-by: Jason Wang <[email protected]>

---------

Signed-off-by: Jason Wang <[email protected]>
Co-authored-by: Eric Harper <[email protected]>

* Fix wrong calling of librosa.get_duration() in notebook (#7376)

Signed-off-by: Robin Dong <[email protected]>
Co-authored-by: Somshubra Majumdar <[email protected]>

* [PATCH] PEFT import mcore (#7393)

* [PATCH] PEFT import mcore

Signed-off-by: Jason Wang <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Jason Wang <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* [TTS] Added a callback for logging initial data (#7384)

Signed-off-by: Ante Jukić <[email protected]>

* Update Core Commit (#7402)

* Update Core Commit

Signed-off-by: Abhinav Khattar <[email protected]>

* update commit

Signed-off-by: Abhinav Khattar <[email protected]>

---------

Signed-off-by: Abhinav Khattar <[email protected]>

* Use cfg attribute in bert (#7394)

* use cfg attribute instead of arg

Signed-off-by: Maanu Grover <[email protected]>

* use torch_dtype in place of cfg.precision

Signed-off-by: Maanu Grover <[email protected]>

* move precision copy before super constructor

Signed-off-by: Maanu Grover <[email protected]>

* use trainer arg

Signed-off-by: Maanu Grover <[email protected]>

---------

Signed-off-by: Maanu Grover <[email protected]>

* Add support for bias conversion in Swiglu models (#7386)

* Add support for bias conversion in Swiglu models

Signed-off-by: smajumdar <[email protected]>

* Add support for auto extracting tokenizer model

Signed-off-by: smajumdar <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add support for auto extracting tokenizer model

Signed-off-by: smajumdar <[email protected]>

* Fix issue with missing tokenizer

Signed-off-by: smajumdar <[email protected]>

* Refactor

Signed-off-by: smajumdar <[email protected]>

* Refactor

Signed-off-by: smajumdar <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: smajumdar <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Update save_to and restore_from for dist checkpointing (#7343)

* add dist ckpt to save to, in progress

Signed-off-by: eharper <[email protected]>

* move dist ckpt

Signed-off-by: eharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* clean up

Signed-off-by: eharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update restore from, need to figure out how to initialize distributed

Signed-off-by: eharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* launch distrib if needed when restoring dist ckpt

Signed-off-by: eharper <[email protected]>

* when using mcore we can change tp pp on the fly

Signed-off-by: eharper <[email protected]>

* add load_from_checkpoint support for dist ckpt

Signed-off-by: eharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update llama convert script to save dist .nemo

Signed-off-by: eharper <[email protected]>

* fix load dist ckpt

Signed-off-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* setup TE TP groups if needed

Signed-off-by: eharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* setup te tp groups if needed

Signed-off-by: eharper <[email protected]>

* remove import

Signed-off-by: eharper <[email protected]>

---------

Signed-off-by: eharper <[email protected]>
Signed-off-by: jasonwan <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: jasonwan <[email protected]>

* fix forward for with mcore=false (#7403)

Signed-off-by: Jimmy Zhang <[email protected]>
Co-authored-by: Jimmy Zhang <[email protected]>

* Fix logging to remove 's/it' from progress bar in Megatron models and add train_step_timing (#7374)

* Add CustomProgressBar class to exp_manager and trainer callbacks

Signed-off-by: Abhishree <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix the progres…
marcromeyn added a commit that referenced this pull request Jun 7, 2024
* Fixes

* Docs fix

* Add support for custom NeMo fields in Lhotse-NeMo adapters (attach to cut.custom)

* Add support for custom NeMo fields in Lhotse-NeMo adapters (attach to cut.custom)

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* support distributed_fused_adam

Signed-off-by: zhehuaichen <[email protected]>

* support distributed_fused_adam

Signed-off-by: zhehuaichen <[email protected]>

* Add support for sharded NeMo manifest files

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* support megatron_amp_O2

Signed-off-by: zhehuaichen <[email protected]>

* Support heterogeneous sampling rates in non tarred NeMo manifests

* migrate to PTL2.0

Signed-off-by: stevehuang52 <[email protected]>

* clean up

Signed-off-by: stevehuang52 <[email protected]>

* update manifest util

Signed-off-by: stevehuang52 <[email protected]>

* Support multiple tokenizer/parser types, aggregate tokenizers, and custom language fields

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* agg and normal tokenizers actually work

* Support weights for NeMo tarred manifests

* Temporarily hardcoded pnc stripping/lowercasing

* fix

* make pnc hack configurable from the config and disabled by default

* fix the hack

* migrate to ptl2.1 to support multiple dataloaders

Signed-off-by: stevehuang52 <[email protected]>

* support encoder overwrite

Signed-off-by: zhehuaichen <[email protected]>

* update misc

Signed-off-by: stevehuang52 <[email protected]>

* fix eval and clean up

Signed-off-by: stevehuang52 <[email protected]>

* support add_sep for perception model

Signed-off-by: zhehuaichen <[email protected]>

* fix https://github.com/Lightning-AI/pytorch-lightning/issues/18803

Signed-off-by: zhehuaichen <[email protected]>

* add_bos

Signed-off-by: zhehuaichen <[email protected]>

* Transformer decoder with conditioning for canary (#8091)

* initial commit for multi-task conf-enc transf-dec for canary

Signed-off-by: Krishna Puvvada <[email protected]>

* removing decoder states caching during training

Signed-off-by: Krishna Puvvada <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Krishna Puvvada <[email protected]>
Co-authored-by: Krishna Puvvada <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Option to limit the number of open streams (#8095)

* audio signal support in multi

Signed-off-by: zhehuaichen <[email protected]>

* update asr evaluator

Signed-off-by: stevehuang52 <[email protected]>

* fix from
https://github.com/NVIDIA/NeMo/commit/fcc0f9f6ff7947c3c7fba3ed17d8ec8af6391397
and
https://github.com/NVIDIA/NeMo/commit/f97c9016e6438ca4174b66bf9c3e248b28197aaa

Signed-off-by: zhehuaichen <[email protected]>

* transcribe fn for Canary models (#8110)

* improve readability

Signed-off-by: Krishna Puvvada <[email protected]>

* adding context in transcribe function for ConfTransfModels

Signed-off-by: Krishna Puvvada <[email protected]>

* supporting relative paths in transcribe function for canary

Signed-off-by: Krishna Puvvada <[email protected]>

* removing cuts.sort_by_duration in __getitem__ to maintain manifest order during inference

Signed-off-by: Krishna Puvvada <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Krishna Puvvada <[email protected]>
Co-authored-by: Krishna Puvvada <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* update for evaluation

Signed-off-by: stevehuang52 <[email protected]>

* update for eval

Signed-off-by: stevehuang52 <[email protected]>

* update for evaluation

Signed-off-by: stevehuang52 <[email protected]>

* fix bleu

Signed-off-by: stevehuang52 <[email protected]>

* fix typo

Signed-off-by: stevehuang52 <[email protected]>

* Add missing audio_filepath validation for Canary (#8119)

* Add missing audio_filepath validation for Canary

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* add default concat_sampling_probabilities

Signed-off-by: zhehuaichen <[email protected]>

* support lhotse dataset in speechllm

Signed-off-by: zhehuaichen <[email protected]>

* bypass get_iterator_k_split

Signed-off-by: zhehuaichen <[email protected]>

* tmp fix

Signed-off-by: zhehuaichen <[email protected]>

* try to use fixed batch with megatron

Signed-off-by: zhehuaichen <[email protected]>

* add batch logging

Signed-off-by: zhehuaichen <[email protected]>

* support unfrozen llm

Signed-off-by: zhehuaichen <[email protected]>

* Create README.md

Signed-off-by: He Huang (Steve) <[email protected]>

* Update README.md

Signed-off-by: He Huang (Steve) <[email protected]>

* Update README.md

Signed-off-by: He Huang (Steve) <[email protected]>

* update

Signed-off-by: stevehuang52 <[email protected]>

* rename

Signed-off-by: stevehuang52 <[email protected]>

* add llama prompt template

Signed-off-by: zhehuaichen <[email protected]>

* update and refactor

Signed-off-by: stevehuang52 <[email protected]>

* support sample alpha

Signed-off-by: zhehuaichen <[email protected]>

* support lhotse validation set and canary pretrained ckpt with pseudo label

Signed-off-by: zhehuaichen <[email protected]>

* make sure backward compatibility

Signed-off-by: zhehuaichen <[email protected]>

* remove pad

Signed-off-by: zhehuaichen <[email protected]>

* make sure asr_model is frozen

Signed-off-by: zhehuaichen <[email protected]>

* support greedy decoding

Signed-off-by: zhehuaichen <[email protected]>

* valid on lhotse

Signed-off-by: zhehuaichen <[email protected]>

* fix multi dataloader in val case for lhotse SALM; add default data
names; keep asr model tokenizer by default to enable adding canary
dataset

Signed-off-by: zhehuaichen <[email protected]>

* remove the bruteforce _keep_special_tokens implementation

Signed-off-by: zhehuaichen <[email protected]>

* decoding_ratio and convert_canary_prompt_to_text support

Signed-off-by: zhehuaichen <[email protected]>

* canary_tokens_augment_ratio

Signed-off-by: zhehuaichen <[email protected]>

* debug

Signed-off-by: zhehuaichen <[email protected]>

* bug fix

Signed-off-by: zhehuaichen <[email protected]>

* fix lhotse based eval of llama canary model

Signed-off-by: zhehuaichen <[email protected]>

* support some overwrite for eval

Signed-off-by: zhehuaichen <[email protected]>

* support zero shot prompt in training

Signed-off-by: zhehuaichen <[email protected]>

* support cross attention based SALM

Signed-off-by: zhehuaichen <[email protected]>

* support cross attention based SALM

Signed-off-by: zhehuaichen <[email protected]>

* fix for batch train/valid of cross

Signed-off-by: zhehuaichen <[email protected]>

* support learnable gate and plotting

Signed-off-by: zhehuaichen <[email protected]>

* support using pseudo label in prompt rather than cross att

Signed-off-by: zhehuaichen <[email protected]>

* bug fix for perception cfg and context tokens shift

Signed-off-by: zhehuaichen <[email protected]>

* DentityConnectorsAdd

Signed-off-by: zhehuaichen <[email protected]>

* fix ckpt saving

Signed-off-by: zhehuaichen <[email protected]>

* Support RnnGatedCrossAttention

Signed-off-by: zhehuaichen <[email protected]>

* add include_ffw and fix _optimizer_param_groups for all unfrozen run

Signed-off-by: zhehuaichen <[email protected]>

* support grad acc when using bucket

Signed-off-by: zhehuaichen <[email protected]>

* support TransformerCrossAttention

Signed-off-by: zhehuaichen <[email protected]>

* support ProjectTransformerCrossAttention

Signed-off-by: zhehuaichen <[email protected]>

* support ++model.use_am_tokenizer ++model.override_vocab_size ++model.override.hidden_size

Signed-off-by: zhehuaichen <[email protected]>

* support question set on val without canary

Signed-off-by: zhehuaichen <[email protected]>

* support load_audio_encoder and wip in optim_param_groups

Signed-off-by: zhehuaichen <[email protected]>

* minor fix for audio pretrain model init

Signed-off-by: zhehuaichen <[email protected]>

* simplify canary_tokens_augment

Signed-off-by: zhehuaichen <[email protected]>

* use question in the manifest if it exists

Signed-off-by: zhehuaichen <[email protected]>

* support dataset weighting for non tar

Signed-off-by: zhehuaichen <[email protected]>

* Update SpeechLLM code (#8475)

* add pleasefixme marker for potential failed nightly tests. (#7678)

Signed-off-by: Xuesong Yang <[email protected]>

* Add new text segmentation library for better TTS quality (#7645)

* Add new text segmentation library for better TTS quality
* Update zh_cn_pinyin.py

added detailed instruction on how to install pkuseg.

Signed-off-by: Xuesong Yang <[email protected]>

* Update requirements_tts.txt

remove pkuseg as the default dependency of NeMo TTS, and instead, direct users to manually install pkuseg if they really need.

Signed-off-by: Xuesong Yang <[email protected]>

---------

Signed-off-by: Xuesong Yang <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Xuesong Yang <[email protected]>

* Create PrecisionPlugin for megatron_ckpt_to_nemo.py trainer (#7767) (#7774)

* Create PrecisionPlugin for megatron_ckpt_to_nemo.py trainer

* Add ddp_find_unused_parameters_true for punctuation_capitalization_train_evaluate.py

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add '32-true' for precision values

---------

Signed-off-by: Abhishree <[email protected]>
Signed-off-by: Abhishree Thittenamane <[email protected]>
Co-authored-by: Abhishree Thittenamane <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* fix(clustering_diarizer.py): fix typo (#7772)

Signed-off-by: Jean-Louis Queguiner <[email protected]>

* fix(diarization-README): typo (#7771)

Signed-off-by: Jean-Louis Queguiner <[email protected]>

* Fix bug wrt change decoding strategy for bpe models (#7762) (#7764)

* Fix bug wrt change decoding strategy for bpe models

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: smajumdar <[email protected]>
Co-authored-by: Somshubra Majumdar <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Remove incorrect extra argument for load_from_checkpoint_dir() (#7500)

Signed-off-by: Robin Dong <[email protected]>
Co-authored-by: Eric Harper <[email protected]>

* Add nemo to mcore GPT conversion script  (#7730)

* add conversion script

Signed-off-by: Chen Cui <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove references to 'ckpt'

Signed-off-by: Chen Cui <[email protected]>

* add one more sanity check to make sure there is no unexpected keys in state dict

Signed-off-by: Chen Cui <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* make cpu loading work

Signed-off-by: Chen Cui <[email protected]>

* make script work for llama2 models

Signed-off-by: Chen Cui <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* address code check

Signed-off-by: Chen Cui <[email protected]>

* remove trainer precision (was for old sanity check)

Signed-off-by: Chen Cui <[email protected]>

* fix script for llama2 model

Signed-off-by: Chen Cui <[email protected]>

* remove commented code

Signed-off-by: Chen Cui <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Chen Cui <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Eric Harper <[email protected]>

* Fix bug in ConditionalInput: cat along the feature dim, not the batch dim (#7785)

Signed-off-by: anferico <[email protected]>

* Add some docs and update scripts for ASR (#7790)

* Add some docs and update scripts

Signed-off-by: smajumdar <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: smajumdar <[email protected]>
Signed-off-by: Somshubra Majumdar <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* set context for text memmap to fork (#7784)

* set context for text memmap to fork

Signed-off-by: arendu <[email protected]>

* typo

Signed-off-by: arendu <[email protected]>

---------

Signed-off-by: arendu <[email protected]>

* add training with multiple audios

Signed-off-by: stevehuang52 <[email protected]>

* Support flash decoding (#7744)

* Add flash-decoding

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* Fix

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

---------

Signed-off-by: Cheng-Ping Hsieh <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Yang Zhang <[email protected]>

* Change accelerator to 'auto' in nlp_checkpoint_port.py (#7761)

* Change accelerator to 'auto' in nlp_checkpoint_port.py (#7747)

* Change accelerator to auto

Signed-off-by: Abhishree <[email protected]>

* Pass omegaconf object to trainer in nlp_checkpoint_port.py

Signed-off-by: Abhishree <[email protected]>

* Pass omegaconf object to trainer in export.py

Signed-off-by: Abhishree <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Abhishree <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Eric Harper <[email protected]>
Signed-off-by: Abhishree <[email protected]>

* docs: fix typos (#7758)

Signed-off-by: shuoer86 <[email protected]>
Co-authored-by: Xuesong Yang <[email protected]>
Signed-off-by: Abhishree <[email protected]>

* Snake act (#7736)

Signed-off-by: Abhishree <[email protected]>

* Update gpt_dataset.py (#6963)

Signed-off-by: Xin Yao <[email protected]>
Co-authored-by: Sandeep Subramanian <[email protected]>
Signed-off-by: Abhishree <[email protected]>

---------

Signed-off-by: Abhishree <[email protected]>
Signed-off-by: shuoer86 <[email protected]>
Signed-off-by: Xin Yao <[email protected]>
Co-authored-by: Abhishree Thittenamane <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Eric Harper <[email protected]>
Co-authored-by: shuoer86 <[email protected]>
Co-authored-by: Xuesong Yang <[email protected]>
Co-authored-by: Nithin Rao <[email protected]>
Co-authored-by: Xin Yao <[email protected]>
Co-authored-by: Sandeep Subramanian <[email protected]>

* Add selection criteria for reference audios in the `GlobalStyleToken` submodule (#7788)

* add selection criteria for reference audios

Signed-off-by: anferico <[email protected]>

* Update configuration files

Signed-off-by: anferico <[email protected]>

* add informative comment in config files

Signed-off-by: anferico <[email protected]>

* sample random index for reference audio selection

Signed-off-by: anferico <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: anferico <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* update text server to support compute logprobs (#7733)

* update text server to support compute logprobs

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix typo

---------

Signed-off-by: Zhilin Wang <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* add multi-layer feat extract and fix random question insertion

Signed-off-by: stevehuang52 <[email protected]>

* Configure MCore logger (#7781)

Signed-off-by: Mikołaj Błaż <[email protected]>

* Revert "PEFT eval fix (#7626) (#7638)" (#7693)

This reverts commit f03dd660bd26d88fd569e76c6f74b83a7c203ff9.

* remove TN from ctc_segm tut (#7807)

Signed-off-by: Evelina <[email protected]>

* [TTS] Support audio offsets in TTS data loaders (#7156)

* [TTS] Support audio offsets in TTS data loaders

Signed-off-by: Ryan <[email protected]>

* [TTS] Change docstring mentions of .pt to .npy

Signed-off-by: Ryan <[email protected]>

---------

Signed-off-by: Ryan <[email protected]>

* Update Apex install command in Dockerfile (#7794) (#7804)

* move core install to /workspace (#7706)

* update apex install in dockerfile

* use fetch head

---------

Signed-off-by: Abhinav Khattar <[email protected]>
Signed-off-by: eharper <[email protected]>
Co-authored-by: Eric Harper <[email protected]>
Co-authored-by: Abhinav Khattar <[email protected]>

* fix typo

Signed-off-by: stevehuang52 <[email protected]>

* Nemo to HF converter for LLaMA model (#7770)

* Create config_llama_truncate.yaml

Signed-off-by: Utkarsh <[email protected]>

* Add files via upload

Signed-off-by: Utkarsh <[email protected]>

* Update convert_nemo_llama_to_hf.py

Signed-off-by: Utkarsh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update config_llama_truncate.yaml

Signed-off-by: Utkarsh <[email protected]>

* Update convert_nemo_llama_to_hf.py

Signed-off-by: Utkarsh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update convert_nemo_llama_to_hf.py

Signed-off-by: Utkarsh <[email protected]>

* clean up trainer

* remove dependency on yaml config. load config from nemo file instead.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* enable ckpt saving into other precision formats

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* support 70b + cleanup qkv slice logic

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix bug

* move hf model folder code from comment to function and add instruction to run

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Utkarsh <[email protected]>
Signed-off-by: Chen Cui <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Eric Harper <[email protected]>
Co-authored-by: Chen Cui <[email protected]>

* Save best NeMo model only when necessary (#7836)

Signed-off-by: Ante Jukić <[email protected]>

* add guard if its a distributed checkpoint (#7845)

Signed-off-by: Gerald Shen <[email protected]>

* Fix tn duplex (#7808)

* fix duplex tn infer

Signed-off-by: Evelina <[email protected]>

* fix typo

Signed-off-by: Evelina <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix TN docs

Signed-off-by: Evelina <[email protected]>

---------

Signed-off-by: Evelina <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Update transformers cache on Jenkins (#7854)

* update transformers cache

Signed-off-by: eharper <[email protected]>

* update

Signed-off-by: eharper <[email protected]>

* add cd

Signed-off-by: eharper <[email protected]>

---------

Signed-off-by: eharper <[email protected]>

* Update README.rst for container update (#7844)

Signed-off-by: fayejf <[email protected]>

* Add support for finetuning with huggingface datasets (#7834)

* add finetune with huggingface dataset

Signed-off-by: stevehuang52 <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update yaml

Signed-off-by: stevehuang52 <[email protected]>

* update

Signed-off-by: stevehuang52 <[email protected]>

* update and refactor

Signed-off-by: stevehuang52 <[email protected]>

* add extrac hf text and update

Signed-off-by: stevehuang52 <[email protected]>

* update and refactor

Signed-off-by: stevehuang52 <[email protected]>

* move dataset dependency to common

Signed-off-by: stevehuang52 <[email protected]>

* add docstring

Signed-off-by: stevehuang52 <[email protected]>

* Add to Dics

Signed-off-by: Nithin Rao Koluguri <nithinraok>

* add ci test

Signed-off-by: Nithin Rao Koluguri <nithinraok>

* add max steps in jenkins

Signed-off-by: Nithin Rao Koluguri <nithinraok>

* reduce max steps

Signed-off-by: Nithin Rao Koluguri <nithinraok>

* jenkins test

Signed-off-by: Nithin Rao Koluguri <nithinraok>

* add bs=2

Signed-off-by: Nithin Rao Koluguri <nithinraok>

---------

Signed-off-by: stevehuang52 <[email protected]>
Signed-off-by: Nithin Rao Koluguri <nithinraok>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Nithin Rao Koluguri <nithinraok>
Co-authored-by: Nithin Rao <[email protected]>

* Multimodal merge (#7728)

* ControlNet TRT export

* Final MR before release

* SD2 update

* Fixed export issue

* Fix for instruct p2p and reformat

* Fix SD export issue

* Add nemo clip export for DB

* Fix ins pix2pix

* fix sd2 config

* [Mingyuan Ma] BF16 and SD conversion script

* [Imagen] NHWC Feature

* Fix .nemo loading issue for NeMo CLIP in SD

* NeMo r1.20.0 Multimodal Merge

* fix the inductor issue in inference

* Fix inductor loading .nemo issue

* Add Neva Model Support

* Imagen Optimizations

* Neva inference code

* NeMo TOT 1.21 to Internal/main

* Update neva_inference.yaml

* REBASING  for latest code changes

* Update internal/main to main tot

* Parallel DDIM implementation

* 1. Fixing indentation bug. (#7352)

Signed-off-by: Micha Livne <[email protected]>

* NeMo MCore llama2 support + MCore PEFT adapters (#7299)

* start adding gpt from megatron core path

Signed-off-by: ericharper <[email protected]>

* set model parallel config

Signed-off-by: ericharper <[email protected]>

* use model parallel config object

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update args

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* set vp size to none if it is 1

Signed-off-by: ericharper <[email protected]>

* set vp size to none if it is 1

Signed-off-by: ericharper <[email protected]>

* add TransformerConfig

Signed-off-by: ericharper <[email protected]>

* start updating to TransformerConfig

Signed-off-by: ericharper <[email protected]>

* add todo

Signed-off-by: ericharper <[email protected]>

* revert to model parallel config

Signed-off-by: ericharper <[email protected]>

* add hidden_size to model_parallel_config

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove imports

Signed-off-by: ericharper <[email protected]>

* revert

Signed-off-by: ericharper <[email protected]>

* remove import

Signed-off-by: ericharper <[email protected]>

* small clean up

Signed-off-by: ericharper <[email protected]>

* update hidden size in peft base model, add mcore commit to jenkins

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update module args

Signed-off-by: ericharper <[email protected]>

* add config obj to flash attention tests

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove args

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove sequence parallel arg

Signed-off-by: ericharper <[email protected]>

* update args

Signed-off-by: ericharper <[email protected]>

* add config to self

Signed-off-by: ericharper <[email protected]>

* update args

Signed-off-by: ericharper <[email protected]>

* update args

Signed-off-by: ericharper <[email protected]>

* update args

Signed-off-by: ericharper <[email protected]>

* add config to test

Signed-off-by: ericharper <[email protected]>

* get hidden_size from config

Signed-off-by: ericharper <[email protected]>

* add try except

Signed-off-by: ericharper <[email protected]>

* use default

Signed-off-by: ericharper <[email protected]>

* update config with hidden size

Signed-off-by: ericharper <[email protected]>

* remove arg

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* comment out jenkins test

Signed-off-by: ericharper <[email protected]>

* revert import

Signed-off-by: ericharper <[email protected]>

* build transformer config

Signed-off-by: ericharper <[email protected]>

* add model to provider func

Signed-off-by: ericharper <[email protected]>

* update forward and float16 wrapper

Signed-off-by: ericharper <[email protected]>

* instantiate model parallel config after init model parallel

Signed-off-by: ericharper <[email protected]>

* set virtual rank

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add GQA config to megatron gpt model (#7096)

* Add GQA config in gpt config file

Signed-off-by: jasonwan <[email protected]>

* Verify mcore is enabled when using GQA

Signed-off-by: jasonwan <[email protected]>

---------

Signed-off-by: jasonwan <[email protected]>

* revert

Signed-off-by: ericharper <[email protected]>

* mcore llama2 ckpt conversion & small fix

Signed-off-by: jasonwan <[email protected]>

* Add inference & sft config by Hongbin

Co-authored-by: Hongbin Liu <[email protected]>

Signed-off-by: jasonwan <[email protected]>

* fix config

Signed-off-by: jasonwan <[email protected]>

* add inference param. update TP/PP script to support mcore gpt

Signed-off-by: jasonwan <[email protected]>

* p-tuning

Signed-off-by: jasonwan <[email protected]>

* modify ckpt conversion script (adding model cast)

Signed-off-by: jasonwan <[email protected]>

* ckpt conversion use relative path for config

Signed-off-by: jasonwan <[email protected]>

* start adding gpt from megatron core path

Signed-off-by: ericharper <[email protected]>

* set model parallel config

Signed-off-by: ericharper <[email protected]>

* use model parallel config object

Signed-off-by: ericharper <[email protected]>

* update args

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* set vp size to none if it is 1

Signed-off-by: ericharper <[email protected]>

* set vp size to none if it is 1

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add TransformerConfig

Signed-off-by: ericharper <[email protected]>

* start updating to TransformerConfig

Signed-off-by: ericharper <[email protected]>

* add todo

Signed-off-by: ericharper <[email protected]>

* revert to model parallel config

Signed-off-by: ericharper <[email protected]>

* add hidden_size to model_parallel_config

Signed-off-by: ericharper <[email protected]>

* remove imports

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove import

Signed-off-by: ericharper <[email protected]>

* small clean up

Signed-off-by: ericharper <[email protected]>

* update hidden size in peft base model, add mcore commit to jenkins

Signed-off-by: ericharper <[email protected]>

* update module args

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add config obj to flash attention tests

Signed-off-by: ericharper <[email protected]>

* remove args

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove sequence parallel arg

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update args

Signed-off-by: ericharper <[email protected]>

* add config to self

Signed-off-by: ericharper <[email protected]>

* update args

Signed-off-by: ericharper <[email protected]>

* update args

Signed-off-by: ericharper <[email protected]>

* update args

Signed-off-by: ericharper <[email protected]>

* add config to test

Signed-off-by: ericharper <[email protected]>

* get hidden_size from config

Signed-off-by: ericharper <[email protected]>

* add try except

Signed-off-by: ericharper <[email protected]>

* use default

Signed-off-by: ericharper <[email protected]>

* update config with hidden size

Signed-off-by: ericharper <[email protected]>

* remove arg

Signed-off-by: ericharper <[email protected]>

* comment out jenkins test

Signed-off-by: ericharper <[email protected]>

* revert import

Signed-off-by: ericharper <[email protected]>

* remove optimizer_idx

Signed-off-by: eharper <[email protected]>

* prefetch num microbatches

Signed-off-by: eharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* start adding gpt from megatron core path

Signed-off-by: ericharper <[email protected]>

* set model parallel config

Signed-off-by: ericharper <[email protected]>

* use model parallel config object

Signed-off-by: ericharper <[email protected]>

* update args

Signed-off-by: ericharper <[email protected]>

* fix for p-tuning sequence parallel

Signed-off-by: jasonwan <[email protected]>

* support SFT/distOpt mcore (#7207)

* add inference param. update TP/PP script to support mcore gpt

* p-tuning

Signed-off-by: jasonwan <[email protected]>

* change layer names for SFT

Signed-off-by: Hongbin Liu <[email protected]>

* fix bug in SFT

Signed-off-by: Hongbin Liu <[email protected]>

---------

Signed-off-by: jasonwan <[email protected]>
Signed-off-by: Hongbin Liu <[email protected]>
Co-authored-by: Hongbin Liu <[email protected]>
Co-authored-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* start updating to TransformerConfig

Signed-off-by: ericharper <[email protected]>

* revert to model parallel config

Signed-off-by: ericharper <[email protected]>

* add hidden_size to model_parallel_config

Signed-off-by: ericharper <[email protected]>

* remove imports

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update module args

Signed-off-by: ericharper <[email protected]>

* add config to self

Signed-off-by: ericharper <[email protected]>

* build transformer config

Signed-off-by: ericharper <[email protected]>

* add model to provider func

Signed-off-by: ericharper <[email protected]>

* update forward and float16 wrapper

Signed-off-by: ericharper <[email protected]>

* instantiate model parallel config after init model parallel

Signed-off-by: ericharper <[email protected]>

* set virtual rank

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add GQA config to megatron gpt model (#7096)

* Add GQA config in gpt config file

Signed-off-by: jasonwan <[email protected]>

* Verify mcore is enabled when using GQA

Signed-off-by: jasonwan <[email protected]>

---------

Signed-off-by: jasonwan <[email protected]>

* revert

Signed-off-by: ericharper <[email protected]>

* remove import

Signed-off-by: eharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* rollback model cast for p-tuning

Signed-off-by: jasonwan <[email protected]>

* update for dist adam

Signed-off-by: eharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* use get_gpt_module_list

Signed-off-by: eharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update ckpt conversion script

Signed-off-by: jasonwan <[email protected]>

* ptl2.0 patch for llama config

Signed-off-by: jasonwan <[email protected]>

* add plugins to trainer in scripts

Signed-off-by: jasonwan <[email protected]>

* fix activation checkpointing mcore

Signed-off-by: jasonwan <[email protected]>

* fix variable names

Signed-off-by: jasonwan <[email protected]>

* overwrite normalization type for mcore/te

Signed-off-by: jasonwan <[email protected]>

* Update megatron_llama_sft.yaml

Signed-off-by: Jason Wang <[email protected]>

* add PEFT adapter support for mcore gpt path (#7276)

* implementation for mcore adapter/mxins

Signed-off-by: jasonwan <[email protected]>

* small fix for lora and ptuning

Signed-off-by: jasonwan <[email protected]>

* support layerwise peft

Signed-off-by: jasonwan <[email protected]>

* support multiple target layers

Signed-off-by: jasonwan <[email protected]>

* support lora GQA

Signed-off-by: jasonwan <[email protected]>

* support amp O2

Signed-off-by: jasonwan <[email protected]>

* revert & more O2 fix

Signed-off-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* lora inject to attention

Signed-off-by: jasonwan <[email protected]>

* support lora weight tying

Signed-off-by: jasonwan <[email protected]>

* add copyright header

Signed-off-by: jasonwan <[email protected]>

* rollback ptuning name change. full string match mcore target

Signed-off-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove comment

Signed-off-by: jasonwan <[email protected]>

---------

Signed-off-by: jasonwan <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* clean up config

Signed-off-by: jasonwan <[email protected]>

* Sync llama branch (#7297)

* add inference param. update TP/PP script to support mcore gpt

* p-tuning

Signed-off-by: jasonwan <[email protected]>

* change layer names for SFT

Signed-off-by: Hongbin Liu <[email protected]>

* fix bug in SFT

Signed-off-by: Hongbin Liu <[email protected]>

* fix bug: cpu initialization is not really enabled

Signed-off-by: Hongbin Liu <[email protected]>

* add use_cpu_initialization to TransformerConfig

Signed-off-by: Hongbin Liu <[email protected]>

* fix bug: wrong config path when using relative cjpt path

Signed-off-by: Hongbin Liu <[email protected]>

* revert mcore config change

Signed-off-by: Jason Wang <[email protected]>

---------

Signed-off-by: jasonwan <[email protected]>
Signed-off-by: Hongbin Liu <[email protected]>
Signed-off-by: Jason Wang <[email protected]>
Co-authored-by: Hongbin Liu <[email protected]>

* clean up ckpt conversion script

Signed-off-by: jasonwan <[email protected]>

* rollback git merge errors

Signed-off-by: jasonwan <[email protected]>

* update mcore, add check for mcore+te

Signed-off-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* formatting

Signed-off-by: jasonwan <[email protected]>

* make sft test dataset optional. fix indentation in config

Signed-off-by: jasonwan <[email protected]>

* one more fix for optional test set

Signed-off-by: jasonwan <[email protected]>

* support merging lora weights in mcore

Signed-off-by: jasonwan <[email protected]>

* update mcore for cpu init

Signed-off-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update ckpt conversion for code llama

Signed-off-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add seq_len_interpolation_factor support for long-context llama ckpts (#7312)

* add inference param. update TP/PP script to support mcore gpt

* p-tuning

Signed-off-by: jasonwan <[email protected]>

* add seq_len_interpolation_factor

Signed-off-by: Hongbin Liu <[email protected]>

---------

Signed-off-by: jasonwan <[email protected]>
Signed-off-by: Hongbin Liu <[email protected]>
Co-authored-by: jasonwan <[email protected]>
Co-authored-by: Hongbin Liu <[email protected]>

* fix old ptuning model, update mcore to support seq_len_interpolation_factor

Signed-off-by: jasonwan <[email protected]>

* support fused layernorm linear, fix ptuning O2

Signed-off-by: jasonwan <[email protected]>

* drop loss mask for mcore for now

Signed-off-by: jasonwan <[email protected]>

* disable dist ckpt in peft

Signed-off-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix loading non dist ckpt

Signed-off-by: jasonwan <[email protected]>

* add ckpt conversion to CI

Signed-off-by: jasonwan <[email protected]>

* update CI

Signed-off-by: jasonwan <[email protected]>

* mcore_mixin docstring

Signed-off-by: jasonwan <[email protected]>

* minor change in mcore peft error message

Signed-off-by: jasonwan <[email protected]>

* fix amp o2 in lora weight tying

Signed-off-by: jasonwan <[email protected]>

* correct mcore fp8 config

Signed-off-by: jasonwan <[email protected]>

* add TE installation

Signed-off-by: jasonwan <[email protected]>

* support mcore adapter tuning

Signed-off-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* comment out new CI test. rollback docker image

Signed-off-by: jasonwan <[email protected]>

* ignore FA tests, try new CI on 23.08

Signed-off-by: jasonwan <[email protected]>

* mark new CI as L2, put to beginning to test

Signed-off-by: jasonwan <[email protected]>

* minor fix for prompt learning

Signed-off-by: jasonwan <[email protected]>

* rollback to 23.06. comment out CI

Signed-off-by: jasonwan <[email protected]>

* minor fix ckpt conversion script

Signed-off-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* minor rollback gpt model change

Signed-off-by: jasonwan <[email protected]>

---------

Signed-off-by: ericharper <[email protected]>
Signed-off-by: jasonwan <[email protected]>
Signed-off-by: eharper <[email protected]>
Signed-off-by: Hongbin Liu <[email protected]>
Signed-off-by: Jason Wang <[email protected]>
Co-authored-by: ericharper <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: eharper <[email protected]>
Co-authored-by: Hongbin Liu <[email protected]>
Co-authored-by: Kelvin Liu <[email protected]>

* Hiddens modules documentation (#7303)

* 1. Changed hiddens transformations module from `transformations` to `hiddens`.

Signed-off-by: Micha Livne <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* 1. Debugging. Signed-off-by: Micha Livne <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* 1. Finished doc.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging. Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging. Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging. Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging. Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging. Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging. Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging. Signed-off-by: Micha Livne <[email protected]>

---------

Signed-off-by: Micha Livne <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Eric Harper <[email protected]>

* Support for flash attention 2.0 (#7063)

* Add flash attn 2

Signed-off-by: MaximumEntropy <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add FA2 feature

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* Remove debugging

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: MaximumEntropy <[email protected]>
Signed-off-by: Cheng-Ping Hsieh <[email protected]>
Signed-off-by: Cheng-Ping Hsieh <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Oleksii Kuchaiev <[email protected]>
Co-authored-by: Cheng-Ping Hsieh <[email protected]>
Co-authored-by: Cheng-Ping Hsieh <[email protected]>

* lora merge fix for O2 names (#7325)

* wip

Signed-off-by: arendu <[email protected]>

* adjust key names based on O2

Signed-off-by: arendu <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update

Signed-off-by: arendu <[email protected]>

* minor

Signed-off-by: arendu <[email protected]>

---------

Signed-off-by: arendu <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* multiple fields can form a context (#7147)

* list of context fields and flexible prompt template

Signed-off-by: arendu <[email protected]>

* list of fields for context

Signed-off-by: arendu <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix bug

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* Fix bug

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* Add multiple truncation fields and middle truncation

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Compatible to old ckpt

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix tokenize detokenize issue

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Remove detokenization, add truncation augmentation

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Resolve comments

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* Remove unused import

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* revert eos

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* Add tokenizer space_sensitive attribute

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix error

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* Fix erorr and use re

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix bug

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* Change assert logic

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Follow adi suggestion

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Remove merge function

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add example and comment

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* Remove context_key and add comment

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* Remove random truncation

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix bug

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix template none

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix bug

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

---------

Signed-off-by: arendu <[email protected]>
Signed-off-by: Cheng-Ping Hsieh <[email protected]>
Signed-off-by: Cheng-Ping Hsieh <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Cheng-Ping Hsieh <[email protected]>
Co-authored-by: Cheng-Ping Hsieh <[email protected]>

* Load buffers in checkpoint (#7357)

Signed-off-by: Jason Wang <[email protected]>

* Add migration guide for lightning 2.0 upgrade (#7360)

* Add lightning 2.0 migration guide in NeMo docs

Signed-off-by: Abhishree <[email protected]>

* Add remaining guide for lightning 2.0 upgrade

Signed-off-by: Abhishree <[email protected]>

* Remove line spill over and continue in next line

Signed-off-by: Abhishree <[email protected]>

* Add missing dataloader_iter in the guide

Signed-off-by: Abhishree <[email protected]>

* Fix minor typo

Signed-off-by: Abhishree <[email protected]>

---------

Signed-off-by: Abhishree <[email protected]>

* adding bias_dropout_add_fusion option for BERT (#7332)

Signed-off-by: Alexander Jipa <[email protected]>
Co-authored-by: Alexander Jipa <[email protected]>

* [TTS] Change audio codec token type to TokenIndex (#7356)

Signed-off-by: Ryan <[email protected]>

* enable selective unfreeze (#7326)

* wip

Signed-off-by: arendu <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* wip

Signed-off-by: arendu <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* avoid PTL method conflicts

Signed-off-by: arendu <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update

Signed-off-by: arendu <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update

Signed-off-by: arendu <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: arendu <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Fix typos (#7361)

* fix typos

Signed-off-by: omahs <[email protected]>

* fix typo

Signed-off-by: omahs <[email protected]>

* fix typos

Signed-off-by: omahs <[email protected]>

* fix typos

Signed-off-by: omahs <[email protected]>

* fix typo

Signed-off-by: omahs <[email protected]>

* fix typos

Signed-off-by: omahs <[email protected]>

* fix typo

Signed-off-by: omahs <[email protected]>

* fix typo

Signed-off-by: omahs <[email protected]>

* fix typo

Signed-off-by: omahs <[email protected]>

---------

Signed-off-by: omahs <[email protected]>

* pin numba=0.57.1 to fix reinstall.sh error (#7366)

Signed-off-by: Xuesong Yang <[email protected]>

* Update new conversion script for converting safetensors.

* Upgrade pytorch container to 23.08 (#7353)

* upgrade pytorch container

Signed-off-by: eharper <[email protected]>

* use mcore

Signed-off-by: eharper <[email protected]>

* revert test change

Signed-off-by: eharper <[email protected]>

* pleasefixme

Signed-off-by: eharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* check for ampere

Signed-off-by: eharper <[email protected]>

* comment test temporarily

Signed-off-by: eharper <[email protected]>

---------

Signed-off-by: eharper <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* enable fp32 optimizer for output_layer in mcore (#7355)

Signed-off-by: lhb8125 <[email protected]>

* revert comment (#7368)

Signed-off-by: eharper <[email protected]>

* Update to core 23.08 branch ToT (#7371)

Signed-off-by: Abhinav Khattar <[email protected]>

* upper bounding ptl (#7370)

Signed-off-by: eharper <[email protected]>

* fix pipeline parallel inference (#7367)

* fix pp inference

Signed-off-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: jasonwan <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* fix for peft tied weights (#7372)

Signed-off-by: arendu <[email protected]>

* fixed trainer.strategy=auto from None. (#7369)

Signed-off-by: Xuesong Yang <[email protected]>

* add O2 option in gpt eval (#7358)

* add O2 option in eval

Signed-off-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add doc for O2 config

Signed-off-by: jasonwan <[email protected]>

* add to llama inference config

Signed-off-by: jasonwan <[email protected]>

---------

Signed-off-by: jasonwan <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Eric Harper <[email protected]>

* Move model precision copy (#7336)

* move cfg precision set to megatron base model

Signed-off-by: Maanu Grover <[email protected]>

* remove copy from other models

Signed-off-by: Maanu Grover <[email protected]>

* modify attribute not arg

Signed-off-by: Maanu Grover <[email protected]>

* fix gpt model test for ptl 2.0

Signed-off-by: Maanu Grover <[email protected]>

* rename function and add docstring

Signed-off-by: Maanu Grover <[email protected]>

* replace precision to dtype conditionals with func call

Signed-off-by: Maanu Grover <[email protected]>

* unnecessary function and cfg reset

Signed-off-by: Maanu Grover <[email protected]>

* set default value

Signed-off-by: Maanu Grover <[email protected]>

* fix precision lookup in a few more places

Signed-off-by: Maanu Grover <[email protected]>

* rename mapping function

Signed-off-by: Maanu Grover <[email protected]>

* ununsed import

Signed-off-by: Maanu Grover <[email protected]>

* save torch datatype to model

Signed-off-by: Maanu Grover <[email protected]>

* set weights precision wrt amp o2

Signed-off-by: Maanu Grover <[email protected]>

* Revert "set weights precision wrt amp o2"

This reverts commit 313a4bfe5eb69d771a6d2433898c0685836aef5c.

Signed-off-by: Maanu Grover <[email protected]>

* revert half precision at inference attempt

Signed-off-by: Maanu Grover <[email protected]>

* move autocast dtype to base model

Signed-off-by: Maanu Grover <[email protected]>

* move params dtype to base model, enable fp16 O2 inf

Signed-off-by: Maanu Grover <[email protected]>

* unused imports

Signed-off-by: Maanu Grover <[email protected]>

---------

Signed-off-by: Maanu Grover <[email protected]>

* Fix PEFT checkpoint loading (#7388)

* Fix PEFT checkpoint loading

Signed-off-by: Jason Wang <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Jason Wang <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Use distributed optimizer support for multiple dtypes (#7359)

* Update distopt wrapper with multiple dtype support

Remove manual handling of separate FP32 optimizer.

Signed-off-by: Tim Moon <[email protected]>

* Use distopt support for contiguous buffers with multiple dtypes

Signed-off-by: Tim Moon <[email protected]>

* Fix typo

Signed-off-by: Tim Moon <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Separate distopt buckets for first GPT layer and non-overlapped params

Signed-off-by: Tim Moon <[email protected]>

* Add distopt logic for int dtypes

Signed-off-by: Tim Moon <[email protected]>

* Update Apex commit

Signed-off-by: Tim Moon <[email protected]>

* Remove unused variables

Signed-off-by: Tim Moon <[email protected]>

* Update Apex commit in README and Jenkensfile

Signed-off-by: Tim Moon <[email protected]>

* Debug Dockerfile and Jenkinsfile

Signed-off-by: Tim Moon <[email protected]>

---------

Signed-off-by: Tim Moon <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Eric Harper <[email protected]>

* minor fix for llama ckpt conversion script (#7387)

* minor fix for llama ckpt conversion script

Signed-off-by: Jason Wang <[email protected]>

* Update Jenkinsfile

Signed-off-by: Jason Wang <[email protected]>

* remove fast_swiglu configuration

Signed-off-by: Jason Wang <[email protected]>

---------

Signed-off-by: Jason Wang <[email protected]>
Co-authored-by: Eric Harper <[email protected]>

* Fix wrong calling of librosa.get_duration() in notebook (#7376)

Signed-off-by: Robin Dong <[email protected]>
Co-authored-by: Somshubra Majumdar <[email protected]>

* [PATCH] PEFT import mcore (#7393)

* [PATCH] PEFT import mcore

Signed-off-by: Jason Wang <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Jason Wang <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* [TTS] Added a callback for logging initial data (#7384)

Signed-off-by: Ante Jukić <[email protected]>

* Update Core Commit (#7402)

* Update Core Commit

Signed-off-by: Abhinav Khattar <[email protected]>

* update commit

Signed-off-by: Abhinav Khattar <[email protected]>

---------

Signed-off-by: Abhinav Khattar <[email protected]>

* Use cfg attribute in bert (#7394)

* use cfg attribute instead of arg

Signed-off-by: Maanu Grover <[email protected]>

* use torch_dtype in place of cfg.precision

Signed-off-by: Maanu Grover <[email protected]>

* move precision copy before super constructor

Signed-off-by: Maanu Grover <[email protected]>

* use trainer arg

Signed-off-by: Maanu Grover <[email protected]>

---------

Signed-off-by: Maanu Grover <[email protected]>

* Add support for bias conversion in Swiglu models (#7386)

* Add support for bias conversion in Swiglu models

Signed-off-by: smajumdar <[email protected]>

* Add support for auto extracting tokenizer model

Signed-off-by: smajumdar <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add support for auto extracting tokenizer model

Signed-off-by: smajumdar <[email protected]>

* Fix issue with missing tokenizer

Signed-off-by: smajumdar <[email protected]>

* Refactor

Signed-off-by: smajumdar <[email protected]>

* Refactor

Signed-off-by: smajumdar <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: smajumdar <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Update save_to and restore_from for dist checkpointing (#7343)

* add dist ckpt to save to, in progress

Signed-off-by: eharper <[email protected]>

* move dist ckpt

Signed-off-by: eharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* clean up

Signed-off-by: eharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update restore from, need to figure out how to initialize distributed

Signed-off-by: eharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* launch distrib if needed when restoring dist ckpt

Signed-off-by: eharper <[email protected]>

* when using mcore we can change tp pp on the fly

Signed-off-by: eharper <[email protected]>

* add load_from_checkpoint support for dist ckpt

Signed-off-by: eharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update llama convert script to save dist .nemo

Signed-off-by: eharper <[email protected]>

* fix load dist ckpt

Signed-off-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* setup TE TP groups if needed

Signed-off-by: eharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* setup te tp groups if needed

Signed-off-by: eharper <[email protected]>

* remove import

Signed-off-by: eharper <[email protected]>

---------

Signed-off-by: eharper <[email protected]>
Signed-off-by: jasonwan <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: jasonwan <[email protected]>

* fix forward for with mcore=false (#7403)

Signed-off-by: Jimmy Zhang <[email protected]>
Co-authored-by: Jimmy Zhang <[email protected]>

* Fix logging to remove 's/it' from progress bar in Megatron models and add train_step_timing (#7374)

* Add CustomProgressBar class to exp_manager and trainer callbacks

Signed-off-by: Abhishree <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix the progress bar to reflect total microbatch cnt

Signed-off-by: Abhishree <[email protected]>

* Modify CustomProgressBar class

1) Modify CustomProgressBar class to update progress bar per global_step instead of per microbatch
2) Add the callback to other megatron training/finetuning files that are not using MegatronTrainerBuilder

Signed-off-by: Abhishree <[email protected]>

* Add CustomProgressBar callback to tuning files

Signed-off-by: Abhishree <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Abhishree <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Set Activation Checkpointing Defaults (#7404)

* Set Activation Checkpointing Defaults

Signed-off-by: Abhinav Khattar <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* check for None

Signed-off-by: Abhinav Khattar <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Abhinav Khattar <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* make loss mask default to false (#7407)

Signed-off-by: eharper <[email protected]>

* Add dummy userbuffer config files (#7408)

Signed-off-by: Sangkug Lym <[email protected]>

* add missing ubconf files (#7412)

Signed-off-by: Abhinav Khattar <[email protected]>

* New tutorial on Speech Data Explorer (#7405)

* Added Google Colab based tutorial on Speech Data Explorer

Signed-off-by: George Zelenfroynd <[email protected]>

* Update ptl training ckpt conversion script to work with dist ckpt (#7416)

* update ptl convert script

Signed-off-by: eharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* don't break legacy

Signed-off-by: eharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: eharper <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Allow disabling sanity checking when num_sanity_val_steps=0 (#7413)

* Allow disabling sanity checking when num_sanity_val_steps=0

Signed-off-by: Abhishree <[email protected]>

* Update num_sanity_val_steps to be a multiple of num_microbatches

Signed-off-by: Abhishree Thittenamane <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more informa…
janekl added a commit that referenced this pull request Jun 12, 2024
* Fixes

* Docs fix

* Add support for custom NeMo fields in Lhotse-NeMo adapters (attach to cut.custom)

* Add support for custom NeMo fields in Lhotse-NeMo adapters (attach to cut.custom)

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* support distributed_fused_adam

Signed-off-by: zhehuaichen <[email protected]>

* support distributed_fused_adam

Signed-off-by: zhehuaichen <[email protected]>

* Add support for sharded NeMo manifest files

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* support megatron_amp_O2

Signed-off-by: zhehuaichen <[email protected]>

* Support heterogeneous sampling rates in non tarred NeMo manifests

* migrate to PTL2.0

Signed-off-by: stevehuang52 <[email protected]>

* clean up

Signed-off-by: stevehuang52 <[email protected]>

* update manifest util

Signed-off-by: stevehuang52 <[email protected]>

* Support multiple tokenizer/parser types, aggregate tokenizers, and custom language fields

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* agg and normal tokenizers actually work

* Support weights for NeMo tarred manifests

* Temporarily hardcoded pnc stripping/lowercasing

* fix

* make pnc hack configurable from the config and disabled by default

* fix the hack

* migrate to ptl2.1 to support multiple dataloaders

Signed-off-by: stevehuang52 <[email protected]>

* support encoder overwrite

Signed-off-by: zhehuaichen <[email protected]>

* update misc

Signed-off-by: stevehuang52 <[email protected]>

* fix eval and clean up

Signed-off-by: stevehuang52 <[email protected]>

* support add_sep for perception model

Signed-off-by: zhehuaichen <[email protected]>

* fix https://github.com/Lightning-AI/pytorch-lightning/issues/18803

Signed-off-by: zhehuaichen <[email protected]>

* add_bos

Signed-off-by: zhehuaichen <[email protected]>

* Transformer decoder with conditioning for canary (#8091)

* initial commit for multi-task conf-enc transf-dec for canary

Signed-off-by: Krishna Puvvada <[email protected]>

* removing decoder states caching during training

Signed-off-by: Krishna Puvvada <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Krishna Puvvada <[email protected]>
Co-authored-by: Krishna Puvvada <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Option to limit the number of open streams (#8095)

* audio signal support in multi

Signed-off-by: zhehuaichen <[email protected]>

* update asr evaluator

Signed-off-by: stevehuang52 <[email protected]>

* fix from
https://github.com/NVIDIA/NeMo/commit/fcc0f9f6ff7947c3c7fba3ed17d8ec8af6391397
and
https://github.com/NVIDIA/NeMo/commit/f97c9016e6438ca4174b66bf9c3e248b28197aaa

Signed-off-by: zhehuaichen <[email protected]>

* transcribe fn for Canary models (#8110)

* improve readability

Signed-off-by: Krishna Puvvada <[email protected]>

* adding context in transcribe function for ConfTransfModels

Signed-off-by: Krishna Puvvada <[email protected]>

* supporting relative paths in transcribe function for canary

Signed-off-by: Krishna Puvvada <[email protected]>

* removing cuts.sort_by_duration in __getitem__ to maintain manifest order during inference

Signed-off-by: Krishna Puvvada <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Krishna Puvvada <[email protected]>
Co-authored-by: Krishna Puvvada <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* update for evaluation

Signed-off-by: stevehuang52 <[email protected]>

* update for eval

Signed-off-by: stevehuang52 <[email protected]>

* update for evaluation

Signed-off-by: stevehuang52 <[email protected]>

* fix bleu

Signed-off-by: stevehuang52 <[email protected]>

* fix typo

Signed-off-by: stevehuang52 <[email protected]>

* Add missing audio_filepath validation for Canary (#8119)

* Add missing audio_filepath validation for Canary

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* add default concat_sampling_probabilities

Signed-off-by: zhehuaichen <[email protected]>

* support lhotse dataset in speechllm

Signed-off-by: zhehuaichen <[email protected]>

* bypass get_iterator_k_split

Signed-off-by: zhehuaichen <[email protected]>

* tmp fix

Signed-off-by: zhehuaichen <[email protected]>

* try to use fixed batch with megatron

Signed-off-by: zhehuaichen <[email protected]>

* add batch logging

Signed-off-by: zhehuaichen <[email protected]>

* support unfrozen llm

Signed-off-by: zhehuaichen <[email protected]>

* Create README.md

Signed-off-by: He Huang (Steve) <[email protected]>

* Update README.md

Signed-off-by: He Huang (Steve) <[email protected]>

* Update README.md

Signed-off-by: He Huang (Steve) <[email protected]>

* update

Signed-off-by: stevehuang52 <[email protected]>

* rename

Signed-off-by: stevehuang52 <[email protected]>

* add llama prompt template

Signed-off-by: zhehuaichen <[email protected]>

* update and refactor

Signed-off-by: stevehuang52 <[email protected]>

* support sample alpha

Signed-off-by: zhehuaichen <[email protected]>

* support lhotse validation set and canary pretrained ckpt with pseudo label

Signed-off-by: zhehuaichen <[email protected]>

* make sure backward compatibility

Signed-off-by: zhehuaichen <[email protected]>

* remove pad

Signed-off-by: zhehuaichen <[email protected]>

* make sure asr_model is frozen

Signed-off-by: zhehuaichen <[email protected]>

* support greedy decoding

Signed-off-by: zhehuaichen <[email protected]>

* valid on lhotse

Signed-off-by: zhehuaichen <[email protected]>

* fix multi dataloader in val case for lhotse SALM; add default data
names; keep asr model tokenizer by default to enable adding canary
dataset

Signed-off-by: zhehuaichen <[email protected]>

* remove the bruteforce _keep_special_tokens implementation

Signed-off-by: zhehuaichen <[email protected]>

* decoding_ratio and convert_canary_prompt_to_text support

Signed-off-by: zhehuaichen <[email protected]>

* canary_tokens_augment_ratio

Signed-off-by: zhehuaichen <[email protected]>

* debug

Signed-off-by: zhehuaichen <[email protected]>

* bug fix

Signed-off-by: zhehuaichen <[email protected]>

* fix lhotse based eval of llama canary model

Signed-off-by: zhehuaichen <[email protected]>

* support some overwrite for eval

Signed-off-by: zhehuaichen <[email protected]>

* support zero shot prompt in training

Signed-off-by: zhehuaichen <[email protected]>

* support cross attention based SALM

Signed-off-by: zhehuaichen <[email protected]>

* support cross attention based SALM

Signed-off-by: zhehuaichen <[email protected]>

* fix for batch train/valid of cross

Signed-off-by: zhehuaichen <[email protected]>

* support learnable gate and plotting

Signed-off-by: zhehuaichen <[email protected]>

* support using pseudo label in prompt rather than cross att

Signed-off-by: zhehuaichen <[email protected]>

* bug fix for perception cfg and context tokens shift

Signed-off-by: zhehuaichen <[email protected]>

* DentityConnectorsAdd

Signed-off-by: zhehuaichen <[email protected]>

* fix ckpt saving

Signed-off-by: zhehuaichen <[email protected]>

* Support RnnGatedCrossAttention

Signed-off-by: zhehuaichen <[email protected]>

* add include_ffw and fix _optimizer_param_groups for all unfrozen run

Signed-off-by: zhehuaichen <[email protected]>

* support grad acc when using bucket

Signed-off-by: zhehuaichen <[email protected]>

* support TransformerCrossAttention

Signed-off-by: zhehuaichen <[email protected]>

* support ProjectTransformerCrossAttention

Signed-off-by: zhehuaichen <[email protected]>

* support ++model.use_am_tokenizer ++model.override_vocab_size ++model.override.hidden_size

Signed-off-by: zhehuaichen <[email protected]>

* support question set on val without canary

Signed-off-by: zhehuaichen <[email protected]>

* support load_audio_encoder and wip in optim_param_groups

Signed-off-by: zhehuaichen <[email protected]>

* minor fix for audio pretrain model init

Signed-off-by: zhehuaichen <[email protected]>

* simplify canary_tokens_augment

Signed-off-by: zhehuaichen <[email protected]>

* use question in the manifest if it exists

Signed-off-by: zhehuaichen <[email protected]>

* support dataset weighting for non tar

Signed-off-by: zhehuaichen <[email protected]>

* Update SpeechLLM code (#8475)

* add pleasefixme marker for potential failed nightly tests. (#7678)

Signed-off-by: Xuesong Yang <[email protected]>

* Add new text segmentation library for better TTS quality (#7645)

* Add new text segmentation library for better TTS quality
* Update zh_cn_pinyin.py

added detailed instruction on how to install pkuseg.

Signed-off-by: Xuesong Yang <[email protected]>

* Update requirements_tts.txt

remove pkuseg as the default dependency of NeMo TTS, and instead, direct users to manually install pkuseg if they really need.

Signed-off-by: Xuesong Yang <[email protected]>


---------

Signed-off-by: Xuesong Yang <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Xuesong Yang <[email protected]>

* Create PrecisionPlugin for megatron_ckpt_to_nemo.py trainer (#7767) (#7774)

* Create PrecisionPlugin for megatron_ckpt_to_nemo.py trainer



* Add ddp_find_unused_parameters_true for punctuation_capitalization_train_evaluate.py



* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add '32-true' for precision values



---------

Signed-off-by: Abhishree <[email protected]>
Signed-off-by: Abhishree Thittenamane <[email protected]>
Co-authored-by: Abhishree Thittenamane <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* fix(clustering_diarizer.py): fix typo (#7772)

Signed-off-by: Jean-Louis Queguiner <[email protected]>

* fix(diarization-README): typo (#7771)

Signed-off-by: Jean-Louis Queguiner <[email protected]>

* Fix bug wrt change decoding strategy for bpe models (#7762) (#7764)

* Fix bug wrt change decoding strategy for bpe models



* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: smajumdar <[email protected]>
Co-authored-by: Somshubra Majumdar <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Remove incorrect extra argument for load_from_checkpoint_dir() (#7500)

Signed-off-by: Robin Dong <[email protected]>
Co-authored-by: Eric Harper <[email protected]>

* Add nemo to mcore GPT conversion script  (#7730)

* add conversion script

Signed-off-by: Chen Cui <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove references to 'ckpt'

Signed-off-by: Chen Cui <[email protected]>

* add one more sanity check to make sure there is no unexpected keys in state dict

Signed-off-by: Chen Cui <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* make cpu loading work

Signed-off-by: Chen Cui <[email protected]>

* make script work for llama2 models

Signed-off-by: Chen Cui <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* address code check

Signed-off-by: Chen Cui <[email protected]>

* remove trainer precision (was for old sanity check)

Signed-off-by: Chen Cui <[email protected]>

* fix script for llama2 model

Signed-off-by: Chen Cui <[email protected]>

* remove commented code

Signed-off-by: Chen Cui <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Chen Cui <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Eric Harper <[email protected]>

* Fix bug in ConditionalInput: cat along the feature dim, not the batch dim (#7785)

Signed-off-by: anferico <[email protected]>

* Add some docs and update scripts for ASR (#7790)

* Add some docs and update scripts

Signed-off-by: smajumdar <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: smajumdar <[email protected]>
Signed-off-by: Somshubra Majumdar <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* set context for text memmap to fork (#7784)

* set context for text memmap to fork

Signed-off-by: arendu <[email protected]>

* typo

Signed-off-by: arendu <[email protected]>

---------

Signed-off-by: arendu <[email protected]>

* add training with multiple audios

Signed-off-by: stevehuang52 <[email protected]>

* Support flash decoding (#7744)

* Add flash-decoding

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* Fix

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

---------

Signed-off-by: Cheng-Ping Hsieh <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Yang Zhang <[email protected]>

* Change accelerator to 'auto' in nlp_checkpoint_port.py (#7761)

* Change accelerator to 'auto' in nlp_checkpoint_port.py (#7747)

* Change accelerator to auto

Signed-off-by: Abhishree <[email protected]>

* Pass omegaconf object to trainer in nlp_checkpoint_port.py

Signed-off-by: Abhishree <[email protected]>

* Pass omegaconf object to trainer in export.py

Signed-off-by: Abhishree <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Abhishree <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Eric Harper <[email protected]>
Signed-off-by: Abhishree <[email protected]>

* docs: fix typos (#7758)

Signed-off-by: shuoer86 <[email protected]>
Co-authored-by: Xuesong Yang <[email protected]>
Signed-off-by: Abhishree <[email protected]>

* Snake act (#7736)

Signed-off-by: Abhishree <[email protected]>

* Update gpt_dataset.py (#6963)

Signed-off-by: Xin Yao <[email protected]>
Co-authored-by: Sandeep Subramanian <[email protected]>
Signed-off-by: Abhishree <[email protected]>

---------

Signed-off-by: Abhishree <[email protected]>
Signed-off-by: shuoer86 <[email protected]>
Signed-off-by: Xin Yao <[email protected]>
Co-authored-by: Abhishree Thittenamane <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Eric Harper <[email protected]>
Co-authored-by: shuoer86 <[email protected]>
Co-authored-by: Xuesong Yang <[email protected]>
Co-authored-by: Nithin Rao <[email protected]>
Co-authored-by: Xin Yao <[email protected]>
Co-authored-by: Sandeep Subramanian <[email protected]>

* Add selection criteria for reference audios in the `GlobalStyleToken` submodule (#7788)

* add selection criteria for reference audios

Signed-off-by: anferico <[email protected]>

* Update configuration files

Signed-off-by: anferico <[email protected]>

* add informative comment in config files

Signed-off-by: anferico <[email protected]>

* sample random index for reference audio selection

Signed-off-by: anferico <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: anferico <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* update text server to support compute logprobs (#7733)

* update text server to support compute logprobs

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix typo

---------

Signed-off-by: Zhilin Wang <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* add multi-layer feat extract and fix random question insertion

Signed-off-by: stevehuang52 <[email protected]>

* Configure MCore logger (#7781)

Signed-off-by: Mikołaj Błaż <[email protected]>

* Revert "PEFT eval fix (#7626) (#7638)" (#7693)

This reverts commit f03dd660bd26d88fd569e76c6f74b83a7c203ff9.

* remove TN from ctc_segm tut (#7807)

Signed-off-by: Evelina <[email protected]>

* [TTS] Support audio offsets in TTS data loaders (#7156)

* [TTS] Support audio offsets in TTS data loaders

Signed-off-by: Ryan <[email protected]>

* [TTS] Change docstring mentions of .pt to .npy

Signed-off-by: Ryan <[email protected]>

---------

Signed-off-by: Ryan <[email protected]>

* Update Apex install command in Dockerfile (#7794) (#7804)

* move core install to /workspace (#7706)



* update apex install in dockerfile



* use fetch head



---------

Signed-off-by: Abhinav Khattar <[email protected]>
Signed-off-by: eharper <[email protected]>
Co-authored-by: Eric Harper <[email protected]>
Co-authored-by: Abhinav Khattar <[email protected]>

* fix typo

Signed-off-by: stevehuang52 <[email protected]>

* Nemo to HF converter for LLaMA model (#7770)

* Create config_llama_truncate.yaml

Signed-off-by: Utkarsh <[email protected]>

* Add files via upload

Signed-off-by: Utkarsh <[email protected]>

* Update convert_nemo_llama_to_hf.py

Signed-off-by: Utkarsh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update config_llama_truncate.yaml

Signed-off-by: Utkarsh <[email protected]>

* Update convert_nemo_llama_to_hf.py

Signed-off-by: Utkarsh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update convert_nemo_llama_to_hf.py

Signed-off-by: Utkarsh <[email protected]>

* clean up trainer

* remove dependency on yaml config. load config from nemo file instead.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* enable ckpt saving into other precision formats

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* support 70b + cleanup qkv slice logic

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix bug

* move hf model folder code from comment to function and add instruction to run

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Utkarsh <[email protected]>
Signed-off-by: Chen Cui <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Eric Harper <[email protected]>
Co-authored-by: Chen Cui <[email protected]>

* Save best NeMo model only when necessary (#7836)

Signed-off-by: Ante Jukić <[email protected]>

* add guard if its a distributed checkpoint (#7845)

Signed-off-by: Gerald Shen <[email protected]>

* Fix tn duplex (#7808)

* fix duplex tn infer

Signed-off-by: Evelina <[email protected]>

* fix typo

Signed-off-by: Evelina <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix TN docs

Signed-off-by: Evelina <[email protected]>

---------

Signed-off-by: Evelina <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Update transformers cache on Jenkins (#7854)

* update transformers cache

Signed-off-by: eharper <[email protected]>

* update

Signed-off-by: eharper <[email protected]>

* add cd

Signed-off-by: eharper <[email protected]>

---------

Signed-off-by: eharper <[email protected]>

* Update README.rst for container update (#7844)

Signed-off-by: fayejf <[email protected]>

* Add support for finetuning with huggingface datasets (#7834)

* add finetune with huggingface dataset

Signed-off-by: stevehuang52 <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update yaml

Signed-off-by: stevehuang52 <[email protected]>

* update

Signed-off-by: stevehuang52 <[email protected]>

* update and refactor

Signed-off-by: stevehuang52 <[email protected]>

* add extrac hf text and update

Signed-off-by: stevehuang52 <[email protected]>

* update and refactor

Signed-off-by: stevehuang52 <[email protected]>

* move dataset dependency to common

Signed-off-by: stevehuang52 <[email protected]>

* add docstring

Signed-off-by: stevehuang52 <[email protected]>

* Add to Dics

Signed-off-by: Nithin Rao Koluguri <nithinraok>

* add ci test

Signed-off-by: Nithin Rao Koluguri <nithinraok>

* add max steps in jenkins

Signed-off-by: Nithin Rao Koluguri <nithinraok>

* reduce max steps

Signed-off-by: Nithin Rao Koluguri <nithinraok>

* jenkins test

Signed-off-by: Nithin Rao Koluguri <nithinraok>

* add bs=2

Signed-off-by: Nithin Rao Koluguri <nithinraok>

---------

Signed-off-by: stevehuang52 <[email protected]>
Signed-off-by: Nithin Rao Koluguri <nithinraok>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Nithin Rao Koluguri <nithinraok>
Co-authored-by: Nithin Rao <[email protected]>

* Multimodal merge (#7728)

* ControlNet TRT export

* Final MR before release

* SD2 update

* Fixed export issue

* Fix for instruct p2p and reformat

* Fix SD export issue

* Add nemo clip export for DB

* Fix ins pix2pix

* fix sd2 config

* [Mingyuan Ma] BF16 and SD conversion script

* [Imagen] NHWC Feature

* Fix .nemo loading issue for NeMo CLIP in SD

* NeMo r1.20.0 Multimodal Merge

* fix the inductor issue in inference

* Fix inductor loading .nemo issue

* Add Neva Model Support

* Imagen Optimizations

* Neva inference code

* NeMo TOT 1.21 to Internal/main

* Update neva_inference.yaml

* REBASING  for latest code changes

* Update internal/main to main tot

* Parallel DDIM implementation

* 1. Fixing indentation bug. (#7352)

Signed-off-by: Micha Livne <[email protected]>

* NeMo MCore llama2 support + MCore PEFT adapters (#7299)

* start adding gpt from megatron core path

Signed-off-by: ericharper <[email protected]>

* set model parallel config

Signed-off-by: ericharper <[email protected]>

* use model parallel config object

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update args

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* set vp size to none if it is 1

Signed-off-by: ericharper <[email protected]>

* set vp size to none if it is 1

Signed-off-by: ericharper <[email protected]>

* add TransformerConfig

Signed-off-by: ericharper <[email protected]>

* start updating to TransformerConfig

Signed-off-by: ericharper <[email protected]>

* add todo

Signed-off-by: ericharper <[email protected]>

* revert to model parallel config

Signed-off-by: ericharper <[email protected]>

* add hidden_size to model_parallel_config

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove imports

Signed-off-by: ericharper <[email protected]>

* revert

Signed-off-by: ericharper <[email protected]>

* remove import

Signed-off-by: ericharper <[email protected]>

* small clean up

Signed-off-by: ericharper <[email protected]>

* update hidden size in peft base model, add mcore commit to jenkins

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update module args

Signed-off-by: ericharper <[email protected]>

* add config obj to flash attention tests

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove args

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove sequence parallel arg

Signed-off-by: ericharper <[email protected]>

* update args

Signed-off-by: ericharper <[email protected]>

* add config to self

Signed-off-by: ericharper <[email protected]>

* update args

Signed-off-by: ericharper <[email protected]>

* update args

Signed-off-by: ericharper <[email protected]>

* update args

Signed-off-by: ericharper <[email protected]>

* add config to test

Signed-off-by: ericharper <[email protected]>

* get hidden_size from config

Signed-off-by: ericharper <[email protected]>

* add try except

Signed-off-by: ericharper <[email protected]>

* use default

Signed-off-by: ericharper <[email protected]>

* update config with hidden size

Signed-off-by: ericharper <[email protected]>

* remove arg

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* comment out jenkins test

Signed-off-by: ericharper <[email protected]>

* revert import

Signed-off-by: ericharper <[email protected]>

* build transformer config

Signed-off-by: ericharper <[email protected]>

* add model to provider func

Signed-off-by: ericharper <[email protected]>

* update forward and float16 wrapper

Signed-off-by: ericharper <[email protected]>

* instantiate model parallel config after init model parallel

Signed-off-by: ericharper <[email protected]>

* set virtual rank

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add GQA config to megatron gpt model (#7096)

* Add GQA config in gpt config file

Signed-off-by: jasonwan <[email protected]>

* Verify mcore is enabled when using GQA

Signed-off-by: jasonwan <[email protected]>

---------

Signed-off-by: jasonwan <[email protected]>

* revert

Signed-off-by: ericharper <[email protected]>

* mcore llama2 ckpt conversion & small fix

Signed-off-by: jasonwan <[email protected]>

* Add inference & sft config by Hongbin

Co-authored-by: Hongbin Liu <[email protected]>

Signed-off-by: jasonwan <[email protected]>

* fix config

Signed-off-by: jasonwan <[email protected]>

* add inference param. update TP/PP script to support mcore gpt

Signed-off-by: jasonwan <[email protected]>

* p-tuning

Signed-off-by: jasonwan <[email protected]>

* modify ckpt conversion script (adding model cast)

Signed-off-by: jasonwan <[email protected]>

* ckpt conversion use relative path for config

Signed-off-by: jasonwan <[email protected]>

* start adding gpt from megatron core path

Signed-off-by: ericharper <[email protected]>

* set model parallel config

Signed-off-by: ericharper <[email protected]>

* use model parallel config object

Signed-off-by: ericharper <[email protected]>

* update args

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* set vp size to none if it is 1

Signed-off-by: ericharper <[email protected]>

* set vp size to none if it is 1

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add TransformerConfig

Signed-off-by: ericharper <[email protected]>

* start updating to TransformerConfig

Signed-off-by: ericharper <[email protected]>

* add todo

Signed-off-by: ericharper <[email protected]>

* revert to model parallel config

Signed-off-by: ericharper <[email protected]>

* add hidden_size to model_parallel_config

Signed-off-by: ericharper <[email protected]>

* remove imports

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove import

Signed-off-by: ericharper <[email protected]>

* small clean up

Signed-off-by: ericharper <[email protected]>

* update hidden size in peft base model, add mcore commit to jenkins

Signed-off-by: ericharper <[email protected]>

* update module args

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add config obj to flash attention tests

Signed-off-by: ericharper <[email protected]>

* remove args

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove sequence parallel arg

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update args

Signed-off-by: ericharper <[email protected]>

* add config to self

Signed-off-by: ericharper <[email protected]>

* update args

Signed-off-by: ericharper <[email protected]>

* update args

Signed-off-by: ericharper <[email protected]>

* update args

Signed-off-by: ericharper <[email protected]>

* add config to test

Signed-off-by: ericharper <[email protected]>

* get hidden_size from config

Signed-off-by: ericharper <[email protected]>

* add try except

Signed-off-by: ericharper <[email protected]>

* use default

Signed-off-by: ericharper <[email protected]>

* update config with hidden size

Signed-off-by: ericharper <[email protected]>

* remove arg

Signed-off-by: ericharper <[email protected]>

* comment out jenkins test

Signed-off-by: ericharper <[email protected]>

* revert import

Signed-off-by: ericharper <[email protected]>

* remove optimizer_idx

Signed-off-by: eharper <[email protected]>

* prefetch num microbatches

Signed-off-by: eharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* start adding gpt from megatron core path

Signed-off-by: ericharper <[email protected]>

* set model parallel config

Signed-off-by: ericharper <[email protected]>

* use model parallel config object

Signed-off-by: ericharper <[email protected]>

* update args

Signed-off-by: ericharper <[email protected]>

* fix for p-tuning sequence parallel

Signed-off-by: jasonwan <[email protected]>

* support SFT/distOpt mcore (#7207)

* add inference param. update TP/PP script to support mcore gpt

* p-tuning

Signed-off-by: jasonwan <[email protected]>

* change layer names for SFT

Signed-off-by: Hongbin Liu <[email protected]>

* fix bug in SFT

Signed-off-by: Hongbin Liu <[email protected]>

---------

Signed-off-by: jasonwan <[email protected]>
Signed-off-by: Hongbin Liu <[email protected]>
Co-authored-by: Hongbin Liu <[email protected]>
Co-authored-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* start updating to TransformerConfig

Signed-off-by: ericharper <[email protected]>

* revert to model parallel config

Signed-off-by: ericharper <[email protected]>

* add hidden_size to model_parallel_config

Signed-off-by: ericharper <[email protected]>

* remove imports

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update module args

Signed-off-by: ericharper <[email protected]>

* add config to self

Signed-off-by: ericharper <[email protected]>

* build transformer config

Signed-off-by: ericharper <[email protected]>

* add model to provider func

Signed-off-by: ericharper <[email protected]>

* update forward and float16 wrapper

Signed-off-by: ericharper <[email protected]>

* instantiate model parallel config after init model parallel

Signed-off-by: ericharper <[email protected]>

* set virtual rank

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add GQA config to megatron gpt model (#7096)

* Add GQA config in gpt config file

Signed-off-by: jasonwan <[email protected]>

* Verify mcore is enabled when using GQA

Signed-off-by: jasonwan <[email protected]>

---------

Signed-off-by: jasonwan <[email protected]>

* revert

Signed-off-by: ericharper <[email protected]>

* remove import

Signed-off-by: eharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* rollback model cast for p-tuning

Signed-off-by: jasonwan <[email protected]>

* update for dist adam

Signed-off-by: eharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* use get_gpt_module_list

Signed-off-by: eharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update ckpt conversion script

Signed-off-by: jasonwan <[email protected]>

* ptl2.0 patch for llama config

Signed-off-by: jasonwan <[email protected]>

* add plugins to trainer in scripts

Signed-off-by: jasonwan <[email protected]>

* fix activation checkpointing mcore

Signed-off-by: jasonwan <[email protected]>

* fix variable names

Signed-off-by: jasonwan <[email protected]>

* overwrite normalization type for mcore/te

Signed-off-by: jasonwan <[email protected]>

* Update megatron_llama_sft.yaml

Signed-off-by: Jason Wang <[email protected]>

* add PEFT adapter support for mcore gpt path (#7276)

* implementation for mcore adapter/mxins

Signed-off-by: jasonwan <[email protected]>

* small fix for lora and ptuning

Signed-off-by: jasonwan <[email protected]>

* support layerwise peft

Signed-off-by: jasonwan <[email protected]>

* support multiple target layers

Signed-off-by: jasonwan <[email protected]>

* support lora GQA

Signed-off-by: jasonwan <[email protected]>

* support amp O2

Signed-off-by: jasonwan <[email protected]>

* revert & more O2 fix

Signed-off-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* lora inject to attention

Signed-off-by: jasonwan <[email protected]>

* support lora weight tying

Signed-off-by: jasonwan <[email protected]>

* add copyright header

Signed-off-by: jasonwan <[email protected]>

* rollback ptuning name change. full string match mcore target

Signed-off-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove comment

Signed-off-by: jasonwan <[email protected]>

---------

Signed-off-by: jasonwan <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* clean up config

Signed-off-by: jasonwan <[email protected]>

* Sync llama branch (#7297)

* add inference param. update TP/PP script to support mcore gpt

* p-tuning

Signed-off-by: jasonwan <[email protected]>

* change layer names for SFT

Signed-off-by: Hongbin Liu <[email protected]>

* fix bug in SFT

Signed-off-by: Hongbin Liu <[email protected]>

* fix bug: cpu initialization is not really enabled

Signed-off-by: Hongbin Liu <[email protected]>

* add use_cpu_initialization to TransformerConfig

Signed-off-by: Hongbin Liu <[email protected]>

* fix bug: wrong config path when using relative cjpt path

Signed-off-by: Hongbin Liu <[email protected]>

* revert mcore config change

Signed-off-by: Jason Wang <[email protected]>

---------

Signed-off-by: jasonwan <[email protected]>
Signed-off-by: Hongbin Liu <[email protected]>
Signed-off-by: Jason Wang <[email protected]>
Co-authored-by: Hongbin Liu <[email protected]>

* clean up ckpt conversion script

Signed-off-by: jasonwan <[email protected]>

* rollback git merge errors

Signed-off-by: jasonwan <[email protected]>

* update mcore, add check for mcore+te

Signed-off-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* formatting

Signed-off-by: jasonwan <[email protected]>

* make sft test dataset optional. fix indentation in config

Signed-off-by: jasonwan <[email protected]>

* one more fix for optional test set

Signed-off-by: jasonwan <[email protected]>

* support merging lora weights in mcore

Signed-off-by: jasonwan <[email protected]>

* update mcore for cpu init

Signed-off-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update ckpt conversion for code llama

Signed-off-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add seq_len_interpolation_factor support for long-context llama ckpts (#7312)

* add inference param. update TP/PP script to support mcore gpt

* p-tuning

Signed-off-by: jasonwan <[email protected]>

* add seq_len_interpolation_factor

Signed-off-by: Hongbin Liu <[email protected]>

---------

Signed-off-by: jasonwan <[email protected]>
Signed-off-by: Hongbin Liu <[email protected]>
Co-authored-by: jasonwan <[email protected]>
Co-authored-by: Hongbin Liu <[email protected]>

* fix old ptuning model, update mcore to support seq_len_interpolation_factor

Signed-off-by: jasonwan <[email protected]>

* support fused layernorm linear, fix ptuning O2

Signed-off-by: jasonwan <[email protected]>

* drop loss mask for mcore for now

Signed-off-by: jasonwan <[email protected]>

* disable dist ckpt in peft

Signed-off-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix loading non dist ckpt

Signed-off-by: jasonwan <[email protected]>

* add ckpt conversion to CI

Signed-off-by: jasonwan <[email protected]>

* update CI

Signed-off-by: jasonwan <[email protected]>

* mcore_mixin docstring

Signed-off-by: jasonwan <[email protected]>

* minor change in mcore peft error message

Signed-off-by: jasonwan <[email protected]>

* fix amp o2 in lora weight tying

Signed-off-by: jasonwan <[email protected]>

* correct mcore fp8 config

Signed-off-by: jasonwan <[email protected]>

* add TE installation

Signed-off-by: jasonwan <[email protected]>

* support mcore adapter tuning

Signed-off-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* comment out new CI test. rollback docker image

Signed-off-by: jasonwan <[email protected]>

* ignore FA tests, try new CI on 23.08

Signed-off-by: jasonwan <[email protected]>

* mark new CI as L2, put to beginning to test

Signed-off-by: jasonwan <[email protected]>

* minor fix for prompt learning

Signed-off-by: jasonwan <[email protected]>

* rollback to 23.06. comment out CI

Signed-off-by: jasonwan <[email protected]>

* minor fix ckpt conversion script

Signed-off-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* minor rollback gpt model change

Signed-off-by: jasonwan <[email protected]>

---------

Signed-off-by: ericharper <[email protected]>
Signed-off-by: jasonwan <[email protected]>
Signed-off-by: eharper <[email protected]>
Signed-off-by: Hongbin Liu <[email protected]>
Signed-off-by: Jason Wang <[email protected]>
Co-authored-by: ericharper <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: eharper <[email protected]>
Co-authored-by: Hongbin Liu <[email protected]>
Co-authored-by: Kelvin Liu <[email protected]>

* Hiddens modules documentation (#7303)

* 1. Changed hiddens transformations module from `transformations` to `hiddens`.

Signed-off-by: Micha Livne <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* 1. Debugging. Signed-off-by: Micha Livne <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* 1. Finished doc.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging. Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging. Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging. Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging. Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging. Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging. Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging. Signed-off-by: Micha Livne <[email protected]>

---------

Signed-off-by: Micha Livne <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Eric Harper <[email protected]>

* Support for flash attention 2.0 (#7063)

* Add flash attn 2

Signed-off-by: MaximumEntropy <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add FA2 feature

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* Remove debugging

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: MaximumEntropy <[email protected]>
Signed-off-by: Cheng-Ping Hsieh <[email protected]>
Signed-off-by: Cheng-Ping Hsieh <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Oleksii Kuchaiev <[email protected]>
Co-authored-by: Cheng-Ping Hsieh <[email protected]>
Co-authored-by: Cheng-Ping Hsieh <[email protected]>

* lora merge fix for O2 names (#7325)

* wip

Signed-off-by: arendu <[email protected]>

* adjust key names based on O2

Signed-off-by: arendu <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update

Signed-off-by: arendu <[email protected]>

* minor

Signed-off-by: arendu <[email protected]>

---------

Signed-off-by: arendu <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* multiple fields can form a context (#7147)

* list of context fields and flexible prompt template

Signed-off-by: arendu <[email protected]>

* list of fields for context

Signed-off-by: arendu <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix bug

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* Fix bug

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* Add multiple truncation fields and middle truncation

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Compatible to old ckpt

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix tokenize detokenize issue

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Remove detokenization, add truncation augmentation

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Resolve comments

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* Remove unused import

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* revert eos

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* Add tokenizer space_sensitive attribute

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix error

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* Fix erorr and use re

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix bug

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* Change assert logic

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Follow adi suggestion

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Remove merge function

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add example and comment

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* Remove context_key and add comment

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* Remove random truncation

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix bug

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix template none

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix bug

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

---------

Signed-off-by: arendu <[email protected]>
Signed-off-by: Cheng-Ping Hsieh <[email protected]>
Signed-off-by: Cheng-Ping Hsieh <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Cheng-Ping Hsieh <[email protected]>
Co-authored-by: Cheng-Ping Hsieh <[email protected]>

* Load buffers in checkpoint (#7357)

Signed-off-by: Jason Wang <[email protected]>

* Add migration guide for lightning 2.0 upgrade (#7360)

* Add lightning 2.0 migration guide in NeMo docs

Signed-off-by: Abhishree <[email protected]>

* Add remaining guide for lightning 2.0 upgrade

Signed-off-by: Abhishree <[email protected]>

* Remove line spill over and continue in next line

Signed-off-by: Abhishree <[email protected]>

* Add missing dataloader_iter in the guide

Signed-off-by: Abhishree <[email protected]>

* Fix minor typo

Signed-off-by: Abhishree <[email protected]>

---------

Signed-off-by: Abhishree <[email protected]>

* adding bias_dropout_add_fusion option for BERT (#7332)

Signed-off-by: Alexander Jipa <[email protected]>
Co-authored-by: Alexander Jipa <[email protected]>

* [TTS] Change audio codec token type to TokenIndex (#7356)

Signed-off-by: Ryan <[email protected]>

* enable selective unfreeze (#7326)

* wip

Signed-off-by: arendu <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* wip

Signed-off-by: arendu <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* avoid PTL method conflicts

Signed-off-by: arendu <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update

Signed-off-by: arendu <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update

Signed-off-by: arendu <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: arendu <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Fix typos (#7361)

* fix typos

Signed-off-by: omahs <[email protected]>

* fix typo

Signed-off-by: omahs <[email protected]>

* fix typos

Signed-off-by: omahs <[email protected]>

* fix typos

Signed-off-by: omahs <[email protected]>

* fix typo

Signed-off-by: omahs <[email protected]>

* fix typos

Signed-off-by: omahs <[email protected]>

* fix typo

Signed-off-by: omahs <[email protected]>

* fix typo

Signed-off-by: omahs <[email protected]>

* fix typo

Signed-off-by: omahs <[email protected]>

---------

Signed-off-by: omahs <[email protected]>

* pin numba=0.57.1 to fix reinstall.sh error (#7366)

Signed-off-by: Xuesong Yang <[email protected]>

* Update new conversion script for converting safetensors.

* Upgrade pytorch container to 23.08 (#7353)

* upgrade pytorch container

Signed-off-by: eharper <[email protected]>

* use mcore

Signed-off-by: eharper <[email protected]>

* revert test change

Signed-off-by: eharper <[email protected]>

* pleasefixme

Signed-off-by: eharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* check for ampere

Signed-off-by: eharper <[email protected]>

* comment test temporarily

Signed-off-by: eharper <[email protected]>

---------

Signed-off-by: eharper <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* enable fp32 optimizer for output_layer in mcore (#7355)

Signed-off-by: lhb8125 <[email protected]>

* revert comment (#7368)

Signed-off-by: eharper <[email protected]>

* Update to core 23.08 branch ToT (#7371)

Signed-off-by: Abhinav Khattar <[email protected]>

* upper bounding ptl (#7370)

Signed-off-by: eharper <[email protected]>

* fix pipeline parallel inference (#7367)

* fix pp inference

Signed-off-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: jasonwan <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* fix for peft tied weights (#7372)

Signed-off-by: arendu <[email protected]>

* fixed trainer.strategy=auto from None. (#7369)

Signed-off-by: Xuesong Yang <[email protected]>

* add O2 option in gpt eval (#7358)

* add O2 option in eval

Signed-off-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add doc for O2 config

Signed-off-by: jasonwan <[email protected]>

* add to llama inference config

Signed-off-by: jasonwan <[email protected]>

---------

Signed-off-by: jasonwan <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Eric Harper <[email protected]>

* Move model precision copy (#7336)

* move cfg precision set to megatron base model

Signed-off-by: Maanu Grover <[email protected]>

* remove copy from other models

Signed-off-by: Maanu Grover <[email protected]>

* modify attribute not arg

Signed-off-by: Maanu Grover <[email protected]>

* fix gpt model test for ptl 2.0

Signed-off-by: Maanu Grover <[email protected]>

* rename function and add docstring

Signed-off-by: Maanu Grover <[email protected]>

* replace precision to dtype conditionals with func call

Signed-off-by: Maanu Grover <[email protected]>

* unnecessary function and cfg reset

Signed-off-by: Maanu Grover <[email protected]>

* set default value

Signed-off-by: Maanu Grover <[email protected]>

* fix precision lookup in a few more places

Signed-off-by: Maanu Grover <[email protected]>

* rename mapping function

Signed-off-by: Maanu Grover <[email protected]>

* ununsed import

Signed-off-by: Maanu Grover <[email protected]>

* save torch datatype to model

Signed-off-by: Maanu Grover <[email protected]>

* set weights precision wrt amp o2

Signed-off-by: Maanu Grover <[email protected]>

* Revert "set weights precision wrt amp o2"

This reverts commit 313a4bfe5eb69d771a6d2433898c0685836aef5c.

Signed-off-by: Maanu Grover <[email protected]>

* revert half precision at inference attempt

Signed-off-by: Maanu Grover <[email protected]>

* move autocast dtype to base model

Signed-off-by: Maanu Grover <[email protected]>

* move params dtype to base model, enable fp16 O2 inf

Signed-off-by: Maanu Grover <[email protected]>

* unused imports

Signed-off-by: Maanu Grover <[email protected]>

---------

Signed-off-by: Maanu Grover <[email protected]>

* Fix PEFT checkpoint loading (#7388)

* Fix PEFT checkpoint loading

Signed-off-by: Jason Wang <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Jason Wang <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Use distributed optimizer support for multiple dtypes (#7359)

* Update distopt wrapper with multiple dtype support

Remove manual handling of separate FP32 optimizer.

Signed-off-by: Tim Moon <[email protected]>

* Use distopt support for contiguous buffers with multiple dtypes

Signed-off-by: Tim Moon <[email protected]>

* Fix typo

Signed-off-by: Tim Moon <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Separate distopt buckets for first GPT layer and non-overlapped params

Signed-off-by: Tim Moon <[email protected]>

* Add distopt logic for int dtypes

Signed-off-by: Tim Moon <[email protected]>

* Update Apex commit

Signed-off-by: Tim Moon <[email protected]>

* Remove unused variables

Signed-off-by: Tim Moon <[email protected]>

* Update Apex commit in README and Jenkensfile

Signed-off-by: Tim Moon <[email protected]>

* Debug Dockerfile and Jenkinsfile

Signed-off-by: Tim Moon <[email protected]>

---------

Signed-off-by: Tim Moon <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Eric Harper <[email protected]>

* minor fix for llama ckpt conversion script (#7387)

* minor fix for llama ckpt conversion script

Signed-off-by: Jason Wang <[email protected]>

* Update Jenkinsfile

Signed-off-by: Jason Wang <[email protected]>

* remove fast_swiglu configuration

Signed-off-by: Jason Wang <[email protected]>

---------

Signed-off-by: Jason Wang <[email protected]>
Co-authored-by: Eric Harper <[email protected]>

* Fix wrong calling of librosa.get_duration() in notebook (#7376)

Signed-off-by: Robin Dong <[email protected]>
Co-authored-by: Somshubra Majumdar <[email protected]>

* [PATCH] PEFT import mcore (#7393)

* [PATCH] PEFT import mcore

Signed-off-by: Jason Wang <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Jason Wang <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* [TTS] Added a callback for logging initial data (#7384)

Signed-off-by: Ante Jukić <[email protected]>

* Update Core Commit (#7402)

* Update Core Commit

Signed-off-by: Abhinav Khattar <[email protected]>

* update commit

Signed-off-by: Abhinav Khattar <[email protected]>

---------

Signed-off-by: Abhinav Khattar <[email protected]>

* Use cfg attribute in bert (#7394)

* use cfg attribute instead of arg

Signed-off-by: Maanu Grover <[email protected]>

* use torch_dtype in place of cfg.precision

Signed-off-by: Maanu Grover <[email protected]>

* move precision copy before super constructor

Signed-off-by: Maanu Grover <[email protected]>

* use trainer arg

Signed-off-by: Maanu Grover <[email protected]>

---------

Signed-off-by: Maanu Grover <[email protected]>

* Add support for bias conversion in Swiglu models (#7386)

* Add support for bias conversion in Swiglu models

Signed-off-by: smajumdar <[email protected]>

* Add support for auto extracting tokenizer model

Signed-off-by: smajumdar <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add support for auto extracting tokenizer model

Signed-off-by: smajumdar <[email protected]>

* Fix issue with missing tokenizer

Signed-off-by: smajumdar <[email protected]>

* Refactor

Signed-off-by: smajumdar <[email protected]>

* Refactor

Signed-off-by: smajumdar <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: smajumdar <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Update save_to and restore_from for dist checkpointing (#7343)

* add dist ckpt to save to, in progress

Signed-off-by: eharper <[email protected]>

* move dist ckpt

Signed-off-by: eharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* clean up

Signed-off-by: eharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update restore from, need to figure out how to initialize distributed

Signed-off-by: eharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* launch distrib if needed when restoring dist ckpt

Signed-off-by: eharper <[email protected]>

* when using mcore we can change tp pp on the fly

Signed-off-by: eharper <[email protected]>

* add load_from_checkpoint support for dist ckpt

Signed-off-by: eharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update llama convert script to save dist .nemo

Signed-off-by: eharper <[email protected]>

* fix load dist ckpt

Signed-off-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* setup TE TP groups if needed

Signed-off-by: eharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* setup te tp groups if needed

Signed-off-by: eharper <[email protected]>

* remove import

Signed-off-by: eharper <[email protected]>

---------

Signed-off-by: eharper <[email protected]>
Signed-off-by: jasonwan <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: jasonwan <[email protected]>

* fix forward for with mcore=false (#7403)

Signed-off-by: Jimmy Zhang <[email protected]>
Co-authored-by: Jimmy Zhang <[email protected]>

* Fix logging to remove 's/it' from progress bar in Megatron models and add train_step_timing (#7374)

* Add CustomProgressBar class to exp_manager and trainer callbacks

Signed-off-by: Abhishree <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix the progres…
JesusPaz pushed a commit to JesusPaz/NeMo that referenced this pull request Jun 18, 2024
…DIA#9169)

* Fixes

* Docs fix

* Add support for custom NeMo fields in Lhotse-NeMo adapters (attach to cut.custom)

* Add support for custom NeMo fields in Lhotse-NeMo adapters (attach to cut.custom)

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* support distributed_fused_adam

Signed-off-by: zhehuaichen <[email protected]>

* support distributed_fused_adam

Signed-off-by: zhehuaichen <[email protected]>

* Add support for sharded NeMo manifest files

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* support megatron_amp_O2

Signed-off-by: zhehuaichen <[email protected]>

* Support heterogeneous sampling rates in non tarred NeMo manifests

* migrate to PTL2.0

Signed-off-by: stevehuang52 <[email protected]>

* clean up

Signed-off-by: stevehuang52 <[email protected]>

* update manifest util

Signed-off-by: stevehuang52 <[email protected]>

* Support multiple tokenizer/parser types, aggregate tokenizers, and custom language fields

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* agg and normal tokenizers actually work

* Support weights for NeMo tarred manifests

* Temporarily hardcoded pnc stripping/lowercasing

* fix

* make pnc hack configurable from the config and disabled by default

* fix the hack

* migrate to ptl2.1 to support multiple dataloaders

Signed-off-by: stevehuang52 <[email protected]>

* support encoder overwrite

Signed-off-by: zhehuaichen <[email protected]>

* update misc

Signed-off-by: stevehuang52 <[email protected]>

* fix eval and clean up

Signed-off-by: stevehuang52 <[email protected]>

* support add_sep for perception model

Signed-off-by: zhehuaichen <[email protected]>

* fix https://github.com/Lightning-AI/pytorch-lightning/issues/18803

Signed-off-by: zhehuaichen <[email protected]>

* add_bos

Signed-off-by: zhehuaichen <[email protected]>

* Transformer decoder with conditioning for canary (#8091)

* initial commit for multi-task conf-enc transf-dec for canary

Signed-off-by: Krishna Puvvada <[email protected]>

* removing decoder states caching during training

Signed-off-by: Krishna Puvvada <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Krishna Puvvada <[email protected]>
Co-authored-by: Krishna Puvvada <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Option to limit the number of open streams (#8095)

* audio signal support in multi

Signed-off-by: zhehuaichen <[email protected]>

* update asr evaluator

Signed-off-by: stevehuang52 <[email protected]>

* fix from
https://github.com/NVIDIA/NeMo/commit/fcc0f9f6ff7947c3c7fba3ed17d8ec8af6391397
and
https://github.com/NVIDIA/NeMo/commit/f97c9016e6438ca4174b66bf9c3e248b28197aaa

Signed-off-by: zhehuaichen <[email protected]>

* transcribe fn for Canary models (#8110)

* improve readability

Signed-off-by: Krishna Puvvada <[email protected]>

* adding context in transcribe function for ConfTransfModels

Signed-off-by: Krishna Puvvada <[email protected]>

* supporting relative paths in transcribe function for canary

Signed-off-by: Krishna Puvvada <[email protected]>

* removing cuts.sort_by_duration in __getitem__ to maintain manifest order during inference

Signed-off-by: Krishna Puvvada <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Krishna Puvvada <[email protected]>
Co-authored-by: Krishna Puvvada <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* update for evaluation

Signed-off-by: stevehuang52 <[email protected]>

* update for eval

Signed-off-by: stevehuang52 <[email protected]>

* update for evaluation

Signed-off-by: stevehuang52 <[email protected]>

* fix bleu

Signed-off-by: stevehuang52 <[email protected]>

* fix typo

Signed-off-by: stevehuang52 <[email protected]>

* Add missing audio_filepath validation for Canary (#8119)

* Add missing audio_filepath validation for Canary

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* add default concat_sampling_probabilities

Signed-off-by: zhehuaichen <[email protected]>

* support lhotse dataset in speechllm

Signed-off-by: zhehuaichen <[email protected]>

* bypass get_iterator_k_split

Signed-off-by: zhehuaichen <[email protected]>

* tmp fix

Signed-off-by: zhehuaichen <[email protected]>

* try to use fixed batch with megatron

Signed-off-by: zhehuaichen <[email protected]>

* add batch logging

Signed-off-by: zhehuaichen <[email protected]>

* support unfrozen llm

Signed-off-by: zhehuaichen <[email protected]>

* Create README.md

Signed-off-by: He Huang (Steve) <[email protected]>

* Update README.md

Signed-off-by: He Huang (Steve) <[email protected]>

* Update README.md

Signed-off-by: He Huang (Steve) <[email protected]>

* update

Signed-off-by: stevehuang52 <[email protected]>

* rename

Signed-off-by: stevehuang52 <[email protected]>

* add llama prompt template

Signed-off-by: zhehuaichen <[email protected]>

* update and refactor

Signed-off-by: stevehuang52 <[email protected]>

* support sample alpha

Signed-off-by: zhehuaichen <[email protected]>

* support lhotse validation set and canary pretrained ckpt with pseudo label

Signed-off-by: zhehuaichen <[email protected]>

* make sure backward compatibility

Signed-off-by: zhehuaichen <[email protected]>

* remove pad

Signed-off-by: zhehuaichen <[email protected]>

* make sure asr_model is frozen

Signed-off-by: zhehuaichen <[email protected]>

* support greedy decoding

Signed-off-by: zhehuaichen <[email protected]>

* valid on lhotse

Signed-off-by: zhehuaichen <[email protected]>

* fix multi dataloader in val case for lhotse SALM; add default data
names; keep asr model tokenizer by default to enable adding canary
dataset

Signed-off-by: zhehuaichen <[email protected]>

* remove the bruteforce _keep_special_tokens implementation

Signed-off-by: zhehuaichen <[email protected]>

* decoding_ratio and convert_canary_prompt_to_text support

Signed-off-by: zhehuaichen <[email protected]>

* canary_tokens_augment_ratio

Signed-off-by: zhehuaichen <[email protected]>

* debug

Signed-off-by: zhehuaichen <[email protected]>

* bug fix

Signed-off-by: zhehuaichen <[email protected]>

* fix lhotse based eval of llama canary model

Signed-off-by: zhehuaichen <[email protected]>

* support some overwrite for eval

Signed-off-by: zhehuaichen <[email protected]>

* support zero shot prompt in training

Signed-off-by: zhehuaichen <[email protected]>

* support cross attention based SALM

Signed-off-by: zhehuaichen <[email protected]>

* support cross attention based SALM

Signed-off-by: zhehuaichen <[email protected]>

* fix for batch train/valid of cross

Signed-off-by: zhehuaichen <[email protected]>

* support learnable gate and plotting

Signed-off-by: zhehuaichen <[email protected]>

* support using pseudo label in prompt rather than cross att

Signed-off-by: zhehuaichen <[email protected]>

* bug fix for perception cfg and context tokens shift

Signed-off-by: zhehuaichen <[email protected]>

* DentityConnectorsAdd

Signed-off-by: zhehuaichen <[email protected]>

* fix ckpt saving

Signed-off-by: zhehuaichen <[email protected]>

* Support RnnGatedCrossAttention

Signed-off-by: zhehuaichen <[email protected]>

* add include_ffw and fix _optimizer_param_groups for all unfrozen run

Signed-off-by: zhehuaichen <[email protected]>

* support grad acc when using bucket

Signed-off-by: zhehuaichen <[email protected]>

* support TransformerCrossAttention

Signed-off-by: zhehuaichen <[email protected]>

* support ProjectTransformerCrossAttention

Signed-off-by: zhehuaichen <[email protected]>

* support ++model.use_am_tokenizer ++model.override_vocab_size ++model.override.hidden_size

Signed-off-by: zhehuaichen <[email protected]>

* support question set on val without canary

Signed-off-by: zhehuaichen <[email protected]>

* support load_audio_encoder and wip in optim_param_groups

Signed-off-by: zhehuaichen <[email protected]>

* minor fix for audio pretrain model init

Signed-off-by: zhehuaichen <[email protected]>

* simplify canary_tokens_augment

Signed-off-by: zhehuaichen <[email protected]>

* use question in the manifest if it exists

Signed-off-by: zhehuaichen <[email protected]>

* support dataset weighting for non tar

Signed-off-by: zhehuaichen <[email protected]>

* Update SpeechLLM code (#8475)

* add pleasefixme marker for potential failed nightly tests. (#7678)

Signed-off-by: Xuesong Yang <[email protected]>

* Add new text segmentation library for better TTS quality (#7645)

* Add new text segmentation library for better TTS quality
* Update zh_cn_pinyin.py

added detailed instruction on how to install pkuseg.

Signed-off-by: Xuesong Yang <[email protected]>

* Update requirements_tts.txt

remove pkuseg as the default dependency of NeMo TTS, and instead, direct users to manually install pkuseg if they really need.

Signed-off-by: Xuesong Yang <[email protected]>


---------

Signed-off-by: Xuesong Yang <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Xuesong Yang <[email protected]>

* Create PrecisionPlugin for megatron_ckpt_to_nemo.py trainer (#7767) (#7774)

* Create PrecisionPlugin for megatron_ckpt_to_nemo.py trainer



* Add ddp_find_unused_parameters_true for punctuation_capitalization_train_evaluate.py



* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add '32-true' for precision values



---------

Signed-off-by: Abhishree <[email protected]>
Signed-off-by: Abhishree Thittenamane <[email protected]>
Co-authored-by: Abhishree Thittenamane <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* fix(clustering_diarizer.py): fix typo (#7772)

Signed-off-by: Jean-Louis Queguiner <[email protected]>

* fix(diarization-README): typo (#7771)

Signed-off-by: Jean-Louis Queguiner <[email protected]>

* Fix bug wrt change decoding strategy for bpe models (#7762) (#7764)

* Fix bug wrt change decoding strategy for bpe models



* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: smajumdar <[email protected]>
Co-authored-by: Somshubra Majumdar <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Remove incorrect extra argument for load_from_checkpoint_dir() (#7500)

Signed-off-by: Robin Dong <[email protected]>
Co-authored-by: Eric Harper <[email protected]>

* Add nemo to mcore GPT conversion script  (#7730)

* add conversion script

Signed-off-by: Chen Cui <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove references to 'ckpt'

Signed-off-by: Chen Cui <[email protected]>

* add one more sanity check to make sure there is no unexpected keys in state dict

Signed-off-by: Chen Cui <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* make cpu loading work

Signed-off-by: Chen Cui <[email protected]>

* make script work for llama2 models

Signed-off-by: Chen Cui <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* address code check

Signed-off-by: Chen Cui <[email protected]>

* remove trainer precision (was for old sanity check)

Signed-off-by: Chen Cui <[email protected]>

* fix script for llama2 model

Signed-off-by: Chen Cui <[email protected]>

* remove commented code

Signed-off-by: Chen Cui <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Chen Cui <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Eric Harper <[email protected]>

* Fix bug in ConditionalInput: cat along the feature dim, not the batch dim (#7785)

Signed-off-by: anferico <[email protected]>

* Add some docs and update scripts for ASR (#7790)

* Add some docs and update scripts

Signed-off-by: smajumdar <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: smajumdar <[email protected]>
Signed-off-by: Somshubra Majumdar <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* set context for text memmap to fork (#7784)

* set context for text memmap to fork

Signed-off-by: arendu <[email protected]>

* typo

Signed-off-by: arendu <[email protected]>

---------

Signed-off-by: arendu <[email protected]>

* add training with multiple audios

Signed-off-by: stevehuang52 <[email protected]>

* Support flash decoding (#7744)

* Add flash-decoding

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* Fix

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

---------

Signed-off-by: Cheng-Ping Hsieh <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Yang Zhang <[email protected]>

* Change accelerator to 'auto' in nlp_checkpoint_port.py (#7761)

* Change accelerator to 'auto' in nlp_checkpoint_port.py (#7747)

* Change accelerator to auto

Signed-off-by: Abhishree <[email protected]>

* Pass omegaconf object to trainer in nlp_checkpoint_port.py

Signed-off-by: Abhishree <[email protected]>

* Pass omegaconf object to trainer in export.py

Signed-off-by: Abhishree <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Abhishree <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Eric Harper <[email protected]>
Signed-off-by: Abhishree <[email protected]>

* docs: fix typos (#7758)

Signed-off-by: shuoer86 <[email protected]>
Co-authored-by: Xuesong Yang <[email protected]>
Signed-off-by: Abhishree <[email protected]>

* Snake act (#7736)

Signed-off-by: Abhishree <[email protected]>

* Update gpt_dataset.py (#6963)

Signed-off-by: Xin Yao <[email protected]>
Co-authored-by: Sandeep Subramanian <[email protected]>
Signed-off-by: Abhishree <[email protected]>

---------

Signed-off-by: Abhishree <[email protected]>
Signed-off-by: shuoer86 <[email protected]>
Signed-off-by: Xin Yao <[email protected]>
Co-authored-by: Abhishree Thittenamane <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Eric Harper <[email protected]>
Co-authored-by: shuoer86 <[email protected]>
Co-authored-by: Xuesong Yang <[email protected]>
Co-authored-by: Nithin Rao <[email protected]>
Co-authored-by: Xin Yao <[email protected]>
Co-authored-by: Sandeep Subramanian <[email protected]>

* Add selection criteria for reference audios in the `GlobalStyleToken` submodule (#7788)

* add selection criteria for reference audios

Signed-off-by: anferico <[email protected]>

* Update configuration files

Signed-off-by: anferico <[email protected]>

* add informative comment in config files

Signed-off-by: anferico <[email protected]>

* sample random index for reference audio selection

Signed-off-by: anferico <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: anferico <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* update text server to support compute logprobs (#7733)

* update text server to support compute logprobs

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix typo

---------

Signed-off-by: Zhilin Wang <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* add multi-layer feat extract and fix random question insertion

Signed-off-by: stevehuang52 <[email protected]>

* Configure MCore logger (#7781)

Signed-off-by: Mikołaj Błaż <[email protected]>

* Revert "PEFT eval fix (#7626) (#7638)" (#7693)

This reverts commit f03dd660bd26d88fd569e76c6f74b83a7c203ff9.

* remove TN from ctc_segm tut (#7807)

Signed-off-by: Evelina <[email protected]>

* [TTS] Support audio offsets in TTS data loaders (#7156)

* [TTS] Support audio offsets in TTS data loaders

Signed-off-by: Ryan <[email protected]>

* [TTS] Change docstring mentions of .pt to .npy

Signed-off-by: Ryan <[email protected]>

---------

Signed-off-by: Ryan <[email protected]>

* Update Apex install command in Dockerfile (#7794) (#7804)

* move core install to /workspace (#7706)



* update apex install in dockerfile



* use fetch head



---------

Signed-off-by: Abhinav Khattar <[email protected]>
Signed-off-by: eharper <[email protected]>
Co-authored-by: Eric Harper <[email protected]>
Co-authored-by: Abhinav Khattar <[email protected]>

* fix typo

Signed-off-by: stevehuang52 <[email protected]>

* Nemo to HF converter for LLaMA model (#7770)

* Create config_llama_truncate.yaml

Signed-off-by: Utkarsh <[email protected]>

* Add files via upload

Signed-off-by: Utkarsh <[email protected]>

* Update convert_nemo_llama_to_hf.py

Signed-off-by: Utkarsh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update config_llama_truncate.yaml

Signed-off-by: Utkarsh <[email protected]>

* Update convert_nemo_llama_to_hf.py

Signed-off-by: Utkarsh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update convert_nemo_llama_to_hf.py

Signed-off-by: Utkarsh <[email protected]>

* clean up trainer

* remove dependency on yaml config. load config from nemo file instead.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* enable ckpt saving into other precision formats

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* support 70b + cleanup qkv slice logic

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix bug

* move hf model folder code from comment to function and add instruction to run

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Utkarsh <[email protected]>
Signed-off-by: Chen Cui <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Eric Harper <[email protected]>
Co-authored-by: Chen Cui <[email protected]>

* Save best NeMo model only when necessary (#7836)

Signed-off-by: Ante Jukić <[email protected]>

* add guard if its a distributed checkpoint (#7845)

Signed-off-by: Gerald Shen <[email protected]>

* Fix tn duplex (#7808)

* fix duplex tn infer

Signed-off-by: Evelina <[email protected]>

* fix typo

Signed-off-by: Evelina <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix TN docs

Signed-off-by: Evelina <[email protected]>

---------

Signed-off-by: Evelina <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Update transformers cache on Jenkins (#7854)

* update transformers cache

Signed-off-by: eharper <[email protected]>

* update

Signed-off-by: eharper <[email protected]>

* add cd

Signed-off-by: eharper <[email protected]>

---------

Signed-off-by: eharper <[email protected]>

* Update README.rst for container update (#7844)

Signed-off-by: fayejf <[email protected]>

* Add support for finetuning with huggingface datasets (#7834)

* add finetune with huggingface dataset

Signed-off-by: stevehuang52 <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update yaml

Signed-off-by: stevehuang52 <[email protected]>

* update

Signed-off-by: stevehuang52 <[email protected]>

* update and refactor

Signed-off-by: stevehuang52 <[email protected]>

* add extrac hf text and update

Signed-off-by: stevehuang52 <[email protected]>

* update and refactor

Signed-off-by: stevehuang52 <[email protected]>

* move dataset dependency to common

Signed-off-by: stevehuang52 <[email protected]>

* add docstring

Signed-off-by: stevehuang52 <[email protected]>

* Add to Dics

Signed-off-by: Nithin Rao Koluguri <nithinraok>

* add ci test

Signed-off-by: Nithin Rao Koluguri <nithinraok>

* add max steps in jenkins

Signed-off-by: Nithin Rao Koluguri <nithinraok>

* reduce max steps

Signed-off-by: Nithin Rao Koluguri <nithinraok>

* jenkins test

Signed-off-by: Nithin Rao Koluguri <nithinraok>

* add bs=2

Signed-off-by: Nithin Rao Koluguri <nithinraok>

---------

Signed-off-by: stevehuang52 <[email protected]>
Signed-off-by: Nithin Rao Koluguri <nithinraok>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Nithin Rao Koluguri <nithinraok>
Co-authored-by: Nithin Rao <[email protected]>

* Multimodal merge (#7728)

* ControlNet TRT export

* Final MR before release

* SD2 update

* Fixed export issue

* Fix for instruct p2p and reformat

* Fix SD export issue

* Add nemo clip export for DB

* Fix ins pix2pix

* fix sd2 config

* [Mingyuan Ma] BF16 and SD conversion script

* [Imagen] NHWC Feature

* Fix .nemo loading issue for NeMo CLIP in SD

* NeMo r1.20.0 Multimodal Merge

* fix the inductor issue in inference

* Fix inductor loading .nemo issue

* Add Neva Model Support

* Imagen Optimizations

* Neva inference code

* NeMo TOT 1.21 to Internal/main

* Update neva_inference.yaml

* REBASING  for latest code changes

* Update internal/main to main tot

* Parallel DDIM implementation

* 1. Fixing indentation bug. (#7352)

Signed-off-by: Micha Livne <[email protected]>

* NeMo MCore llama2 support + MCore PEFT adapters (#7299)

* start adding gpt from megatron core path

Signed-off-by: ericharper <[email protected]>

* set model parallel config

Signed-off-by: ericharper <[email protected]>

* use model parallel config object

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update args

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* set vp size to none if it is 1

Signed-off-by: ericharper <[email protected]>

* set vp size to none if it is 1

Signed-off-by: ericharper <[email protected]>

* add TransformerConfig

Signed-off-by: ericharper <[email protected]>

* start updating to TransformerConfig

Signed-off-by: ericharper <[email protected]>

* add todo

Signed-off-by: ericharper <[email protected]>

* revert to model parallel config

Signed-off-by: ericharper <[email protected]>

* add hidden_size to model_parallel_config

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove imports

Signed-off-by: ericharper <[email protected]>

* revert

Signed-off-by: ericharper <[email protected]>

* remove import

Signed-off-by: ericharper <[email protected]>

* small clean up

Signed-off-by: ericharper <[email protected]>

* update hidden size in peft base model, add mcore commit to jenkins

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update module args

Signed-off-by: ericharper <[email protected]>

* add config obj to flash attention tests

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove args

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove sequence parallel arg

Signed-off-by: ericharper <[email protected]>

* update args

Signed-off-by: ericharper <[email protected]>

* add config to self

Signed-off-by: ericharper <[email protected]>

* update args

Signed-off-by: ericharper <[email protected]>

* update args

Signed-off-by: ericharper <[email protected]>

* update args

Signed-off-by: ericharper <[email protected]>

* add config to test

Signed-off-by: ericharper <[email protected]>

* get hidden_size from config

Signed-off-by: ericharper <[email protected]>

* add try except

Signed-off-by: ericharper <[email protected]>

* use default

Signed-off-by: ericharper <[email protected]>

* update config with hidden size

Signed-off-by: ericharper <[email protected]>

* remove arg

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* comment out jenkins test

Signed-off-by: ericharper <[email protected]>

* revert import

Signed-off-by: ericharper <[email protected]>

* build transformer config

Signed-off-by: ericharper <[email protected]>

* add model to provider func

Signed-off-by: ericharper <[email protected]>

* update forward and float16 wrapper

Signed-off-by: ericharper <[email protected]>

* instantiate model parallel config after init model parallel

Signed-off-by: ericharper <[email protected]>

* set virtual rank

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add GQA config to megatron gpt model (#7096)

* Add GQA config in gpt config file

Signed-off-by: jasonwan <[email protected]>

* Verify mcore is enabled when using GQA

Signed-off-by: jasonwan <[email protected]>

---------

Signed-off-by: jasonwan <[email protected]>

* revert

Signed-off-by: ericharper <[email protected]>

* mcore llama2 ckpt conversion & small fix

Signed-off-by: jasonwan <[email protected]>

* Add inference & sft config by Hongbin

Co-authored-by: Hongbin Liu <[email protected]>

Signed-off-by: jasonwan <[email protected]>

* fix config

Signed-off-by: jasonwan <[email protected]>

* add inference param. update TP/PP script to support mcore gpt

Signed-off-by: jasonwan <[email protected]>

* p-tuning

Signed-off-by: jasonwan <[email protected]>

* modify ckpt conversion script (adding model cast)

Signed-off-by: jasonwan <[email protected]>

* ckpt conversion use relative path for config

Signed-off-by: jasonwan <[email protected]>

* start adding gpt from megatron core path

Signed-off-by: ericharper <[email protected]>

* set model parallel config

Signed-off-by: ericharper <[email protected]>

* use model parallel config object

Signed-off-by: ericharper <[email protected]>

* update args

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* set vp size to none if it is 1

Signed-off-by: ericharper <[email protected]>

* set vp size to none if it is 1

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add TransformerConfig

Signed-off-by: ericharper <[email protected]>

* start updating to TransformerConfig

Signed-off-by: ericharper <[email protected]>

* add todo

Signed-off-by: ericharper <[email protected]>

* revert to model parallel config

Signed-off-by: ericharper <[email protected]>

* add hidden_size to model_parallel_config

Signed-off-by: ericharper <[email protected]>

* remove imports

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove import

Signed-off-by: ericharper <[email protected]>

* small clean up

Signed-off-by: ericharper <[email protected]>

* update hidden size in peft base model, add mcore commit to jenkins

Signed-off-by: ericharper <[email protected]>

* update module args

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add config obj to flash attention tests

Signed-off-by: ericharper <[email protected]>

* remove args

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove sequence parallel arg

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update args

Signed-off-by: ericharper <[email protected]>

* add config to self

Signed-off-by: ericharper <[email protected]>

* update args

Signed-off-by: ericharper <[email protected]>

* update args

Signed-off-by: ericharper <[email protected]>

* update args

Signed-off-by: ericharper <[email protected]>

* add config to test

Signed-off-by: ericharper <[email protected]>

* get hidden_size from config

Signed-off-by: ericharper <[email protected]>

* add try except

Signed-off-by: ericharper <[email protected]>

* use default

Signed-off-by: ericharper <[email protected]>

* update config with hidden size

Signed-off-by: ericharper <[email protected]>

* remove arg

Signed-off-by: ericharper <[email protected]>

* comment out jenkins test

Signed-off-by: ericharper <[email protected]>

* revert import

Signed-off-by: ericharper <[email protected]>

* remove optimizer_idx

Signed-off-by: eharper <[email protected]>

* prefetch num microbatches

Signed-off-by: eharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* start adding gpt from megatron core path

Signed-off-by: ericharper <[email protected]>

* set model parallel config

Signed-off-by: ericharper <[email protected]>

* use model parallel config object

Signed-off-by: ericharper <[email protected]>

* update args

Signed-off-by: ericharper <[email protected]>

* fix for p-tuning sequence parallel

Signed-off-by: jasonwan <[email protected]>

* support SFT/distOpt mcore (#7207)

* add inference param. update TP/PP script to support mcore gpt

* p-tuning

Signed-off-by: jasonwan <[email protected]>

* change layer names for SFT

Signed-off-by: Hongbin Liu <[email protected]>

* fix bug in SFT

Signed-off-by: Hongbin Liu <[email protected]>

---------

Signed-off-by: jasonwan <[email protected]>
Signed-off-by: Hongbin Liu <[email protected]>
Co-authored-by: Hongbin Liu <[email protected]>
Co-authored-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* start updating to TransformerConfig

Signed-off-by: ericharper <[email protected]>

* revert to model parallel config

Signed-off-by: ericharper <[email protected]>

* add hidden_size to model_parallel_config

Signed-off-by: ericharper <[email protected]>

* remove imports

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update module args

Signed-off-by: ericharper <[email protected]>

* add config to self

Signed-off-by: ericharper <[email protected]>

* build transformer config

Signed-off-by: ericharper <[email protected]>

* add model to provider func

Signed-off-by: ericharper <[email protected]>

* update forward and float16 wrapper

Signed-off-by: ericharper <[email protected]>

* instantiate model parallel config after init model parallel

Signed-off-by: ericharper <[email protected]>

* set virtual rank

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add GQA config to megatron gpt model (#7096)

* Add GQA config in gpt config file

Signed-off-by: jasonwan <[email protected]>

* Verify mcore is enabled when using GQA

Signed-off-by: jasonwan <[email protected]>

---------

Signed-off-by: jasonwan <[email protected]>

* revert

Signed-off-by: ericharper <[email protected]>

* remove import

Signed-off-by: eharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* rollback model cast for p-tuning

Signed-off-by: jasonwan <[email protected]>

* update for dist adam

Signed-off-by: eharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* use get_gpt_module_list

Signed-off-by: eharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update ckpt conversion script

Signed-off-by: jasonwan <[email protected]>

* ptl2.0 patch for llama config

Signed-off-by: jasonwan <[email protected]>

* add plugins to trainer in scripts

Signed-off-by: jasonwan <[email protected]>

* fix activation checkpointing mcore

Signed-off-by: jasonwan <[email protected]>

* fix variable names

Signed-off-by: jasonwan <[email protected]>

* overwrite normalization type for mcore/te

Signed-off-by: jasonwan <[email protected]>

* Update megatron_llama_sft.yaml

Signed-off-by: Jason Wang <[email protected]>

* add PEFT adapter support for mcore gpt path (#7276)

* implementation for mcore adapter/mxins

Signed-off-by: jasonwan <[email protected]>

* small fix for lora and ptuning

Signed-off-by: jasonwan <[email protected]>

* support layerwise peft

Signed-off-by: jasonwan <[email protected]>

* support multiple target layers

Signed-off-by: jasonwan <[email protected]>

* support lora GQA

Signed-off-by: jasonwan <[email protected]>

* support amp O2

Signed-off-by: jasonwan <[email protected]>

* revert & more O2 fix

Signed-off-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* lora inject to attention

Signed-off-by: jasonwan <[email protected]>

* support lora weight tying

Signed-off-by: jasonwan <[email protected]>

* add copyright header

Signed-off-by: jasonwan <[email protected]>

* rollback ptuning name change. full string match mcore target

Signed-off-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove comment

Signed-off-by: jasonwan <[email protected]>

---------

Signed-off-by: jasonwan <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* clean up config

Signed-off-by: jasonwan <[email protected]>

* Sync llama branch (#7297)

* add inference param. update TP/PP script to support mcore gpt

* p-tuning

Signed-off-by: jasonwan <[email protected]>

* change layer names for SFT

Signed-off-by: Hongbin Liu <[email protected]>

* fix bug in SFT

Signed-off-by: Hongbin Liu <[email protected]>

* fix bug: cpu initialization is not really enabled

Signed-off-by: Hongbin Liu <[email protected]>

* add use_cpu_initialization to TransformerConfig

Signed-off-by: Hongbin Liu <[email protected]>

* fix bug: wrong config path when using relative cjpt path

Signed-off-by: Hongbin Liu <[email protected]>

* revert mcore config change

Signed-off-by: Jason Wang <[email protected]>

---------

Signed-off-by: jasonwan <[email protected]>
Signed-off-by: Hongbin Liu <[email protected]>
Signed-off-by: Jason Wang <[email protected]>
Co-authored-by: Hongbin Liu <[email protected]>

* clean up ckpt conversion script

Signed-off-by: jasonwan <[email protected]>

* rollback git merge errors

Signed-off-by: jasonwan <[email protected]>

* update mcore, add check for mcore+te

Signed-off-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* formatting

Signed-off-by: jasonwan <[email protected]>

* make sft test dataset optional. fix indentation in config

Signed-off-by: jasonwan <[email protected]>

* one more fix for optional test set

Signed-off-by: jasonwan <[email protected]>

* support merging lora weights in mcore

Signed-off-by: jasonwan <[email protected]>

* update mcore for cpu init

Signed-off-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update ckpt conversion for code llama

Signed-off-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add seq_len_interpolation_factor support for long-context llama ckpts (#7312)

* add inference param. update TP/PP script to support mcore gpt

* p-tuning

Signed-off-by: jasonwan <[email protected]>

* add seq_len_interpolation_factor

Signed-off-by: Hongbin Liu <[email protected]>

---------

Signed-off-by: jasonwan <[email protected]>
Signed-off-by: Hongbin Liu <[email protected]>
Co-authored-by: jasonwan <[email protected]>
Co-authored-by: Hongbin Liu <[email protected]>

* fix old ptuning model, update mcore to support seq_len_interpolation_factor

Signed-off-by: jasonwan <[email protected]>

* support fused layernorm linear, fix ptuning O2

Signed-off-by: jasonwan <[email protected]>

* drop loss mask for mcore for now

Signed-off-by: jasonwan <[email protected]>

* disable dist ckpt in peft

Signed-off-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix loading non dist ckpt

Signed-off-by: jasonwan <[email protected]>

* add ckpt conversion to CI

Signed-off-by: jasonwan <[email protected]>

* update CI

Signed-off-by: jasonwan <[email protected]>

* mcore_mixin docstring

Signed-off-by: jasonwan <[email protected]>

* minor change in mcore peft error message

Signed-off-by: jasonwan <[email protected]>

* fix amp o2 in lora weight tying

Signed-off-by: jasonwan <[email protected]>

* correct mcore fp8 config

Signed-off-by: jasonwan <[email protected]>

* add TE installation

Signed-off-by: jasonwan <[email protected]>

* support mcore adapter tuning

Signed-off-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* comment out new CI test. rollback docker image

Signed-off-by: jasonwan <[email protected]>

* ignore FA tests, try new CI on 23.08

Signed-off-by: jasonwan <[email protected]>

* mark new CI as L2, put to beginning to test

Signed-off-by: jasonwan <[email protected]>

* minor fix for prompt learning

Signed-off-by: jasonwan <[email protected]>

* rollback to 23.06. comment out CI

Signed-off-by: jasonwan <[email protected]>

* minor fix ckpt conversion script

Signed-off-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* minor rollback gpt model change

Signed-off-by: jasonwan <[email protected]>

---------

Signed-off-by: ericharper <[email protected]>
Signed-off-by: jasonwan <[email protected]>
Signed-off-by: eharper <[email protected]>
Signed-off-by: Hongbin Liu <[email protected]>
Signed-off-by: Jason Wang <[email protected]>
Co-authored-by: ericharper <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: eharper <[email protected]>
Co-authored-by: Hongbin Liu <[email protected]>
Co-authored-by: Kelvin Liu <[email protected]>

* Hiddens modules documentation (#7303)

* 1. Changed hiddens transformations module from `transformations` to `hiddens`.

Signed-off-by: Micha Livne <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* 1. Debugging. Signed-off-by: Micha Livne <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* 1. Finished doc.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging. Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging. Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging. Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging. Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging. Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging. Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging. Signed-off-by: Micha Livne <[email protected]>

---------

Signed-off-by: Micha Livne <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Eric Harper <[email protected]>

* Support for flash attention 2.0 (#7063)

* Add flash attn 2

Signed-off-by: MaximumEntropy <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add FA2 feature

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* Remove debugging

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: MaximumEntropy <[email protected]>
Signed-off-by: Cheng-Ping Hsieh <[email protected]>
Signed-off-by: Cheng-Ping Hsieh <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Oleksii Kuchaiev <[email protected]>
Co-authored-by: Cheng-Ping Hsieh <[email protected]>
Co-authored-by: Cheng-Ping Hsieh <[email protected]>

* lora merge fix for O2 names (#7325)

* wip

Signed-off-by: arendu <[email protected]>

* adjust key names based on O2

Signed-off-by: arendu <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update

Signed-off-by: arendu <[email protected]>

* minor

Signed-off-by: arendu <[email protected]>

---------

Signed-off-by: arendu <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* multiple fields can form a context (#7147)

* list of context fields and flexible prompt template

Signed-off-by: arendu <[email protected]>

* list of fields for context

Signed-off-by: arendu <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix bug

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* Fix bug

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* Add multiple truncation fields and middle truncation

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Compatible to old ckpt

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix tokenize detokenize issue

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Remove detokenization, add truncation augmentation

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Resolve comments

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* Remove unused import

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* revert eos

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* Add tokenizer space_sensitive attribute

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix error

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* Fix erorr and use re

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix bug

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* Change assert logic

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Follow adi suggestion

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Remove merge function

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add example and comment

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* Remove context_key and add comment

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* Remove random truncation

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix bug

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix template none

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix bug

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

---------

Signed-off-by: arendu <[email protected]>
Signed-off-by: Cheng-Ping Hsieh <[email protected]>
Signed-off-by: Cheng-Ping Hsieh <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Cheng-Ping Hsieh <[email protected]>
Co-authored-by: Cheng-Ping Hsieh <[email protected]>

* Load buffers in checkpoint (#7357)

Signed-off-by: Jason Wang <[email protected]>

* Add migration guide for lightning 2.0 upgrade (#7360)

* Add lightning 2.0 migration guide in NeMo docs

Signed-off-by: Abhishree <[email protected]>

* Add remaining guide for lightning 2.0 upgrade

Signed-off-by: Abhishree <[email protected]>

* Remove line spill over and continue in next line

Signed-off-by: Abhishree <[email protected]>

* Add missing dataloader_iter in the guide

Signed-off-by: Abhishree <[email protected]>

* Fix minor typo

Signed-off-by: Abhishree <[email protected]>

---------

Signed-off-by: Abhishree <[email protected]>

* adding bias_dropout_add_fusion option for BERT (#7332)

Signed-off-by: Alexander Jipa <[email protected]>
Co-authored-by: Alexander Jipa <[email protected]>

* [TTS] Change audio codec token type to TokenIndex (#7356)

Signed-off-by: Ryan <[email protected]>

* enable selective unfreeze (#7326)

* wip

Signed-off-by: arendu <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* wip

Signed-off-by: arendu <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* avoid PTL method conflicts

Signed-off-by: arendu <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update

Signed-off-by: arendu <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update

Signed-off-by: arendu <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: arendu <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Fix typos (#7361)

* fix typos

Signed-off-by: omahs <[email protected]>

* fix typo

Signed-off-by: omahs <[email protected]>

* fix typos

Signed-off-by: omahs <[email protected]>

* fix typos

Signed-off-by: omahs <[email protected]>

* fix typo

Signed-off-by: omahs <[email protected]>

* fix typos

Signed-off-by: omahs <[email protected]>

* fix typo

Signed-off-by: omahs <[email protected]>

* fix typo

Signed-off-by: omahs <[email protected]>

* fix typo

Signed-off-by: omahs <[email protected]>

---------

Signed-off-by: omahs <[email protected]>

* pin numba=0.57.1 to fix reinstall.sh error (#7366)

Signed-off-by: Xuesong Yang <[email protected]>

* Update new conversion script for converting safetensors.

* Upgrade pytorch container to 23.08 (#7353)

* upgrade pytorch container

Signed-off-by: eharper <[email protected]>

* use mcore

Signed-off-by: eharper <[email protected]>

* revert test change

Signed-off-by: eharper <[email protected]>

* pleasefixme

Signed-off-by: eharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* check for ampere

Signed-off-by: eharper <[email protected]>

* comment test temporarily

Signed-off-by: eharper <[email protected]>

---------

Signed-off-by: eharper <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* enable fp32 optimizer for output_layer in mcore (#7355)

Signed-off-by: lhb8125 <[email protected]>

* revert comment (#7368)

Signed-off-by: eharper <[email protected]>

* Update to core 23.08 branch ToT (#7371)

Signed-off-by: Abhinav Khattar <[email protected]>

* upper bounding ptl (#7370)

Signed-off-by: eharper <[email protected]>

* fix pipeline parallel inference (#7367)

* fix pp inference

Signed-off-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: jasonwan <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* fix for peft tied weights (#7372)

Signed-off-by: arendu <[email protected]>

* fixed trainer.strategy=auto from None. (#7369)

Signed-off-by: Xuesong Yang <[email protected]>

* add O2 option in gpt eval (#7358)

* add O2 option in eval

Signed-off-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add doc for O2 config

Signed-off-by: jasonwan <[email protected]>

* add to llama inference config

Signed-off-by: jasonwan <[email protected]>

---------

Signed-off-by: jasonwan <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Eric Harper <[email protected]>

* Move model precision copy (#7336)

* move cfg precision set to megatron base model

Signed-off-by: Maanu Grover <[email protected]>

* remove copy from other models

Signed-off-by: Maanu Grover <[email protected]>

* modify attribute not arg

Signed-off-by: Maanu Grover <[email protected]>

* fix gpt model test for ptl 2.0

Signed-off-by: Maanu Grover <[email protected]>

* rename function and add docstring

Signed-off-by: Maanu Grover <[email protected]>

* replace precision to dtype conditionals with func call

Signed-off-by: Maanu Grover <[email protected]>

* unnecessary function and cfg reset

Signed-off-by: Maanu Grover <[email protected]>

* set default value

Signed-off-by: Maanu Grover <[email protected]>

* fix precision lookup in a few more places

Signed-off-by: Maanu Grover <[email protected]>

* rename mapping function

Signed-off-by: Maanu Grover <[email protected]>

* ununsed import

Signed-off-by: Maanu Grover <[email protected]>

* save torch datatype to model

Signed-off-by: Maanu Grover <[email protected]>

* set weights precision wrt amp o2

Signed-off-by: Maanu Grover <[email protected]>

* Revert "set weights precision wrt amp o2"

This reverts commit 313a4bfe5eb69d771a6d2433898c0685836aef5c.

Signed-off-by: Maanu Grover <[email protected]>

* revert half precision at inference attempt

Signed-off-by: Maanu Grover <[email protected]>

* move autocast dtype to base model

Signed-off-by: Maanu Grover <[email protected]>

* move params dtype to base model, enable fp16 O2 inf

Signed-off-by: Maanu Grover <[email protected]>

* unused imports

Signed-off-by: Maanu Grover <[email protected]>

---------

Signed-off-by: Maanu Grover <[email protected]>

* Fix PEFT checkpoint loading (#7388)

* Fix PEFT checkpoint loading

Signed-off-by: Jason Wang <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Jason Wang <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Use distributed optimizer support for multiple dtypes (#7359)

* Update distopt wrapper with multiple dtype support

Remove manual handling of separate FP32 optimizer.

Signed-off-by: Tim Moon <[email protected]>

* Use distopt support for contiguous buffers with multiple dtypes

Signed-off-by: Tim Moon <[email protected]>

* Fix typo

Signed-off-by: Tim Moon <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Separate distopt buckets for first GPT layer and non-overlapped params

Signed-off-by: Tim Moon <[email protected]>

* Add distopt logic for int dtypes

Signed-off-by: Tim Moon <[email protected]>

* Update Apex commit

Signed-off-by: Tim Moon <[email protected]>

* Remove unused variables

Signed-off-by: Tim Moon <[email protected]>

* Update Apex commit in README and Jenkensfile

Signed-off-by: Tim Moon <[email protected]>

* Debug Dockerfile and Jenkinsfile

Signed-off-by: Tim Moon <[email protected]>

---------

Signed-off-by: Tim Moon <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Eric Harper <[email protected]>

* minor fix for llama ckpt conversion script (#7387)

* minor fix for llama ckpt conversion script

Signed-off-by: Jason Wang <[email protected]>

* Update Jenkinsfile

Signed-off-by: Jason Wang <[email protected]>

* remove fast_swiglu configuration

Signed-off-by: Jason Wang <[email protected]>

---------

Signed-off-by: Jason Wang <[email protected]>
Co-authored-by: Eric Harper <[email protected]>

* Fix wrong calling of librosa.get_duration() in notebook (#7376)

Signed-off-by: Robin Dong <[email protected]>
Co-authored-by: Somshubra Majumdar <[email protected]>

* [PATCH] PEFT import mcore (#7393)

* [PATCH] PEFT import mcore

Signed-off-by: Jason Wang <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Jason Wang <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* [TTS] Added a callback for logging initial data (#7384)

Signed-off-by: Ante Jukić <[email protected]>

* Update Core Commit (#7402)

* Update Core Commit

Signed-off-by: Abhinav Khattar <[email protected]>

* update commit

Signed-off-by: Abhinav Khattar <[email protected]>

---------

Signed-off-by: Abhinav Khattar <[email protected]>

* Use cfg attribute in bert (#7394)

* use cfg attribute instead of arg

Signed-off-by: Maanu Grover <[email protected]>

* use torch_dtype in place of cfg.precision

Signed-off-by: Maanu Grover <[email protected]>

* move precision copy before super constructor

Signed-off-by: Maanu Grover <[email protected]>

* use trainer arg

Signed-off-by: Maanu Grover <[email protected]>

---------

Signed-off-by: Maanu Grover <[email protected]>

* Add support for bias conversion in Swiglu models (#7386)

* Add support for bias conversion in Swiglu models

Signed-off-by: smajumdar <[email protected]>

* Add support for auto extracting tokenizer model

Signed-off-by: smajumdar <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add support for auto extracting tokenizer model

Signed-off-by: smajumdar <[email protected]>

* Fix issue with missing tokenizer

Signed-off-by: smajumdar <[email protected]>

* Refactor

Signed-off-by: smajumdar <[email protected]>

* Refactor

Signed-off-by: smajumdar <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: smajumdar <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Update save_to and restore_from for dist checkpointing (#7343)

* add dist ckpt to save to, in progress

Signed-off-by: eharper <[email protected]>

* move dist ckpt

Signed-off-by: eharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* clean up

Signed-off-by: eharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update restore from, need to figure out how to initialize distributed

Signed-off-by: eharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* launch distrib if needed when restoring dist ckpt

Signed-off-by: eharper <[email protected]>

* when using mcore we can change tp pp on the fly

Signed-off-by: eharper <[email protected]>

* add load_from_checkpoint support for dist ckpt

Signed-off-by: eharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update llama convert script to save dist .nemo

Signed-off-by: eharper <[email protected]>

* fix load dist ckpt

Signed-off-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* setup TE TP groups if needed

Signed-off-by: eharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* setup te tp groups if needed

Signed-off-by: eharper <[email protected]>

* remove import

Signed-off-by: eharper <[email protected]>

---------

Signed-off-by: eharper <[email protected]>
Signed-off-by: jasonwan <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: jasonwan <[email protected]>

* fix forward for with mcore=false (#7403)

Signed-off-by: Jimmy Zhang <[email protected]>
Co-authored-by: Jimmy Zhang <[email protected]>

* Fix logging to remove 's/it' from progress bar in Megatron models and add train_step_timing (#7374)

* Add CustomProgressBar class to exp_manager and trainer callbacks

Signed-off-by: Abhishree <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix the progres…
rohitrango pushed a commit to rohitrango/NeMo that referenced this pull request Jun 25, 2024
…DIA#9169)

* Fixes

* Docs fix

* Add support for custom NeMo fields in Lhotse-NeMo adapters (attach to cut.custom)

* Add support for custom NeMo fields in Lhotse-NeMo adapters (attach to cut.custom)

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* support distributed_fused_adam

Signed-off-by: zhehuaichen <[email protected]>

* support distributed_fused_adam

Signed-off-by: zhehuaichen <[email protected]>

* Add support for sharded NeMo manifest files

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* support megatron_amp_O2

Signed-off-by: zhehuaichen <[email protected]>

* Support heterogeneous sampling rates in non tarred NeMo manifests

* migrate to PTL2.0

Signed-off-by: stevehuang52 <[email protected]>

* clean up

Signed-off-by: stevehuang52 <[email protected]>

* update manifest util

Signed-off-by: stevehuang52 <[email protected]>

* Support multiple tokenizer/parser types, aggregate tokenizers, and custom language fields

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* agg and normal tokenizers actually work

* Support weights for NeMo tarred manifests

* Temporarily hardcoded pnc stripping/lowercasing

* fix

* make pnc hack configurable from the config and disabled by default

* fix the hack

* migrate to ptl2.1 to support multiple dataloaders

Signed-off-by: stevehuang52 <[email protected]>

* support encoder overwrite

Signed-off-by: zhehuaichen <[email protected]>

* update misc

Signed-off-by: stevehuang52 <[email protected]>

* fix eval and clean up

Signed-off-by: stevehuang52 <[email protected]>

* support add_sep for perception model

Signed-off-by: zhehuaichen <[email protected]>

* fix https://github.com/Lightning-AI/pytorch-lightning/issues/18803

Signed-off-by: zhehuaichen <[email protected]>

* add_bos

Signed-off-by: zhehuaichen <[email protected]>

* Transformer decoder with conditioning for canary (#8091)

* initial commit for multi-task conf-enc transf-dec for canary

Signed-off-by: Krishna Puvvada <[email protected]>

* removing decoder states caching during training

Signed-off-by: Krishna Puvvada <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Krishna Puvvada <[email protected]>
Co-authored-by: Krishna Puvvada <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Option to limit the number of open streams (#8095)

* audio signal support in multi

Signed-off-by: zhehuaichen <[email protected]>

* update asr evaluator

Signed-off-by: stevehuang52 <[email protected]>

* fix from
https://github.com/NVIDIA/NeMo/commit/fcc0f9f6ff7947c3c7fba3ed17d8ec8af6391397
and
https://github.com/NVIDIA/NeMo/commit/f97c9016e6438ca4174b66bf9c3e248b28197aaa

Signed-off-by: zhehuaichen <[email protected]>

* transcribe fn for Canary models (#8110)

* improve readability

Signed-off-by: Krishna Puvvada <[email protected]>

* adding context in transcribe function for ConfTransfModels

Signed-off-by: Krishna Puvvada <[email protected]>

* supporting relative paths in transcribe function for canary

Signed-off-by: Krishna Puvvada <[email protected]>

* removing cuts.sort_by_duration in __getitem__ to maintain manifest order during inference

Signed-off-by: Krishna Puvvada <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Krishna Puvvada <[email protected]>
Co-authored-by: Krishna Puvvada <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* update for evaluation

Signed-off-by: stevehuang52 <[email protected]>

* update for eval

Signed-off-by: stevehuang52 <[email protected]>

* update for evaluation

Signed-off-by: stevehuang52 <[email protected]>

* fix bleu

Signed-off-by: stevehuang52 <[email protected]>

* fix typo

Signed-off-by: stevehuang52 <[email protected]>

* Add missing audio_filepath validation for Canary (#8119)

* Add missing audio_filepath validation for Canary

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* add default concat_sampling_probabilities

Signed-off-by: zhehuaichen <[email protected]>

* support lhotse dataset in speechllm

Signed-off-by: zhehuaichen <[email protected]>

* bypass get_iterator_k_split

Signed-off-by: zhehuaichen <[email protected]>

* tmp fix

Signed-off-by: zhehuaichen <[email protected]>

* try to use fixed batch with megatron

Signed-off-by: zhehuaichen <[email protected]>

* add batch logging

Signed-off-by: zhehuaichen <[email protected]>

* support unfrozen llm

Signed-off-by: zhehuaichen <[email protected]>

* Create README.md

Signed-off-by: He Huang (Steve) <[email protected]>

* Update README.md

Signed-off-by: He Huang (Steve) <[email protected]>

* Update README.md

Signed-off-by: He Huang (Steve) <[email protected]>

* update

Signed-off-by: stevehuang52 <[email protected]>

* rename

Signed-off-by: stevehuang52 <[email protected]>

* add llama prompt template

Signed-off-by: zhehuaichen <[email protected]>

* update and refactor

Signed-off-by: stevehuang52 <[email protected]>

* support sample alpha

Signed-off-by: zhehuaichen <[email protected]>

* support lhotse validation set and canary pretrained ckpt with pseudo label

Signed-off-by: zhehuaichen <[email protected]>

* make sure backward compatibility

Signed-off-by: zhehuaichen <[email protected]>

* remove pad

Signed-off-by: zhehuaichen <[email protected]>

* make sure asr_model is frozen

Signed-off-by: zhehuaichen <[email protected]>

* support greedy decoding

Signed-off-by: zhehuaichen <[email protected]>

* valid on lhotse

Signed-off-by: zhehuaichen <[email protected]>

* fix multi dataloader in val case for lhotse SALM; add default data
names; keep asr model tokenizer by default to enable adding canary
dataset

Signed-off-by: zhehuaichen <[email protected]>

* remove the bruteforce _keep_special_tokens implementation

Signed-off-by: zhehuaichen <[email protected]>

* decoding_ratio and convert_canary_prompt_to_text support

Signed-off-by: zhehuaichen <[email protected]>

* canary_tokens_augment_ratio

Signed-off-by: zhehuaichen <[email protected]>

* debug

Signed-off-by: zhehuaichen <[email protected]>

* bug fix

Signed-off-by: zhehuaichen <[email protected]>

* fix lhotse based eval of llama canary model

Signed-off-by: zhehuaichen <[email protected]>

* support some overwrite for eval

Signed-off-by: zhehuaichen <[email protected]>

* support zero shot prompt in training

Signed-off-by: zhehuaichen <[email protected]>

* support cross attention based SALM

Signed-off-by: zhehuaichen <[email protected]>

* support cross attention based SALM

Signed-off-by: zhehuaichen <[email protected]>

* fix for batch train/valid of cross

Signed-off-by: zhehuaichen <[email protected]>

* support learnable gate and plotting

Signed-off-by: zhehuaichen <[email protected]>

* support using pseudo label in prompt rather than cross att

Signed-off-by: zhehuaichen <[email protected]>

* bug fix for perception cfg and context tokens shift

Signed-off-by: zhehuaichen <[email protected]>

* DentityConnectorsAdd

Signed-off-by: zhehuaichen <[email protected]>

* fix ckpt saving

Signed-off-by: zhehuaichen <[email protected]>

* Support RnnGatedCrossAttention

Signed-off-by: zhehuaichen <[email protected]>

* add include_ffw and fix _optimizer_param_groups for all unfrozen run

Signed-off-by: zhehuaichen <[email protected]>

* support grad acc when using bucket

Signed-off-by: zhehuaichen <[email protected]>

* support TransformerCrossAttention

Signed-off-by: zhehuaichen <[email protected]>

* support ProjectTransformerCrossAttention

Signed-off-by: zhehuaichen <[email protected]>

* support ++model.use_am_tokenizer ++model.override_vocab_size ++model.override.hidden_size

Signed-off-by: zhehuaichen <[email protected]>

* support question set on val without canary

Signed-off-by: zhehuaichen <[email protected]>

* support load_audio_encoder and wip in optim_param_groups

Signed-off-by: zhehuaichen <[email protected]>

* minor fix for audio pretrain model init

Signed-off-by: zhehuaichen <[email protected]>

* simplify canary_tokens_augment

Signed-off-by: zhehuaichen <[email protected]>

* use question in the manifest if it exists

Signed-off-by: zhehuaichen <[email protected]>

* support dataset weighting for non tar

Signed-off-by: zhehuaichen <[email protected]>

* Update SpeechLLM code (#8475)

* add pleasefixme marker for potential failed nightly tests. (#7678)

Signed-off-by: Xuesong Yang <[email protected]>

* Add new text segmentation library for better TTS quality (#7645)

* Add new text segmentation library for better TTS quality
* Update zh_cn_pinyin.py

added detailed instruction on how to install pkuseg.

Signed-off-by: Xuesong Yang <[email protected]>

* Update requirements_tts.txt

remove pkuseg as the default dependency of NeMo TTS, and instead, direct users to manually install pkuseg if they really need.

Signed-off-by: Xuesong Yang <[email protected]>


---------

Signed-off-by: Xuesong Yang <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Xuesong Yang <[email protected]>

* Create PrecisionPlugin for megatron_ckpt_to_nemo.py trainer (#7767) (#7774)

* Create PrecisionPlugin for megatron_ckpt_to_nemo.py trainer



* Add ddp_find_unused_parameters_true for punctuation_capitalization_train_evaluate.py



* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add '32-true' for precision values



---------

Signed-off-by: Abhishree <[email protected]>
Signed-off-by: Abhishree Thittenamane <[email protected]>
Co-authored-by: Abhishree Thittenamane <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* fix(clustering_diarizer.py): fix typo (#7772)

Signed-off-by: Jean-Louis Queguiner <[email protected]>

* fix(diarization-README): typo (#7771)

Signed-off-by: Jean-Louis Queguiner <[email protected]>

* Fix bug wrt change decoding strategy for bpe models (#7762) (#7764)

* Fix bug wrt change decoding strategy for bpe models



* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: smajumdar <[email protected]>
Co-authored-by: Somshubra Majumdar <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Remove incorrect extra argument for load_from_checkpoint_dir() (#7500)

Signed-off-by: Robin Dong <[email protected]>
Co-authored-by: Eric Harper <[email protected]>

* Add nemo to mcore GPT conversion script  (#7730)

* add conversion script

Signed-off-by: Chen Cui <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove references to 'ckpt'

Signed-off-by: Chen Cui <[email protected]>

* add one more sanity check to make sure there is no unexpected keys in state dict

Signed-off-by: Chen Cui <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* make cpu loading work

Signed-off-by: Chen Cui <[email protected]>

* make script work for llama2 models

Signed-off-by: Chen Cui <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* address code check

Signed-off-by: Chen Cui <[email protected]>

* remove trainer precision (was for old sanity check)

Signed-off-by: Chen Cui <[email protected]>

* fix script for llama2 model

Signed-off-by: Chen Cui <[email protected]>

* remove commented code

Signed-off-by: Chen Cui <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Chen Cui <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Eric Harper <[email protected]>

* Fix bug in ConditionalInput: cat along the feature dim, not the batch dim (#7785)

Signed-off-by: anferico <[email protected]>

* Add some docs and update scripts for ASR (#7790)

* Add some docs and update scripts

Signed-off-by: smajumdar <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: smajumdar <[email protected]>
Signed-off-by: Somshubra Majumdar <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* set context for text memmap to fork (#7784)

* set context for text memmap to fork

Signed-off-by: arendu <[email protected]>

* typo

Signed-off-by: arendu <[email protected]>

---------

Signed-off-by: arendu <[email protected]>

* add training with multiple audios

Signed-off-by: stevehuang52 <[email protected]>

* Support flash decoding (#7744)

* Add flash-decoding

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* Fix

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

---------

Signed-off-by: Cheng-Ping Hsieh <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Yang Zhang <[email protected]>

* Change accelerator to 'auto' in nlp_checkpoint_port.py (#7761)

* Change accelerator to 'auto' in nlp_checkpoint_port.py (#7747)

* Change accelerator to auto

Signed-off-by: Abhishree <[email protected]>

* Pass omegaconf object to trainer in nlp_checkpoint_port.py

Signed-off-by: Abhishree <[email protected]>

* Pass omegaconf object to trainer in export.py

Signed-off-by: Abhishree <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Abhishree <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Eric Harper <[email protected]>
Signed-off-by: Abhishree <[email protected]>

* docs: fix typos (#7758)

Signed-off-by: shuoer86 <[email protected]>
Co-authored-by: Xuesong Yang <[email protected]>
Signed-off-by: Abhishree <[email protected]>

* Snake act (#7736)

Signed-off-by: Abhishree <[email protected]>

* Update gpt_dataset.py (#6963)

Signed-off-by: Xin Yao <[email protected]>
Co-authored-by: Sandeep Subramanian <[email protected]>
Signed-off-by: Abhishree <[email protected]>

---------

Signed-off-by: Abhishree <[email protected]>
Signed-off-by: shuoer86 <[email protected]>
Signed-off-by: Xin Yao <[email protected]>
Co-authored-by: Abhishree Thittenamane <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Eric Harper <[email protected]>
Co-authored-by: shuoer86 <[email protected]>
Co-authored-by: Xuesong Yang <[email protected]>
Co-authored-by: Nithin Rao <[email protected]>
Co-authored-by: Xin Yao <[email protected]>
Co-authored-by: Sandeep Subramanian <[email protected]>

* Add selection criteria for reference audios in the `GlobalStyleToken` submodule (#7788)

* add selection criteria for reference audios

Signed-off-by: anferico <[email protected]>

* Update configuration files

Signed-off-by: anferico <[email protected]>

* add informative comment in config files

Signed-off-by: anferico <[email protected]>

* sample random index for reference audio selection

Signed-off-by: anferico <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: anferico <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* update text server to support compute logprobs (#7733)

* update text server to support compute logprobs

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix typo

---------

Signed-off-by: Zhilin Wang <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* add multi-layer feat extract and fix random question insertion

Signed-off-by: stevehuang52 <[email protected]>

* Configure MCore logger (#7781)

Signed-off-by: Mikołaj Błaż <[email protected]>

* Revert "PEFT eval fix (#7626) (#7638)" (#7693)

This reverts commit c24bb454bf1fa6f5820f1805c6387254a73220b9.

* remove TN from ctc_segm tut (#7807)

Signed-off-by: Evelina <[email protected]>

* [TTS] Support audio offsets in TTS data loaders (#7156)

* [TTS] Support audio offsets in TTS data loaders

Signed-off-by: Ryan <[email protected]>

* [TTS] Change docstring mentions of .pt to .npy

Signed-off-by: Ryan <[email protected]>

---------

Signed-off-by: Ryan <[email protected]>

* Update Apex install command in Dockerfile (#7794) (#7804)

* move core install to /workspace (#7706)



* update apex install in dockerfile



* use fetch head



---------

Signed-off-by: Abhinav Khattar <[email protected]>
Signed-off-by: eharper <[email protected]>
Co-authored-by: Eric Harper <[email protected]>
Co-authored-by: Abhinav Khattar <[email protected]>

* fix typo

Signed-off-by: stevehuang52 <[email protected]>

* Nemo to HF converter for LLaMA model (#7770)

* Create config_llama_truncate.yaml

Signed-off-by: Utkarsh <[email protected]>

* Add files via upload

Signed-off-by: Utkarsh <[email protected]>

* Update convert_nemo_llama_to_hf.py

Signed-off-by: Utkarsh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update config_llama_truncate.yaml

Signed-off-by: Utkarsh <[email protected]>

* Update convert_nemo_llama_to_hf.py

Signed-off-by: Utkarsh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update convert_nemo_llama_to_hf.py

Signed-off-by: Utkarsh <[email protected]>

* clean up trainer

* remove dependency on yaml config. load config from nemo file instead.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* enable ckpt saving into other precision formats

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* support 70b + cleanup qkv slice logic

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix bug

* move hf model folder code from comment to function and add instruction to run

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Utkarsh <[email protected]>
Signed-off-by: Chen Cui <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Eric Harper <[email protected]>
Co-authored-by: Chen Cui <[email protected]>

* Save best NeMo model only when necessary (#7836)

Signed-off-by: Ante Jukić <[email protected]>

* add guard if its a distributed checkpoint (#7845)

Signed-off-by: Gerald Shen <[email protected]>

* Fix tn duplex (#7808)

* fix duplex tn infer

Signed-off-by: Evelina <[email protected]>

* fix typo

Signed-off-by: Evelina <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix TN docs

Signed-off-by: Evelina <[email protected]>

---------

Signed-off-by: Evelina <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Update transformers cache on Jenkins (#7854)

* update transformers cache

Signed-off-by: eharper <[email protected]>

* update

Signed-off-by: eharper <[email protected]>

* add cd

Signed-off-by: eharper <[email protected]>

---------

Signed-off-by: eharper <[email protected]>

* Update README.rst for container update (#7844)

Signed-off-by: fayejf <[email protected]>

* Add support for finetuning with huggingface datasets (#7834)

* add finetune with huggingface dataset

Signed-off-by: stevehuang52 <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update yaml

Signed-off-by: stevehuang52 <[email protected]>

* update

Signed-off-by: stevehuang52 <[email protected]>

* update and refactor

Signed-off-by: stevehuang52 <[email protected]>

* add extrac hf text and update

Signed-off-by: stevehuang52 <[email protected]>

* update and refactor

Signed-off-by: stevehuang52 <[email protected]>

* move dataset dependency to common

Signed-off-by: stevehuang52 <[email protected]>

* add docstring

Signed-off-by: stevehuang52 <[email protected]>

* Add to Dics

Signed-off-by: Nithin Rao Koluguri <nithinraok>

* add ci test

Signed-off-by: Nithin Rao Koluguri <nithinraok>

* add max steps in jenkins

Signed-off-by: Nithin Rao Koluguri <nithinraok>

* reduce max steps

Signed-off-by: Nithin Rao Koluguri <nithinraok>

* jenkins test

Signed-off-by: Nithin Rao Koluguri <nithinraok>

* add bs=2

Signed-off-by: Nithin Rao Koluguri <nithinraok>

---------

Signed-off-by: stevehuang52 <[email protected]>
Signed-off-by: Nithin Rao Koluguri <nithinraok>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Nithin Rao Koluguri <nithinraok>
Co-authored-by: Nithin Rao <[email protected]>

* Multimodal merge (#7728)

* ControlNet TRT export

* Final MR before release

* SD2 update

* Fixed export issue

* Fix for instruct p2p and reformat

* Fix SD export issue

* Add nemo clip export for DB

* Fix ins pix2pix

* fix sd2 config

* [Mingyuan Ma] BF16 and SD conversion script

* [Imagen] NHWC Feature

* Fix .nemo loading issue for NeMo CLIP in SD

* NeMo r1.20.0 Multimodal Merge

* fix the inductor issue in inference

* Fix inductor loading .nemo issue

* Add Neva Model Support

* Imagen Optimizations

* Neva inference code

* NeMo TOT 1.21 to Internal/main

* Update neva_inference.yaml

* REBASING  for latest code changes

* Update internal/main to main tot

* Parallel DDIM implementation

* 1. Fixing indentation bug. (#7352)

Signed-off-by: Micha Livne <[email protected]>

* NeMo MCore llama2 support + MCore PEFT adapters (#7299)

* start adding gpt from megatron core path

Signed-off-by: ericharper <[email protected]>

* set model parallel config

Signed-off-by: ericharper <[email protected]>

* use model parallel config object

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update args

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* set vp size to none if it is 1

Signed-off-by: ericharper <[email protected]>

* set vp size to none if it is 1

Signed-off-by: ericharper <[email protected]>

* add TransformerConfig

Signed-off-by: ericharper <[email protected]>

* start updating to TransformerConfig

Signed-off-by: ericharper <[email protected]>

* add todo

Signed-off-by: ericharper <[email protected]>

* revert to model parallel config

Signed-off-by: ericharper <[email protected]>

* add hidden_size to model_parallel_config

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove imports

Signed-off-by: ericharper <[email protected]>

* revert

Signed-off-by: ericharper <[email protected]>

* remove import

Signed-off-by: ericharper <[email protected]>

* small clean up

Signed-off-by: ericharper <[email protected]>

* update hidden size in peft base model, add mcore commit to jenkins

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update module args

Signed-off-by: ericharper <[email protected]>

* add config obj to flash attention tests

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove args

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove sequence parallel arg

Signed-off-by: ericharper <[email protected]>

* update args

Signed-off-by: ericharper <[email protected]>

* add config to self

Signed-off-by: ericharper <[email protected]>

* update args

Signed-off-by: ericharper <[email protected]>

* update args

Signed-off-by: ericharper <[email protected]>

* update args

Signed-off-by: ericharper <[email protected]>

* add config to test

Signed-off-by: ericharper <[email protected]>

* get hidden_size from config

Signed-off-by: ericharper <[email protected]>

* add try except

Signed-off-by: ericharper <[email protected]>

* use default

Signed-off-by: ericharper <[email protected]>

* update config with hidden size

Signed-off-by: ericharper <[email protected]>

* remove arg

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* comment out jenkins test

Signed-off-by: ericharper <[email protected]>

* revert import

Signed-off-by: ericharper <[email protected]>

* build transformer config

Signed-off-by: ericharper <[email protected]>

* add model to provider func

Signed-off-by: ericharper <[email protected]>

* update forward and float16 wrapper

Signed-off-by: ericharper <[email protected]>

* instantiate model parallel config after init model parallel

Signed-off-by: ericharper <[email protected]>

* set virtual rank

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add GQA config to megatron gpt model (#7096)

* Add GQA config in gpt config file

Signed-off-by: jasonwan <[email protected]>

* Verify mcore is enabled when using GQA

Signed-off-by: jasonwan <[email protected]>

---------

Signed-off-by: jasonwan <[email protected]>

* revert

Signed-off-by: ericharper <[email protected]>

* mcore llama2 ckpt conversion & small fix

Signed-off-by: jasonwan <[email protected]>

* Add inference & sft config by Hongbin

Co-authored-by: Hongbin Liu <[email protected]>

Signed-off-by: jasonwan <[email protected]>

* fix config

Signed-off-by: jasonwan <[email protected]>

* add inference param. update TP/PP script to support mcore gpt

Signed-off-by: jasonwan <[email protected]>

* p-tuning

Signed-off-by: jasonwan <[email protected]>

* modify ckpt conversion script (adding model cast)

Signed-off-by: jasonwan <[email protected]>

* ckpt conversion use relative path for config

Signed-off-by: jasonwan <[email protected]>

* start adding gpt from megatron core path

Signed-off-by: ericharper <[email protected]>

* set model parallel config

Signed-off-by: ericharper <[email protected]>

* use model parallel config object

Signed-off-by: ericharper <[email protected]>

* update args

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* set vp size to none if it is 1

Signed-off-by: ericharper <[email protected]>

* set vp size to none if it is 1

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add TransformerConfig

Signed-off-by: ericharper <[email protected]>

* start updating to TransformerConfig

Signed-off-by: ericharper <[email protected]>

* add todo

Signed-off-by: ericharper <[email protected]>

* revert to model parallel config

Signed-off-by: ericharper <[email protected]>

* add hidden_size to model_parallel_config

Signed-off-by: ericharper <[email protected]>

* remove imports

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove import

Signed-off-by: ericharper <[email protected]>

* small clean up

Signed-off-by: ericharper <[email protected]>

* update hidden size in peft base model, add mcore commit to jenkins

Signed-off-by: ericharper <[email protected]>

* update module args

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add config obj to flash attention tests

Signed-off-by: ericharper <[email protected]>

* remove args

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove sequence parallel arg

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update args

Signed-off-by: ericharper <[email protected]>

* add config to self

Signed-off-by: ericharper <[email protected]>

* update args

Signed-off-by: ericharper <[email protected]>

* update args

Signed-off-by: ericharper <[email protected]>

* update args

Signed-off-by: ericharper <[email protected]>

* add config to test

Signed-off-by: ericharper <[email protected]>

* get hidden_size from config

Signed-off-by: ericharper <[email protected]>

* add try except

Signed-off-by: ericharper <[email protected]>

* use default

Signed-off-by: ericharper <[email protected]>

* update config with hidden size

Signed-off-by: ericharper <[email protected]>

* remove arg

Signed-off-by: ericharper <[email protected]>

* comment out jenkins test

Signed-off-by: ericharper <[email protected]>

* revert import

Signed-off-by: ericharper <[email protected]>

* remove optimizer_idx

Signed-off-by: eharper <[email protected]>

* prefetch num microbatches

Signed-off-by: eharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* start adding gpt from megatron core path

Signed-off-by: ericharper <[email protected]>

* set model parallel config

Signed-off-by: ericharper <[email protected]>

* use model parallel config object

Signed-off-by: ericharper <[email protected]>

* update args

Signed-off-by: ericharper <[email protected]>

* fix for p-tuning sequence parallel

Signed-off-by: jasonwan <[email protected]>

* support SFT/distOpt mcore (#7207)

* add inference param. update TP/PP script to support mcore gpt

* p-tuning

Signed-off-by: jasonwan <[email protected]>

* change layer names for SFT

Signed-off-by: Hongbin Liu <[email protected]>

* fix bug in SFT

Signed-off-by: Hongbin Liu <[email protected]>

---------

Signed-off-by: jasonwan <[email protected]>
Signed-off-by: Hongbin Liu <[email protected]>
Co-authored-by: Hongbin Liu <[email protected]>
Co-authored-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* start updating to TransformerConfig

Signed-off-by: ericharper <[email protected]>

* revert to model parallel config

Signed-off-by: ericharper <[email protected]>

* add hidden_size to model_parallel_config

Signed-off-by: ericharper <[email protected]>

* remove imports

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update module args

Signed-off-by: ericharper <[email protected]>

* add config to self

Signed-off-by: ericharper <[email protected]>

* build transformer config

Signed-off-by: ericharper <[email protected]>

* add model to provider func

Signed-off-by: ericharper <[email protected]>

* update forward and float16 wrapper

Signed-off-by: ericharper <[email protected]>

* instantiate model parallel config after init model parallel

Signed-off-by: ericharper <[email protected]>

* set virtual rank

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add GQA config to megatron gpt model (#7096)

* Add GQA config in gpt config file

Signed-off-by: jasonwan <[email protected]>

* Verify mcore is enabled when using GQA

Signed-off-by: jasonwan <[email protected]>

---------

Signed-off-by: jasonwan <[email protected]>

* revert

Signed-off-by: ericharper <[email protected]>

* remove import

Signed-off-by: eharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* rollback model cast for p-tuning

Signed-off-by: jasonwan <[email protected]>

* update for dist adam

Signed-off-by: eharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* use get_gpt_module_list

Signed-off-by: eharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update ckpt conversion script

Signed-off-by: jasonwan <[email protected]>

* ptl2.0 patch for llama config

Signed-off-by: jasonwan <[email protected]>

* add plugins to trainer in scripts

Signed-off-by: jasonwan <[email protected]>

* fix activation checkpointing mcore

Signed-off-by: jasonwan <[email protected]>

* fix variable names

Signed-off-by: jasonwan <[email protected]>

* overwrite normalization type for mcore/te

Signed-off-by: jasonwan <[email protected]>

* Update megatron_llama_sft.yaml

Signed-off-by: Jason Wang <[email protected]>

* add PEFT adapter support for mcore gpt path (#7276)

* implementation for mcore adapter/mxins

Signed-off-by: jasonwan <[email protected]>

* small fix for lora and ptuning

Signed-off-by: jasonwan <[email protected]>

* support layerwise peft

Signed-off-by: jasonwan <[email protected]>

* support multiple target layers

Signed-off-by: jasonwan <[email protected]>

* support lora GQA

Signed-off-by: jasonwan <[email protected]>

* support amp O2

Signed-off-by: jasonwan <[email protected]>

* revert & more O2 fix

Signed-off-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* lora inject to attention

Signed-off-by: jasonwan <[email protected]>

* support lora weight tying

Signed-off-by: jasonwan <[email protected]>

* add copyright header

Signed-off-by: jasonwan <[email protected]>

* rollback ptuning name change. full string match mcore target

Signed-off-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove comment

Signed-off-by: jasonwan <[email protected]>

---------

Signed-off-by: jasonwan <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* clean up config

Signed-off-by: jasonwan <[email protected]>

* Sync llama branch (#7297)

* add inference param. update TP/PP script to support mcore gpt

* p-tuning

Signed-off-by: jasonwan <[email protected]>

* change layer names for SFT

Signed-off-by: Hongbin Liu <[email protected]>

* fix bug in SFT

Signed-off-by: Hongbin Liu <[email protected]>

* fix bug: cpu initialization is not really enabled

Signed-off-by: Hongbin Liu <[email protected]>

* add use_cpu_initialization to TransformerConfig

Signed-off-by: Hongbin Liu <[email protected]>

* fix bug: wrong config path when using relative cjpt path

Signed-off-by: Hongbin Liu <[email protected]>

* revert mcore config change

Signed-off-by: Jason Wang <[email protected]>

---------

Signed-off-by: jasonwan <[email protected]>
Signed-off-by: Hongbin Liu <[email protected]>
Signed-off-by: Jason Wang <[email protected]>
Co-authored-by: Hongbin Liu <[email protected]>

* clean up ckpt conversion script

Signed-off-by: jasonwan <[email protected]>

* rollback git merge errors

Signed-off-by: jasonwan <[email protected]>

* update mcore, add check for mcore+te

Signed-off-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* formatting

Signed-off-by: jasonwan <[email protected]>

* make sft test dataset optional. fix indentation in config

Signed-off-by: jasonwan <[email protected]>

* one more fix for optional test set

Signed-off-by: jasonwan <[email protected]>

* support merging lora weights in mcore

Signed-off-by: jasonwan <[email protected]>

* update mcore for cpu init

Signed-off-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update ckpt conversion for code llama

Signed-off-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add seq_len_interpolation_factor support for long-context llama ckpts (#7312)

* add inference param. update TP/PP script to support mcore gpt

* p-tuning

Signed-off-by: jasonwan <[email protected]>

* add seq_len_interpolation_factor

Signed-off-by: Hongbin Liu <[email protected]>

---------

Signed-off-by: jasonwan <[email protected]>
Signed-off-by: Hongbin Liu <[email protected]>
Co-authored-by: jasonwan <[email protected]>
Co-authored-by: Hongbin Liu <[email protected]>

* fix old ptuning model, update mcore to support seq_len_interpolation_factor

Signed-off-by: jasonwan <[email protected]>

* support fused layernorm linear, fix ptuning O2

Signed-off-by: jasonwan <[email protected]>

* drop loss mask for mcore for now

Signed-off-by: jasonwan <[email protected]>

* disable dist ckpt in peft

Signed-off-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix loading non dist ckpt

Signed-off-by: jasonwan <[email protected]>

* add ckpt conversion to CI

Signed-off-by: jasonwan <[email protected]>

* update CI

Signed-off-by: jasonwan <[email protected]>

* mcore_mixin docstring

Signed-off-by: jasonwan <[email protected]>

* minor change in mcore peft error message

Signed-off-by: jasonwan <[email protected]>

* fix amp o2 in lora weight tying

Signed-off-by: jasonwan <[email protected]>

* correct mcore fp8 config

Signed-off-by: jasonwan <[email protected]>

* add TE installation

Signed-off-by: jasonwan <[email protected]>

* support mcore adapter tuning

Signed-off-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* comment out new CI test. rollback docker image

Signed-off-by: jasonwan <[email protected]>

* ignore FA tests, try new CI on 23.08

Signed-off-by: jasonwan <[email protected]>

* mark new CI as L2, put to beginning to test

Signed-off-by: jasonwan <[email protected]>

* minor fix for prompt learning

Signed-off-by: jasonwan <[email protected]>

* rollback to 23.06. comment out CI

Signed-off-by: jasonwan <[email protected]>

* minor fix ckpt conversion script

Signed-off-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* minor rollback gpt model change

Signed-off-by: jasonwan <[email protected]>

---------

Signed-off-by: ericharper <[email protected]>
Signed-off-by: jasonwan <[email protected]>
Signed-off-by: eharper <[email protected]>
Signed-off-by: Hongbin Liu <[email protected]>
Signed-off-by: Jason Wang <[email protected]>
Co-authored-by: ericharper <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: eharper <[email protected]>
Co-authored-by: Hongbin Liu <[email protected]>
Co-authored-by: Kelvin Liu <[email protected]>

* Hiddens modules documentation (#7303)

* 1. Changed hiddens transformations module from `transformations` to `hiddens`.

Signed-off-by: Micha Livne <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* 1. Debugging. Signed-off-by: Micha Livne <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* 1. Finished doc.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging. Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging. Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging. Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging. Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging. Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging. Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging. Signed-off-by: Micha Livne <[email protected]>

---------

Signed-off-by: Micha Livne <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Eric Harper <[email protected]>

* Support for flash attention 2.0 (#7063)

* Add flash attn 2

Signed-off-by: MaximumEntropy <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add FA2 feature

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* Remove debugging

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: MaximumEntropy <[email protected]>
Signed-off-by: Cheng-Ping Hsieh <[email protected]>
Signed-off-by: Cheng-Ping Hsieh <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Oleksii Kuchaiev <[email protected]>
Co-authored-by: Cheng-Ping Hsieh <[email protected]>
Co-authored-by: Cheng-Ping Hsieh <[email protected]>

* lora merge fix for O2 names (#7325)

* wip

Signed-off-by: arendu <[email protected]>

* adjust key names based on O2

Signed-off-by: arendu <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update

Signed-off-by: arendu <[email protected]>

* minor

Signed-off-by: arendu <[email protected]>

---------

Signed-off-by: arendu <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* multiple fields can form a context (#7147)

* list of context fields and flexible prompt template

Signed-off-by: arendu <[email protected]>

* list of fields for context

Signed-off-by: arendu <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix bug

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* Fix bug

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* Add multiple truncation fields and middle truncation

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Compatible to old ckpt

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix tokenize detokenize issue

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Remove detokenization, add truncation augmentation

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Resolve comments

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* Remove unused import

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* revert eos

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* Add tokenizer space_sensitive attribute

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix error

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* Fix erorr and use re

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix bug

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* Change assert logic

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Follow adi suggestion

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Remove merge function

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add example and comment

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* Remove context_key and add comment

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* Remove random truncation

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix bug

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix template none

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix bug

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

---------

Signed-off-by: arendu <[email protected]>
Signed-off-by: Cheng-Ping Hsieh <[email protected]>
Signed-off-by: Cheng-Ping Hsieh <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Cheng-Ping Hsieh <[email protected]>
Co-authored-by: Cheng-Ping Hsieh <[email protected]>

* Load buffers in checkpoint (#7357)

Signed-off-by: Jason Wang <[email protected]>

* Add migration guide for lightning 2.0 upgrade (#7360)

* Add lightning 2.0 migration guide in NeMo docs

Signed-off-by: Abhishree <[email protected]>

* Add remaining guide for lightning 2.0 upgrade

Signed-off-by: Abhishree <[email protected]>

* Remove line spill over and continue in next line

Signed-off-by: Abhishree <[email protected]>

* Add missing dataloader_iter in the guide

Signed-off-by: Abhishree <[email protected]>

* Fix minor typo

Signed-off-by: Abhishree <[email protected]>

---------

Signed-off-by: Abhishree <[email protected]>

* adding bias_dropout_add_fusion option for BERT (#7332)

Signed-off-by: Alexander Jipa <[email protected]>
Co-authored-by: Alexander Jipa <[email protected]>

* [TTS] Change audio codec token type to TokenIndex (#7356)

Signed-off-by: Ryan <[email protected]>

* enable selective unfreeze (#7326)

* wip

Signed-off-by: arendu <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* wip

Signed-off-by: arendu <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* avoid PTL method conflicts

Signed-off-by: arendu <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update

Signed-off-by: arendu <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update

Signed-off-by: arendu <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: arendu <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Fix typos (#7361)

* fix typos

Signed-off-by: omahs <[email protected]>

* fix typo

Signed-off-by: omahs <[email protected]>

* fix typos

Signed-off-by: omahs <[email protected]>

* fix typos

Signed-off-by: omahs <[email protected]>

* fix typo

Signed-off-by: omahs <[email protected]>

* fix typos

Signed-off-by: omahs <[email protected]>

* fix typo

Signed-off-by: omahs <[email protected]>

* fix typo

Signed-off-by: omahs <[email protected]>

* fix typo

Signed-off-by: omahs <[email protected]>

---------

Signed-off-by: omahs <[email protected]>

* pin numba=0.57.1 to fix reinstall.sh error (#7366)

Signed-off-by: Xuesong Yang <[email protected]>

* Update new conversion script for converting safetensors.

* Upgrade pytorch container to 23.08 (#7353)

* upgrade pytorch container

Signed-off-by: eharper <[email protected]>

* use mcore

Signed-off-by: eharper <[email protected]>

* revert test change

Signed-off-by: eharper <[email protected]>

* pleasefixme

Signed-off-by: eharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* check for ampere

Signed-off-by: eharper <[email protected]>

* comment test temporarily

Signed-off-by: eharper <[email protected]>

---------

Signed-off-by: eharper <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* enable fp32 optimizer for output_layer in mcore (#7355)

Signed-off-by: lhb8125 <[email protected]>

* revert comment (#7368)

Signed-off-by: eharper <[email protected]>

* Update to core 23.08 branch ToT (#7371)

Signed-off-by: Abhinav Khattar <[email protected]>

* upper bounding ptl (#7370)

Signed-off-by: eharper <[email protected]>

* fix pipeline parallel inference (#7367)

* fix pp inference

Signed-off-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: jasonwan <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* fix for peft tied weights (#7372)

Signed-off-by: arendu <[email protected]>

* fixed trainer.strategy=auto from None. (#7369)

Signed-off-by: Xuesong Yang <[email protected]>

* add O2 option in gpt eval (#7358)

* add O2 option in eval

Signed-off-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add doc for O2 config

Signed-off-by: jasonwan <[email protected]>

* add to llama inference config

Signed-off-by: jasonwan <[email protected]>

---------

Signed-off-by: jasonwan <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Eric Harper <[email protected]>

* Move model precision copy (#7336)

* move cfg precision set to megatron base model

Signed-off-by: Maanu Grover <[email protected]>

* remove copy from other models

Signed-off-by: Maanu Grover <[email protected]>

* modify attribute not arg

Signed-off-by: Maanu Grover <[email protected]>

* fix gpt model test for ptl 2.0

Signed-off-by: Maanu Grover <[email protected]>

* rename function and add docstring

Signed-off-by: Maanu Grover <[email protected]>

* replace precision to dtype conditionals with func call

Signed-off-by: Maanu Grover <[email protected]>

* unnecessary function and cfg reset

Signed-off-by: Maanu Grover <[email protected]>

* set default value

Signed-off-by: Maanu Grover <[email protected]>

* fix precision lookup in a few more places

Signed-off-by: Maanu Grover <[email protected]>

* rename mapping function

Signed-off-by: Maanu Grover <[email protected]>

* ununsed import

Signed-off-by: Maanu Grover <[email protected]>

* save torch datatype to model

Signed-off-by: Maanu Grover <[email protected]>

* set weights precision wrt amp o2

Signed-off-by: Maanu Grover <[email protected]>

* Revert "set weights precision wrt amp o2"

This reverts commit 313a4bfe5eb69d771a6d2433898c0685836aef5c.

Signed-off-by: Maanu Grover <[email protected]>

* revert half precision at inference attempt

Signed-off-by: Maanu Grover <[email protected]>

* move autocast dtype to base model

Signed-off-by: Maanu Grover <[email protected]>

* move params dtype to base model, enable fp16 O2 inf

Signed-off-by: Maanu Grover <[email protected]>

* unused imports

Signed-off-by: Maanu Grover <[email protected]>

---------

Signed-off-by: Maanu Grover <[email protected]>

* Fix PEFT checkpoint loading (#7388)

* Fix PEFT checkpoint loading

Signed-off-by: Jason Wang <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Jason Wang <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Use distributed optimizer support for multiple dtypes (#7359)

* Update distopt wrapper with multiple dtype support

Remove manual handling of separate FP32 optimizer.

Signed-off-by: Tim Moon <[email protected]>

* Use distopt support for contiguous buffers with multiple dtypes

Signed-off-by: Tim Moon <[email protected]>

* Fix typo

Signed-off-by: Tim Moon <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Separate distopt buckets for first GPT layer and non-overlapped params

Signed-off-by: Tim Moon <[email protected]>

* Add distopt logic for int dtypes

Signed-off-by: Tim Moon <[email protected]>

* Update Apex commit

Signed-off-by: Tim Moon <[email protected]>

* Remove unused variables

Signed-off-by: Tim Moon <[email protected]>

* Update Apex commit in README and Jenkensfile

Signed-off-by: Tim Moon <[email protected]>

* Debug Dockerfile and Jenkinsfile

Signed-off-by: Tim Moon <[email protected]>

---------

Signed-off-by: Tim Moon <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Eric Harper <[email protected]>

* minor fix for llama ckpt conversion script (#7387)

* minor fix for llama ckpt conversion script

Signed-off-by: Jason Wang <[email protected]>

* Update Jenkinsfile

Signed-off-by: Jason Wang <[email protected]>

* remove fast_swiglu configuration

Signed-off-by: Jason Wang <[email protected]>

---------

Signed-off-by: Jason Wang <[email protected]>
Co-authored-by: Eric Harper <[email protected]>

* Fix wrong calling of librosa.get_duration() in notebook (#7376)

Signed-off-by: Robin Dong <[email protected]>
Co-authored-by: Somshubra Majumdar <[email protected]>

* [PATCH] PEFT import mcore (#7393)

* [PATCH] PEFT import mcore

Signed-off-by: Jason Wang <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Jason Wang <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* [TTS] Added a callback for logging initial data (#7384)

Signed-off-by: Ante Jukić <[email protected]>

* Update Core Commit (#7402)

* Update Core Commit

Signed-off-by: Abhinav Khattar <[email protected]>

* update commit

Signed-off-by: Abhinav Khattar <[email protected]>

---------

Signed-off-by: Abhinav Khattar <[email protected]>

* Use cfg attribute in bert (#7394)

* use cfg attribute instead of arg

Signed-off-by: Maanu Grover <[email protected]>

* use torch_dtype in place of cfg.precision

Signed-off-by: Maanu Grover <[email protected]>

* move precision copy before super constructor

Signed-off-by: Maanu Grover <[email protected]>

* use trainer arg

Signed-off-by: Maanu Grover <[email protected]>

---------

Signed-off-by: Maanu Grover <[email protected]>

* Add support for bias conversion in Swiglu models (#7386)

* Add support for bias conversion in Swiglu models

Signed-off-by: smajumdar <[email protected]>

* Add support for auto extracting tokenizer model

Signed-off-by: smajumdar <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add support for auto extracting tokenizer model

Signed-off-by: smajumdar <[email protected]>

* Fix issue with missing tokenizer

Signed-off-by: smajumdar <[email protected]>

* Refactor

Signed-off-by: smajumdar <[email protected]>

* Refactor

Signed-off-by: smajumdar <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: smajumdar <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Update save_to and restore_from for dist checkpointing (#7343)

* add dist ckpt to save to, in progress

Signed-off-by: eharper <[email protected]>

* move dist ckpt

Signed-off-by: eharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* clean up

Signed-off-by: eharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update restore from, need to figure out how to initialize distributed

Signed-off-by: eharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* launch distrib if needed when restoring dist ckpt

Signed-off-by: eharper <[email protected]>

* when using mcore we can change tp pp on the fly

Signed-off-by: eharper <[email protected]>

* add load_from_checkpoint support for dist ckpt

Signed-off-by: eharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update llama convert script to save dist .nemo

Signed-off-by: eharper <[email protected]>

* fix load dist ckpt

Signed-off-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* setup TE TP groups if needed

Signed-off-by: eharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* setup te tp groups if needed

Signed-off-by: eharper <[email protected]>

* remove import

Signed-off-by: eharper <[email protected]>

---------

Signed-off-by: eharper <[email protected]>
Signed-off-by: jasonwan <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: jasonwan <[email protected]>

* fix forward for with mcore=false (#7403)

Signed-off-by: Jimmy Zhang <[email protected]>
Co-authored-by: Jimmy Zhang <[email protected]>

* Fix logging to remove 's/it' from progress bar in Megatron models and add train_step_timing (#7374)

* Add CustomProgressBar class to exp_manager and trainer callbacks

Signed-off-by: Abhishree <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix the progres…
rohitrango pushed a commit to rohitrango/NeMo that referenced this pull request Jun 25, 2024
rohitrango pushed a commit to rohitrango/NeMo that referenced this pull request Jun 25, 2024
* ControlNet TRT export

* Final MR before release

* SD2 update

* Fixed export issue

* Fix for instruct p2p and reformat

* Fix SD export issue

* Add nemo clip export for DB

* Fix ins pix2pix

* fix sd2 config

* [Mingyuan Ma] BF16 and SD conversion script

* [Imagen] NHWC Feature

* Fix .nemo loading issue for NeMo CLIP in SD

* NeMo r1.20.0 Multimodal Merge

* fix the inductor issue in inference

* Fix inductor loading .nemo issue

* Add Neva Model Support

* Imagen Optimizations

* Neva inference code

* NeMo TOT 1.21 to Internal/main

* Update neva_inference.yaml

* REBASING  for latest code changes

* Update internal/main to main tot

* Parallel DDIM implementation

* 1. Fixing indentation bug. (#7352)

Signed-off-by: Micha Livne <[email protected]>

* NeMo MCore llama2 support + MCore PEFT adapters (#7299)

* start adding gpt from megatron core path

Signed-off-by: ericharper <[email protected]>

* set model parallel config

Signed-off-by: ericharper <[email protected]>

* use model parallel config object

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update args

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* set vp size to none if it is 1

Signed-off-by: ericharper <[email protected]>

* set vp size to none if it is 1

Signed-off-by: ericharper <[email protected]>

* add TransformerConfig

Signed-off-by: ericharper <[email protected]>

* start updating to TransformerConfig

Signed-off-by: ericharper <[email protected]>

* add todo

Signed-off-by: ericharper <[email protected]>

* revert to model parallel config

Signed-off-by: ericharper <[email protected]>

* add hidden_size to model_parallel_config

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove imports

Signed-off-by: ericharper <[email protected]>

* revert

Signed-off-by: ericharper <[email protected]>

* remove import

Signed-off-by: ericharper <[email protected]>

* small clean up

Signed-off-by: ericharper <[email protected]>

* update hidden size in peft base model, add mcore commit to jenkins

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update module args

Signed-off-by: ericharper <[email protected]>

* add config obj to flash attention tests

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove args

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove sequence parallel arg

Signed-off-by: ericharper <[email protected]>

* update args

Signed-off-by: ericharper <[email protected]>

* add config to self

Signed-off-by: ericharper <[email protected]>

* update args

Signed-off-by: ericharper <[email protected]>

* update args

Signed-off-by: ericharper <[email protected]>

* update args

Signed-off-by: ericharper <[email protected]>

* add config to test

Signed-off-by: ericharper <[email protected]>

* get hidden_size from config

Signed-off-by: ericharper <[email protected]>

* add try except

Signed-off-by: ericharper <[email protected]>

* use default

Signed-off-by: ericharper <[email protected]>

* update config with hidden size

Signed-off-by: ericharper <[email protected]>

* remove arg

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* comment out jenkins test

Signed-off-by: ericharper <[email protected]>

* revert import

Signed-off-by: ericharper <[email protected]>

* build transformer config

Signed-off-by: ericharper <[email protected]>

* add model to provider func

Signed-off-by: ericharper <[email protected]>

* update forward and float16 wrapper

Signed-off-by: ericharper <[email protected]>

* instantiate model parallel config after init model parallel

Signed-off-by: ericharper <[email protected]>

* set virtual rank

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add GQA config to megatron gpt model (#7096)

* Add GQA config in gpt config file

Signed-off-by: jasonwan <[email protected]>

* Verify mcore is enabled when using GQA

Signed-off-by: jasonwan <[email protected]>

---------

Signed-off-by: jasonwan <[email protected]>

* revert

Signed-off-by: ericharper <[email protected]>

* mcore llama2 ckpt conversion & small fix

Signed-off-by: jasonwan <[email protected]>

* Add inference & sft config by Hongbin

Co-authored-by: Hongbin Liu <[email protected]>

Signed-off-by: jasonwan <[email protected]>

* fix config

Signed-off-by: jasonwan <[email protected]>

* add inference param. update TP/PP script to support mcore gpt

Signed-off-by: jasonwan <[email protected]>

* p-tuning

Signed-off-by: jasonwan <[email protected]>

* modify ckpt conversion script (adding model cast)

Signed-off-by: jasonwan <[email protected]>

* ckpt conversion use relative path for config

Signed-off-by: jasonwan <[email protected]>

* start adding gpt from megatron core path

Signed-off-by: ericharper <[email protected]>

* set model parallel config

Signed-off-by: ericharper <[email protected]>

* use model parallel config object

Signed-off-by: ericharper <[email protected]>

* update args

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* set vp size to none if it is 1

Signed-off-by: ericharper <[email protected]>

* set vp size to none if it is 1

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add TransformerConfig

Signed-off-by: ericharper <[email protected]>

* start updating to TransformerConfig

Signed-off-by: ericharper <[email protected]>

* add todo

Signed-off-by: ericharper <[email protected]>

* revert to model parallel config

Signed-off-by: ericharper <[email protected]>

* add hidden_size to model_parallel_config

Signed-off-by: ericharper <[email protected]>

* remove imports

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove import

Signed-off-by: ericharper <[email protected]>

* small clean up

Signed-off-by: ericharper <[email protected]>

* update hidden size in peft base model, add mcore commit to jenkins

Signed-off-by: ericharper <[email protected]>

* update module args

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add config obj to flash attention tests

Signed-off-by: ericharper <[email protected]>

* remove args

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove sequence parallel arg

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update args

Signed-off-by: ericharper <[email protected]>

* add config to self

Signed-off-by: ericharper <[email protected]>

* update args

Signed-off-by: ericharper <[email protected]>

* update args

Signed-off-by: ericharper <[email protected]>

* update args

Signed-off-by: ericharper <[email protected]>

* add config to test

Signed-off-by: ericharper <[email protected]>

* get hidden_size from config

Signed-off-by: ericharper <[email protected]>

* add try except

Signed-off-by: ericharper <[email protected]>

* use default

Signed-off-by: ericharper <[email protected]>

* update config with hidden size

Signed-off-by: ericharper <[email protected]>

* remove arg

Signed-off-by: ericharper <[email protected]>

* comment out jenkins test

Signed-off-by: ericharper <[email protected]>

* revert import

Signed-off-by: ericharper <[email protected]>

* remove optimizer_idx

Signed-off-by: eharper <[email protected]>

* prefetch num microbatches

Signed-off-by: eharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* start adding gpt from megatron core path

Signed-off-by: ericharper <[email protected]>

* set model parallel config

Signed-off-by: ericharper <[email protected]>

* use model parallel config object

Signed-off-by: ericharper <[email protected]>

* update args

Signed-off-by: ericharper <[email protected]>

* fix for p-tuning sequence parallel

Signed-off-by: jasonwan <[email protected]>

* support SFT/distOpt mcore (#7207)

* add inference param. update TP/PP script to support mcore gpt

* p-tuning

Signed-off-by: jasonwan <[email protected]>

* change layer names for SFT

Signed-off-by: Hongbin Liu <[email protected]>

* fix bug in SFT

Signed-off-by: Hongbin Liu <[email protected]>

---------

Signed-off-by: jasonwan <[email protected]>
Signed-off-by: Hongbin Liu <[email protected]>
Co-authored-by: Hongbin Liu <[email protected]>
Co-authored-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* start updating to TransformerConfig

Signed-off-by: ericharper <[email protected]>

* revert to model parallel config

Signed-off-by: ericharper <[email protected]>

* add hidden_size to model_parallel_config

Signed-off-by: ericharper <[email protected]>

* remove imports

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update module args

Signed-off-by: ericharper <[email protected]>

* add config to self

Signed-off-by: ericharper <[email protected]>

* build transformer config

Signed-off-by: ericharper <[email protected]>

* add model to provider func

Signed-off-by: ericharper <[email protected]>

* update forward and float16 wrapper

Signed-off-by: ericharper <[email protected]>

* instantiate model parallel config after init model parallel

Signed-off-by: ericharper <[email protected]>

* set virtual rank

Signed-off-by: ericharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add GQA config to megatron gpt model (#7096)

* Add GQA config in gpt config file

Signed-off-by: jasonwan <[email protected]>

* Verify mcore is enabled when using GQA

Signed-off-by: jasonwan <[email protected]>

---------

Signed-off-by: jasonwan <[email protected]>

* revert

Signed-off-by: ericharper <[email protected]>

* remove import

Signed-off-by: eharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* rollback model cast for p-tuning

Signed-off-by: jasonwan <[email protected]>

* update for dist adam

Signed-off-by: eharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* use get_gpt_module_list

Signed-off-by: eharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update ckpt conversion script

Signed-off-by: jasonwan <[email protected]>

* ptl2.0 patch for llama config

Signed-off-by: jasonwan <[email protected]>

* add plugins to trainer in scripts

Signed-off-by: jasonwan <[email protected]>

* fix activation checkpointing mcore

Signed-off-by: jasonwan <[email protected]>

* fix variable names

Signed-off-by: jasonwan <[email protected]>

* overwrite normalization type for mcore/te

Signed-off-by: jasonwan <[email protected]>

* Update megatron_llama_sft.yaml

Signed-off-by: Jason Wang <[email protected]>

* add PEFT adapter support for mcore gpt path (#7276)

* implementation for mcore adapter/mxins

Signed-off-by: jasonwan <[email protected]>

* small fix for lora and ptuning

Signed-off-by: jasonwan <[email protected]>

* support layerwise peft

Signed-off-by: jasonwan <[email protected]>

* support multiple target layers

Signed-off-by: jasonwan <[email protected]>

* support lora GQA

Signed-off-by: jasonwan <[email protected]>

* support amp O2

Signed-off-by: jasonwan <[email protected]>

* revert & more O2 fix

Signed-off-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* lora inject to attention

Signed-off-by: jasonwan <[email protected]>

* support lora weight tying

Signed-off-by: jasonwan <[email protected]>

* add copyright header

Signed-off-by: jasonwan <[email protected]>

* rollback ptuning name change. full string match mcore target

Signed-off-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* remove comment

Signed-off-by: jasonwan <[email protected]>

---------

Signed-off-by: jasonwan <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* clean up config

Signed-off-by: jasonwan <[email protected]>

* Sync llama branch (#7297)

* add inference param. update TP/PP script to support mcore gpt

* p-tuning

Signed-off-by: jasonwan <[email protected]>

* change layer names for SFT

Signed-off-by: Hongbin Liu <[email protected]>

* fix bug in SFT

Signed-off-by: Hongbin Liu <[email protected]>

* fix bug: cpu initialization is not really enabled

Signed-off-by: Hongbin Liu <[email protected]>

* add use_cpu_initialization to TransformerConfig

Signed-off-by: Hongbin Liu <[email protected]>

* fix bug: wrong config path when using relative cjpt path

Signed-off-by: Hongbin Liu <[email protected]>

* revert mcore config change

Signed-off-by: Jason Wang <[email protected]>

---------

Signed-off-by: jasonwan <[email protected]>
Signed-off-by: Hongbin Liu <[email protected]>
Signed-off-by: Jason Wang <[email protected]>
Co-authored-by: Hongbin Liu <[email protected]>

* clean up ckpt conversion script

Signed-off-by: jasonwan <[email protected]>

* rollback git merge errors

Signed-off-by: jasonwan <[email protected]>

* update mcore, add check for mcore+te

Signed-off-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* formatting

Signed-off-by: jasonwan <[email protected]>

* make sft test dataset optional. fix indentation in config

Signed-off-by: jasonwan <[email protected]>

* one more fix for optional test set

Signed-off-by: jasonwan <[email protected]>

* support merging lora weights in mcore

Signed-off-by: jasonwan <[email protected]>

* update mcore for cpu init

Signed-off-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update ckpt conversion for code llama

Signed-off-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add seq_len_interpolation_factor support for long-context llama ckpts (#7312)

* add inference param. update TP/PP script to support mcore gpt

* p-tuning

Signed-off-by: jasonwan <[email protected]>

* add seq_len_interpolation_factor

Signed-off-by: Hongbin Liu <[email protected]>

---------

Signed-off-by: jasonwan <[email protected]>
Signed-off-by: Hongbin Liu <[email protected]>
Co-authored-by: jasonwan <[email protected]>
Co-authored-by: Hongbin Liu <[email protected]>

* fix old ptuning model, update mcore to support seq_len_interpolation_factor

Signed-off-by: jasonwan <[email protected]>

* support fused layernorm linear, fix ptuning O2

Signed-off-by: jasonwan <[email protected]>

* drop loss mask for mcore for now

Signed-off-by: jasonwan <[email protected]>

* disable dist ckpt in peft

Signed-off-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix loading non dist ckpt

Signed-off-by: jasonwan <[email protected]>

* add ckpt conversion to CI

Signed-off-by: jasonwan <[email protected]>

* update CI

Signed-off-by: jasonwan <[email protected]>

* mcore_mixin docstring

Signed-off-by: jasonwan <[email protected]>

* minor change in mcore peft error message

Signed-off-by: jasonwan <[email protected]>

* fix amp o2 in lora weight tying

Signed-off-by: jasonwan <[email protected]>

* correct mcore fp8 config

Signed-off-by: jasonwan <[email protected]>

* add TE installation

Signed-off-by: jasonwan <[email protected]>

* support mcore adapter tuning

Signed-off-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* comment out new CI test. rollback docker image

Signed-off-by: jasonwan <[email protected]>

* ignore FA tests, try new CI on 23.08

Signed-off-by: jasonwan <[email protected]>

* mark new CI as L2, put to beginning to test

Signed-off-by: jasonwan <[email protected]>

* minor fix for prompt learning

Signed-off-by: jasonwan <[email protected]>

* rollback to 23.06. comment out CI

Signed-off-by: jasonwan <[email protected]>

* minor fix ckpt conversion script

Signed-off-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* minor rollback gpt model change

Signed-off-by: jasonwan <[email protected]>

---------

Signed-off-by: ericharper <[email protected]>
Signed-off-by: jasonwan <[email protected]>
Signed-off-by: eharper <[email protected]>
Signed-off-by: Hongbin Liu <[email protected]>
Signed-off-by: Jason Wang <[email protected]>
Co-authored-by: ericharper <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: eharper <[email protected]>
Co-authored-by: Hongbin Liu <[email protected]>
Co-authored-by: Kelvin Liu <[email protected]>

* Hiddens modules documentation (#7303)

* 1. Changed hiddens transformations module from `transformations` to `hiddens`.

Signed-off-by: Micha Livne <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* 1. Debugging. Signed-off-by: Micha Livne <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* 1. Finished doc.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging. Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging. Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging. Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging. Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging. Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging. Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging. Signed-off-by: Micha Livne <[email protected]>

---------

Signed-off-by: Micha Livne <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Eric Harper <[email protected]>

* Support for flash attention 2.0 (#7063)

* Add flash attn 2

Signed-off-by: MaximumEntropy <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add FA2 feature

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* Remove debugging

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: MaximumEntropy <[email protected]>
Signed-off-by: Cheng-Ping Hsieh <[email protected]>
Signed-off-by: Cheng-Ping Hsieh <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Oleksii Kuchaiev <[email protected]>
Co-authored-by: Cheng-Ping Hsieh <[email protected]>
Co-authored-by: Cheng-Ping Hsieh <[email protected]>

* lora merge fix for O2 names (#7325)

* wip

Signed-off-by: arendu <[email protected]>

* adjust key names based on O2

Signed-off-by: arendu <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update

Signed-off-by: arendu <[email protected]>

* minor

Signed-off-by: arendu <[email protected]>

---------

Signed-off-by: arendu <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* multiple fields can form a context (#7147)

* list of context fields and flexible prompt template

Signed-off-by: arendu <[email protected]>

* list of fields for context

Signed-off-by: arendu <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix bug

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* Fix bug

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* Add multiple truncation fields and middle truncation

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Compatible to old ckpt

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix tokenize detokenize issue

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Remove detokenization, add truncation augmentation

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Resolve comments

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* Remove unused import

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* revert eos

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* Add tokenizer space_sensitive attribute

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix error

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* Fix erorr and use re

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix bug

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* Change assert logic

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Follow adi suggestion

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Remove merge function

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add example and comment

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* Remove context_key and add comment

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* Remove random truncation

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix bug

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix template none

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix bug

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

---------

Signed-off-by: arendu <[email protected]>
Signed-off-by: Cheng-Ping Hsieh <[email protected]>
Signed-off-by: Cheng-Ping Hsieh <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Cheng-Ping Hsieh <[email protected]>
Co-authored-by: Cheng-Ping Hsieh <[email protected]>

* Load buffers in checkpoint (#7357)

Signed-off-by: Jason Wang <[email protected]>

* Add migration guide for lightning 2.0 upgrade (#7360)

* Add lightning 2.0 migration guide in NeMo docs

Signed-off-by: Abhishree <[email protected]>

* Add remaining guide for lightning 2.0 upgrade

Signed-off-by: Abhishree <[email protected]>

* Remove line spill over and continue in next line

Signed-off-by: Abhishree <[email protected]>

* Add missing dataloader_iter in the guide

Signed-off-by: Abhishree <[email protected]>

* Fix minor typo

Signed-off-by: Abhishree <[email protected]>

---------

Signed-off-by: Abhishree <[email protected]>

* adding bias_dropout_add_fusion option for BERT (#7332)

Signed-off-by: Alexander Jipa <[email protected]>
Co-authored-by: Alexander Jipa <[email protected]>

* [TTS] Change audio codec token type to TokenIndex (#7356)

Signed-off-by: Ryan <[email protected]>

* enable selective unfreeze (#7326)

* wip

Signed-off-by: arendu <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* wip

Signed-off-by: arendu <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* avoid PTL method conflicts

Signed-off-by: arendu <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update

Signed-off-by: arendu <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update

Signed-off-by: arendu <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: arendu <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Fix typos (#7361)

* fix typos

Signed-off-by: omahs <[email protected]>

* fix typo

Signed-off-by: omahs <[email protected]>

* fix typos

Signed-off-by: omahs <[email protected]>

* fix typos

Signed-off-by: omahs <[email protected]>

* fix typo

Signed-off-by: omahs <[email protected]>

* fix typos

Signed-off-by: omahs <[email protected]>

* fix typo

Signed-off-by: omahs <[email protected]>

* fix typo

Signed-off-by: omahs <[email protected]>

* fix typo

Signed-off-by: omahs <[email protected]>

---------

Signed-off-by: omahs <[email protected]>

* pin numba=0.57.1 to fix reinstall.sh error (#7366)

Signed-off-by: Xuesong Yang <[email protected]>

* Update new conversion script for converting safetensors.

* Upgrade pytorch container to 23.08 (#7353)

* upgrade pytorch container

Signed-off-by: eharper <[email protected]>

* use mcore

Signed-off-by: eharper <[email protected]>

* revert test change

Signed-off-by: eharper <[email protected]>

* pleasefixme

Signed-off-by: eharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* check for ampere

Signed-off-by: eharper <[email protected]>

* comment test temporarily

Signed-off-by: eharper <[email protected]>

---------

Signed-off-by: eharper <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* enable fp32 optimizer for output_layer in mcore (#7355)

Signed-off-by: lhb8125 <[email protected]>

* revert comment (#7368)

Signed-off-by: eharper <[email protected]>

* Update to core 23.08 branch ToT (#7371)

Signed-off-by: Abhinav Khattar <[email protected]>

* upper bounding ptl (#7370)

Signed-off-by: eharper <[email protected]>

* fix pipeline parallel inference (#7367)

* fix pp inference

Signed-off-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: jasonwan <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* fix for peft tied weights (#7372)

Signed-off-by: arendu <[email protected]>

* fixed trainer.strategy=auto from None. (#7369)

Signed-off-by: Xuesong Yang <[email protected]>

* add O2 option in gpt eval (#7358)

* add O2 option in eval

Signed-off-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add doc for O2 config

Signed-off-by: jasonwan <[email protected]>

* add to llama inference config

Signed-off-by: jasonwan <[email protected]>

---------

Signed-off-by: jasonwan <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Eric Harper <[email protected]>

* Move model precision copy (#7336)

* move cfg precision set to megatron base model

Signed-off-by: Maanu Grover <[email protected]>

* remove copy from other models

Signed-off-by: Maanu Grover <[email protected]>

* modify attribute not arg

Signed-off-by: Maanu Grover <[email protected]>

* fix gpt model test for ptl 2.0

Signed-off-by: Maanu Grover <[email protected]>

* rename function and add docstring

Signed-off-by: Maanu Grover <[email protected]>

* replace precision to dtype conditionals with func call

Signed-off-by: Maanu Grover <[email protected]>

* unnecessary function and cfg reset

Signed-off-by: Maanu Grover <[email protected]>

* set default value

Signed-off-by: Maanu Grover <[email protected]>

* fix precision lookup in a few more places

Signed-off-by: Maanu Grover <[email protected]>

* rename mapping function

Signed-off-by: Maanu Grover <[email protected]>

* ununsed import

Signed-off-by: Maanu Grover <[email protected]>

* save torch datatype to model

Signed-off-by: Maanu Grover <[email protected]>

* set weights precision wrt amp o2

Signed-off-by: Maanu Grover <[email protected]>

* Revert "set weights precision wrt amp o2"

This reverts commit 313a4bfe5eb69d771a6d2433898c0685836aef5c.

Signed-off-by: Maanu Grover <[email protected]>

* revert half precision at inference attempt

Signed-off-by: Maanu Grover <[email protected]>

* move autocast dtype to base model

Signed-off-by: Maanu Grover <[email protected]>

* move params dtype to base model, enable fp16 O2 inf

Signed-off-by: Maanu Grover <[email protected]>

* unused imports

Signed-off-by: Maanu Grover <[email protected]>

---------

Signed-off-by: Maanu Grover <[email protected]>

* Fix PEFT checkpoint loading (#7388)

* Fix PEFT checkpoint loading

Signed-off-by: Jason Wang <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Jason Wang <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Use distributed optimizer support for multiple dtypes (#7359)

* Update distopt wrapper with multiple dtype support

Remove manual handling of separate FP32 optimizer.

Signed-off-by: Tim Moon <[email protected]>

* Use distopt support for contiguous buffers with multiple dtypes

Signed-off-by: Tim Moon <[email protected]>

* Fix typo

Signed-off-by: Tim Moon <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Separate distopt buckets for first GPT layer and non-overlapped params

Signed-off-by: Tim Moon <[email protected]>

* Add distopt logic for int dtypes

Signed-off-by: Tim Moon <[email protected]>

* Update Apex commit

Signed-off-by: Tim Moon <[email protected]>

* Remove unused variables

Signed-off-by: Tim Moon <[email protected]>

* Update Apex commit in README and Jenkensfile

Signed-off-by: Tim Moon <[email protected]>

* Debug Dockerfile and Jenkinsfile

Signed-off-by: Tim Moon <[email protected]>

---------

Signed-off-by: Tim Moon <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Eric Harper <[email protected]>

* minor fix for llama ckpt conversion script (#7387)

* minor fix for llama ckpt conversion script

Signed-off-by: Jason Wang <[email protected]>

* Update Jenkinsfile

Signed-off-by: Jason Wang <[email protected]>

* remove fast_swiglu configuration

Signed-off-by: Jason Wang <[email protected]>

---------

Signed-off-by: Jason Wang <[email protected]>
Co-authored-by: Eric Harper <[email protected]>

* Fix wrong calling of librosa.get_duration() in notebook (#7376)

Signed-off-by: Robin Dong <[email protected]>
Co-authored-by: Somshubra Majumdar <[email protected]>

* [PATCH] PEFT import mcore (#7393)

* [PATCH] PEFT import mcore

Signed-off-by: Jason Wang <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Jason Wang <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* [TTS] Added a callback for logging initial data (#7384)

Signed-off-by: Ante Jukić <[email protected]>

* Update Core Commit (#7402)

* Update Core Commit

Signed-off-by: Abhinav Khattar <[email protected]>

* update commit

Signed-off-by: Abhinav Khattar <[email protected]>

---------

Signed-off-by: Abhinav Khattar <[email protected]>

* Use cfg attribute in bert (#7394)

* use cfg attribute instead of arg

Signed-off-by: Maanu Grover <[email protected]>

* use torch_dtype in place of cfg.precision

Signed-off-by: Maanu Grover <[email protected]>

* move precision copy before super constructor

Signed-off-by: Maanu Grover <[email protected]>

* use trainer arg

Signed-off-by: Maanu Grover <[email protected]>

---------

Signed-off-by: Maanu Grover <[email protected]>

* Add support for bias conversion in Swiglu models (#7386)

* Add support for bias conversion in Swiglu models

Signed-off-by: smajumdar <[email protected]>

* Add support for auto extracting tokenizer model

Signed-off-by: smajumdar <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add support for auto extracting tokenizer model

Signed-off-by: smajumdar <[email protected]>

* Fix issue with missing tokenizer

Signed-off-by: smajumdar <[email protected]>

* Refactor

Signed-off-by: smajumdar <[email protected]>

* Refactor

Signed-off-by: smajumdar <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: smajumdar <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Update save_to and restore_from for dist checkpointing (#7343)

* add dist ckpt to save to, in progress

Signed-off-by: eharper <[email protected]>

* move dist ckpt

Signed-off-by: eharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* clean up

Signed-off-by: eharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update restore from, need to figure out how to initialize distributed

Signed-off-by: eharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* launch distrib if needed when restoring dist ckpt

Signed-off-by: eharper <[email protected]>

* when using mcore we can change tp pp on the fly

Signed-off-by: eharper <[email protected]>

* add load_from_checkpoint support for dist ckpt

Signed-off-by: eharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update llama convert script to save dist .nemo

Signed-off-by: eharper <[email protected]>

* fix load dist ckpt

Signed-off-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* setup TE TP groups if needed

Signed-off-by: eharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* setup te tp groups if needed

Signed-off-by: eharper <[email protected]>

* remove import

Signed-off-by: eharper <[email protected]>

---------

Signed-off-by: eharper <[email protected]>
Signed-off-by: jasonwan <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: jasonwan <[email protected]>

* fix forward for with mcore=false (#7403)

Signed-off-by: Jimmy Zhang <[email protected]>
Co-authored-by: Jimmy Zhang <[email protected]>

* Fix logging to remove 's/it' from progress bar in Megatron models and add train_step_timing (#7374)

* Add CustomProgressBar class to exp_manager and trainer callbacks

Signed-off-by: Abhishree <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix the progress bar to reflect total microbatch cnt

Signed-off-by: Abhishree <[email protected]>

* Modify CustomProgressBar class

1) Modify CustomProgressBar class to update progress bar per global_step instead of per microbatch
2) Add the callback to other megatron training/finetuning files that are not using MegatronTrainerBuilder

Signed-off-by: Abhishree <[email protected]>

* Add CustomProgressBar callback to tuning files

Signed-off-by: Abhishree <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Abhishree <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Set Activation Checkpointing Defaults (#7404)

* Set Activation Checkpointing Defaults

Signed-off-by: Abhinav Khattar <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* check for None

Signed-off-by: Abhinav Khattar <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Abhinav Khattar <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* make loss mask default to false (#7407)

Signed-off-by: eharper <[email protected]>

* Add dummy userbuffer config files (#7408)

Signed-off-by: Sangkug Lym <[email protected]>

* add missing ubconf files (#7412)

Signed-off-by: Abhinav Khattar <[email protected]>

* New tutorial on Speech Data Explorer (#7405)

* Added Google Colab based tutorial on Speech Data Explorer 

Signed-off-by: George Zelenfroynd <[email protected]>

* Update ptl training ckpt conversion script to work with dist ckpt (#7416)

* update ptl convert script

Signed-off-by: eharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* don't break legacy

Signed-off-by: eharper <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: eharper <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Allow disabling sanity checking when num_sanity_val_steps=0 (#7413)

* Allow disabling sanity checking when num_sanity_val_steps=0

Signed-off-by: Abhishree <[email protected]>

* Update num_sanity_val_steps to be a multiple of num_microbatches

Signed-off-by: Abhishree Thittenamane <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Abhishree <[email protected]>
Signed-off-by: Abhishree Thittenamane <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Add comprehensive error messages (#7261)

Signed-off-by: Anton Peganov <[email protected]>

* check NEMO_PATH (#7418)

Signed-off-by: Nikolay Karpov <[email protected]>

* layer selection for ia3 (#7417)

* layer selection for ia3

Signed-off-by: arendu <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: arendu <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Fix missing pip package 'einops' (#7397)

Signed-off-by: Robin Dong <[email protected]>

* Fix failure of pyaudio in Google Colab (#7396)

Signed-off-by: Robin Dong <[email protected]>

* Update README.md: output_path --> output_manifest_filepath (#7442)

Signed-off-by: Samuele Cornell <[email protected]>

* Updating FlashAttention API to match FlashAttentionV2

* Multiple fixes for mm

* Fix CI inductor issue and update to torch compile

* Remove suppress error

* Fix when conversion config uses fp16 and it complains about precision plugin

* Fixing FAv2 API usage

* Initial release of content filtering model

* Added synthetic dataloader for precached and online mode

* Mingyuanm/dreambooth opt

* Add llama2 support in neva training

* Fix sampler length

* Fix all precision issues in nemo multimodal

* Add rope dynamic linear scaling (#7437)

* Add dynamic linear scaling

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix bug

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

---------

Signed-off-by: Cheng-Ping Hsieh <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Yang Zhang <[email protected]>

* Fix None dataloader issue in PTL2.0 (#7455)

* Fix None dataloader issue in PTL2.0

Signed-off-by: KunalDhawan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* updating values of self._validation_dl and self._test_dl as well

Signed-off-by: KunalDhawan <[email protected]>

* updating values of self._validation_dl and self._test_dl as well

Signed-off-by: KunalDhawan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: KunalDhawan <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* [ASR] Confidence measure -> method renames (#7434)

* measure -> method

Signed-off-by: Aleksandr Laptev <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Aleksandr Laptev <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Add steps for document of getting dataset 'SF Bilingual Speech' (#7378)

* Add steps for document of getting dataset 'SF Bilingual Speech'

Signed-off-by: Robin Dong <[email protected]>

* Update datasets.rst

added a link from a tutorial demonstrating detailed data prep steps.

Signed-off-by: Xuesong Yang <[email protected]>

---------

Signed-off-by: Robin Dong <[email protected]>
Signed-off-by: Xuesong Yang <[email protected]>
Co-authored-by: Xuesong Yang <[email protected]>

* RNN-T confidence and alignment bugfix (#7381)

* new frame_confidence and alignments lists are now always created after the while loop

Signed-off-by: Aleksandr Laptev <[email protected]>

* tests added

Signed-off-by: Aleksandr Laptev <[email protected]>

---------

Signed-off-by: Aleksandr Laptev <[email protected]>

* Fix resume from checkpoint in exp_manager (#7424) (#7426)

Signed-off-by: Abhishree <[email protected]>
Co-authored-by: Abhishree Thittenamane <[email protected]>
Co-authored-by: Eric Harper <[email protected]>

* Fix checking of cuda/cpu device for inputs of Decoder (#7444)

* Fix checking of cuda/cpu device for inputs of Decoder

Signed-off-by: Robin Dong <[email protected]>

* Update tacotron2.py

Signed-off-by: Jason <[email protected]>

---------

Signed-off-by: Robin Dong <[email protected]>
Signed-off-by: Jason <[email protected]>
Co-authored-by: Jason <[email protected]>

* Fix failure of ljspeech's get_data.py (#7430)

* Fix failure of ljspeech's get_data.py

Signed-off-by: Robin Dong <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Robin Dong <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* [TTS] Fix audio codec type checks (#7373)

* [TTS] Fix audio codec type checks

Signed-off-by: Ryan <[email protected]>

* [TTS] Fix audio codec tests

Signed-off-by: Ryan <[email protected]>

---------

Signed-off-by: Ryan <[email protected]>

* [TTS] Add dataset to path of logged artifacts (#7462)

* [TTS] Add dataset to path of logged artifacts

Signed-off-by: Ryan <[email protected]>

* [TTS] Revert axis name back to Audio Frames

Signed-off-by: Ryan <[email protected]>

---------

Signed-off-by: Ryan <[email protected]>

* Fix sft dataset truncation (#7464)

* Add fix

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

---------

Signed-off-by: Cheng-Ping Hsieh <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Automatic Lip Reading Recognition (ALR) - ASR/CV (Visual ASR) (#7330)

* striding_conv1d_k5 and dw_striding_conv1d_k5 subsampling

Signed-off-by: mburchi <[email protected]>

* transpose conv1d inputs

Signed-off-by: mburchi <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Signed-off-by: mburchi <[email protected]>

* Update subsampling.py

change striding_conv1d_k5 to striding_conv1d

Signed-off-by: Maxime Burchi <[email protected]>

* cv branch

Signed-off-by: mburchi <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* video manifest

Signed-off-by: mburchi <[email protected]>

* add collection classes

Signed-off-by: mburchi <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add test_step_outputs

Signed-off-by: mburchi <[email protected]>

* correct manifest bug when having only audio or only videos

Signed-off-by: mburchi <[email protected]>

* correct manifest bug when having only audio or only videos

Signed-off-by: mburchi <[email protected]>

* clean references

Signed-off-by: mburchi <[email protected]>

* freeze unfreeze transcribe cv models

Signed-off-by: mburchi <[email protected]>

* correct manifest get_full_path bug

Signed-off-by: mburchi <[email protected]>

* update for PR

Signed-off-by: mburchi <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* guard torchvision

Signed-off-by: mburchi <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update nemo/collections/cv/data/video_to_text_dataset.py

Co-authored-by: Igor Gitman <[email protected]>
Signed-off-by: Maxime Burchi <[email protected]>

* _video_speech_collate_fn in cv/data/video_to_text.py

Signed-off-by: mburchi <[email protected]>

* add self.out = None to asr subsampling

Signed-off-by: mburchi <[email protected]>

* Update nemo/collections/cv/data/video_to_text_dataset.py

Co-authored-by: Igor Gitman <[email protected]>
Signed-off-by: Maxime Burchi <[email protected]>

* cv -> multimodal/speech_cv branch

Signed-off-by: mburchi <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: mburchi <[email protected]>
Signed-off-by: Maxime Burchi <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Igor Gitman <[email protected]>

* HF StarCoder to NeMo conversion script (#7421)

* Script to convert HF StarCoder checkpoint to NeMo

Signed-off-by: Jan Lasek <[email protected]>

* StarCoder conversion test

Signed-off-by: Jan Lasek <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Signed-off-by: Jan Lasek <[email protected]>

* Fix test

Signed-off-by: Jan Lasek <[email protected]>

* Catch up with save_to changes

Signed-off-by: Jan Lasek <[email protected]>

* Don't abbreviate args for clarity

Signed-off-by: Jan Lasek <[email protected]>

* Configurable precision: BF16 vs FP32

Signed-off-by: Jan Lasek <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Jan Lasek <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* fix bug when loading dist ckpt in peft (#7452)

Signed-off-by: Hongbin Liu <[email protected]>
Co-authored-by: Hongbin Liu <[email protected]>

* Fix adding positional embeddings in-place in transformer module (#7440)

Signed-off-by: Tamerlan Tabolov <[email protected]>
Co-authored-by: Cheng-Ping Hsieh <[email protected]>

* Fix (#7478)

Signed-off-by: Cheng-Ping Hsieh <[email protected]>

* add sleep (#7498) (#7499)

* add sleep



* add sleep onto config instead



* add comment



---------

Signed-off-by: Gerald Shen <[email protected]>
Co-authored-by: Gerald Shen <[email protected]>

* Fix exp manager check for sleep (#7503) (#7504)

Signed-off-by: smajumdar <[email protected]>
Co-authored-by: Somshubra Majumdar <[email protected]>

* bugfix: trainer.accelerator=auto from None. (#7492) (#7493)

Signed-off-by: Xuesong Yang <[email protected]>
Co-authored-by: Xuesong Yang <[email protected]>

* [doc] fix broken link (#7481)

Signed-off-by: Stas Bekman <[email protected]>

* [TTS] Read audio as int32 to avoid flac read errors (#7477)

* [TTS] Read audio as int32 to avoid flac read errors

Signed-off-by: Ryan <[email protected]>

* [TTS] Add comment about read failures

Signed-off-by: Ryan <[email protected]>

---------

Signed-off-by: Ryan <[email protected]>

* Add dataset 'AISHELL-3' from OpenSLR for training mandarin TTS (#7409)

* Add dataset 'AISHELL-3' from OpenSLR for training mandarin TTS
* Train 'AISHELL-3' dataset with multi-speakers

Signed-off-by: Robin Dong <[email protected]>

* Update get_data.py

update copyright header

Signed-off-by: Xuesong Yang <[email protected]>

* Update get_data.py

added a disclaimer

Signed-off-by: Xuesong Yang <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add new configuration file for AISHELL3 with multispeaker of fastpitch

Signed-off-by: Robin Dong <[email protected]>

---------

Signed-off-by: Robin Dong <[email protected]>
Signed-off-by: Xuesong Yang <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Xuesong Yang <[email protected]>

* dllogger - log on rank 0 only (#7513)

Signed-off-by: Stas Bekman <[email protected]>

* Fix TTS FastPitch tutorial (#7494) (#7516)

* Fix

---------

Signed-off-by: Cheng-Ping Hsieh <[email protected]>
Co-authored-by: Cheng-Ping Hsieh <[email protected]>

* Fix get_dist() tensor dimension (#7506) (#7515)

Signed-off-by: Jocelyn Huang <[email protected]>
Co-authored-by: Jocelyn <[email protected]>

* bugfix: specify trainer.strategy=auto when devices=1 (#7509) (#7512)

Signed-off-by: Xuesong Yang <[email protected]>
Co-authored-by: Xuesong Yang <[email protected]>
Co-authored-by: Xuesong Yang <[email protected]>

* fix (#7511)

Signed-off-by: Abhinav Khattar <[email protected]>

* [TTS] Fix FastPitch data prep tutorial (#7524)

Signed-off-by: Ryan <[email protected]>

* add italian tokenization (#7486)

* add italian tokenization

Signed-off-by: GiacomoLeoneMaria <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add more ipa lexicon it

Signed-off-by: GiacomoLeoneMaria <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix error deletion

Signed-off-by: GiacomoLeoneMaria <[email protected]>

* add test

Signed-off-by: GiacomoLeoneMaria <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: GiacomoLeoneMaria <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Replace None strategy with auto in tutorial notebooks (#7521) (#7527)

Signed-off-by: Abhishree <[email protected]>
Co-authored-by: Abhishree Thittenamane <[email protected]>

* unpin setuptools (#7534) (#7535)

Signed-off-by: fayejf <[email protected]>
Co-authored-by: fayejf <[email protected]>

* remove auto generated examples (#7510)

* explicitly remove autogenerated examples for data parallel evaluation

Signed-off-by: arendu <[email protected]>

* mark autogenrated and remove it for test

Signed-off-by: arendu <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: arendu <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Add the `strategy` argument to `MegatronGPTModel.generate()` (#7264)

It is passed as an explicit argument rather than through
`**strategy_args` so as to ensure someone cannot accidentally pass other
arguments that would end up being ignored.

It is a keyword-only argument to ensure that if in the future we want to
update the signature to `**strategy_args`, we can do it without breaking
code.

Signed-off-by: Olivier Delalleau <[email protected]>

* Fix PTL2.0 related ASR bugs in r1.21.0: Val metrics logging, None dataloader issue (#7531) (#7533)

* fix none dataloader issue ptl2



* ptl2.0 logging fixes for rnnt_models



---------

Signed-off-by: KunalDhawan <[email protected]>
Co-authored-by: Kunal Dhawan <[email protected]>
Co-authored-by: Nithin Rao <[email protected]>

* gpus -> devices (#7542) (#7545)

Signed-off-by: Nithin Rao Koluguri <nithinraok>
Co-authored-by: Nithin Rao <[email protected]>

* Update FFMPEG version to fix issue with torchaudio (#7551) (#7553)

Signed-off-by: smajumdar <[email protected]>
Co-authored-by: Somshubra Majumdar <[email protected]>

* PEFT GPT & T5 Refactor (#7308)

* initial implementation of add_adapters API

* correct type hint

* Add config in add_adapters for save and load (@author bobchen)

* Remove AdapterConfig to avoid import error

* Add AdaterConfig back and move adaptermixin to sft model

* Add NLPSaveRestoreConnector as default in NLPModel.restore_from

* Add restore_from_nemo_with_adapter and test script

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* rename t5 file and classes to be consistent with GPT

* add t5 sft dataset

* add support for single-file format with T5SFTDataset

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Various small changes to make T5 SFT work like GPT SFT

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add adapter evaluation test script

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add MultiAdaterConfig for ia3 and fix builder issue

* Make ptuning for T5SFTModel work using mixin

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add IA3_Adapter for AdapterName

* Add adapter name for ptuning and attention adapter

* Make test script GPT/T5 agnostic

* Add layer selection feature

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Integrate adapter name and config

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update gpt peft tuning script to new API

* add t5 peft tuning script with new API

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix IA3 layer selection issue

* Override state_dict on SFT model instead of mixin

* Add load adapter by adapter config

* move peft config map away from example script

* auto get config from nemo adapter

* Move PEFTConfig to new file

* fix ckpt save/load for t5

* name change: add_adapters -> add_adapter

* variable name change

* update t5 script

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix t5 issues

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add weight tying

* update gpt tuning script

* PEFT-API proposal

* Fix according to comments

* update tuning scripts

* move merge_cfg_with to mixin class since it applies to both gpt and t5 and requires the model class for restore

* Add mcore_gpt support for NLPAdapterMixin

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix typo

* variable name change to distinguish "peft" and "adapter"

* override `load_adapters` to support `add_adapter` name change

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* update tuning and eval script for adapter save/load

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add Ptuning on first stage only

* add lora tutorial for review

* Fix layer selection for mcore

* add landing page

* fix resume training

Signed-off-by: jasonwan <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add mcore condition in sharded_state_dict to make sft work

* Update lora_tutorial.md

First edit of this file for PEFT documentation for NeMO

Signed-off-by: hkelly33 <[email protected]>

* rename Adapter to AttentionAdapter to avoid confusion in doc

* Change load_adapters to load .nemo

* add quick start guide

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add load_adapters with .ckpt

* Remove setup_complete changes in load_adapters

* update landing page

* remove typo

* Updated quick_start.md per Chen Cui

Signed-off-by: hkelly33 <[email protected]>

* Add inference config merger and tutorial

* Add doc string for NLPAdapterModelMixin and deprecated warning on MegatronGPTPEFTModel

* add suppor…
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants