placeholder from Speechllm selene to main #13

zhehuaichen · 2023-09-19T16:01:19Z

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Collection: [Note which collection this PR will affect]

Changelog

Add specific line by line info of high level changes in this PR.

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Related to # (issue)

approve the nit change

…#2) approve as no need to review

… passed now) (#3)

… to guard the llm input

* Merge heh and zhehuai's initial version of frozen am+llm The previous differences are summarized here: https://docs.google.com/document/d/1zNI4hC6vJtUfcHbrUSPaMuYWRBQdN_36H0P2NiBiuPY/edit This PR includes 1. Finish merging the model, dataset, and config code 2. Previous tests are still enabled and passed (prepare_llm_input, training_step, validation_step) 3. the example training script with LS960 has been run to make sure the training pipeline works The major remaining works are listed here https://docs.google.com/document/d/1o0AM7v4gcTQkPZjE0Vl9TTX4vYnGTrbXEFGWh0UhGlk/edit#bookmark=id.pzvdadt5oxyw --------- Co-authored-by: He Huang (Steve) <[email protected]>

Signed-off-by: zhehuaichen <[email protected]>

init them from the ckpt Signed-off-by: zhehuaichen <[email protected]>

Signed-off-by: zhehuaichen <[email protected]>

Signed-off-by: stevehuang52 <[email protected]>

…into speechllm_selene_he

Signed-off-by: stevehuang52 <[email protected]>

add tarred datasets

Signed-off-by: zhehuaichen <[email protected]>

…peechllm_selene Signed-off-by: zhehuaichen <[email protected]>

Signed-off-by: stevehuang52 <[email protected]>

…peechllm_selene

Signed-off-by: stevehuang52 <[email protected]>

fix bucketing dataset

Signed-off-by: zhehuaichen <[email protected]>

…peechllm_selene

multitask understanding; also fix bleu implementation Signed-off-by: zhehuaichen <[email protected]>

Signed-off-by: zhehuaichen <[email protected]>

…models Signed-off-by: zhehuaichen <[email protected]>

Signed-off-by: zhehuaichen <[email protected]>

nemo/collections/nlp/models/language_modeling/megatron_gpt_sft_model.py

zhehuaichen

left some items to run down

…g update in training Signed-off-by: zhehuaichen <[email protected]>

nemo/collections/nlp/models/language_modeling/megatron_gpt_sft_model.py

stevehuanghe · 2023-09-22T08:09:06Z

nemo/collections/nlp/models/language_modeling/megatron_gpt_sft_model.py

@@ -541,46 +593,88 @@ def inference_epoch_end(self, outputs, mode, data_cfg):
            averaged_metric = 0.0 if monitor_mode == 'max' else 1e5

        if mode == 'validation':
-            self.log("validation_loss", averaged_loss)
+            self.log("validation_loss", averaged_loss, batch_size=1, sync_dist=True)


may not be necessary, I added batch_size=1, sync_dist=True just because other places have them. it's ok to keep the original code.

stevehuanghe · 2023-09-22T08:11:15Z

nemo/collections/nlp/models/language_modeling/megatron_gpt_sft_model.py

-    def on_test_epoch_end(self):
-        self.on_inference_epoch_end(self.cfg.data.test_ds)
-        return super().on_test_epoch_end()
+    # def on_test_epoch_end(self):


commented out since they are not actually called in PTL1.8

Zhehuai Chen and others added 30 commits August 3, 2023 09:53

add initial impl of ModularizedSpeechGPTModel and integration test

c39c135

fix typo in the test name (#1)

03364d6

approve the nit change

clean a initial version of example config; make sure it works by test (…

0c626ce

…#2) approve as no need to review

add the test for training_step and fix the code correspondingly (test…

31f970d

… passed now) (#3)

add test for validation_step (#4)

bf8c4af

mv audio and text emb concat to prepare_llm_input so as to write test…

113be74

… to guard the llm input

fix a nit init bug broke test (#6)

d2e3fc7

Signed-off-by: zhehuaichen <[email protected]>

wip

f10137a

Signed-off-by: zhehuaichen <[email protected]>

fix data

9c2f707

Signed-off-by: zhehuaichen <[email protected]>

fix consumed_samples

463cc4b

Signed-off-by: zhehuaichen <[email protected]>

fix the training restart problem by storing adapter+perception model and

baf0cfd

init them from the ckpt Signed-off-by: zhehuaichen <[email protected]>

refix state dict

4e13045

Signed-off-by: zhehuaichen <[email protected]>

support wer and inf

9fd4ab5

Signed-off-by: zhehuaichen <[email protected]>

nan guard

7ddd943

Signed-off-by: zhehuaichen <[email protected]>

reimpl inf and bug fix

34993b1

Signed-off-by: zhehuaichen <[email protected]>

multi loader

9b6e7b2

Signed-off-by: zhehuaichen <[email protected]>

unfreeze lm

8401bf1

Signed-off-by: zhehuaichen <[email protected]>

flag for load am

42977ef

Signed-off-by: zhehuaichen <[email protected]>

tokenizer

8d30a19

Signed-off-by: zhehuaichen <[email protected]>

overwrite vocab size

8cd3eb6

Signed-off-by: zhehuaichen <[email protected]>

support bpe dropout

7d778a4

Signed-off-by: zhehuaichen <[email protected]>

add tarred datasets

8bb8683

Signed-off-by: stevehuang52 <[email protected]>

Merge branch 'speechllm_selene' of https://github.com/zhehuaichen/NeMo …

db8ccc0

…into speechllm_selene_he

fix sample_alpha

29e66ed

Signed-off-by: stevehuang52 <[email protected]>

Merge pull request #8 from zhehuaichen/speechllm_selene_he

58e5a26

add tarred datasets

fix bpe dropout bugs in the mismatched context in tokenization

c408143

Signed-off-by: zhehuaichen <[email protected]>

Merge branch 'speechllm_selene' of github.com:zhehuaichen/NeMo into s…

a76916b

…peechllm_selene Signed-off-by: zhehuaichen <[email protected]>

add bleu metric

2faff14

Signed-off-by: stevehuang52 <[email protected]>

update metrics

3d7aa53

Signed-off-by: stevehuang52 <[email protected]>

zhehuaichen and others added 18 commits August 23, 2023 11:35

Merge branch 'speechllm_selene' of github.com:zhehuaichen/NeMo into s…

69ef1d7

…peechllm_selene

fix bucketing dataset

649ce0e

Signed-off-by: stevehuang52 <[email protected]>

Merge pull request #10 from zhehuaichen/speechllm_selene_he

4b95198

fix bucketing dataset

fix bleu implementation

8bd1798

Signed-off-by: zhehuaichen <[email protected]>

Merge branch 'speechllm_selene' of github.com:zhehuaichen/NeMo into s…

43061e2

…peechllm_selene

support question set file per dataset/data loader in preparation for

d230fea

multitask understanding; also fix bleu implementation Signed-off-by: zhehuaichen <[email protected]>

support simple random context for word boosting

fe3d854

Signed-off-by: zhehuaichen <[email protected]>

use sacrebleu.corpus_bleu to be consistent with the rest

7943fd8

Signed-off-by: zhehuaichen <[email protected]>

make audio_file optional in the data loader

ba5ff92

Signed-off-by: zhehuaichen <[email protected]>

add a tool to materialize mt and text data

6e8adba

Signed-off-by: zhehuaichen <[email protected]>

compatible with tar dataset

2b34238

Signed-off-by: zhehuaichen <[email protected]>

temp fix for metric and speed up materialization

4542d06

Signed-off-by: zhehuaichen <[email protected]>

make num of context configurable

e5d8884

Signed-off-by: zhehuaichen <[email protected]>

val_check_interval fix; make manifest dumping consistent with speech …

49fd526

…models Signed-off-by: zhehuaichen <[email protected]>

random_context_positive_ratio configurable to control precision

19525bc

Signed-off-by: zhehuaichen <[email protected]>

bug fix: freeze_llm flag is not passed to the model cfg

72abaf9

Signed-off-by: zhehuaichen <[email protected]>

overwrite tensor_model_parallel_size

07a1803

Signed-off-by: zhehuaichen <[email protected]>

support both stt and ssl models for loading audio encoder

d383055

Signed-off-by: zhehuaichen <[email protected]>

zhehuaichen commented Sep 19, 2023

View reviewed changes