Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding RETRO tests to Action Tests (cicd-main.yml) #8942

Merged
merged 138 commits into from
Apr 17, 2024
Merged
Changes from all commits
Commits
Show all changes
138 commits
Select commit Hold shift + click to select a range
e91a66d
update branch
ericharper Jan 29, 2024
305ad9c
Add dist ckpt support for regular optimizers (#7749)
mikolajblaz Jan 31, 2024
40da002
Pin lhotse=1.19.2 in r1.23.0 (#8303)
pzelasko Feb 1, 2024
d3bad4b
Cache Aware Streaming tutorial notebook (#8296)
erastorgueva-nv Feb 1, 2024
17f09e4
fix path location and branch (#8304)
nithinraok Feb 2, 2024
991dad9
add deallocate pipeline output optimization (#8279)
JimmyZhang12 Feb 2, 2024
e9320ed
Fix memory leak caused by context parallelism hanging references by o…
JimmyZhang12 Feb 2, 2024
8b18cfc
remove assertion (#8302)
dimapihtar Feb 2, 2024
d9f1409
Update PEFT Doc (#8262)
cuichenx Feb 3, 2024
a592517
Attention encoder-decoder models for multiple speech-to-text tasks …
titu1994 Feb 3, 2024
3ef5513
add code for calling mcore_retro in NeMo
huvunvidia Nov 27, 2023
6218b8c
add code for calling mcore_retro in NeMo
huvunvidia Nov 27, 2023
c5907ac
runnable, training curve match retro mcore and nemo
huvunvidia Dec 15, 2023
5f10619
working on retro inference
huvunvidia Jan 10, 2024
ecc061e
working on megatron_retro_eval.py and megatron_retro_inference.yaml
huvunvidia Jan 10, 2024
c1e99d3
refactoring text_generation_utils code and retro inference relevant f…
Jan 20, 2024
2bf5d2c
clean PR
Jan 26, 2024
db6ffe3
resolving quick hacks (reading number of train/valid samples from wor…
Jan 30, 2024
1d1021c
clean repository
Jan 31, 2024
9b3cd36
revert changes to inference/eval code to original in main
Jan 31, 2024
31834d5
clean code
Jan 31, 2024
43c4af9
runable training code, with already implemented eval code
Jan 31, 2024
45ea217
[tutorial] fixed missing RIR scripts file. (#8257)
XuesongYang Jan 29, 2024
186a369
add values to en tts dict (#7879)
mgrafu Jan 30, 2024
7aea8c9
Add Bert HF checkpoint converter (#8088)
yaoyu-33 Jan 31, 2024
11787e7
revert to original eval code files
Jan 31, 2024
a1faebf
revert to original eval code files 2
Jan 31, 2024
78070f5
revert to original eval code files 3
Jan 31, 2024
7f2a889
revert to original eval code files 4
Jan 31, 2024
c0e4ea2
clean code
Feb 1, 2024
5dbee42
clean code
Feb 1, 2024
a9fb106
update my code to support changes from lastest main
Feb 2, 2024
769605c
commit before rebase r1.23.0
Feb 6, 2024
c3c766e
Multimodal r1.23.0 bug fix (#8315)
yaoyu-33 Feb 6, 2024
53bac6e
copy paste files from r1.23.0
Feb 6, 2024
9830475
clean PR
Feb 6, 2024
1434979
Fixes for MoE parameter passing & use of AutoTokenizer/Model for mist…
akoumpa Feb 6, 2024
ec8f413
Keep max_seqlen and cu_seqlens_argmin for later micro-batches when PP…
erhoo82 Feb 6, 2024
50864db
Remove asr webapp (#8347)
titu1994 Feb 6, 2024
498e9e4
remove _target_ at model level in aed config (#8351)
krishnacpuvvada Feb 6, 2024
c4f38cd
revert changes for tts and asr
Feb 7, 2024
2f72846
Add change_vocabulary and save_tokenizers() support to Multitask ASR …
titu1994 Feb 7, 2024
931c53c
Change default (#8371)
titu1994 Feb 8, 2024
7c75022
implement retro's own fwd_bwd_step() and validation_step() to not hav…
Feb 8, 2024
40bb4a2
adding megatron compile_helpers(), in future can be fixed with correc…
Feb 9, 2024
0e13348
bug fix in fast-conformer-aed.yaml and adding jenkins test for speech…
krishnacpuvvada Feb 9, 2024
138a7ab
Enable megatron core loggers for GPT pretraining (#8354)
ashbhandare Feb 9, 2024
4ee9c58
mcore ds fix (#8283)
dimapihtar Feb 9, 2024
de96b6e
addressing Eric's reviews
Feb 9, 2024
0e806b9
adding existing implementation RETRO files
Feb 9, 2024
09d9ce2
adding existing implementation RETRO files
Feb 9, 2024
02ec761
Add Finetuning tutorial with HF Datasets (#8356)
nithinraok Feb 9, 2024
88d7b21
release updates (#8378)
dimapihtar Feb 9, 2024
400c4a1
MCore dataset compatibility for tokenizers (#8390)
vysarge Feb 11, 2024
3112091
Mcore customization doc (#8298)
HuiyingLi Feb 12, 2024
68eba36
wer fix (#8404)
tbartley94 Feb 12, 2024
5b8f18c
updated link to pubmed (#8402)
nithinraok Feb 13, 2024
0f7b49b
Update NFA video download link (#8406)
erastorgueva-nv Feb 13, 2024
f897a77
revert changes (#8410)
cuichenx Feb 13, 2024
371de5b
Fix dreambooth data sampler issue (#8400)
yaoyu-33 Feb 13, 2024
98186c2
Fixed errors in the CTM gen functions (#8416)
tango4j Feb 14, 2024
8689bc0
add ensemble decoding fix (#8427)
nithinraok Feb 15, 2024
770f73b
SDE bugfix log (#8430)
Jorjeous Feb 15, 2024
05122bd
mcore customization doc minor fix (#8421)
HuiyingLi Feb 16, 2024
2e77f20
NeMo-Mistral to HF converter bugfix. (#8353)
akoumpa Feb 16, 2024
9588494
Fixing mcore bert for TP, PP and SP (#8336)
shanmugamr1992 Feb 16, 2024
71ce00c
Add settings to suppress bf16 compile errors in CI on V100 (#8481)
athitten Feb 22, 2024
c98b9c1
MoE parameter passing (#8255)
akoumpa Feb 23, 2024
a836fce
Update k2 version (#8478) (#8492)
artbataev Feb 23, 2024
0dc8a19
Add fp8 support for SD/Update notebook paths (#8489)
Victor49152 Feb 25, 2024
1d80d00
pin to 0.5.0 (#8465)
ericharper Feb 26, 2024
fcf1044
Update NeMo Multimodal Requirements (#8515)
yaoyu-33 Feb 26, 2024
d2283e3
update github raw content link (#8517)
cuichenx Feb 26, 2024
e6b7354
Add dep notice for notebooks (#8522)
ericharper Feb 27, 2024
ae9a2aa
Revert FP8 integration (#8520)
Victor49152 Feb 27, 2024
e772dbf
Update data prep notebook (#8532)
Victor49152 Feb 27, 2024
21984a1
before update branch with latest r1.23.0
Mar 4, 2024
52ee601
Merge remote-tracking branch 'origin/r1.23.0' into huvu/mcore_retro
Mar 4, 2024
b5d8aec
update to run with MLM ae2817b3dde4efb1515061a5311d01d8f85bd99c (runn…
Mar 5, 2024
4199bc7
remove compile_helpers
Mar 5, 2024
0ff5673
reverse changes from main branch to r1.23.0
Mar 5, 2024
74994c2
adding *_legacy files
Mar 5, 2024
061b632
update MLM commit in Jenkinsfile to latest
Mar 6, 2024
41b0178
debugging Jenkinstest: test different mcore import in retro_dataset
Mar 6, 2024
f9c3293
update Jenkinsfile edit megatron_retro_mutransfer_pretrain_legacy.py
Mar 7, 2024
88ef4d4
removing all mcore RETRO to pass the Jenkinstest
Mar 7, 2024
251762b
fixing import legacy problem for tests/collections/nlp/test_indexed_r…
Mar 7, 2024
8a6452d
update Jenkinsfile file to use TE v0.7
Mar 7, 2024
12263c0
update NeMo to work with latest mcore RETRO (solving TE problems)
Mar 20, 2024
188dd43
update TE commit Jenkinsfile to be the same with r1.23.0's Jenkinsfile
Mar 20, 2024
26c44a2
update commit for MLM
Mar 20, 2024
9068890
jenkinstest debugging
Mar 20, 2024
adff1f8
temporary fix RETRO's __init__ for jenkinstest
Mar 21, 2024
fec1852
edit splits_string in jenkinsfile to correct format; put RETRO test i…
Mar 21, 2024
ab4c6c0
edit splits_string in jenkinsfile to correct format; put RETRO test i…
Mar 21, 2024
5cedf92
edit splits_string in jenkinsfile to correct format; put RETRO test i…
Mar 21, 2024
fadd10b
edit splits_string in jenkinsfile to correct format; put RETRO test i…
Mar 21, 2024
08d2d73
add model.data.dataloader_type=cyclic to jenkinsfile
Mar 22, 2024
92ade73
update code to work with latest megatron-lm main 81dab6067
Apr 4, 2024
4f547fb
update M-LM commit in Jenkinsfile to latest main M-LM 81dab6067
Apr 4, 2024
e8a83ed
fix to by pass CI test bf16 problem (following this PR https://github…
Apr 8, 2024
b33d8a0
isort and black
Apr 8, 2024
2402e66
adjusting model.micro_batch_size to 1
Apr 8, 2024
d050eb9
fix conflicts
Apr 9, 2024
7b13e95
fix BRANCH = 'r1.23.0'
Apr 9, 2024
35dc730
replace tutorials dir from main branch to huvu/mcore_retro
Apr 9, 2024
003b4b3
fix minor merges conflict
Apr 9, 2024
964f366
update Jenkinsfile
Apr 9, 2024
0787a20
runnable with a temporary fix from Jacek (unfound -unfinished problem)
Apr 10, 2024
39b2d76
runnable with a temporary fix from Jacek (unfound -unfinished problem)
Apr 10, 2024
17beaf8
merged from main on 10apr
Apr 10, 2024
19bfae0
modified nlp_overrides.py back to original
Apr 10, 2024
1ea089d
fix checkpoint from Jacek Bieniusiewicz
Apr 10, 2024
ac3e2b9
config Jenkinsfile test
Apr 10, 2024
a522627
set RETRO Jenkins MBS to 1
Apr 11, 2024
72ff280
black fix
Apr 11, 2024
4fbb7b8
isort fix
Apr 11, 2024
62f6d7e
update TE commit
Apr 11, 2024
41de539
update to latest Jenkinsfile with latest container and commits
Apr 11, 2024
32ab269
remove new RETRO jenkinstest
Apr 11, 2024
04844cd
Merge remote-tracking branch 'origin/main' into huvu/mcore_retro
Apr 11, 2024
637448c
merge latest main
Apr 11, 2024
6022d66
put RETRO Jenkinstest to the right place
Apr 11, 2024
a4952cc
update code for megatron_retro_pretraining_legacy.py
Apr 11, 2024
e85a849
untrack ipa_cmudict-0.7b_nv23.01.txt
Apr 11, 2024
cc0d537
untrack ipa_cmudict-0.7b_nv23.01.txt
Apr 11, 2024
974aa91
set config in megatron_retro_pretraining_legacy.py to megatron_retro_…
Apr 12, 2024
a0e821a
update new RETRO jenkinstest to run faster
Apr 12, 2024
18e8cef
Merge remote-tracking branch 'origin/main' into huvu/mcore_retro
Apr 14, 2024
1928227
merging latest main, and edit Jenkinstest
Apr 14, 2024
4570557
Merge remote-tracking branch 'origin/main' into huvu/mcore_retro
Apr 16, 2024
d14c327
update Jenkinstest for new RETRO to run faster
Apr 16, 2024
abfebe2
fix isort
Apr 16, 2024
9c20ae8
adding RETRO tests to cicd-main.yml action tests
Apr 16, 2024
4e60aa7
Merge remote-tracking branch 'origin/main' into huvu/mcore_retro
Apr 16, 2024
ac5292e
update ipa_cmudict-0.7b_nv23.01.txt
Apr 16, 2024
dfea71e
remove quotes for model.data for legacy RETRO action tests
Apr 17, 2024
a03c328
Merge branch 'main' into huvu/mcore_retro
pablo-garay Apr 17, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
77 changes: 73 additions & 4 deletions .github/workflows/cicd-main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3690,6 +3690,75 @@ jobs:
uses: actions/checkout@v2
- run: |
python examples/nlp/language_modeling/megatron_retro_pretraining.py \
trainer.num_nodes=1 \
trainer.devices=2 \
trainer.precision=bf16 \
trainer.accelerator=gpu \
model.data.data_prefix=['none'] \
exp_manager.exp_dir=examples/nlp/language_modeling/mcore_retro_results \
model.mcore_gpt=True \
model.tensor_model_parallel_size=1 \
model.pipeline_model_parallel_size=1 \
model.optim.name=distributed_fused_adam \
model.retro.retro_project_dir=/home/TestData/nlp/megatron_retro/mcore_retro/micro-wiki-core \
model.data.num_workers=4 \
model.micro_batch_size=1 \
model.data.shuffle_documents=False \
trainer.val_check_interval=30 \
+trainer.num_sanity_val_steps=0 \
model.init_method_std=0.023 \
model.optim.lr=6.0e-4 \
model.megatron_amp_O2=True \
model.data.splits_string=\'\"98,2,0\"\' \
model.data.dataloader_type=cyclic \
trainer.max_steps=10

python examples/nlp/language_modeling/megatron_retro_pretraining.py \
trainer.num_nodes=1 \
trainer.devices=2 \
trainer.precision=bf16 \
trainer.accelerator=gpu \
model.data.data_prefix=['none'] \
exp_manager.exp_dir=examples/nlp/language_modeling/mcore_retro_results \
model.mcore_gpt=True \
model.tensor_model_parallel_size=1 \
model.pipeline_model_parallel_size=1 \
model.optim.name=distributed_fused_adam \
model.retro.retro_project_dir=/home/TestData/nlp/megatron_retro/mcore_retro/micro-wiki-core \
model.data.num_workers=4 \
model.micro_batch_size=1 \
model.data.shuffle_documents=False \
trainer.val_check_interval=30 \
+trainer.num_sanity_val_steps=0 \
model.init_method_std=0.023 \
model.optim.lr=6.0e-4 \
model.megatron_amp_O2=True \
model.data.splits_string=\'\"98,2,0\"\' \
model.data.dataloader_type=cyclic \
trainer.max_steps=20

rm -rf examples/nlp/language_modeling/mcore_retro_results
- uses: "NVIDIA/NeMo/.github/actions/cancel-workflow@main"
if: "failure()"

L2_Legacy_Megatron_RETRO_Pretraining_and_Resume_Training:
needs: [cicd-test-container-setup]
runs-on: self-hosted-azure
container:
image: nemoci.azurecr.io/nemo_container_${{ github.run_id }}
options:
# --user 0:128
--device=/dev/nvidia0
--gpus all
--shm-size=8g
--env TRANSFORMERS_OFFLINE=0
--env HYDRA_FULL_ERROR=1
--volume /mnt/datadrive/TestData:/home/TestData
steps:
- name: Checkout repository
uses: actions/checkout@v2
- run: |
python examples/nlp/language_modeling/megatron_retro_pretraining_legacy.py \
trainer.devices=2 \
trainer.num_nodes=1 \
trainer.accelerator=gpu \
Expand All @@ -3700,7 +3769,7 @@ jobs:
trainer.precision=16 \
trainer.gradient_clip_val=1.0 \
trainer.val_check_interval=10 \
exp_manager.exp_dir=examples/nlp/language_modeling/retro_results \
exp_manager.exp_dir=examples/nlp/language_modeling/retro_legacy_results \
model.data.data_prefix= \
model.data.knn_index= \
model.data.retrieval_prefix= \
Expand All @@ -3720,7 +3789,7 @@ jobs:
model.dec_cross_attention=[1] \
+model.data.mock=True

python examples/nlp/language_modeling/megatron_retro_pretraining.py \
python examples/nlp/language_modeling/megatron_retro_pretraining_legacy.py \
trainer.devices=2 \
trainer.num_nodes=1 \
trainer.accelerator=gpu \
Expand All @@ -3731,7 +3800,7 @@ jobs:
trainer.precision=16 \
trainer.gradient_clip_val=1.0 \
trainer.val_check_interval=10 \
exp_manager.exp_dir=examples/nlp/language_modeling/retro_results \
exp_manager.exp_dir=examples/nlp/language_modeling/retro_legacy_results \
model.data.data_prefix= \
model.data.knn_index= \
model.data.retrieval_prefix= \
Expand All @@ -3751,7 +3820,7 @@ jobs:
model.dec_cross_attention=[1] \
+model.data.mock=True

rm -rf examples/nlp/language_modeling/retro_results
rm -rf examples/nlp/language_modeling/retro_legacy_results
- uses: "NVIDIA/NeMo/.github/actions/cancel-workflow@main"
if: "failure()"

Expand Down
Loading