Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade to pytorch lightning 2.0 #6433

Merged
merged 90 commits into from
Aug 5, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
90 commits
Select commit Hold shift + click to select a range
bca054e
Upgrade pytorch lightning version in requirements
athitten Apr 16, 2023
ae35807
Initial fixes for PTL2.0
athitten Apr 17, 2023
2d39e48
Add further fixes to support lightning 2.0
athitten Apr 18, 2023
47c1c87
Add replacements for replace_sampler_ddp, resume_from_checkpoint_fit_…
athitten Apr 20, 2023
5e35bcc
Replace all occurances of validation_epoch_end to on_validation_epoch…
athitten Apr 20, 2023
5d6747d
Replace training_epoch_end, test_epoch_end with on_train_epoch_end an…
athitten Apr 22, 2023
c2e3e76
Change logger=None to logger=False in Trainer object
athitten Apr 24, 2023
2a4274a
Remove PTL2.0 deprecated Trainer args from TrainerConfig dataclass
athitten Apr 25, 2023
e1c71a6
Modify trainer.precision check and other small edits
athitten May 1, 2023
c206bb0
Replace logger=None with logger=False in test_ptl_stateless_timer.py …
athitten May 23, 2023
1d15c94
Add default values for args to fix Attribute Error
athitten Jun 1, 2023
57d8af7
Add the following modifications
athitten Jun 13, 2023
2cac147
Remove outputs arg from on_validation_epoch_end, on_test_epoch_end
athitten Jun 14, 2023
cfb9c70
Remove outputs arg in on_validation_epoch_end in MultiBinaryAccuracy …
athitten Jun 14, 2023
9e0f6c5
Add val, test outputs as instance vars in PunctuationCapitalizationMo…
athitten Jun 15, 2023
5c6ae50
Replace trainer.fit_loop.max_steps with trainer.fit_loop.epoch_loop.m…
athitten Jun 20, 2023
6974037
Revert an extra space that was mistakenly added
athitten Jun 20, 2023
ed1af3d
Use self.validation_step_outputs and self.test_step_outputs in test_e…
athitten Jun 20, 2023
ea2fd23
Use self.validation_step_outputs and self.test_step_outputs in test_p…
athitten Jun 21, 2023
c334149
Add self.validation_step_outputs.clear() and self.test_step_outputs.c…
athitten Jun 21, 2023
786e962
Remove outputs arg from on_train_epoch_end
athitten Jun 21, 2023
74147b7
Remove outputs from on_validation_epoch_end in multi_binary_acc.py
athitten Jun 21, 2023
792234e
Remove output args from on_validation_epoch_end in the docstrings of …
athitten Jun 22, 2023
a7a9c3f
Remove output args from on_validation_epoch_end and clear memory from…
athitten Jun 22, 2023
776fd7d
Add on_validation_epoch_end and remove outputs args for nlp models
athitten Jun 23, 2023
41e5dda
Append output of validation_step to validation_step_outputs in EncDec…
athitten Jun 28, 2023
3276dd5
Add the following changes
athitten Jun 30, 2023
ec4fb5d
Add default value dataloader_idx=0 for on_validation_batch_end() in m…
athitten Jul 1, 2023
b635c16
TypeCast precision to str in attention.py and utils_funcs.py to avoid…
athitten Jul 2, 2023
11a6e13
Add if condition check for multiple dataloaders when appending to val…
athitten Jul 2, 2023
876dba9
Separate validation pass to be used with both validation_step and tes…
athitten Jul 2, 2023
7a0ed39
Add if condition check for multiple dataloader while appending to tes…
athitten Jul 2, 2023
4a35223
Add condition check for multiple dataloaders based on type of trainer…
athitten Jul 4, 2023
d368515
Comment Megatron T5 IA3 PP=2 in CI pipeline due to dataloader_iter is…
athitten Jul 5, 2023
c8d28ff
Modify precision checks to account for 16-mixed and bf16-mixed
athitten Jul 5, 2023
24e77f3
Append output of validation/test_step to self.validation/test_step_ou…
athitten Jul 6, 2023
81e3868
Modify find_unused_parameters=True in g2p_heteronym model
athitten Jul 6, 2023
ae04106
Remove outputs from on_test_epoch_end in DialogueGPTClassificationModel
athitten Jul 6, 2023
5765625
Add validation/test outputs in sgdqa_model and modify dialogue_config…
athitten Jul 7, 2023
0edd455
Add split arg self.test_step_outputs to TextClassificationModel
athitten Jul 7, 2023
a73c0f7
Add test_step_outputs to dialogue and text classification models
athitten Jul 9, 2023
0754dc9
Change condition check for multiple dataloaders:
athitten Jul 11, 2023
7d0396b
Add additional condition for multi dataloaders
athitten Jul 11, 2023
b800f93
Add val step outputs and default val for dataloader_idx
athitten Jul 11, 2023
6617008
Add val/test_step_outputs to S2SQAModel and GPTQAModel
athitten Jul 11, 2023
03a8fd9
Edit JenkinsFile for bert_pretrainig.py
athitten Jul 11, 2023
f99433a
Modify precision to support 16-mixed, bf16-mixed in megatron_gpt_pret…
athitten Jul 11, 2023
52d43bd
Add ddp_find_unused_parameters_true and remove output args
athitten Jul 12, 2023
be6875a
Precision fix in megatron_nmt_training.py for 16-mixed, bf16-mixed
athitten Jul 12, 2023
f519c8a
Precision fix for megatron_bert_pretraining.py and megatron_bert_mode…
athitten Jul 12, 2023
8988355
Precision fix and validation/test_step_outputs
athitten Jul 12, 2023
00e8150
Precision fix and skip few failing tests
athitten Jul 12, 2023
e0dfba2
Add missing comment lines in JenkinsFile
athitten Jul 12, 2023
104c48a
Comment jenkin tests and super().on_validation_epoch_end() in megatro…
athitten Jul 13, 2023
828df4f
Minor edit JenkinsFile
athitten Jul 13, 2023
0f0712b
Minor edit in jenkins file
athitten Jul 13, 2023
76a9a59
Edit in Jenkins file
athitten Jul 13, 2023
cc14249
Comment missed lines in Jenkins file
athitten Jul 13, 2023
c0963c4
Fix precision and validation/test outputs
athitten Jul 13, 2023
16fe164
Fix precision and validation/test/predict errors in megatron_t5_promp…
athitten Jul 13, 2023
f8fe51f
Precision fix and edit precision typo in all files
athitten Jul 13, 2023
6c6f875
Fix all CI TTS tests and comment few Jenkins tests
athitten Jul 18, 2023
0480d9a
Combine xx_epoch_end and on_xx_epoch_end
athitten Jul 20, 2023
697571d
Add a missing comment in JenkinsFile
athitten Jul 20, 2023
0ab3c3a
Add try except StopIteration in validation_step for models with datal…
athitten Jul 22, 2023
3bdeb93
Remove pyyaml from requirements
athitten Jul 24, 2023
363fc7e
Add try except for inference_step in megatron_finetune_model.py
athitten Jul 24, 2023
08c874d
Remove limit_val_batches for mockGPTDataset test
athitten Jul 24, 2023
109402b
Add new self.validation_step_outputs for MegatronGPTSFTModel
athitten Jul 25, 2023
70d6e3e
Minor edit Jenkinsfile
athitten Jul 25, 2023
f5d4307
Initialize self.validation/test_step_outputs in megatron_gpt_sft_mode…
athitten Jul 25, 2023
c5d42a6
Remove resume_from_checkpoint if trainer arg in conf yaml files
athitten Jul 26, 2023
4cf8704
Remove resume_from_checkpoint as trainer arg in GPT, T5 configs
athitten Jul 26, 2023
df45646
Remove resume_from_checkpoint in duplex_tn_config.yaml
athitten Jul 26, 2023
a5ac49d
Fix typos, unused imports and refactor code to remove redundant funcs
athitten Jul 27, 2023
52f67c4
Remove commented code in megatron_nmt_model.py
athitten Jul 27, 2023
cbfd81f
Fix overriden functions to match parent class functions
athitten Jul 27, 2023
42495b2
Prefetch dataloader_iter to prevent hang for PP>1
athitten Jul 31, 2023
fd177b2
Override setup() in NLPDDPStrategy to avoid hang during predict with …
athitten Aug 3, 2023
0e0f9f3
Uncomment tests in JenkinsFile
athitten Aug 3, 2023
22b8cfd
Add '16' to precision checks and other minor fixes
athitten Aug 3, 2023
39fa73a
Clear validation/test_step_outputs with dataloader_idx for multi data…
athitten Aug 4, 2023
ef78022
Minor edits
athitten Aug 4, 2023
7926361
Modify precision checks to avoid indexing
athitten Aug 4, 2023
538b733
Remove self.validation_step_outputs_sft and add dataloader_idx to cle…
athitten Aug 4, 2023
55353a1
Reference checkpoint with trainer.ckpt_path
athitten Aug 5, 2023
cbec318
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 5, 2023
b5b2abf
Add _prefetch to NLPModel and minor fixes
athitten Aug 5, 2023
5a60f11
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 5, 2023
c8dab2d
Add limit_val_batches in JenkinsFile for NMT
athitten Aug 5, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
63 changes: 30 additions & 33 deletions Jenkinsfile
Original file line number Diff line number Diff line change
Expand Up @@ -2234,7 +2234,10 @@ pipeline {
trainer.devices=[1] \
trainer.accelerator="gpu" \
trainer.precision=16 \
+trainer.fast_dev_run=true \
+trainer.fast_dev_run=false \
+trainer.max_epochs=1 \
+trainer.limit_val_batches=0 \
+trainer.limit_train_batches=1 \
model.train_ds.data_file=/home/TestData/nlp/wiki_book_mini/training \
model.train_ds.batch_size=8 \
model.language_model.lm_checkpoint=/home/TestData/nlp/bert_ckpts/nemo1.0/bert_base_uncased_mlm_final_1074591_nemo1.0.pt \
Expand Down Expand Up @@ -2626,7 +2629,6 @@ pipeline {
sh "rm -rf examples/nlp/machine_translation/megatron_nmt_results"
}
}

// stage('L2: NMT Bottleneck Fallback') {
// when {
// anyOf {
Expand Down Expand Up @@ -3202,7 +3204,7 @@ assert_frame_equal(training_curve, gt_curve, rtol=1e-3, atol=1e-3)"'''
trainer.accelerator=gpu \
trainer.log_every_n_steps=1 \
trainer.val_check_interval=2 \
trainer.limit_val_batches=1 \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reason for removing this line? removing this can significantly increase time for this CI test

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aklife97 thanks, adding it back in. During the migration to 2.0 there was an issue using this and hence was removed temporarily. Missed to add it back.

Copy link
Collaborator Author

@athitten athitten Aug 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aklife97 sorry my bad.Actually this trainer.limit_val_batches=1 needs to be removed or made as 2 (which is global bs/micro bs). Otherwise, there's an error with the datalaoder_iter.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

setting it to 2 sounds good, can we also change other tests that remove it?

trainer.limit_val_batches=2 \
trainer.accumulate_grad_batches=1 \
trainer.max_steps=6 \
trainer.precision=16 \
Expand Down Expand Up @@ -3319,10 +3321,10 @@ assert_frame_equal(training_curve, gt_curve, rtol=1e-3, atol=1e-3)"'''
//model.activations_checkpoint_num_layers=1 \
//model.data.data_prefix=[.5,/home/TestData/nlp/megatron_gpt/data/gpt/simple_wiki_gpt_preproc_text_document,.5,/home/TestData/nlp/megatron_gpt/data/gpt/simple_wiki_gpt_preproc_text_document] \
//model.data.index_mapping_dir=examples/nlp/language_modeling/gpt_index_mappings"
sh "rm -rf examples/nlp/language_modeling/gpt_pretrain_results"
sh "rm -rf examples/nlp/language_modeling/gpt_index_mappings"
}
}
sh "rm -rf examples/nlp/language_modeling/gpt_pretrain_results"
sh "rm -rf examples/nlp/language_modeling/gpt_index_mappings"
}
}
stage('L2: Megatron GPT with Rope Pretraining using Flash Attention and Resume Training TP=2') {
when {
anyOf {
Expand Down Expand Up @@ -3578,8 +3580,8 @@ assert_frame_equal(training_curve, gt_curve, rtol=1e-3, atol=1e-3)"'''
//model.activations_checkpoint_num_layers=1 \
//model.data.data_prefix=[.5,/home/TestData/nlp/megatron_gpt/data/gpt/simple_wiki_gpt_preproc_text_document,.5,/home/TestData/nlp/megatron_gpt/data/gpt/simple_wiki_gpt_preproc_text_document] \
//model.data.index_mapping_dir=examples/nlp/language_modeling/gpt_index_mappings"
sh "rm -rf examples/nlp/language_modeling/gpt_pretrain_results"
sh "rm -rf examples/nlp/language_modeling/gpt_index_mappings"
//sh "rm -rf examples/nlp/language_modeling/gpt_pretrain_results"
//sh "rm -rf examples/nlp/language_modeling/gpt_index_mappings"
}
}
stage('L2: Megatron GPT Pretraining and Resume Training PP=2') {
Expand Down Expand Up @@ -3666,6 +3668,7 @@ assert_frame_equal(training_curve, gt_curve, rtol=1e-3, atol=1e-3)"'''
sh "rm -rf examples/nlp/language_modeling/gpt_index_mappings"
}
}
// @athitten Remove /home/TestData/nlp/megatron_sft/trec.jsonl for validation and test file until we have support for multiple dataloaders in lightning 2.0
stage('L2: Megatron GPT Finetuning PP=2') {
when {
anyOf {
Expand Down Expand Up @@ -3696,13 +3699,13 @@ assert_frame_equal(training_curve, gt_curve, rtol=1e-3, atol=1e-3)"'''
model.data.train_ds.num_workers=0 \
model.data.test_ds.micro_batch_size=1 \
model.data.test_ds.global_batch_size=4 \
model.data.test_ds.file_names=[/home/TestData/nlp/megatron_sft/quarel.jsonl,/home/TestData/nlp/megatron_sft/trec.jsonl] \
model.data.test_ds.names=[quarel,trec] \
model.data.test_ds.file_names=[/home/TestData/nlp/megatron_sft/quarel.jsonl] \
model.data.test_ds.names=[quarel] \
model.data.validation_ds.micro_batch_size=1 \
model.data.validation_ds.global_batch_size=4 \
model.data.validation_ds.num_workers=0 \
model.data.validation_ds.file_names=[/home/TestData/nlp/megatron_sft/quarel.jsonl,/home/TestData/nlp/megatron_sft/trec.jsonl] \
model.data.validation_ds.names=[quarel,trec]"
model.data.validation_ds.file_names=[/home/TestData/nlp/megatron_sft/quarel.jsonl] \
model.data.validation_ds.names=[quarel]"
sh "python examples/nlp/language_modeling/tuning/megatron_gpt_sft.py \
trainer.devices=2 \
trainer.log_every_n_steps=1 \
Expand All @@ -3724,13 +3727,13 @@ assert_frame_equal(training_curve, gt_curve, rtol=1e-3, atol=1e-3)"'''
model.data.train_ds.num_workers=0 \
model.data.test_ds.micro_batch_size=1 \
model.data.test_ds.global_batch_size=4 \
model.data.test_ds.file_names=[/home/TestData/nlp/megatron_sft/quarel.jsonl,/home/TestData/nlp/megatron_sft/trec.jsonl] \
model.data.test_ds.names=[quarel,trec] \
model.data.test_ds.file_names=[/home/TestData/nlp/megatron_sft/quarel.jsonl] \
model.data.test_ds.names=[quarel] \
model.data.validation_ds.micro_batch_size=1 \
model.data.validation_ds.global_batch_size=4 \
model.data.validation_ds.num_workers=0 \
model.data.validation_ds.file_names=[/home/TestData/nlp/megatron_sft/quarel.jsonl,/home/TestData/nlp/megatron_sft/trec.jsonl] \
athitten marked this conversation as resolved.
Show resolved Hide resolved
model.data.validation_ds.names=[quarel,trec]"
model.data.validation_ds.file_names=[/home/TestData/nlp/megatron_sft/quarel.jsonl] \
model.data.validation_ds.names=[quarel]"
sh "rm -rf examples/nlp/language_modeling/gpt_sft_results"
}
}
Expand Down Expand Up @@ -3912,7 +3915,6 @@ assert_frame_equal(training_curve, gt_curve, rtol=1e-3, atol=1e-3)"'''
// }
// }
//}

stage('L2: Megatron GPT Prompt Tuning TP2 PP1') {
when {
anyOf {
Expand Down Expand Up @@ -3955,7 +3957,6 @@ assert_frame_equal(training_curve, gt_curve, rtol=1e-3, atol=1e-3)"'''
}
}
}

stage('L2: Megatron GPT Prompt Tuning TP1 PP2') {
when {
anyOf {
Expand Down Expand Up @@ -3995,10 +3996,10 @@ assert_frame_equal(training_curve, gt_curve, rtol=1e-3, atol=1e-3)"'''
data_paths=['/home/TestData/nlp/prompt_learning/boolq_CI_test.jsonl']"
sh "rm -rf /home/TestData/nlp/prompt_learning/p_tuning_test_pp.nemo"
sh "rm -rf /home/TestData/nlp/prompt_learning/p_tuning_test_pp_preds.txt"
}
}
}
}
}
}
}
}

// TODO: Add this test back. Test was failing on CI machines due to HW error
// stage('L2: Megatron GPT Convert from Megatron-LM checkpoing and Eval') {
Expand Down Expand Up @@ -4608,7 +4609,6 @@ assert_frame_equal(training_curve, gt_curve, rtol=1e-3, atol=1e-3)"'''
// }
// }
// }

stage('L2: Megatron UL2 Pretraining and Resume Training TP=2') {
when {
anyOf {
Expand Down Expand Up @@ -4748,7 +4748,6 @@ assert_frame_equal(training_curve, gt_curve, rtol=1e-3, atol=1e-3)"'''
trainer.accelerator=gpu \
trainer.log_every_n_steps=1 \
trainer.val_check_interval=2 \
trainer.limit_val_batches=1 \
trainer.accumulate_grad_batches=1 \
trainer.max_steps=6 \
trainer.precision=16 \
Expand Down Expand Up @@ -4934,7 +4933,6 @@ assert_frame_equal(training_curve, gt_curve, rtol=1e-3, atol=1e-3)"'''
steps {
sh "python examples/nlp/language_modeling/megatron_gpt_pretraining.py \
trainer.max_steps=10 \
trainer.limit_val_batches=1 \
trainer.val_check_interval=10 \
exp_manager.exp_dir=examples/nlp/language_modeling/gpt_pretrain_results \
model.data.data_impl=mock \
Expand All @@ -4947,7 +4945,6 @@ assert_frame_equal(training_curve, gt_curve, rtol=1e-3, atol=1e-3)"'''
steps {
sh "python examples/nlp/language_modeling/megatron_t5_pretraining.py \
trainer.max_steps=10 \
trainer.limit_val_batches=1 \
trainer.val_check_interval=10 \
exp_manager.exp_dir=examples/nlp/language_modeling/t5_pretrain_results \
model.data.data_impl=mock \
Expand All @@ -4974,7 +4971,7 @@ assert_frame_equal(training_curve, gt_curve, rtol=1e-3, atol=1e-3)"'''
trainer.devices=[0] \
trainer.accelerator="gpu" \
+trainer.limit_train_batches=1 +trainer.limit_val_batches=1 trainer.max_epochs=1 \
trainer.strategy=null \
trainer.strategy=auto \
model.decoder.decoder_rnn_dim=256 \
model.decoder.attention_rnn_dim=1024 \
model.decoder.prenet_dim=128 \
Expand All @@ -4996,7 +4993,7 @@ assert_frame_equal(training_curve, gt_curve, rtol=1e-3, atol=1e-3)"'''
validation_datasets=/home/TestData/an4_dataset/an4_val.json \
trainer.devices="[0]" \
+trainer.limit_train_batches=1 +trainer.limit_val_batches=1 trainer.max_epochs=1 \
trainer.strategy=null \
trainer.strategy=auto \
model.train_ds.dataloader_params.batch_size=4 \
model.train_ds.dataloader_params.num_workers=0 \
model.validation_ds.dataloader_params.batch_size=4 \
Expand All @@ -5018,7 +5015,7 @@ assert_frame_equal(training_curve, gt_curve, rtol=1e-3, atol=1e-3)"'''
+trainer.limit_train_batches=1 \
+trainer.limit_val_batches=1 \
trainer.max_epochs=1 \
trainer.strategy=null \
trainer.strategy=auto \
model.pitch_mean=212.35873413085938 \
model.pitch_std=68.52806091308594 \
model.train_ds.dataloader_params.batch_size=4 \
Expand All @@ -5045,7 +5042,7 @@ assert_frame_equal(training_curve, gt_curve, rtol=1e-3, atol=1e-3)"'''
+trainer.limit_train_batches=1 \
+trainer.limit_val_batches=1 \
trainer.max_epochs=1 \
trainer.strategy=null \
trainer.strategy=auto \
model.pitch_mean=212.35873413085938 \
model.pitch_std=68.52806091308594 \
model.train_ds.dataloader_params.batch_size=4 \
Expand All @@ -5070,7 +5067,7 @@ assert_frame_equal(training_curve, gt_curve, rtol=1e-3, atol=1e-3)"'''
+trainer.limit_train_batches=1 \
+trainer.limit_val_batches=1 \
trainer.max_epochs=1 \
trainer.strategy=null \
trainer.strategy=auto \
model.pitch_mean=212.35873413085938 \
model.pitch_std=68.52806091308594 \
model.train_ds.dataloader_params.batch_size=4 \
Expand All @@ -5091,7 +5088,7 @@ assert_frame_equal(training_curve, gt_curve, rtol=1e-3, atol=1e-3)"'''
+trainer.limit_train_batches=1 \
+trainer.limit_val_batches=1 \
+trainer.max_epochs=1 \
trainer.strategy=null \
trainer.strategy=auto \
model.train_ds.dataloader_params.batch_size=4 \
model.train_ds.dataloader_params.num_workers=0 \
model.validation_ds.dataloader_params.batch_size=4 \
Expand Down
20 changes: 10 additions & 10 deletions docs/source/tts/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,22 +8,22 @@ Mel-Spectrogram Generators
.. autoclass:: nemo.collections.tts.models.FastPitchModel
:show-inheritance:
:members:
:exclude-members: setup_training_data, setup_validation_data, training_step, validation_epoch_end, validation_step, setup_test_data, on_train_epoch_start
:exclude-members: setup_training_data, setup_validation_data, training_step, on_validation_epoch_end, validation_step, setup_test_data, on_train_epoch_start

.. autoclass:: nemo.collections.tts.models.MixerTTSModel
:show-inheritance:
:members:
:exclude-members: setup_training_data, setup_validation_data, training_step, validation_epoch_end, validation_step, setup_test_data, on_train_epoch_start
:exclude-members: setup_training_data, setup_validation_data, training_step, on_validation_epoch_end, validation_step, setup_test_data, on_train_epoch_start

.. autoclass:: nemo.collections.tts.models.RadTTSModel
:show-inheritance:
:members:
:exclude-members: setup_training_data, setup_validation_data, training_step, validation_epoch_end, validation_step, setup_test_data, on_train_epoch_start
:exclude-members: setup_training_data, setup_validation_data, training_step, on_validation_epoch_end, validation_step, setup_test_data, on_train_epoch_start

.. autoclass:: nemo.collections.tts.models.Tacotron2Model
:show-inheritance:
:members:
:exclude-members: setup_training_data, setup_validation_data, training_step, validation_epoch_end, validation_step, setup_test_data, on_train_epoch_start
:exclude-members: setup_training_data, setup_validation_data, training_step, on_validation_epoch_end, validation_step, setup_test_data, on_train_epoch_start

.. autoclass:: nemo.collections.tts.models.SpectrogramEnhancerModel
:show-inheritance:
Expand All @@ -36,38 +36,38 @@ Speech-to-Text Aligner Models
.. autoclass:: nemo.collections.tts.models.AlignerModel
:show-inheritance:
:members:
:exclude-members: setup_training_data, setup_validation_data, training_step, validation_epoch_end, validation_step, setup_test_data, on_train_epoch_start
:exclude-members: setup_training_data, setup_validation_data, training_step, on_validation_epoch_end, validation_step, setup_test_data, on_train_epoch_start


Two-Stage Models
~~~~~~~~~~~~~~~~~
.. autoclass:: nemo.collections.tts.models.TwoStagesModel
:show-inheritance:
:members:
:exclude-members: setup_training_data, setup_validation_data, training_step, validation_epoch_end, validation_step, setup_test_data, on_train_epoch_start
:exclude-members: setup_training_data, setup_validation_data, training_step, on_validation_epoch_end, validation_step, setup_test_data, on_train_epoch_start


Vocoders
~~~~~~~~
.. autoclass:: nemo.collections.tts.models.GriffinLimModel
:show-inheritance:
:members:
:exclude-members: setup_training_data, setup_validation_data, training_step, validation_epoch_end, validation_step, setup_test_data, on_train_epoch_start
:exclude-members: setup_training_data, setup_validation_data, training_step, on_validation_epoch_end, validation_step, setup_test_data, on_train_epoch_start

.. autoclass:: nemo.collections.tts.models.HifiGanModel
:show-inheritance:
:members:
:exclude-members: setup_training_data, setup_validation_data, training_step, validation_epoch_end, validation_step, setup_test_data, on_train_epoch_start
:exclude-members: setup_training_data, setup_validation_data, training_step, on_validation_epoch_end, validation_step, setup_test_data, on_train_epoch_start

.. autoclass:: nemo.collections.tts.models.UnivNetModel
:show-inheritance:
:members:
:exclude-members: setup_training_data, setup_validation_data, training_step, validation_epoch_end, validation_step, setup_test_data, on_train_epoch_start
:exclude-members: setup_training_data, setup_validation_data, training_step, on_validation_epoch_end, validation_step, setup_test_data, on_train_epoch_start

.. autoclass:: nemo.collections.tts.models.WaveGlowModel
:show-inheritance:
:members:
:exclude-members: setup_training_data, setup_validation_data, training_step, validation_epoch_end, validation_step, setup_test_data, on_train_epoch_start
:exclude-members: setup_training_data, setup_validation_data, training_step, on_validation_epoch_end, validation_step, setup_test_data, on_train_epoch_start


Base Classes
Expand Down
1 change: 0 additions & 1 deletion examples/asr/conf/asr_adapters/asr_adaptation.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -187,7 +187,6 @@ trainer:
precision: 32 # Should be set to 16 for O1 and O2 to enable the AMP.
log_every_n_steps: 10 # Interval of logging.
enable_progress_bar: True
resume_from_checkpoint: null # The path to a checkpoint file to continue the training, restores the whole state including the epoch, step, LR schedulers, apex, etc.
num_sanity_val_steps: 0 # number of steps to perform validation steps for sanity check the validation process before starting the training, setting to 0 disables it
check_val_every_n_epoch: 1 # number of evaluations on validation every n epochs
sync_batchnorm: true
Expand Down
1 change: 0 additions & 1 deletion examples/asr/conf/conformer/conformer_ctc_bpe.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -204,7 +204,6 @@ trainer:
precision: 32 # 16, 32, or bf16
log_every_n_steps: 10 # Interval of logging.
enable_progress_bar: True
resume_from_checkpoint: null # The path to a checkpoint file to continue the training, restores the whole state including the epoch, step, LR schedulers, apex, etc.
num_sanity_val_steps: 0 # number of steps to perform validation steps for sanity check the validation process before starting the training, setting to 0 disables it
check_val_every_n_epoch: 1 # number of evaluations on validation every n epochs
sync_batchnorm: true
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -239,7 +239,6 @@ trainer:
precision: 32 # 16, 32, or bf16
log_every_n_steps: 10 # Interval of logging.
enable_progress_bar: True
resume_from_checkpoint: null # The path to a checkpoint file to continue the training, restores the whole state including the epoch, step, LR schedulers, apex, etc.
num_sanity_val_steps: 0 # number of steps to perform validation steps for sanity check the validation process before starting the training, setting to 0 disables it
check_val_every_n_epoch: 1 # number of evaluations on validation every n epochs
sync_batchnorm: true
Expand Down
1 change: 0 additions & 1 deletion examples/asr/conf/squeezeformer/squeezeformer_ctc_bpe.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -179,7 +179,6 @@ trainer:
precision: 32 # 16, 32, or bf16
log_every_n_steps: 10 # Interval of logging.
enable_progress_bar: True
resume_from_checkpoint: null # The path to a checkpoint file to continue the training, restores the whole state including the epoch, step, LR schedulers, apex, etc.
num_sanity_val_steps: 0 # number of steps to perform validation steps for sanity check the validation process before starting the training, setting to 0 disables it
check_val_every_n_epoch: 1 # number of evaluations on validation every n epochs
sync_batchnorm: true
Expand Down
1 change: 0 additions & 1 deletion examples/asr/conf/ssl/wav2vec/wav2vec_ci.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -138,7 +138,6 @@ trainer:
gradient_clip_val: 0.0
precision: 32 # 16, 32, or bf16
log_every_n_steps: 100 # Interval of logging.
resume_from_checkpoint: null # The path to a checkpoint file to continue the training, restores the whole state including the epoch, step, LR schedulers, apex, etc.
num_sanity_val_steps: 0 # number of steps to perform validation steps for sanity check the validation process before starting the training, setting to 0 disables it
check_val_every_n_epoch: 1 # number of evaluations on validation every n epochs
sync_batchnorm: false
Expand Down
1 change: 0 additions & 1 deletion examples/nlp/dialogue/conf/dialogue_config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,6 @@ trainer:
accelerator: gpu
log_every_n_steps: 5 # Interval of logging.
val_check_interval: 1.0 # Set to 0.25 to check 4 times per epoch, or an int for number of iterations
resume_from_checkpoint: null # The path to a checkpoint file to continue the training, restores the whole state including the epoch, step, LR schedulers, apex, etc.
num_sanity_val_steps: 0 # number of steps to perform validation steps for sanity check the validation process before starting the training, setting to 0 disables it
enable_checkpointing: False # Provided by exp_manager
logger: False # Provided by exp_manager
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -71,7 +71,6 @@ decoder_trainer:
strategy: ddp
log_every_n_steps: 1 # Interval of logging.
val_check_interval: 1.0 # Set to 0.25 to check 4 times per epoch, or an int for number of iterations
resume_from_checkpoint: null # The path to a checkpoint file to continue the training, restores the whole state including the epoch, step, LR schedulers, apex, etc.

decoder_model:
do_training: true
Expand Down
4 changes: 4 additions & 0 deletions examples/nlp/entity_linking/self_alignment_pretraining.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,10 @@

@hydra_runner(config_path="conf", config_name="umls_medical_entity_linking_config.yaml")
def main(cfg: DictConfig) -> None:
# PTL 2.0 has find_unused_parameters as False by default, so its required to set it to True
# when there are unused parameters here
if cfg.trainer.strategy == 'ddp':
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ericharper @okuchaiev - This thing caused a accuracy / BLEU score regression a few years back. We should train a small model with just ddp to see if model is fine or ACC degrades before switching to this specific "ddp_find_unused_parameters_true".

If it doesn't hurt memory or execution speed, maybe we can just make this value the default everywhere to not lose accuracy.

Copy link
Collaborator Author

@athitten athitten Aug 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@titu1994 quick FYI: in previous lightning versions like 1.9 find_unused_parameters was True by default. In lightning 2.0 it was made False by default to improve performance. Did we observe ACC regression with 1.9 versions ? If not, then setting "ddp_find_unused_parameters_true" should be okay right as it was the case by default in 1.9. More details in this PR: lightning PR 16611

cfg.trainer.strategy = "ddp_find_unused_parameters_true"
logging.info(f"\nConfig Params:\n{OmegaConf.to_yaml(cfg)}")
trainer = Trainer(**cfg.trainer)
exp_manager(trainer, cfg.get("exp_manager", None))
Expand Down
Loading
Loading