Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support tp pp conversion #6218

Merged
merged 44 commits into from
Mar 25, 2023
Merged
Show file tree
Hide file tree
Changes from 37 commits
Commits
Show all changes
44 commits
Select commit Hold shift + click to select a range
ee50d70
Add required flags to partially laod model
titu1994 Mar 15, 2023
1ae6ae9
Add cleaned up script for tp pp change
titu1994 Mar 16, 2023
070cfa5
Add cleaned up script for tp pp change
titu1994 Mar 16, 2023
d8aa228
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 16, 2023
e0bd0f1
Add support to change parameter dtypes during conversion
titu1994 Mar 16, 2023
07f0cca
Add Debug Prints flag
titu1994 Mar 17, 2023
acd408d
Improve error logs
titu1994 Mar 17, 2023
c5f7a2e
Fix issues with TP > 1 for Megatron T5
titu1994 Mar 17, 2023
f82a00f
Finalize splitting of T5 models
titu1994 Mar 17, 2023
4b5b252
Update docstrings
titu1994 Mar 17, 2023
8e9ae21
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 17, 2023
277274f
Finalize pp tp change for T5 models
titu1994 Mar 17, 2023
a5af3f3
Fix CodeQL issue
titu1994 Mar 17, 2023
665740b
Fix dtype cast of num_gpu_per_node
titu1994 Mar 17, 2023
770a18c
Merge branch 'main' into support_tp_pp_conversion
arendu Mar 20, 2023
8764f27
Update config
titu1994 Mar 20, 2023
260b2fc
Remove block for config checks
titu1994 Mar 20, 2023
b7f8f36
Merge remote-tracking branch 'origin/support_tp_pp_conversion' into s…
titu1994 Mar 20, 2023
e94b3da
Reduce shared embedding check for older configs
titu1994 Mar 20, 2023
3efccaa
Add support for extracted directory path
titu1994 Mar 22, 2023
56f109f
Force CPU init for TP 1 PP 1 temp model
titu1994 Mar 22, 2023
5be7f9a
Patch T5 models to init fully on CPU
titu1994 Mar 23, 2023
adc2315
Update docstring
titu1994 Mar 23, 2023
180d19d
Update docstring
titu1994 Mar 23, 2023
ad18766
Update prints to logging
titu1994 Mar 23, 2023
3298398
Patch apex code
titu1994 Mar 23, 2023
f720fa5
Merge branch 'main' into support_tp_pp_conversion
titu1994 Mar 23, 2023
0e76920
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 23, 2023
9f297d1
Patch typo
titu1994 Mar 23, 2023
c845321
Fix import test of ModelType
titu1994 Mar 24, 2023
035483b
Add docstring comment for nlp override
titu1994 Mar 24, 2023
2a1981f
Merge new file with old file
titu1994 Mar 24, 2023
aba99e3
Update script call signature
titu1994 Mar 24, 2023
fbef6eb
Remove comments
titu1994 Mar 24, 2023
76541b5
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 24, 2023
8c9ae9e
Merge branch 'main' into support_tp_pp_conversion
titu1994 Mar 24, 2023
64ac539
Update jenkins test
titu1994 Mar 24, 2023
a54353c
Fix formatting
titu1994 Mar 24, 2023
383a98c
Add open_dict hooks
titu1994 Mar 24, 2023
c268a70
Fix unit test
titu1994 Mar 24, 2023
c4edbb0
Fix unit test
titu1994 Mar 24, 2023
9f6857c
Retry in another directory
titu1994 Mar 24, 2023
0cb2e75
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 24, 2023
ea9e4b1
Revert second test cause of shutil.rename error on CI
titu1994 Mar 25, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 12 additions & 4 deletions Jenkinsfile
Original file line number Diff line number Diff line change
Expand Up @@ -3494,7 +3494,7 @@ assert_frame_equal(training_curve, gt_curve, rtol=1e-3, atol=1e-3)"'''
}
failFast true
parallel{
stage('Reduce Num Partitions (2 to 1)'){
stage('Reduce TP Num Partitions (2 to 1) and PP Num Partitions (1 to 2)
steps{
sh "python examples/nlp/language_modeling/megatron_change_num_partitions.py \
--model_file \
Expand All @@ -3504,11 +3504,15 @@ assert_frame_equal(training_curve, gt_curve, rtol=1e-3, atol=1e-3)"'''
--tensor_model_parallel_size \
2 \
--target_tensor_model_parallel_size \
1"
1 \
--pipeline_model_parallel_size \
1 \
--target_pipeline_model_parallel_size \
2"
sh "rm /home/TestData/nlp/megatron_gpt/TP2/test-reduce.nemo"
}
}
stage('Increase Num Partitions (2 to 4)'){
stage('Increase TP Num Partitions (2 to 4) and PP Num Partitions (1 to 3)'){
steps{
sh "python examples/nlp/language_modeling/megatron_change_num_partitions.py \
--model_file \
Expand All @@ -3518,7 +3522,11 @@ assert_frame_equal(training_curve, gt_curve, rtol=1e-3, atol=1e-3)"'''
--tensor_model_parallel_size \
2 \
--target_tensor_model_parallel_size \
4"
4 \
--pipeline_model_parallel_size \
1 \
--target_pipeline_model_parallel_size \
3"
sh "rm /home/TestData/nlp/megatron_gpt/TP2/test-increase.nemo"
}
}
Expand Down
Loading