-
Notifications
You must be signed in to change notification settings - Fork 228
Fix curriculum learning support #134
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
|
OK, so testing this branch - the training hangs on startup: and so it appears that perhaps there is a deadlock somewhere. All gpus spin at 100% |
|
So besides the discussion on slack confirming that once deepspeedai/DeepSpeed#1473 is merged we can merge this one as well. I was able to us these 2 PRs to launch a Meg-DS training w/o problems. And additionally @conglongli run |
Fix the CL+PP case when pp >=4.
The error that this PR fixes can be reproduced by changing pp_size to 4 and num-layers to 4 in https://github.com/bigscience-workshop/Megatron-DeepSpeed/blob/main/tests/test_training.py.
@stas00 please test this on your side.
Also fixes backward compatibility for new chkpt keys introduced by CL.