-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support tp pp conversion #6218
Support tp pp conversion #6218
Conversation
Signed-off-by: smajumdar <[email protected]>
Signed-off-by: smajumdar <[email protected]>
Signed-off-by: smajumdar <[email protected]>
for more information, see https://pre-commit.ci
Signed-off-by: smajumdar <[email protected]>
Signed-off-by: smajumdar <[email protected]>
Signed-off-by: smajumdar <[email protected]>
Signed-off-by: smajumdar <[email protected]>
eebbd0b
to
c5f7a2e
Compare
Signed-off-by: smajumdar <[email protected]>
Signed-off-by: smajumdar <[email protected]>
for more information, see https://pre-commit.ci
Signed-off-by: smajumdar <[email protected]>
Signed-off-by: smajumdar <[email protected]>
Signed-off-by: smajumdar <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pretty cool PR. just have one minor comment.
examples/nlp/language_modeling/megatron_change_num_partitions_pp.py
Outdated
Show resolved
Hide resolved
examples/nlp/language_modeling/megatron_change_num_partitions_pp.py
Outdated
Show resolved
Hide resolved
Signed-off-by: smajumdar <[email protected]>
Signed-off-by: smajumdar <[email protected]>
…upport_tp_pp_conversion
Signed-off-by: smajumdar <[email protected]>
Signed-off-by: smajumdar <[email protected]>
Signed-off-by: smajumdar <[email protected]>
examples/nlp/language_modeling/megatron_change_num_partitions_pp.py
Outdated
Show resolved
Hide resolved
Signed-off-by: smajumdar <[email protected]>
Signed-off-by: smajumdar <[email protected]>
Signed-off-by: smajumdar <[email protected]>
Signed-off-by: smajumdar <[email protected]>
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! just a couple of minor comments, everything else looks great!
also, can we add a PP change CI as well? would be helpful to keep testing that since the PR brings in global overrides that may cause issues if someone changes it |
Signed-off-by: smajumdar <[email protected]>
Good point about jenkins test - updated old one from tp reduce and increase to jointly increase pp by even or odd number, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
Signed-off-by: smajumdar <[email protected]>
Signed-off-by: smajumdar <[email protected]>
Signed-off-by: smajumdar <[email protected]>
Signed-off-by: smajumdar <[email protected]>
Signed-off-by: smajumdar <[email protected]>
for more information, see https://pre-commit.ci
Signed-off-by: smajumdar <[email protected]>
* Add required flags to partially laod model Signed-off-by: smajumdar <[email protected]> * Add cleaned up script for tp pp change Signed-off-by: smajumdar <[email protected]> * Add cleaned up script for tp pp change Signed-off-by: smajumdar <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add support to change parameter dtypes during conversion Signed-off-by: smajumdar <[email protected]> * Add Debug Prints flag Signed-off-by: smajumdar <[email protected]> * Improve error logs Signed-off-by: smajumdar <[email protected]> * Fix issues with TP > 1 for Megatron T5 Signed-off-by: smajumdar <[email protected]> * Finalize splitting of T5 models Signed-off-by: smajumdar <[email protected]> * Update docstrings Signed-off-by: smajumdar <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Finalize pp tp change for T5 models Signed-off-by: smajumdar <[email protected]> * Fix CodeQL issue Signed-off-by: smajumdar <[email protected]> * Fix dtype cast of num_gpu_per_node Signed-off-by: smajumdar <[email protected]> * Update config Signed-off-by: smajumdar <[email protected]> * Remove block for config checks Signed-off-by: smajumdar <[email protected]> * Reduce shared embedding check for older configs Signed-off-by: smajumdar <[email protected]> * Add support for extracted directory path Signed-off-by: smajumdar <[email protected]> * Force CPU init for TP 1 PP 1 temp model Signed-off-by: smajumdar <[email protected]> * Patch T5 models to init fully on CPU Signed-off-by: smajumdar <[email protected]> * Update docstring Signed-off-by: smajumdar <[email protected]> * Update docstring Signed-off-by: smajumdar <[email protected]> * Update prints to logging Signed-off-by: smajumdar <[email protected]> * Patch apex code Signed-off-by: smajumdar <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Patch typo Signed-off-by: smajumdar <[email protected]> * Fix import test of ModelType Signed-off-by: smajumdar <[email protected]> * Add docstring comment for nlp override Signed-off-by: smajumdar <[email protected]> * Merge new file with old file Signed-off-by: smajumdar <[email protected]> * Update script call signature Signed-off-by: smajumdar <[email protected]> * Remove comments Signed-off-by: smajumdar <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update jenkins test Signed-off-by: smajumdar <[email protected]> * Fix formatting Signed-off-by: smajumdar <[email protected]> * Add open_dict hooks Signed-off-by: smajumdar <[email protected]> * Fix unit test Signed-off-by: smajumdar <[email protected]> * Fix unit test Signed-off-by: smajumdar <[email protected]> * Retry in another directory Signed-off-by: smajumdar <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Revert second test cause of shutil.rename error on CI Signed-off-by: smajumdar <[email protected]> --------- Signed-off-by: smajumdar <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Adi Renduchintala <[email protected]> Signed-off-by: hsiehjackson <[email protected]>
What does this PR do ?
Adds support for changing pipeline parallel version post construction for GPT
Collection: [Core, NLP]
Changelog
Usage
Before your PR is "Ready for review"
Pre checks:
PR Type: