Fix CPU initialization of GPT models #7889

cuichenx · 2023-11-15T00:50:35Z

What does this PR do ?

Add use_cpu_initialization check in MegatronGPTModel.
Allows conversion scripts such as convert_nemo_gpt_to_mcore.py to run for bigger models in cpu mode.

Collection: NLP

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Related to # (issue)

Signed-off-by: Chen Cui <[email protected]>

cuichenx · 2023-11-15T21:26:13Z

jenkins

aklife97

LGTM, thank you!

Signed-off-by: Chen Cui <[email protected]> support packed dataset Signed-off-by: Chen Cui <[email protected]> [Codec] Finite scalar quantizer (NVIDIA#7886) * Finite scalar quantizer Signed-off-by: Ante Jukić <[email protected]> * Updated test Signed-off-by: Ante Jukić <[email protected]> --------- Signed-off-by: Ante Jukić <[email protected]> upgrade to latest mcore and TE (NVIDIA#7908) * reimport module Signed-off-by: dimapihtar <[email protected]> * update mcore and TE commits Signed-off-by: dimapihtar <[email protected]> --------- Signed-off-by: dimapihtar <[email protected]> Tar codec (NVIDIA#7867) added missing torch import (NVIDIA#7913) Signed-off-by: David Mosallanezhad <[email protected]> add cpu init check (NVIDIA#7889) Signed-off-by: Chen Cui <[email protected]> Fix pinned triton version (NVIDIA#7925) * Fix pinned triton version Signed-off-by: Cheng-Ping Hsieh <[email protected]> * Remove comment Signed-off-by: Cheng-Ping Hsieh <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Change README Signed-off-by: Cheng-Ping Hsieh <[email protected]> * Remove flash-attn in Dockerfile Signed-off-by: Cheng-Ping Hsieh <[email protected]> * Revert Signed-off-by: Cheng-Ping Hsieh <[email protected]> --------- Signed-off-by: Cheng-Ping Hsieh <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> fix tp_overlap config var name (NVIDIA#7928) Signed-off-by: Xiaowei Ren <[email protected]> add Dutch P&C FC model info (NVIDIA#7892) * add Dutch P&C FC model info Signed-off-by: zhehuaichen <[email protected]> * update order of the results Signed-off-by: zhehuaichen <[email protected]> --------- Signed-off-by: zhehuaichen <[email protected]> fix issues with convert_nemo_llama_to_hf.py (NVIDIA#7922) [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci fix collate_fn bug for TP > 1 Signed-off-by: Chen Cui <[email protected]> [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci make packed dataset work Signed-off-by: Chen Cui <[email protected]> fix nan bug Signed-off-by: Chen Cui <[email protected]> [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci support answer only loss Signed-off-by: Chen Cui <[email protected]> [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci account for padding in cu_seqlens during dataloading for attn kernel Signed-off-by: Chen Cui <[email protected]> fix path for answer_only_loss = false Signed-off-by: Chen Cui <[email protected]> Modify GPTSFTPackedDataset to respond to pad_to_max_length setting Signed-off-by: Valerie Sarge <[email protected]>

Signed-off-by: Chen Cui <[email protected]> support packed dataset Signed-off-by: Chen Cui <[email protected]> [Codec] Finite scalar quantizer (NVIDIA#7886) * Finite scalar quantizer Signed-off-by: Ante Jukić <[email protected]> * Updated test Signed-off-by: Ante Jukić <[email protected]> --------- Signed-off-by: Ante Jukić <[email protected]> upgrade to latest mcore and TE (NVIDIA#7908) * reimport module Signed-off-by: dimapihtar <[email protected]> * update mcore and TE commits Signed-off-by: dimapihtar <[email protected]> --------- Signed-off-by: dimapihtar <[email protected]> Tar codec (NVIDIA#7867) added missing torch import (NVIDIA#7913) Signed-off-by: David Mosallanezhad <[email protected]> add cpu init check (NVIDIA#7889) Signed-off-by: Chen Cui <[email protected]> Fix pinned triton version (NVIDIA#7925) * Fix pinned triton version Signed-off-by: Cheng-Ping Hsieh <[email protected]> * Remove comment Signed-off-by: Cheng-Ping Hsieh <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Change README Signed-off-by: Cheng-Ping Hsieh <[email protected]> * Remove flash-attn in Dockerfile Signed-off-by: Cheng-Ping Hsieh <[email protected]> * Revert Signed-off-by: Cheng-Ping Hsieh <[email protected]> --------- Signed-off-by: Cheng-Ping Hsieh <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> fix tp_overlap config var name (NVIDIA#7928) Signed-off-by: Xiaowei Ren <[email protected]> add Dutch P&C FC model info (NVIDIA#7892) * add Dutch P&C FC model info Signed-off-by: zhehuaichen <[email protected]> * update order of the results Signed-off-by: zhehuaichen <[email protected]> --------- Signed-off-by: zhehuaichen <[email protected]> fix issues with convert_nemo_llama_to_hf.py (NVIDIA#7922) [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci fix collate_fn bug for TP > 1 Signed-off-by: Chen Cui <[email protected]> [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci make packed dataset work Signed-off-by: Chen Cui <[email protected]> fix nan bug Signed-off-by: Chen Cui <[email protected]> [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci support answer only loss Signed-off-by: Chen Cui <[email protected]> [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci account for padding in cu_seqlens during dataloading for attn kernel Signed-off-by: Chen Cui <[email protected]> fix path for answer_only_loss = false Signed-off-by: Chen Cui <[email protected]>

Signed-off-by: Chen Cui <[email protected]> Signed-off-by: Piotr Żelasko <[email protected]>

Signed-off-by: Chen Cui <[email protected]>

add cpu init check

6fa43a2

Signed-off-by: Chen Cui <[email protected]>

github-actions bot added the NLP label Nov 15, 2023

Merge branch 'main' into chcui/fix_gpt_cpu_init

5ce6492

cuichenx requested review from ericharper and aklife97 November 15, 2023 21:28

aklife97 approved these changes Nov 20, 2023

View reviewed changes

cuichenx merged commit ef19e02 into main Nov 21, 2023
15 checks passed

cuichenx deleted the chcui/fix_gpt_cpu_init branch November 21, 2023 16:31

pzelasko pushed a commit to pzelasko/NeMo that referenced this pull request Jan 3, 2024

add cpu init check (NVIDIA#7889)

020e874

Signed-off-by: Chen Cui <[email protected]> Signed-off-by: Piotr Żelasko <[email protected]>

rohitrango pushed a commit to rohitrango/NeMo that referenced this pull request Jun 25, 2024

add cpu init check (NVIDIA#7889)

23ef428

Signed-off-by: Chen Cui <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix CPU initialization of GPT models #7889

Fix CPU initialization of GPT models #7889

cuichenx commented Nov 15, 2023

cuichenx commented Nov 15, 2023

aklife97 left a comment

Fix CPU initialization of GPT models #7889

Fix CPU initialization of GPT models #7889

Conversation

cuichenx commented Nov 15, 2023

What does this PR do ?

Before your PR is "Ready for review"

Who can review?

Additional Information

cuichenx commented Nov 15, 2023

aklife97 left a comment

Choose a reason for hiding this comment