-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update dependency install for LLM and MM #8990
Conversation
Signed-off-by: eharper <[email protected]>
Signed-off-by: eharper <[email protected]>
Signed-off-by: eharper <[email protected]>
Signed-off-by: eharper <[email protected]>
Signed-off-by: eharper <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice !
Signed-off-by: eharper <[email protected]>
Signed-off-by: eharper <[email protected]>
* NeMo Speech Container - `nvcr.io/nvidia/nemo:24.01.speech` | ||
|
||
* LLM and Multimodal Dependencies - Refer to the `LLM and Multimodal dependencies <#llm-and-multimodal-dependencies>`_ section for isntallation instructions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change "LLM and Multimodal Dependencies - Refer to the LLM and Multimodal dependencies <#llm-and-multimodal-dependencies>
_ section for isntallation instructions" to:
LLM and Multimodal Dependencies - Refer to the LLM and Multimodal dependencies <#llm-and-multimodal-dependencies>
_ section for installation instructions
please refer to the `Software Component Versions <https://docs.nvidia.com/nemo-framework/user-guide/latest/softwarecomponentversions.html>`_ | ||
for the correct versions. | ||
|
||
If starting with a base NVIDIA PyTorch container first launch the container: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change "If starting with a base NVIDIA PyTorch container first launch the container:" to:
If starting with a base NVIDIA PyTorch container, first launch the container:
NeMo LLM Domain training requires NVIDIA Apex to be installed. | ||
Install it manually if not using the NVIDIA PyTorch container. | ||
NeMo LLM Multimodal Domains require that NVIDIA Apex to be installed. | ||
Apex comes installed in the NVIDIA PyTorch container but it's possible that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change "Apex comes installed in the NVIDIA PyTorch container but it's possible that" to:
Apex comes installed in the NVIDIA PyTorch container, but it's possible that
While installing Apex, it may raise an error if the CUDA version on your system does not match the CUDA version torch was compiled with. | ||
While installing Apex outside of the NVIDIA PyTorch container, | ||
it may raise an error if the CUDA version on your system does not match the CUDA version torch was compiled with. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change "While installing Apex, it may raise an error if the CUDA version on your system does not match the CUDA version torch was compiled with." to:
While installing Apex, you may encounter an error if the CUDA version on your system does not align with the CUDA version used to compile PyTorch binaries.
|
||
pip install --upgrade git+https://github.com/NVIDIA/TransformerEngine.git@stable | ||
The NeMo LLM Multimodal Domains require that NVIDIA Transformer Engine to be installed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change: "The NeMo LLM Multimodal Domains require that NVIDIA Transformer Engine to be installed" to:
The NeMo LLM Multimodal Domains require that the NVIDIA Transformer Engine be installed.
@@ -366,35 +405,43 @@ With the latest versions of Apex, the `pyproject.toml` file in Apex may need to | |||
|
|||
Transformer Engine | |||
~~~~~~~~~~~~~~~~~~ | |||
NeMo LLM Domain has been integrated with `NVIDIA Transformer Engine <https://github.com/NVIDIA/TransformerEngine>`_ | |||
Transformer Engine enables FP8 training on NVIDIA Hopper GPUs. | |||
`Install <https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/installation.html>`_ it manually if not using the NVIDIA PyTorch container. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change "Install <https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/installation.html>
_ it manually if not using the NVIDIA PyTorch container." to:
Install <https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/installation.html>
_ manually if not using the NVIDIA PyTorch container.
|
||
pip install --upgrade git+https://github.com/NVIDIA/TransformerEngine.git@stable | ||
The NeMo LLM Multimodal Domains require that NVIDIA Transformer Engine to be installed. | ||
Transformer Engine comes installed in the NVIDIA PyTorch container but it's possible that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change: "Transformer Engine comes installed in the NVIDIA PyTorch container but it's possible that" to:
Transformer Engine comes installed in the NVIDIA PyTorch container, but it's possible that
Flash Attention | ||
~~~~~~~~~~~~~~~ | ||
When traning Large Language Models in NeMo, users may opt to use Flash Attention for efficient training. Transformer Engine already supports Flash Attention for GPT models. If you want to use Flash Attention for non-causal models, please install `flash-attn <https://github.com/HazyResearch/flash-attention>`_. If you want to use Flash Attention with attention bias (introduced from position encoding, e.g. Alibi), please also install triton pinned version following the `implementation <https://github.com/Dao-AILab/flash-attention/blob/main/flash_attn/flash_attn_triton.py#L3>`_. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change: "When traning Large Language Models in NeMo, users may opt to use Flash Attention for efficient training. " to:
When training Large Language Models in NeMo, users may opt to use Flash Attention for efficient training.
The NeMo LLM Multimodal Domains require that NVIDIA Megatron Core to be installed. | ||
Megatron core is a library for scaling large transfromer base models. | ||
NeMo LLM and Multimodal models leverage Megatron Core for model parallelism, | ||
transformer architectures, and optimized pytorch datasets. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change: "transformer architectures, and optimized pytorch datasets." to:
transformer architectures, and optimized PyTorch datasets.
@@ -404,7 +451,7 @@ Docker containers | |||
~~~~~~~~~~~~~~~~~ | |||
We release NeMo containers alongside NeMo releases. For example, NeMo ``r1.23.0`` comes with container ``nemo:24.01.speech``, you may find more details about released containers in `releases page <https://github.com/NVIDIA/NeMo/releases>`_. | |||
|
|||
To use built container, please run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change: "To use built container, please run" to:
To use a built container, please run
* NeMo Speech Container - `nvcr.io/nvidia/nemo:24.01.speech` | ||
|
||
* LLM and Multimodal Dependencies - Refer to the `LLM and Multimodal dependencies <#llm-and-multimodal-dependencies>`_ section for isntallation instructions. | ||
* It's higly recommended to start with a base NVIDIA PyTorch container: `nvcr.io/nvidia/pytorch:24.02-py3` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change: "It's higly recommended to start with a base NVIDIA PyTorch container: nvcr.io/nvidia/pytorch:24.02-py3
" to:
It's highly recommended that you start with a base NVIDIA PyTorch container: nvcr.io/nvidia/pytorch:24.02-py3
The LLM and Multimodal domains require three additional dependencies: | ||
NVIDIA Apex, NVIDIA Transformer Engine, and NVIDIA Megatron Core. | ||
|
||
When working with the `main` branch these dependencies may require a recent commit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change: "When working with the main
branch these dependencies may require a recent commit." to:
When working with the main
branch, these dependencies may require a recent commit.
Apex | ||
~~~~ | ||
NeMo LLM Domain training requires NVIDIA Apex to be installed. | ||
Install it manually if not using the NVIDIA PyTorch container. | ||
NeMo LLM Multimodal Domains require that NVIDIA Apex to be installed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change: "NeMo LLM Multimodal Domains require that NVIDIA Apex to be installed." to:
NeMo LLM Multimodal Domains require that NVIDIA Apex be installed.
While installing Apex, it may raise an error if the CUDA version on your system does not match the CUDA version torch was compiled with. | ||
While installing Apex outside of the NVIDIA PyTorch container, | ||
it may raise an error if the CUDA version on your system does not match the CUDA version torch was compiled with. | ||
This raise can be avoided by commenting it here: https://github.com/NVIDIA/apex/blob/master/setup.py#L32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change: :This raise can be avoided by commenting it here" to"
This raised error can be avoided by commenting about it here
@@ -188,12 +188,15 @@ The NeMo Framework can be installed in a variety of ways, depending on your need | |||
* This is recommended for Automatic Speech Recognition (ASR) and Text-to-Speech (TTS) domains. | |||
* When using a Nvidia PyTorch container as the base, this is the recommended installation method for all domains. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change: "When using a Nvidia PyTorch container as the base, this is the recommended installation method for all domains." to:
When using an NVIDIA PyTorch container as the base, this is the recommended installation method for all domains.
@@ -188,12 +188,15 @@ The NeMo Framework can be installed in a variety of ways, depending on your need | |||
* This is recommended for Automatic Speech Recognition (ASR) and Text-to-Speech (TTS) domains. | |||
* When using a Nvidia PyTorch container as the base, this is the recommended installation method for all domains. | |||
|
|||
* Docker - Refer to the `Docker containers <#docker-containers>`_ section for installation instructions. | |||
* Docker Containers - Refer to the `Docker containers <#docker-containers>`_ section for installation instructions. | |||
|
|||
* This is recommended for Large Language Models (LLM), Multimodal and Vision domains. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change: "This is recommended for Large Language Models (LLM), Multimodal and Vision domains." to:
This is recommended for Large Language Models (LLM), Multimodal (MM), and Vision domains.
@@ -366,35 +405,43 @@ With the latest versions of Apex, the `pyproject.toml` file in Apex may need to | |||
|
|||
Transformer Engine | |||
~~~~~~~~~~~~~~~~~~ | |||
NeMo LLM Domain has been integrated with `NVIDIA Transformer Engine <https://github.com/NVIDIA/TransformerEngine>`_ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change: "NeMo LLM Domain has been integrated with NVIDIA Transformer Engine <https://github.com/NVIDIA/TransformerEngine>
_ to:
NeMo LLM Domain has been integrated with NVIDIA Transformer Engine <https://github.com/NVIDIA/TransformerEngine>
_.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copyedited content
* update Signed-off-by: eharper <[email protected]> * update Signed-off-by: eharper <[email protected]> * update Signed-off-by: eharper <[email protected]> * update Signed-off-by: eharper <[email protected]> * update Signed-off-by: eharper <[email protected]> * update Signed-off-by: eharper <[email protected]> * typo Signed-off-by: eharper <[email protected]> --------- Signed-off-by: eharper <[email protected]> Co-authored-by: Pablo Garay <[email protected]>
@@ -10,7 +10,7 @@ ijson | |||
jieba | |||
markdown2 | |||
matplotlib>=3.3.2 | |||
megatron_core==0.5.0 | |||
megatron_core>0.6.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The latest version as of now is megatron_core 0.6.0, and this change makes it unable to install NeMo from source
* update Signed-off-by: eharper <[email protected]> * update Signed-off-by: eharper <[email protected]> * update Signed-off-by: eharper <[email protected]> * update Signed-off-by: eharper <[email protected]> * update Signed-off-by: eharper <[email protected]> * update Signed-off-by: eharper <[email protected]> * typo Signed-off-by: eharper <[email protected]> --------- Signed-off-by: eharper <[email protected]> Co-authored-by: Pablo Garay <[email protected]>
* update Signed-off-by: eharper <[email protected]> * update Signed-off-by: eharper <[email protected]> * update Signed-off-by: eharper <[email protected]> * update Signed-off-by: eharper <[email protected]> * update Signed-off-by: eharper <[email protected]> * update Signed-off-by: eharper <[email protected]> * typo Signed-off-by: eharper <[email protected]> --------- Signed-off-by: eharper <[email protected]> Co-authored-by: Pablo Garay <[email protected]>
* update Signed-off-by: eharper <[email protected]> * update Signed-off-by: eharper <[email protected]> * update Signed-off-by: eharper <[email protected]> * update Signed-off-by: eharper <[email protected]> * update Signed-off-by: eharper <[email protected]> * update Signed-off-by: eharper <[email protected]> * typo Signed-off-by: eharper <[email protected]> --------- Signed-off-by: eharper <[email protected]> Co-authored-by: Pablo Garay <[email protected]> Signed-off-by: Ao Tang <[email protected]>
* update Signed-off-by: eharper <[email protected]> * update Signed-off-by: eharper <[email protected]> * update Signed-off-by: eharper <[email protected]> * update Signed-off-by: eharper <[email protected]> * update Signed-off-by: eharper <[email protected]> * update Signed-off-by: eharper <[email protected]> * typo Signed-off-by: eharper <[email protected]> --------- Signed-off-by: eharper <[email protected]> Co-authored-by: Pablo Garay <[email protected]>
What does this PR do ?
Add a one line overview of what this PR aims to accomplish.
Collection: [Note which collection this PR will affect]
Changelog
Usage
# Add a code snippet demonstrating how to use this
Jenkins CI
The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.
There's no need to comment
jenkins
on the PR to trigger Jenkins CI.The GitHub Actions CI will run automatically when the PR is opened.
To run CI on an untrusted fork, a NeMo user with write access must click "Approve and run".
Before your PR is "Ready for review"
Pre checks:
PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.
Who can review?
Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.
Additional Information