Update dependency install for LLM and MM #8990

ericharper · 2024-04-19T23:03:32Z

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Collection: [Note which collection this PR will affect]

Changelog

Add specific line by line info of high level changes in this PR.

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

Jenkins CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

There's no need to comment jenkins on the PR to trigger Jenkins CI.
The GitHub Actions CI will run automatically when the PR is opened.
To run CI on an untrusted fork, a NeMo user with write access must click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Related to # (issue)

Signed-off-by: eharper <[email protected]>

titu1994

Very nice !

Signed-off-by: eharper <[email protected]>

jgerh · 2024-04-22T18:44:34Z

README.rst

  * NeMo Speech Container - `nvcr.io/nvidia/nemo:24.01.speech`

+* LLM and Multimodal Dependencies - Refer to the `LLM and Multimodal dependencies <#llm-and-multimodal-dependencies>`_ section for isntallation instructions.


Change "LLM and Multimodal Dependencies - Refer to the LLM and Multimodal dependencies <#llm-and-multimodal-dependencies>_ section for isntallation instructions" to:

LLM and Multimodal Dependencies - Refer to the LLM and Multimodal dependencies <#llm-and-multimodal-dependencies>_ section for installation instructions

jgerh · 2024-04-22T18:45:22Z

README.rst

+please refer to the `Software Component Versions <https://docs.nvidia.com/nemo-framework/user-guide/latest/softwarecomponentversions.html>`_ 
+for the correct versions.
+
+If starting with a base NVIDIA PyTorch container first launch the container:


Change "If starting with a base NVIDIA PyTorch container first launch the container:" to:

If starting with a base NVIDIA PyTorch container, first launch the container:

jgerh · 2024-04-22T18:46:08Z

README.rst

-NeMo LLM Domain training requires NVIDIA Apex to be installed.
-Install it manually if not using the NVIDIA PyTorch container.
+NeMo LLM Multimodal Domains require that NVIDIA Apex to be installed.
+Apex comes installed in the NVIDIA PyTorch container but it's possible that


Change "Apex comes installed in the NVIDIA PyTorch container but it's possible that" to:

Apex comes installed in the NVIDIA PyTorch container, but it's possible that

jgerh · 2024-04-22T18:47:43Z

README.rst


-While installing Apex, it may raise an error if the CUDA version on your system does not match the CUDA version torch was compiled with.
+While installing Apex outside of the NVIDIA PyTorch container,
+it may raise an error if the CUDA version on your system does not match the CUDA version torch was compiled with.


Change "While installing Apex, it may raise an error if the CUDA version on your system does not match the CUDA version torch was compiled with." to:

While installing Apex, you may encounter an error if the CUDA version on your system does not align with the CUDA version used to compile PyTorch binaries.

jgerh · 2024-04-22T18:54:16Z

README.rst


-  pip install --upgrade git+https://github.com/NVIDIA/TransformerEngine.git@stable
+The NeMo LLM Multimodal Domains require that NVIDIA Transformer Engine to be installed.


Change: "The NeMo LLM Multimodal Domains require that NVIDIA Transformer Engine to be installed" to:

The NeMo LLM Multimodal Domains require that the NVIDIA Transformer Engine be installed.

jgerh · 2024-04-22T18:58:04Z

README.rst

@@ -366,35 +405,43 @@ With the latest versions of Apex, the `pyproject.toml` file in Apex may need to

 Transformer Engine
 ~~~~~~~~~~~~~~~~~~
-NeMo LLM Domain has been integrated with `NVIDIA Transformer Engine <https://github.com/NVIDIA/TransformerEngine>`_
-Transformer Engine enables FP8 training on NVIDIA Hopper GPUs.
-`Install <https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/installation.html>`_ it manually if not using the NVIDIA PyTorch container.


Change "Install <https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/installation.html>_ it manually if not using the NVIDIA PyTorch container." to:

Install <https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/installation.html>_ manually if not using the NVIDIA PyTorch container.

jgerh · 2024-04-22T19:00:03Z

README.rst


-  pip install --upgrade git+https://github.com/NVIDIA/TransformerEngine.git@stable
+The NeMo LLM Multimodal Domains require that NVIDIA Transformer Engine to be installed.
+Transformer Engine comes installed in the NVIDIA PyTorch container but it's possible that


Change: "Transformer Engine comes installed in the NVIDIA PyTorch container but it's possible that" to:

Transformer Engine comes installed in the NVIDIA PyTorch container, but it's possible that

jgerh · 2024-04-22T19:04:04Z

README.rst


-Flash Attention
-~~~~~~~~~~~~~~~
-When traning Large Language Models in NeMo, users may opt to use Flash Attention for efficient training. Transformer Engine already supports Flash Attention for GPT models. If you want to use Flash Attention for non-causal models, please install `flash-attn <https://github.com/HazyResearch/flash-attention>`_. If you want to use Flash Attention with attention bias (introduced from position encoding, e.g. Alibi), please also install triton pinned version following the `implementation <https://github.com/Dao-AILab/flash-attention/blob/main/flash_attn/flash_attn_triton.py#L3>`_.


Change: "When traning Large Language Models in NeMo, users may opt to use Flash Attention for efficient training. " to:

When training Large Language Models in NeMo, users may opt to use Flash Attention for efficient training.

jgerh · 2024-04-22T19:25:11Z

README.rst

+The NeMo LLM Multimodal Domains require that NVIDIA Megatron Core to be installed.
+Megatron core is a library for scaling large transfromer base models. 
+NeMo LLM and Multimodal models leverage Megatron Core for model parallelism, 
+transformer architectures, and optimized pytorch datasets.


Change: "transformer architectures, and optimized pytorch datasets." to:

transformer architectures, and optimized PyTorch datasets.

jgerh · 2024-04-22T19:26:01Z

README.rst

@@ -404,7 +451,7 @@ Docker containers
 ~~~~~~~~~~~~~~~~~
 We release NeMo containers alongside NeMo releases. For example, NeMo ``r1.23.0`` comes with container ``nemo:24.01.speech``, you may find more details about released containers in `releases page <https://github.com/NVIDIA/NeMo/releases>`_.

-To use built container, please run


Change: "To use built container, please run" to:

To use a built container, please run

jgerh · 2024-04-22T19:40:41Z

README.rst

  * NeMo Speech Container - `nvcr.io/nvidia/nemo:24.01.speech`

+* LLM and Multimodal Dependencies - Refer to the `LLM and Multimodal dependencies <#llm-and-multimodal-dependencies>`_ section for isntallation instructions.
+  * It's higly recommended to start with a base NVIDIA PyTorch container: `nvcr.io/nvidia/pytorch:24.02-py3`


Change: "It's higly recommended to start with a base NVIDIA PyTorch container: nvcr.io/nvidia/pytorch:24.02-py3" to:

It's highly recommended that you start with a base NVIDIA PyTorch container: nvcr.io/nvidia/pytorch:24.02-py3

jgerh · 2024-04-22T19:41:44Z

README.rst

+The LLM and Multimodal domains require three additional dependencies: 
+NVIDIA Apex, NVIDIA Transformer Engine, and NVIDIA Megatron Core.
+
+When working with the `main` branch these dependencies may require a recent commit.


Change: "When working with the main branch these dependencies may require a recent commit." to:

When working with the main branch, these dependencies may require a recent commit.

jgerh · 2024-04-22T19:44:06Z

README.rst

 Apex
 ~~~~
-NeMo LLM Domain training requires NVIDIA Apex to be installed.
-Install it manually if not using the NVIDIA PyTorch container.
+NeMo LLM Multimodal Domains require that NVIDIA Apex to be installed.


Change: "NeMo LLM Multimodal Domains require that NVIDIA Apex to be installed." to:

NeMo LLM Multimodal Domains require that NVIDIA Apex be installed.

jgerh · 2024-04-22T20:30:11Z

README.rst


-While installing Apex, it may raise an error if the CUDA version on your system does not match the CUDA version torch was compiled with.
+While installing Apex outside of the NVIDIA PyTorch container,
+it may raise an error if the CUDA version on your system does not match the CUDA version torch was compiled with.
 This raise can be avoided by commenting it here: https://github.com/NVIDIA/apex/blob/master/setup.py#L32


Change: :This raise can be avoided by commenting it here" to"

This raised error can be avoided by commenting about it here

jgerh · 2024-04-22T20:31:04Z

README.rst

@@ -188,12 +188,15 @@ The NeMo Framework can be installed in a variety of ways, depending on your need
  * This is recommended for Automatic Speech Recognition (ASR) and Text-to-Speech (TTS) domains.
  * When using a Nvidia PyTorch container as the base, this is the recommended installation method for all domains.


Change: "When using a Nvidia PyTorch container as the base, this is the recommended installation method for all domains." to:

When using an NVIDIA PyTorch container as the base, this is the recommended installation method for all domains.

jgerh · 2024-04-22T20:32:31Z

README.rst

@@ -188,12 +188,15 @@ The NeMo Framework can be installed in a variety of ways, depending on your need
  * This is recommended for Automatic Speech Recognition (ASR) and Text-to-Speech (TTS) domains.
  * When using a Nvidia PyTorch container as the base, this is the recommended installation method for all domains.

-* Docker - Refer to the `Docker containers <#docker-containers>`_ section for installation instructions.
+* Docker Containers - Refer to the `Docker containers <#docker-containers>`_ section for installation instructions.

  * This is recommended for Large Language Models (LLM), Multimodal and Vision domains.


Change: "This is recommended for Large Language Models (LLM), Multimodal and Vision domains." to:

This is recommended for Large Language Models (LLM), Multimodal (MM), and Vision domains.

jgerh · 2024-04-22T20:46:17Z

README.rst

@@ -366,35 +405,43 @@ With the latest versions of Apex, the `pyproject.toml` file in Apex may need to

 Transformer Engine
 ~~~~~~~~~~~~~~~~~~
-NeMo LLM Domain has been integrated with `NVIDIA Transformer Engine <https://github.com/NVIDIA/TransformerEngine>`_


Change: "NeMo LLM Domain has been integrated with NVIDIA Transformer Engine <https://github.com/NVIDIA/TransformerEngine>_ to:

NeMo LLM Domain has been integrated with NVIDIA Transformer Engine <https://github.com/NVIDIA/TransformerEngine>_.

jgerh

Copyedited content

* update Signed-off-by: eharper <[email protected]> * update Signed-off-by: eharper <[email protected]> * update Signed-off-by: eharper <[email protected]> * update Signed-off-by: eharper <[email protected]> * update Signed-off-by: eharper <[email protected]> * update Signed-off-by: eharper <[email protected]> * typo Signed-off-by: eharper <[email protected]> --------- Signed-off-by: eharper <[email protected]> Co-authored-by: Pablo Garay <[email protected]>

trunglebka · 2024-04-23T09:21:35Z

requirements/requirements_nlp.txt

@@ -10,7 +10,7 @@ ijson
 jieba
 markdown2
 matplotlib>=3.3.2
-megatron_core==0.5.0
+megatron_core>0.6.0


The latest version as of now is megatron_core 0.6.0, and this change makes it unable to install NeMo from source

* update Signed-off-by: eharper <[email protected]> * update Signed-off-by: eharper <[email protected]> * update Signed-off-by: eharper <[email protected]> * update Signed-off-by: eharper <[email protected]> * update Signed-off-by: eharper <[email protected]> * update Signed-off-by: eharper <[email protected]> * typo Signed-off-by: eharper <[email protected]> --------- Signed-off-by: eharper <[email protected]> Co-authored-by: Pablo Garay <[email protected]>

* update Signed-off-by: eharper <[email protected]> * update Signed-off-by: eharper <[email protected]> * update Signed-off-by: eharper <[email protected]> * update Signed-off-by: eharper <[email protected]> * update Signed-off-by: eharper <[email protected]> * update Signed-off-by: eharper <[email protected]> * typo Signed-off-by: eharper <[email protected]> --------- Signed-off-by: eharper <[email protected]> Co-authored-by: Pablo Garay <[email protected]> Signed-off-by: Ao Tang <[email protected]>

* update Signed-off-by: eharper <[email protected]> * update Signed-off-by: eharper <[email protected]> * update Signed-off-by: eharper <[email protected]> * update Signed-off-by: eharper <[email protected]> * update Signed-off-by: eharper <[email protected]> * update Signed-off-by: eharper <[email protected]> * typo Signed-off-by: eharper <[email protected]> --------- Signed-off-by: eharper <[email protected]> Co-authored-by: Pablo Garay <[email protected]>

ericharper added 5 commits April 19, 2024 14:49

update

3b2464c

Signed-off-by: eharper <[email protected]>

update

7d6edca

Signed-off-by: eharper <[email protected]>

update

cebd66a

Signed-off-by: eharper <[email protected]>

update

44e305c

Signed-off-by: eharper <[email protected]>

update

e0e974f

Signed-off-by: eharper <[email protected]>

titu1994 previously approved these changes Apr 19, 2024

View reviewed changes

update

d0845aa

Signed-off-by: eharper <[email protected]>

ericharper dismissed titu1994’s stale review via d0845aa April 19, 2024 23:32

ericharper and others added 2 commits April 19, 2024 17:54

typo

38a9a06

Signed-off-by: eharper <[email protected]>

Merge branch 'main' into update_install

870613b

ericharper merged commit a452a4f into main Apr 22, 2024
128 checks passed

ericharper deleted the update_install branch April 22, 2024 17:45

jgerh reviewed Apr 22, 2024

View reviewed changes

trunglebka reviewed Apr 23, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update dependency install for LLM and MM #8990

Update dependency install for LLM and MM #8990

ericharper commented Apr 19, 2024

titu1994 left a comment

jgerh Apr 22, 2024 •

edited

Loading

jgerh Apr 22, 2024 •

edited

Loading

jgerh Apr 22, 2024 •

edited

Loading

jgerh Apr 22, 2024 •

edited

Loading

jgerh Apr 22, 2024

jgerh Apr 22, 2024

jgerh Apr 22, 2024 •

edited

Loading

jgerh Apr 22, 2024

jgerh Apr 22, 2024

jgerh Apr 22, 2024 •

edited

Loading

jgerh Apr 22, 2024 •

edited

Loading

jgerh Apr 22, 2024

jgerh Apr 22, 2024

jgerh Apr 22, 2024 •

edited

Loading

jgerh Apr 22, 2024

jgerh Apr 22, 2024 •

edited

Loading

jgerh Apr 22, 2024

jgerh left a comment

trunglebka Apr 23, 2024

		* NeMo Speech Container - `nvcr.io/nvidia/nemo:24.01.speech`

		* LLM and Multimodal Dependencies - Refer to the `LLM and Multimodal dependencies <#llm-and-multimodal-dependencies>`_ section for isntallation instructions.


		pip install --upgrade git+https://github.com/NVIDIA/TransformerEngine.git@stable
		The NeMo LLM Multimodal Domains require that NVIDIA Transformer Engine to be installed.

		@@ -188,12 +188,15 @@ The NeMo Framework can be installed in a variety of ways, depending on your need
		* This is recommended for Automatic Speech Recognition (ASR) and Text-to-Speech (TTS) domains.
		* When using a Nvidia PyTorch container as the base, this is the recommended installation method for all domains.

Update dependency install for LLM and MM #8990

Update dependency install for LLM and MM #8990

Conversation

ericharper commented Apr 19, 2024

What does this PR do ?

Changelog

Usage

Jenkins CI

Before your PR is "Ready for review"

Who can review?

Additional Information

titu1994 left a comment

Choose a reason for hiding this comment

jgerh Apr 22, 2024 • edited Loading

Choose a reason for hiding this comment

jgerh Apr 22, 2024 • edited Loading

Choose a reason for hiding this comment

jgerh Apr 22, 2024 • edited Loading

Choose a reason for hiding this comment

jgerh Apr 22, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jgerh Apr 22, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jgerh Apr 22, 2024 • edited Loading

Choose a reason for hiding this comment

jgerh Apr 22, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jgerh Apr 22, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jgerh Apr 22, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jgerh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jgerh Apr 22, 2024 •

edited

Loading

jgerh Apr 22, 2024 •

edited

Loading

jgerh Apr 22, 2024 •

edited

Loading

jgerh Apr 22, 2024 •

edited

Loading

jgerh Apr 22, 2024 •

edited

Loading

jgerh Apr 22, 2024 •

edited

Loading

jgerh Apr 22, 2024 •

edited

Loading

jgerh Apr 22, 2024 •

edited

Loading

jgerh Apr 22, 2024 •

edited

Loading