Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update dependency install for LLM and MM #8990

Merged
merged 8 commits into from
Apr 22, 2024
Merged

Update dependency install for LLM and MM #8990

merged 8 commits into from
Apr 22, 2024

Conversation

ericharper
Copy link
Collaborator

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Collection: [Note which collection this PR will affect]

Changelog

  • Add specific line by line info of high level changes in this PR.

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this 

Jenkins CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

There's no need to comment jenkins on the PR to trigger Jenkins CI.
The GitHub Actions CI will run automatically when the PR is opened.
To run CI on an untrusted fork, a NeMo user with write access must click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

  • Related to # (issue)

Signed-off-by: eharper <[email protected]>
Signed-off-by: eharper <[email protected]>
Signed-off-by: eharper <[email protected]>
Signed-off-by: eharper <[email protected]>
Signed-off-by: eharper <[email protected]>
titu1994
titu1994 previously approved these changes Apr 19, 2024
Copy link
Collaborator

@titu1994 titu1994 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice !

Signed-off-by: eharper <[email protected]>
@ericharper ericharper merged commit a452a4f into main Apr 22, 2024
128 checks passed
@ericharper ericharper deleted the update_install branch April 22, 2024 17:45
* NeMo Speech Container - `nvcr.io/nvidia/nemo:24.01.speech`

* LLM and Multimodal Dependencies - Refer to the `LLM and Multimodal dependencies <#llm-and-multimodal-dependencies>`_ section for isntallation instructions.
Copy link
Collaborator

@jgerh jgerh Apr 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change "LLM and Multimodal Dependencies - Refer to the LLM and Multimodal dependencies <#llm-and-multimodal-dependencies>_ section for isntallation instructions" to:

LLM and Multimodal Dependencies - Refer to the LLM and Multimodal dependencies <#llm-and-multimodal-dependencies>_ section for installation instructions

please refer to the `Software Component Versions <https://docs.nvidia.com/nemo-framework/user-guide/latest/softwarecomponentversions.html>`_
for the correct versions.

If starting with a base NVIDIA PyTorch container first launch the container:
Copy link
Collaborator

@jgerh jgerh Apr 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change "If starting with a base NVIDIA PyTorch container first launch the container:" to:

If starting with a base NVIDIA PyTorch container, first launch the container:

NeMo LLM Domain training requires NVIDIA Apex to be installed.
Install it manually if not using the NVIDIA PyTorch container.
NeMo LLM Multimodal Domains require that NVIDIA Apex to be installed.
Apex comes installed in the NVIDIA PyTorch container but it's possible that
Copy link
Collaborator

@jgerh jgerh Apr 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change "Apex comes installed in the NVIDIA PyTorch container but it's possible that" to:

Apex comes installed in the NVIDIA PyTorch container, but it's possible that

While installing Apex, it may raise an error if the CUDA version on your system does not match the CUDA version torch was compiled with.
While installing Apex outside of the NVIDIA PyTorch container,
it may raise an error if the CUDA version on your system does not match the CUDA version torch was compiled with.
Copy link
Collaborator

@jgerh jgerh Apr 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change "While installing Apex, it may raise an error if the CUDA version on your system does not match the CUDA version torch was compiled with." to:

While installing Apex, you may encounter an error if the CUDA version on your system does not align with the CUDA version used to compile PyTorch binaries.


pip install --upgrade git+https://github.com/NVIDIA/TransformerEngine.git@stable
The NeMo LLM Multimodal Domains require that NVIDIA Transformer Engine to be installed.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change: "The NeMo LLM Multimodal Domains require that NVIDIA Transformer Engine to be installed" to:

The NeMo LLM Multimodal Domains require that the NVIDIA Transformer Engine be installed.

@@ -366,35 +405,43 @@ With the latest versions of Apex, the `pyproject.toml` file in Apex may need to

Transformer Engine
~~~~~~~~~~~~~~~~~~
NeMo LLM Domain has been integrated with `NVIDIA Transformer Engine <https://github.com/NVIDIA/TransformerEngine>`_
Transformer Engine enables FP8 training on NVIDIA Hopper GPUs.
`Install <https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/installation.html>`_ it manually if not using the NVIDIA PyTorch container.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change "Install <https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/installation.html>_ it manually if not using the NVIDIA PyTorch container." to:

Install <https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/installation.html>_ manually if not using the NVIDIA PyTorch container.


pip install --upgrade git+https://github.com/NVIDIA/TransformerEngine.git@stable
The NeMo LLM Multimodal Domains require that NVIDIA Transformer Engine to be installed.
Transformer Engine comes installed in the NVIDIA PyTorch container but it's possible that
Copy link
Collaborator

@jgerh jgerh Apr 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change: "Transformer Engine comes installed in the NVIDIA PyTorch container but it's possible that" to:

Transformer Engine comes installed in the NVIDIA PyTorch container, but it's possible that

Flash Attention
~~~~~~~~~~~~~~~
When traning Large Language Models in NeMo, users may opt to use Flash Attention for efficient training. Transformer Engine already supports Flash Attention for GPT models. If you want to use Flash Attention for non-causal models, please install `flash-attn <https://github.com/HazyResearch/flash-attention>`_. If you want to use Flash Attention with attention bias (introduced from position encoding, e.g. Alibi), please also install triton pinned version following the `implementation <https://github.com/Dao-AILab/flash-attention/blob/main/flash_attn/flash_attn_triton.py#L3>`_.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change: "When traning Large Language Models in NeMo, users may opt to use Flash Attention for efficient training. " to:

When training Large Language Models in NeMo, users may opt to use Flash Attention for efficient training.

The NeMo LLM Multimodal Domains require that NVIDIA Megatron Core to be installed.
Megatron core is a library for scaling large transfromer base models.
NeMo LLM and Multimodal models leverage Megatron Core for model parallelism,
transformer architectures, and optimized pytorch datasets.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change: "transformer architectures, and optimized pytorch datasets." to:

transformer architectures, and optimized PyTorch datasets.

@@ -404,7 +451,7 @@ Docker containers
~~~~~~~~~~~~~~~~~
We release NeMo containers alongside NeMo releases. For example, NeMo ``r1.23.0`` comes with container ``nemo:24.01.speech``, you may find more details about released containers in `releases page <https://github.com/NVIDIA/NeMo/releases>`_.

To use built container, please run
Copy link
Collaborator

@jgerh jgerh Apr 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change: "To use built container, please run" to:

To use a built container, please run

* NeMo Speech Container - `nvcr.io/nvidia/nemo:24.01.speech`

* LLM and Multimodal Dependencies - Refer to the `LLM and Multimodal dependencies <#llm-and-multimodal-dependencies>`_ section for isntallation instructions.
* It's higly recommended to start with a base NVIDIA PyTorch container: `nvcr.io/nvidia/pytorch:24.02-py3`
Copy link
Collaborator

@jgerh jgerh Apr 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change: "It's higly recommended to start with a base NVIDIA PyTorch container: nvcr.io/nvidia/pytorch:24.02-py3" to:

It's highly recommended that you start with a base NVIDIA PyTorch container: nvcr.io/nvidia/pytorch:24.02-py3

The LLM and Multimodal domains require three additional dependencies:
NVIDIA Apex, NVIDIA Transformer Engine, and NVIDIA Megatron Core.

When working with the `main` branch these dependencies may require a recent commit.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change: "When working with the main branch these dependencies may require a recent commit." to:

When working with the main branch, these dependencies may require a recent commit.

Apex
~~~~
NeMo LLM Domain training requires NVIDIA Apex to be installed.
Install it manually if not using the NVIDIA PyTorch container.
NeMo LLM Multimodal Domains require that NVIDIA Apex to be installed.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change: "NeMo LLM Multimodal Domains require that NVIDIA Apex to be installed." to:

NeMo LLM Multimodal Domains require that NVIDIA Apex be installed.

While installing Apex, it may raise an error if the CUDA version on your system does not match the CUDA version torch was compiled with.
While installing Apex outside of the NVIDIA PyTorch container,
it may raise an error if the CUDA version on your system does not match the CUDA version torch was compiled with.
This raise can be avoided by commenting it here: https://github.com/NVIDIA/apex/blob/master/setup.py#L32
Copy link
Collaborator

@jgerh jgerh Apr 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change: :This raise can be avoided by commenting it here" to"

This raised error can be avoided by commenting about it here

@@ -188,12 +188,15 @@ The NeMo Framework can be installed in a variety of ways, depending on your need
* This is recommended for Automatic Speech Recognition (ASR) and Text-to-Speech (TTS) domains.
* When using a Nvidia PyTorch container as the base, this is the recommended installation method for all domains.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change: "When using a Nvidia PyTorch container as the base, this is the recommended installation method for all domains." to:

When using an NVIDIA PyTorch container as the base, this is the recommended installation method for all domains.

@@ -188,12 +188,15 @@ The NeMo Framework can be installed in a variety of ways, depending on your need
* This is recommended for Automatic Speech Recognition (ASR) and Text-to-Speech (TTS) domains.
* When using a Nvidia PyTorch container as the base, this is the recommended installation method for all domains.

* Docker - Refer to the `Docker containers <#docker-containers>`_ section for installation instructions.
* Docker Containers - Refer to the `Docker containers <#docker-containers>`_ section for installation instructions.

* This is recommended for Large Language Models (LLM), Multimodal and Vision domains.
Copy link
Collaborator

@jgerh jgerh Apr 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change: "This is recommended for Large Language Models (LLM), Multimodal and Vision domains." to:

This is recommended for Large Language Models (LLM), Multimodal (MM), and Vision domains.

@@ -366,35 +405,43 @@ With the latest versions of Apex, the `pyproject.toml` file in Apex may need to

Transformer Engine
~~~~~~~~~~~~~~~~~~
NeMo LLM Domain has been integrated with `NVIDIA Transformer Engine <https://github.com/NVIDIA/TransformerEngine>`_
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change: "NeMo LLM Domain has been integrated with NVIDIA Transformer Engine <https://github.com/NVIDIA/TransformerEngine>_ to:

NeMo LLM Domain has been integrated with NVIDIA Transformer Engine <https://github.com/NVIDIA/TransformerEngine>_.

Copy link
Collaborator

@jgerh jgerh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copyedited content

xingyaoww pushed a commit to xingyaoww/NeMo that referenced this pull request Apr 23, 2024
* update

Signed-off-by: eharper <[email protected]>

* update

Signed-off-by: eharper <[email protected]>

* update

Signed-off-by: eharper <[email protected]>

* update

Signed-off-by: eharper <[email protected]>

* update

Signed-off-by: eharper <[email protected]>

* update

Signed-off-by: eharper <[email protected]>

* typo

Signed-off-by: eharper <[email protected]>

---------

Signed-off-by: eharper <[email protected]>
Co-authored-by: Pablo Garay <[email protected]>
@@ -10,7 +10,7 @@ ijson
jieba
markdown2
matplotlib>=3.3.2
megatron_core==0.5.0
megatron_core>0.6.0

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latest version as of now is megatron_core 0.6.0, and this change makes it unable to install NeMo from source

alxzhang-amazon pushed a commit to alxzhang-amazon/NeMo that referenced this pull request Apr 26, 2024
* update

Signed-off-by: eharper <[email protected]>

* update

Signed-off-by: eharper <[email protected]>

* update

Signed-off-by: eharper <[email protected]>

* update

Signed-off-by: eharper <[email protected]>

* update

Signed-off-by: eharper <[email protected]>

* update

Signed-off-by: eharper <[email protected]>

* typo

Signed-off-by: eharper <[email protected]>

---------

Signed-off-by: eharper <[email protected]>
Co-authored-by: Pablo Garay <[email protected]>
galv pushed a commit to galv/NeMo that referenced this pull request Apr 29, 2024
* update

Signed-off-by: eharper <[email protected]>

* update

Signed-off-by: eharper <[email protected]>

* update

Signed-off-by: eharper <[email protected]>

* update

Signed-off-by: eharper <[email protected]>

* update

Signed-off-by: eharper <[email protected]>

* update

Signed-off-by: eharper <[email protected]>

* typo

Signed-off-by: eharper <[email protected]>

---------

Signed-off-by: eharper <[email protected]>
Co-authored-by: Pablo Garay <[email protected]>
suiyoubi pushed a commit that referenced this pull request May 2, 2024
* update

Signed-off-by: eharper <[email protected]>

* update

Signed-off-by: eharper <[email protected]>

* update

Signed-off-by: eharper <[email protected]>

* update

Signed-off-by: eharper <[email protected]>

* update

Signed-off-by: eharper <[email protected]>

* update

Signed-off-by: eharper <[email protected]>

* typo

Signed-off-by: eharper <[email protected]>

---------

Signed-off-by: eharper <[email protected]>
Co-authored-by: Pablo Garay <[email protected]>
Signed-off-by: Ao Tang <[email protected]>
rohitrango pushed a commit to rohitrango/NeMo that referenced this pull request Jun 25, 2024
* update

Signed-off-by: eharper <[email protected]>

* update

Signed-off-by: eharper <[email protected]>

* update

Signed-off-by: eharper <[email protected]>

* update

Signed-off-by: eharper <[email protected]>

* update

Signed-off-by: eharper <[email protected]>

* update

Signed-off-by: eharper <[email protected]>

* typo

Signed-off-by: eharper <[email protected]>

---------

Signed-off-by: eharper <[email protected]>
Co-authored-by: Pablo Garay <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants