-
Notifications
You must be signed in to change notification settings - Fork 2.8k
Update dev doc for features #10049
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update dev doc for features #10049
Conversation
Signed-off-by: yaoyu-33 <[email protected]>
Signed-off-by: yaoyu-33 <[email protected]>
Signed-off-by: yaoyu-33 <[email protected]>
docs/source/features/optimizations/activation_recomputation.rst
Outdated
Show resolved
Hide resolved
Signed-off-by: yaoyu-33 <[email protected]>
Overview | ||
-------- | ||
|
||
NeMo supports Mixture of Experts (MoE) in the transformer layer for NLP models. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NeMo Framework supports Mixture of Experts (MoE) in the transformer layer for Natural Language Processing (NLP) models.
the most appropriate expert based on the current input. | ||
|
||
|
||
To use MoE in the NeMo Framework, adjust the ``num_moe_experts`` parameter in the model configuration: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To use MoE in the NeMo Framework, you need to adjust the num_moe_experts
parameter in the model configuration.
focuses on a specific subtask or domain, while a gating network dynamically activates | ||
the most appropriate expert based on the current input. | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a heading for the procedure.
Use MoE
.. code-block:: yaml | ||
|
||
moe_router_topk: 2 # Processes each token using 2 experts. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a heading for the procedure.
Configure MoE-specific Loss Function
moe_router_topk: 2 # Processes each token using 2 experts. | ||
|
||
In addition, NeMo provides options to configure MoE-specific loss function. | ||
To balance token distribution across experts: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a new paragraph between Lines 29 and 30 like this:
29 In addition, NeMo provides options to configure MoE-specific loss function.
30
31 To balance token distribution across experts:
2. ``moe_token_dropping`` enables selectively dropping and padding tokens for each expert to achieve | ||
a specified capacity. | ||
|
||
3. ``moe_token_dropping`` specifies the token dispatcher type, options include 'allgather' and 'alltoall'. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use semicolon.
moe_token_dropping
specifies the token dispatcher type; options include 'allgather' and 'alltoall'.
|
||
3. ``moe_token_dropping`` specifies the token dispatcher type, options include 'allgather' and 'alltoall'. | ||
|
||
4. ``moe_per_layer_logging`` enables per-layer logging for MoE, currently support aux-loss and z-loss. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moe_per_layer_logging
enables per-layer logging for MoE, which currently supports 'aux-loss' and 'z-loss'.
|
||
5. ``moe_expert_capacity_factor`` the capacity factor for each expert, None means no token will be dropped. The default is None. | ||
|
||
6. ``moe_pad_expert_input_to_capacity`` if True, pads the input for each expert to match the expert capacity length, effective only after the moe_expert_capacity_factor is set. The default setting is False. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moe_pad_expert_input_to_capacity
if True, pads the input for each expert to match the expert capacity length. This is effective only after the moe_expert_capacity_factor is set. The default setting is False.
|
||
6. ``moe_pad_expert_input_to_capacity`` if True, pads the input for each expert to match the expert capacity length, effective only after the moe_expert_capacity_factor is set. The default setting is False. | ||
|
||
7. ``moe_token_drop_policy`` the policy to drop tokens. Can be either "probs" or "position". If "probs", the tokens with the lowest probabilities will be dropped. If "position", tokens at the end of each batch will be dropped. Default value is "probs". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add "The"
moe_token_drop_policy
the policy to drop tokens. Can be either "probs" or "position". If "probs", the tokens with the lowest probabilities will be dropped. If "position", tokens at the end of each batch will be dropped. The default value is "probs".
|
||
7. ``moe_token_drop_policy`` the policy to drop tokens. Can be either "probs" or "position". If "probs", the tokens with the lowest probabilities will be dropped. If "position", tokens at the end of each batch will be dropped. Default value is "probs". | ||
|
||
8. ``moe_layer_recompute`` if True, checkpointing moe_layer to save activation memory, default is False. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moe_layer_recompute
if True, checkpointing moe_layer to save activation memory. The default value is False.
.. image:: https://github.com/NVIDIA/NeMo/releases/download/v2.0.0rc0/asset-post-activation-recomputation-exampe-2.jpg | ||
:align: center | ||
:alt: activation-recomputation-example-2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The following is a technical edit of activation_recomputation.rst by line numbers. This file was read-only, so I had to add edits here in the comment field.
Line 3. Suggest adding an introductory sentence that explains what activation recomputation is (summarized from arxiv and HF). For example:
Activation recomputation is a technique used in training large neural network models, such as transformer models, to manage the memory constraints of the device. It addresses the challenge of storing large numbers of intermediate activations—outputs from each layer of the network—required during the back-propagation phase for gradient computation.
Line 11 - 13. suggested revision. Please verify change from "NeMo" to "NeMo Framework".
NeMo Framework enables transformer layer recomputation, a technique that checkpoints the input at each transformer layer and then recalculates the activations for subsequent layers. Transformer layer recomputation significantly reduces the activation memory usage. However, this approach also leads to a 30% increase in the computational demand for each transformer layer, attributable to the necessity of reprocessing the full forward computation for the layer.
Line 14-15 suggested revision.
NeMo Framework also supports partial transformer layer recomputation. This feature is beneficial when recomputing a few transformer layers to fit the training workload on GPU memory. By doing so, it eliminates the need to recompute the remaining layers.
Line 17 -18. fix capitalization, use active voice, suggested revision.
You can enable transformer layer recomputation by setting activations_checkpoint_granularity=full
. Additionally, you can specify the number of transformer layers to recompute by setting activations_checkpoint_num_layers
along with activations_checkpoint_method=block
.
Line 19. fix capitalization, use active voice, suggested revision.
If you set activations_checkpoint_num_layers
as the total number of layers, the inputs of all transformer layers are checkpointed and recomputed.
Line 21. suggested revision.
If the virtual pipelining is used, activations_checkpoint_num_layers
indicates the layers per virtual pipeline stage.
Line 23. fix capitalization, suggested revision.
NeMo Framework also supports checkpointing the input to a block of multiple consecutive transformer layers, effectively making this block the unit of granularity for recomputation.
Line 25. fix capitalization
Thus, it is only beneficial for memory savings when the model has many transformer layers or the intermediate layers of a transformer layer hold relatively small activation stores.
Line 26. fix capitalization, use active voice, suggested revision.
You can enable this recomputation mode by setting activations_checkpoint_method=uniform
. Additionally, you can set the number of transformer layers per recomputation block using activations_checkpoint_num_layers
.
Line 31 - 35. various corrections, suggested revision.
NeMo Framework supports the self-attention recomputation that checkpoints the inputs of each self-attention block and recomputes the intermediate input activations. This is a cost-efficient recomputation method that achieves higher memory savings with lower recomputation cost.
The intermediate layers within the self-attention block are responsible for the majority of activation memory usage.
This is because the input sizes of softmax, dropout, and qkv dot-product attention layers have a memory complexity that is proportional to the square of the sequence length.
However, the recomputation cost of these layers is relatively smaller compared to that of the other linear projection layers, which scale linearly with the square of the hidden size.
Line 37. suggested revision.
Self-attention recomputation is hard-enabled when using Flash Attention, which the Transformer Engine supports. Additionally, you can opt for self-attention recomputation without Flash Attention by configuring activations_checkpoint_granularity=selective
.
@@ -179,24 +81,3 @@ Implement MQA or GQA | |||
NeMo's support for GQA and MQA is enabled through the integration of Megatron Core's Attention mechanism. The underlying implementation details can be explored within the Attention class of Megatron Core, which provides the functional backbone for these advanced attention methods. To understand the specific modifications and implementations of MQA and GQA, refer to the source code in the Attention class: | |||
|
|||
Check implementation details from Attention Class in Megatron Core Repo: https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/core/transformer/attention.py#L49. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technical edit of attention_optimizations.rst by line numbers.
Line 3. Need to add summary text for this topic. For example:
Attention optimizations seek to improve the efficiency and effectiveness of transformer models, particularly in language processing tasks. They address the computational demands of the attention mechanism by altering the way attention is calculated across tokens in a sequence. This section describes the following attention optimization techniques: flash attention, Multi-Query Attention (MQA), and Grouped-Query Attention (GQA) and explains how to use MQA and GQA with NeMo Framework.
Line 7-8. Delete "Overview" and the ^^^^^^^^ characters under it. This heading is not needed.
Line 18. suggested revision.
Flash attention significantly reduces the memory footprint and computational complexity of processing sequences in large language models. By transforming the complexity from quadratic to linear, it enables these models to handle much longer sequence lengths efficiently.
Line 25. fix capitalization, punctuation, suggested revision.
In the NeMo Framework, flash attention is supported through Transformer Engine <https://github.com/NVIDIA/TransformerEngine/tree/main>
_, including both of the implementations mentioned above. Transformer Engine selects the appropriate implementation based on input information such as sequence length, number of heads, and head dimension. When both implementations are applicable, Transformer Engine prefers cuDNN flash attention on Hopper+ architectures and Tri Dao flash attention on Ampere architectures.
Line 31. fix capitalization.
Multi-Query Attention (MQA) and Grouped-Query Attention (GQA)
Line 36-37. Delete "Overview" and the ^^^^^^^^ characters. A heading is not needed here.
Line 34, 39 - 43. Remove bold format, delete Line 39 and Line 42, combine Lines 40 and 43 into one paragraph. Revise text as follows.
Multi-Query Attention (MQA) and Grouped-Query Attention (GQA) are modifications of the traditional multi-head attention mechanism in Transformer models. These methods improve the efficiency and effectiveness of attention mechanisms. MQA treats all attention heads as a single group, reducing computational complexity and accelerating training times. It is beneficial when model scalability or limited computational resources are concerns. GQA groups the heads into clusters, each processing a subset of queries independently. This method balances the detailed focus of traditional multi-head attention with the broad approach of MQA, enhancing nuanced input data processing.
Line 56. remove bold and punctuation.
- For MQA, set
num_query_groups
to1
to treat all attention heads as a single group.
Line 62. remove bold and punctuation.
- For GQA, set
num_query_groups
to a number that is a divisor of the total number of attention heads (more than one but less than the total heads).
Line 70. make this a step in the procedure.
- For flash attention, set this parameter to
None
or match it with the number of heads.
Line 81, 83. revise.
NeMo Framework support for GQA and MQA is enabled through the integration of Megatron Core's Attention mechanism. The underlying implementation details can be explored within the Attention class of Megatron Core, which provides the functional backbone for these advanced attention methods. To understand the specific modifications and implementations of MQA and GQA, refer to the source code in the Attention class. Check the implementation details for the Attention Class in the Megatron Core repo: https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/core/transformer/attention.py#L49.
@@ -1,5 +1,5 @@ | |||
Communication Overlap |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technical edit of communication_overlap.rst by line numbers.
Line 3. Need to add summary text for this topic. Something like:
Communication overlap strategies are crucial for training large-scale models efficiently, as they help to utilize computational resources better and reduce training time. This section describes strategies for overlapping the data-parallel (DP), tensor-parallel (TP), and pipeline-parallel (PP) communications with computation operations.
Lines 7-8 suggested revision.
NeMo enables the overlap of data-parallel (DP) communications with computations during LLM training. It also includes a Distributed Optimizer that distributes optimizer states and high-precision master parameters across GPUs.
Line 9. fix capitalization.
The DP communication is chunked by the granularity of a transformer layer and overlaps each communication chunk with computation.
Line 10. fix punctuation.
This overlap method exposes only one DP communication chunk, ensuring efficient large-scale LLM training.
Lines 13 - 15. suggested revision.
DP gradient reduce-scatter and parameter all-gather overlaps are enabled when setting overlap_grad_sync=true
and overlap_param_sync=true
, respectively.The precision of the gradient reduce-scatter is set by grad_sync_dtype
. Reducing bf16 ensures improved performance at large scale training compared to the default precision of fp32.
When training in fp8 computing precision (with fp8=true
), setting fp8_params=true
conducts the parameter all-gather in fp8 thereby reducing the all-gather overhead by half.
Line 20. revise.
Tensor parallelism, used with the sequence-parallel activation sharding (sequence_parallel=true
), introduces activation (gradient) all-gather and reduce-scatter as shown in the figure below.
Line 22. revise.
The TP communication without direct computation dependency is overlapped with the computation in bulk (the linear layer and TP communication pairs in the yellow boxes).
Line 25. revise.
The TP communication and computation are chunked and the chunks are overlapped in the pipeline.
Line 26. revise run-on sentence.
In the pipelined overlap, the activation (gradient) tensor all-gather is replaced with multiple steps of input P2P ring exchanges. Similarly, reduce-scatter is replaced with multiple steps of GEMM output P2P ring exchanges followed by a reduction of the received outputs.
Lines 29 - 32. Please check .. image:: ../nlp/nemo_megatron/images/tp_comm_overlap.png. It is not showing up in the HTML file.
Lines 34 -36. revise.
The pipelined TP communication overlap is implemented in Transformer Engine and enabled by setting ub_tp_comm_overlap=true. The specific overlap methods can be set in a config dictionary and passed as a YAML file. The individual bulk, pipelined all-gather, and reduce-scatter operations can be enabled or disabled by using tp_comm_bulk_wgrad, tp_comm_bulk_dgrad, tp_comm_overlap_ag, and tp_comm_overlap_rs, respectively.
Lines 41 - 43. Please check the original sentence: "This increasing PP communication overhead and it cancels off the reduced the pipeline bubbles with virtual pipelining." Here is my suggested revision.
Pipelining enables P2P activation (gradient) sends and receives between GPUs in a pipeline-parallel (PP) configuration. As the virtual pipeline parallel size increases, the frequency of PP communications also increases due to a reduction in the number of transformer layers processed per micro-batch. However, this increase in PP communication overhead counteracts the reduction in pipeline stalls that virtual pipelining is supposed to mitigate.
Line 52. revise.
The PP communication overlap is enabled by setting overlap_p2p_comm=true
. In addition, setting batch_p2p_comm=false
uses separate kernels for the send and the receive operations, which further improves the communication efficiency and GPU resource utilization.
Line 59. suggested revision.
Context parallelism divides the activations (gradients) across all layers within the sequence domain. This division introduces all-gather and reduce-scatter operations for activations (gradients) during the forward and backward propagations of self-attention mechanisms.
Line 63. revise.
By default, the CP communication overlap is enabled when context parallelism is used (context_parallel_size > 1
).
@@ -0,0 +1,19 @@ | |||
CPU Offloading |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No comments on this file.
@@ -1,5 +1,5 @@ | |||
Throughput Optimizations | |||
======================== | |||
Sequence Packing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technical edit of sequence_packing.rst by line numbers.
Line 3. Need to add summary text for this topic. Something like:
Sequence packing in Large Language Models (LLMs) is a technique used to improve the efficiency of training by better managing variable-length sequences and optimizing computational resources. This section describes sequence packing for Supervised Fine-Tuning (SFT) and Parameter-Efficient Fine-Tuning (PERT) methods and provides instructions for running it.
Lines 7 - 8. Delete "Overview" and ^^^^^^^^ characters. This heading is not needed.
Lines 10 - 14 fix punctuation.
When fine-tuning a LLM with either SFT or PEFT, GPU
underutilization is a common problem due to an inefficient data pipeline. This problem occurs because most fine-tuning datasets have a skewed distribution of sequence lengths, with many short sequences and a few long sequences, following Zipf’s Law.
Transformer models can only take in fixed length inputs, so the input has to be padded with many unused pad tokens. This method is inefficient in two ways:
Lines 24 - 25. fix punctuation.
While sequences for pretraining can be concatenated naively, this is not the case for SFT and instruction fine-tuning, where each input sequence should be treated individually.
Line 30. fix spelling.
kernels in Flash Attention and Transformer Engine. With this, attention values between sequences are never calculated,
Lines 34 - 38. revise.
All things considered, the NeMo Framework implementation of sequence packing provides:
- Up to 10X performance improvement in terms of FLOPs.
- Up to 6X performance improvement in terms of training time.
- No impact on model convergence.
Lines 42 - 43. Use imperative verb.
Run SFT/PEFT with Packed Sequence
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Line 44. Add lead-in text for this section. suggested addition.
This section provides procedures for preparing the dataset and adjusting the training configuration for SFT and PEFT methods.
Lines 57 - 59. suggested revision.
-
The online processing code in GPTSFTDataset is run. This process includes operations such as prompt template manipulation, sequence length truncation, and tokenization. The result is an array of tokenized sequences represented by indices.
-
The sequences are grouped by length and a packing algorithm is run.
Lines 62 - 65. fix punctuation and syntax.
Currently, two variants of first fit are supported:
-
first_fit_decreasing
sorts the sequences in decreasing order before applying the first-fit algorithm. It generates a more optimal packing, but it tends to keep all short sequences together, which may have an impact for convergence. -
first_fit_shuffle
runs first-fit in a random order. Packing is less optimal but it keeps the dataset order random. The recommendation is to run first_fit_shuffle and check the packed sequence lengths.
If they are similar to the target length (i.e. efficient packing), then use first_fit_shuffle
. Otherwise try first_fit_decreasing
.
Lines 80 - 93. suggested revision
..note::
-
If your model or dataset requires non-default configs for conventional SFT/PEFT training in NeMo, you will need to pass in the same configs to
model.data.train_ds
as you would for training with unpacked dataset. -
model.data.train_ds.max_seq_length
is the length to truncate each sequence before packing multiple sequences to the size of packed sequence (pack_size
).max_seq_length
should be set to the same value as unpacked data, and can be determined by examining the distribution of sequence lengths in the dataset. -
pack_sizes
is a list of packed sequence lengths. In this example, there will be three output files, one for each pack size. The output files are named<output_folder>/packed_{pack_size}_seed{seed}.npy
. This argument is a list because you will likely want to experiment with a fewpack_sizes
to find out which length can fill the GPU memory without exceeding it. Adjustingpack_size
is analogous to adjusting the micro batch size in the unpacked case.
Line 99. fix punctuation.
To train with packed sequences, you need to change four items in the SFT/PEFT config file.
Line 101. revise.
- Enable the packed_sequence flag:
Line 107. fix punctuation.
- Use the new dataset file instead of the original jsonl file:
Line 109. fix punctuation.
- Adjust the batch sizes:
Line 133. revise.
Now, you are all set to fine-tune your model with a much improved throughput!
Lines 138 - 139. revise. Also, verify that the link works in the rendered HTML file.
Sequence packing in NeVA, which is a part of Multimodal Language Models (LLMs), differs from the SFT/PEFT approach. For details,
please refer to :doc:../multimodal/mllm/sequence_packing
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Completed technical edit of 5 files. I left some comments inline, but had to add other comments in bulk because the files were read-only. Please review the edits, make the changes you agree with, and resolve the threads.
* update structure Signed-off-by: yaoyu-33 <[email protected]> * update structure Signed-off-by: yaoyu-33 <[email protected]> * add image Signed-off-by: yaoyu-33 <[email protected]> * address comments Signed-off-by: yaoyu-33 <[email protected]> --------- Signed-off-by: yaoyu-33 <[email protected]> Signed-off-by: adityavavre <[email protected]>
* update structure Signed-off-by: yaoyu-33 <[email protected]> * update structure Signed-off-by: yaoyu-33 <[email protected]> * add image Signed-off-by: yaoyu-33 <[email protected]> * address comments Signed-off-by: yaoyu-33 <[email protected]> --------- Signed-off-by: yaoyu-33 <[email protected]>
* update structure Signed-off-by: yaoyu-33 <[email protected]> * update structure Signed-off-by: yaoyu-33 <[email protected]> * add image Signed-off-by: yaoyu-33 <[email protected]> * address comments Signed-off-by: yaoyu-33 <[email protected]> --------- Signed-off-by: yaoyu-33 <[email protected]> Signed-off-by: Hainan Xu <[email protected]>
* update structure Signed-off-by: yaoyu-33 <[email protected]> * update structure Signed-off-by: yaoyu-33 <[email protected]> * add image Signed-off-by: yaoyu-33 <[email protected]> * address comments Signed-off-by: yaoyu-33 <[email protected]> --------- Signed-off-by: yaoyu-33 <[email protected]>
What does this PR do ?
Update dev doc for features section
Collection: [Note which collection this PR will affect]
Changelog
Usage
# Add a code snippet demonstrating how to use this
GitHub Actions CI
The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.
The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".
Before your PR is "Ready for review"
Pre checks:
PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.
Who can review?
Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.
Additional Information