Update dev doc for features #10049

yaoyu-33 · 2024-08-05T20:57:38Z

What does this PR do ?

Update dev doc for features section

Collection: [Note which collection this PR will affect]

Changelog

Add specific line by line info of high level changes in this PR.

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Related to # (issue)

Signed-off-by: yaoyu-33 <[email protected]>

docs/source/features/optimizations/cpu_offloading.rst

docs/source/features/optimizations/activation_recomputation.rst

Signed-off-by: yaoyu-33 <[email protected]>

jgerh · 2024-08-08T17:53:46Z

docs/source/features/moe.rst

+Overview
+--------
+
+NeMo supports Mixture of Experts (MoE) in the transformer layer for NLP models.


NeMo Framework supports Mixture of Experts (MoE) in the transformer layer for Natural Language Processing (NLP) models.

jgerh · 2024-08-08T17:55:38Z

docs/source/features/moe.rst

+the most appropriate expert based on the current input.
+
+
+To use MoE  in the NeMo Framework, adjust the ``num_moe_experts`` parameter in the model configuration:


To use MoE in the NeMo Framework, you need to adjust the num_moe_experts parameter in the model configuration.

jgerh · 2024-08-08T17:57:58Z

docs/source/features/moe.rst

+focuses on a specific subtask or domain, while a gating network dynamically activates
+the most appropriate expert based on the current input.
+
+


add a heading for the procedure.

Use MoE

jgerh · 2024-08-08T17:58:51Z

docs/source/features/moe.rst

+   .. code-block:: yaml
+
+       moe_router_topk: 2  # Processes each token using 2 experts.
+


Add a heading for the procedure.

Configure MoE-specific Loss Function

jgerh · 2024-08-08T18:00:32Z

docs/source/features/moe.rst

+       moe_router_topk: 2  # Processes each token using 2 experts.
+
+In addition, NeMo provides options to configure MoE-specific loss function.
+To balance token distribution across experts:


Add a new paragraph between Lines 29 and 30 like this:

29 In addition, NeMo provides options to configure MoE-specific loss function.
30
31 To balance token distribution across experts:

jgerh · 2024-08-08T18:02:18Z

docs/source/features/moe.rst

+2. ``moe_token_dropping`` enables selectively dropping and padding tokens for each expert to achieve
+   a specified capacity.
+
+3. ``moe_token_dropping`` specifies the token dispatcher type, options include 'allgather' and 'alltoall'.


use semicolon.

moe_token_dropping specifies the token dispatcher type; options include 'allgather' and 'alltoall'.

jgerh · 2024-08-08T18:04:08Z

docs/source/features/moe.rst

+
+3. ``moe_token_dropping`` specifies the token dispatcher type, options include 'allgather' and 'alltoall'.
+
+4. ``moe_per_layer_logging`` enables per-layer logging for MoE, currently support aux-loss and z-loss.


moe_per_layer_logging enables per-layer logging for MoE, which currently supports 'aux-loss' and 'z-loss'.

jgerh · 2024-08-08T18:05:14Z

docs/source/features/moe.rst

+
+5. ``moe_expert_capacity_factor`` the capacity factor for each expert, None means no token will be dropped. The default is None.
+
+6. ``moe_pad_expert_input_to_capacity`` if True, pads the input for each expert to match the expert capacity length, effective only after the moe_expert_capacity_factor is set. The default setting is False.


moe_pad_expert_input_to_capacity if True, pads the input for each expert to match the expert capacity length. This is effective only after the moe_expert_capacity_factor is set. The default setting is False.

jgerh · 2024-08-08T18:06:07Z

docs/source/features/moe.rst

+
+6. ``moe_pad_expert_input_to_capacity`` if True, pads the input for each expert to match the expert capacity length, effective only after the moe_expert_capacity_factor is set. The default setting is False.
+
+7. ``moe_token_drop_policy`` the policy to drop tokens. Can be either "probs" or "position". If "probs", the tokens with the lowest probabilities will be dropped. If "position", tokens at the end of each batch will be dropped. Default value is "probs".


add "The"

moe_token_drop_policy the policy to drop tokens. Can be either "probs" or "position". If "probs", the tokens with the lowest probabilities will be dropped. If "position", tokens at the end of each batch will be dropped. The default value is "probs".

jgerh · 2024-08-08T18:06:38Z

docs/source/features/moe.rst

+
+7. ``moe_token_drop_policy`` the policy to drop tokens. Can be either "probs" or "position". If "probs", the tokens with the lowest probabilities will be dropped. If "position", tokens at the end of each batch will be dropped. Default value is "probs".
+
+8. ``moe_layer_recompute`` if True, checkpointing moe_layer to save activation memory, default is False.


moe_layer_recompute if True, checkpointing moe_layer to save activation memory. The default value is False.

jgerh · 2024-08-08T19:11:42Z

docs/source/features/optimizations/activation_recomputation.rst

+.. image:: https://github.com/NVIDIA/NeMo/releases/download/v2.0.0rc0/asset-post-activation-recomputation-exampe-2.jpg
+    :align: center
+    :alt: activation-recomputation-example-2


The following is a technical edit of activation_recomputation.rst by line numbers. This file was read-only, so I had to add edits here in the comment field.

Line 3. Suggest adding an introductory sentence that explains what activation recomputation is (summarized from arxiv and HF). For example:

Activation recomputation is a technique used in training large neural network models, such as transformer models, to manage the memory constraints of the device. It addresses the challenge of storing large numbers of intermediate activations—outputs from each layer of the network—required during the back-propagation phase for gradient computation.

Line 11 - 13. suggested revision. Please verify change from "NeMo" to "NeMo Framework".

NeMo Framework enables transformer layer recomputation, a technique that checkpoints the input at each transformer layer and then recalculates the activations for subsequent layers. Transformer layer recomputation significantly reduces the activation memory usage. However, this approach also leads to a 30% increase in the computational demand for each transformer layer, attributable to the necessity of reprocessing the full forward computation for the layer.

Line 14-15 suggested revision.

NeMo Framework also supports partial transformer layer recomputation. This feature is beneficial when recomputing a few transformer layers to fit the training workload on GPU memory. By doing so, it eliminates the need to recompute the remaining layers.

Line 17 -18. fix capitalization, use active voice, suggested revision.

You can enable transformer layer recomputation by setting activations_checkpoint_granularity=full. Additionally, you can specify the number of transformer layers to recompute by setting activations_checkpoint_num_layers along with activations_checkpoint_method=block.

Line 19. fix capitalization, use active voice, suggested revision.

If you set activations_checkpoint_num_layers as the total number of layers, the inputs of all transformer layers are checkpointed and recomputed.

Line 21. suggested revision.

If the virtual pipelining is used, activations_checkpoint_num_layers indicates the layers per virtual pipeline stage.

Line 23. fix capitalization, suggested revision.

NeMo Framework also supports checkpointing the input to a block of multiple consecutive transformer layers, effectively making this block the unit of granularity for recomputation.

Line 25. fix capitalization

Thus, it is only beneficial for memory savings when the model has many transformer layers or the intermediate layers of a transformer layer hold relatively small activation stores.

Line 26. fix capitalization, use active voice, suggested revision.

You can enable this recomputation mode by setting activations_checkpoint_method=uniform. Additionally, you can set the number of transformer layers per recomputation block using activations_checkpoint_num_layers.

Line 31 - 35. various corrections, suggested revision.

NeMo Framework supports the self-attention recomputation that checkpoints the inputs of each self-attention block and recomputes the intermediate input activations. This is a cost-efficient recomputation method that achieves higher memory savings with lower recomputation cost.

The intermediate layers within the self-attention block are responsible for the majority of activation memory usage.
This is because the input sizes of softmax, dropout, and qkv dot-product attention layers have a memory complexity that is proportional to the square of the sequence length.

However, the recomputation cost of these layers is relatively smaller compared to that of the other linear projection layers, which scale linearly with the square of the hidden size.

Line 37. suggested revision.

Self-attention recomputation is hard-enabled when using Flash Attention, which the Transformer Engine supports. Additionally, you can opt for self-attention recomputation without Flash Attention by configuring activations_checkpoint_granularity=selective.

jgerh · 2024-08-08T20:56:38Z

docs/source/features/optimizations/attention_optimizations.rst

@@ -179,24 +81,3 @@ Implement MQA or GQA
 NeMo's support for GQA and MQA is enabled through the integration of Megatron Core's Attention mechanism. The underlying implementation details can be explored within the Attention class of Megatron Core, which provides the functional backbone for these advanced attention methods. To understand the specific modifications and implementations of MQA and GQA, refer to the source code in the Attention class:

 Check implementation details from Attention Class in Megatron Core Repo: https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/core/transformer/attention.py#L49.


Technical edit of attention_optimizations.rst by line numbers.

Line 3. Need to add summary text for this topic. For example:

Attention optimizations seek to improve the efficiency and effectiveness of transformer models, particularly in language processing tasks. They address the computational demands of the attention mechanism by altering the way attention is calculated across tokens in a sequence. This section describes the following attention optimization techniques: flash attention, Multi-Query Attention (MQA), and Grouped-Query Attention (GQA) and explains how to use MQA and GQA with NeMo Framework.

Line 7-8. Delete "Overview" and the ^^^^^^^^ characters under it. This heading is not needed.

Line 18. suggested revision.

Flash attention significantly reduces the memory footprint and computational complexity of processing sequences in large language models. By transforming the complexity from quadratic to linear, it enables these models to handle much longer sequence lengths efficiently.

Line 25. fix capitalization, punctuation, suggested revision.

In the NeMo Framework, flash attention is supported through Transformer Engine <https://github.com/NVIDIA/TransformerEngine/tree/main>_, including both of the implementations mentioned above. Transformer Engine selects the appropriate implementation based on input information such as sequence length, number of heads, and head dimension. When both implementations are applicable, Transformer Engine prefers cuDNN flash attention on Hopper+ architectures and Tri Dao flash attention on Ampere architectures.

Line 31. fix capitalization.

Multi-Query Attention (MQA) and Grouped-Query Attention (GQA)

Line 36-37. Delete "Overview" and the ^^^^^^^^ characters. A heading is not needed here.

Line 34, 39 - 43. Remove bold format, delete Line 39 and Line 42, combine Lines 40 and 43 into one paragraph. Revise text as follows.

Multi-Query Attention (MQA) and Grouped-Query Attention (GQA) are modifications of the traditional multi-head attention mechanism in Transformer models. These methods improve the efficiency and effectiveness of attention mechanisms. MQA treats all attention heads as a single group, reducing computational complexity and accelerating training times. It is beneficial when model scalability or limited computational resources are concerns. GQA groups the heads into clusters, each processing a subset of queries independently. This method balances the detailed focus of traditional multi-head attention with the broad approach of MQA, enhancing nuanced input data processing.

Line 56. remove bold and punctuation.

For MQA, set num_query_groups to 1 to treat all attention heads as a single group.

Line 62. remove bold and punctuation.

For GQA, set num_query_groups to a number that is a divisor of the total number of attention heads (more than one but less than the total heads).

Line 70. make this a step in the procedure.

For flash attention, set this parameter to None or match it with the number of heads.

Line 81, 83. revise.

NeMo Framework support for GQA and MQA is enabled through the integration of Megatron Core's Attention mechanism. The underlying implementation details can be explored within the Attention class of Megatron Core, which provides the functional backbone for these advanced attention methods. To understand the specific modifications and implementations of MQA and GQA, refer to the source code in the Attention class. Check the implementation details for the Attention Class in the Megatron Core repo: https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/core/transformer/attention.py#L49.

jgerh · 2024-08-08T22:14:44Z

docs/source/features/optimizations/communication_overlap.rst

@@ -1,5 +1,5 @@
 Communication Overlap


Technical edit of communication_overlap.rst by line numbers.

Line 3. Need to add summary text for this topic. Something like:

Communication overlap strategies are crucial for training large-scale models efficiently, as they help to utilize computational resources better and reduce training time. This section describes strategies for overlapping the data-parallel (DP), tensor-parallel (TP), and pipeline-parallel (PP) communications with computation operations.

Lines 7-8 suggested revision.

NeMo enables the overlap of data-parallel (DP) communications with computations during LLM training. It also includes a Distributed Optimizer that distributes optimizer states and high-precision master parameters across GPUs.

Line 9. fix capitalization.

The DP communication is chunked by the granularity of a transformer layer and overlaps each communication chunk with computation.

Line 10. fix punctuation.

This overlap method exposes only one DP communication chunk, ensuring efficient large-scale LLM training.

Lines 13 - 15. suggested revision.

DP gradient reduce-scatter and parameter all-gather overlaps are enabled when setting overlap_grad_sync=true and overlap_param_sync=true, respectively.The precision of the gradient reduce-scatter is set by grad_sync_dtype. Reducing bf16 ensures improved performance at large scale training compared to the default precision of fp32.
When training in fp8 computing precision (with fp8=true), setting fp8_params=true conducts the parameter all-gather in fp8 thereby reducing the all-gather overhead by half.

Line 20. revise.

Tensor parallelism, used with the sequence-parallel activation sharding (sequence_parallel=true), introduces activation (gradient) all-gather and reduce-scatter as shown in the figure below.

Line 22. revise.

The TP communication without direct computation dependency is overlapped with the computation in bulk (the linear layer and TP communication pairs in the yellow boxes).

Line 25. revise.

The TP communication and computation are chunked and the chunks are overlapped in the pipeline.

Line 26. revise run-on sentence.

In the pipelined overlap, the activation (gradient) tensor all-gather is replaced with multiple steps of input P2P ring exchanges. Similarly, reduce-scatter is replaced with multiple steps of GEMM output P2P ring exchanges followed by a reduction of the received outputs.

Lines 29 - 32. Please check .. image:: ../nlp/nemo_megatron/images/tp_comm_overlap.png. It is not showing up in the HTML file.

Lines 34 -36. revise.

The pipelined TP communication overlap is implemented in Transformer Engine and enabled by setting ub_tp_comm_overlap=true. The specific overlap methods can be set in a config dictionary and passed as a YAML file. The individual bulk, pipelined all-gather, and reduce-scatter operations can be enabled or disabled by using tp_comm_bulk_wgrad, tp_comm_bulk_dgrad, tp_comm_overlap_ag, and tp_comm_overlap_rs, respectively.

Lines 41 - 43. Please check the original sentence: "This increasing PP communication overhead and it cancels off the reduced the pipeline bubbles with virtual pipelining." Here is my suggested revision.

Pipelining enables P2P activation (gradient) sends and receives between GPUs in a pipeline-parallel (PP) configuration. As the virtual pipeline parallel size increases, the frequency of PP communications also increases due to a reduction in the number of transformer layers processed per micro-batch. However, this increase in PP communication overhead counteracts the reduction in pipeline stalls that virtual pipelining is supposed to mitigate.

Line 52. revise.

The PP communication overlap is enabled by setting overlap_p2p_comm=true. In addition, setting batch_p2p_comm=false uses separate kernels for the send and the receive operations, which further improves the communication efficiency and GPU resource utilization.

Line 59. suggested revision.

Context parallelism divides the activations (gradients) across all layers within the sequence domain. This division introduces all-gather and reduce-scatter operations for activations (gradients) during the forward and backward propagations of self-attention mechanisms.

Line 63. revise.

By default, the CP communication overlap is enabled when context parallelism is used (context_parallel_size > 1).

jgerh · 2024-08-08T22:16:17Z

docs/source/features/optimizations/cpu_offloading.rst

@@ -0,0 +1,19 @@
+CPU Offloading


No comments on this file.

jgerh · 2024-08-08T23:32:47Z

docs/source/features/optimizations/sequence_packing.rst

@@ -1,5 +1,5 @@
-Throughput Optimizations
-========================
+Sequence Packing


Technical edit of sequence_packing.rst by line numbers.

Line 3. Need to add summary text for this topic. Something like:

Sequence packing in Large Language Models (LLMs) is a technique used to improve the efficiency of training by better managing variable-length sequences and optimizing computational resources. This section describes sequence packing for Supervised Fine-Tuning (SFT) and Parameter-Efficient Fine-Tuning (PERT) methods and provides instructions for running it.

Lines 7 - 8. Delete "Overview" and ^^^^^^^^ characters. This heading is not needed.

Lines 10 - 14 fix punctuation.

When fine-tuning a LLM with either SFT or PEFT, GPU
underutilization is a common problem due to an inefficient data pipeline. This problem occurs because most fine-tuning datasets have a skewed distribution of sequence lengths, with many short sequences and a few long sequences, following Zipf’s Law.

Transformer models can only take in fixed length inputs, so the input has to be padded with many unused pad tokens. This method is inefficient in two ways:

Lines 24 - 25. fix punctuation.

While sequences for pretraining can be concatenated naively, this is not the case for SFT and instruction fine-tuning, where each input sequence should be treated individually.

Line 30. fix spelling.

kernels in Flash Attention and Transformer Engine. With this, attention values between sequences are never calculated,

Lines 34 - 38. revise.

All things considered, the NeMo Framework implementation of sequence packing provides:

Up to 10X performance improvement in terms of FLOPs.

Up to 6X performance improvement in terms of training time.

No impact on model convergence.

Lines 42 - 43. Use imperative verb.

Run SFT/PEFT with Packed Sequence
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Line 44. Add lead-in text for this section. suggested addition.

This section provides procedures for preparing the dataset and adjusting the training configuration for SFT and PEFT methods.

Lines 57 - 59. suggested revision.

The online processing code in GPTSFTDataset is run. This process includes operations such as prompt template manipulation, sequence length truncation, and tokenization. The result is an array of tokenized sequences represented by indices.

The sequences are grouped by length and a packing algorithm is run.

Lines 62 - 65. fix punctuation and syntax.

Currently, two variants of first fit are supported:

first_fit_decreasing sorts the sequences in decreasing order before applying the first-fit algorithm. It generates a more optimal packing, but it tends to keep all short sequences together, which may have an impact for convergence.

first_fit_shuffle runs first-fit in a random order. Packing is less optimal but it keeps the dataset order random. The recommendation is to run first_fit_shuffle and check the packed sequence lengths.

If they are similar to the target length (i.e. efficient packing), then use first_fit_shuffle. Otherwise try first_fit_decreasing.

Lines 80 - 93. suggested revision

..note::

If your model or dataset requires non-default configs for conventional SFT/PEFT training in NeMo, you will need to pass in the same configs to model.data.train_ds as you would for training with unpacked dataset.

model.data.train_ds.max_seq_length is the length to truncate each sequence before packing multiple sequences to the size of packed sequence (pack_size). max_seq_length should be set to the same value as unpacked data, and can be determined by examining the distribution of sequence lengths in the dataset.

pack_sizes is a list of packed sequence lengths. In this example, there will be three output files, one for each pack size. The output files are named <output_folder>/packed_{pack_size}_seed{seed}.npy. This argument is a list because you will likely want to experiment with a few pack_sizes to find out which length can fill the GPU memory without exceeding it. Adjusting pack_size is analogous to adjusting the micro batch size in the unpacked case.

Line 99. fix punctuation.

To train with packed sequences, you need to change four items in the SFT/PEFT config file.

Line 101. revise.

Enable the packed_sequence flag:

Line 107. fix punctuation.

Use the new dataset file instead of the original jsonl file:

Line 109. fix punctuation.

Adjust the batch sizes:

Line 133. revise.

Now, you are all set to fine-tune your model with a much improved throughput!

Lines 138 - 139. revise. Also, verify that the link works in the rendered HTML file.

Sequence packing in NeVA, which is a part of Multimodal Language Models (LLMs), differs from the SFT/PEFT approach. For details,
please refer to :doc:../multimodal/mllm/sequence_packing.

jgerh

Completed technical edit of 5 files. I left some comments inline, but had to add other comments in bulk because the files were read-only. Please review the edits, make the changes you agree with, and resolve the threads.

* update structure Signed-off-by: yaoyu-33 <[email protected]> * update structure Signed-off-by: yaoyu-33 <[email protected]> * add image Signed-off-by: yaoyu-33 <[email protected]> * address comments Signed-off-by: yaoyu-33 <[email protected]> --------- Signed-off-by: yaoyu-33 <[email protected]> Signed-off-by: adityavavre <[email protected]>

* update structure Signed-off-by: yaoyu-33 <[email protected]> * update structure Signed-off-by: yaoyu-33 <[email protected]> * add image Signed-off-by: yaoyu-33 <[email protected]> * address comments Signed-off-by: yaoyu-33 <[email protected]> --------- Signed-off-by: yaoyu-33 <[email protected]>

* update structure Signed-off-by: yaoyu-33 <[email protected]> * update structure Signed-off-by: yaoyu-33 <[email protected]> * add image Signed-off-by: yaoyu-33 <[email protected]> * address comments Signed-off-by: yaoyu-33 <[email protected]> --------- Signed-off-by: yaoyu-33 <[email protected]> Signed-off-by: Hainan Xu <[email protected]>

* update structure Signed-off-by: yaoyu-33 <[email protected]> * update structure Signed-off-by: yaoyu-33 <[email protected]> * add image Signed-off-by: yaoyu-33 <[email protected]> * address comments Signed-off-by: yaoyu-33 <[email protected]> --------- Signed-off-by: yaoyu-33 <[email protected]>

yaoyu-33 added 3 commits July 29, 2024 15:41

update structure

57cdbca

Signed-off-by: yaoyu-33 <[email protected]>

update structure

38cb500

Signed-off-by: yaoyu-33 <[email protected]>

add image

2df233e

Signed-off-by: yaoyu-33 <[email protected]>

cuichenx self-requested a review August 6, 2024 16:17

cuichenx reviewed Aug 6, 2024

View reviewed changes

docs/source/features/optimizations/cpu_offloading.rst Outdated Show resolved Hide resolved

docs/source/features/optimizations/activation_recomputation.rst Show resolved Hide resolved

docs/source/features/optimizations/activation_recomputation.rst Outdated Show resolved Hide resolved

address comments

3dd1cea

Signed-off-by: yaoyu-33 <[email protected]>

cuichenx approved these changes Aug 7, 2024

View reviewed changes

yaoyu-33 added the Run CICD label Aug 7, 2024

ericharper requested a review from jgerh August 7, 2024 17:05

yaoyu-33 merged commit e879330 into main Aug 7, 2024
224 checks passed

yaoyu-33 deleted the yuya/update_dev_doc branch August 7, 2024 21:25

jgerh reviewed Aug 8, 2024

View reviewed changes

docs/source/features/optimizations/cpu_offloading.rst

@@ -0,0 +1,19 @@

CPU Offloading

Copy link

Collaborator

jgerh Aug 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No comments on this file.

jgerh reviewed Aug 8, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update dev doc for features #10049

Update dev doc for features #10049

yaoyu-33 commented Aug 5, 2024

jgerh Aug 8, 2024 •

edited

Loading

jgerh Aug 8, 2024 •

edited

Loading

jgerh Aug 8, 2024

jgerh Aug 8, 2024

jgerh Aug 8, 2024

jgerh Aug 8, 2024

jgerh Aug 8, 2024

jgerh Aug 8, 2024

jgerh Aug 8, 2024

jgerh Aug 8, 2024

jgerh Aug 8, 2024

jgerh Aug 8, 2024

jgerh Aug 8, 2024

jgerh Aug 8, 2024

jgerh Aug 8, 2024

jgerh left a comment

		the most appropriate expert based on the current input.


		To use MoE in the NeMo Framework, adjust the ``num_moe_experts`` parameter in the model configuration:

		focuses on a specific subtask or domain, while a gating network dynamically activates
		the most appropriate expert based on the current input.

		.. code-block:: yaml

		moe_router_topk: 2 # Processes each token using 2 experts.


		3. ``moe_token_dropping`` specifies the token dispatcher type, options include 'allgather' and 'alltoall'.

		4. ``moe_per_layer_logging`` enables per-layer logging for MoE, currently support aux-loss and z-loss.


		5. ``moe_expert_capacity_factor`` the capacity factor for each expert, None means no token will be dropped. The default is None.

		6. ``moe_pad_expert_input_to_capacity`` if True, pads the input for each expert to match the expert capacity length, effective only after the moe_expert_capacity_factor is set. The default setting is False.


		6. ``moe_pad_expert_input_to_capacity`` if True, pads the input for each expert to match the expert capacity length, effective only after the moe_expert_capacity_factor is set. The default setting is False.

		7. ``moe_token_drop_policy`` the policy to drop tokens. Can be either "probs" or "position". If "probs", the tokens with the lowest probabilities will be dropped. If "position", tokens at the end of each batch will be dropped. Default value is "probs".


		7. ``moe_token_drop_policy`` the policy to drop tokens. Can be either "probs" or "position". If "probs", the tokens with the lowest probabilities will be dropped. If "position", tokens at the end of each batch will be dropped. Default value is "probs".

		8. ``moe_layer_recompute`` if True, checkpointing moe_layer to save activation memory, default is False.

		@@ -179,24 +81,3 @@ Implement MQA or GQA
		NeMo's support for GQA and MQA is enabled through the integration of Megatron Core's Attention mechanism. The underlying implementation details can be explored within the Attention class of Megatron Core, which provides the functional backbone for these advanced attention methods. To understand the specific modifications and implementations of MQA and GQA, refer to the source code in the Attention class:

		Check implementation details from Attention Class in Megatron Core Repo: https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/core/transformer/attention.py#L49.

Update dev doc for features #10049

Update dev doc for features #10049

Conversation

yaoyu-33 commented Aug 5, 2024

What does this PR do ?

Changelog

Usage

GitHub Actions CI

Before your PR is "Ready for review"

Who can review?

Additional Information

jgerh Aug 8, 2024 • edited Loading

Choose a reason for hiding this comment

jgerh Aug 8, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Use MoE

Choose a reason for hiding this comment

Configure MoE-specific Loss Function

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Multi-Query Attention (MQA) and Grouped-Query Attention (GQA)

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jgerh left a comment

Choose a reason for hiding this comment

jgerh Aug 8, 2024 •

edited

Loading

jgerh Aug 8, 2024 •

edited

Loading