add rampup bs documentation (NVIDIA#9884) (NVIDIA#10289)

dimapihtar · adityavavre · commit 2196b1076e1b · 2024-09-14T20:57:51.000-05:00
* create documentation for rampup bs

* fix format

* fix format

* fix config format

* move config stage

* add example

* fix table

* fix table

* fix grammar

* fix grammar

---------

Signed-off-by: dimapihtar &lt;dpihtar@gmail.com&gt;
Signed-off-by: adityavavre &lt;aditya.vavre@gmail.com&gt;
diff --git a/docs/source/nlp/nemo_megatron/intro.rst b/docs/source/nlp/nemo_megatron/intro.rst
@@ -20,6 +20,7 @@ To learn more about using NeMo to train Large Language Models at scale, please r
    peft/landing_page
    positional_embeddings
    mcore_customization
+   rampup_batch_size
 
 
 References
@@ -28,4 +29,4 @@ References
 .. bibliography:: ../nlp_all.bib
     :style: plain
     :labelprefix: nlp-megatron
-    :keyprefix: nlp-megatron-
+    :keyprefix: nlp-megatron-
diff --git a/docs/source/nlp/nemo_megatron/rampup_batch_size.rst b/docs/source/nlp/nemo_megatron/rampup_batch_size.rst
@@ -0,0 +1,62 @@
+.. _rampup_batch_size:
+
+Ramp Up Batch Size
+------------------
+
+Ramp up batch size is a feature that allows training to start with a smaller global batch size and linearly increase to a target global batch size over a given number of training samples with specified incremental steps.
+
+Usage
+-----
+
+To enable global batch size rampup during training, set the rampup_batch_size parameter under the model section of training configuration. This parameter should be a list of three values:
+
+* ``start_batch_size``: The initial batch size.
+* ``batch_size_increment``: The amount by which the batch size will increase at each step.
+* ``rampup_samples``: The number of training samples over which the batch size will be ramped up.
+
+``model.global_batch_size=1024 model.rampup_batch_size=[256, 128, 50000000]``
+
+In this example, the training will start with a batch size of 256, increment by 128, and reach the target global batch size of 1024 over 50,000,000 training samples.
+
+Ramp Up Stages and Training Interruption
+----------------------------------------
+
+Once the next rampup stage is reached (the point in training when the global batch size increases), NeMo will stop the training. It allows to rerun the training job with a larger number of GPUs or nodes for the next stage of ramp up batch size.
+
+Automatic Node Scheduling
+-------------------------
+
+In the `NeMo-Framework-Launcher <https://github.com/NVIDIA/NeMo-Framework-Launcher>`_, when using rampup batch size, a node scheduler is created automatically. This scheduler allows the use smaller number of nodes for smaller batch size stages and scales up according to the ``training.trainer.num_nodes`` parameter. This parameter corresponds to the maximum number of nodes you want to use for the maximum global batch size.
+
+Example
+-------
+
+Detailed example of ramp up batch size feature usage with GPT3 5B model and `NeMo-Framework-Launcher <https://github.com/NVIDIA/NeMo-Framework-Launcher>`_. In this example, the training started with a global batch size of 256, increased by 256 at each ramp up stage, and reached the target global batch size of 2048 over 10,000,000 training samples.
+
+Node schedule looks as follows:
+
++--------------------+--------------------+
+| global_batch_size  | num_nodes          |
++====================+====================+
+| 256                | 8                  |
++--------------------+--------------------+
+| 512                | 8                  |
++--------------------+--------------------+
+| 768                | 8                  |
++--------------------+--------------------+
+| 1024               | 8                  |
++--------------------+--------------------+
+| 1280               | 10                 |
++--------------------+--------------------+
+| 1536               | 12                 |
++--------------------+--------------------+
+| 1792               | 14                 |
++--------------------+--------------------+
+| 2048               | 16                 |
++--------------------+--------------------+
+
+Plot of ``global_batch_size`` increase during training:
+
+.. image:: https://github.com/NVIDIA/NeMo/releases/download/v2.0.0rc0/asset-post-rampup-batch-size-example.png
+  :alt: 
+  :width: 1080px