Skip to content

Commit 2196b10

Browse files
dimapihtaradityavavre
authored andcommitted
add rampup bs documentation (NVIDIA#9884) (NVIDIA#10289)
* create documentation for rampup bs * fix format * fix format * fix config format * move config stage * add example * fix table * fix table * fix grammar * fix grammar --------- Signed-off-by: dimapihtar <[email protected]> Signed-off-by: adityavavre <[email protected]>
1 parent 9a69162 commit 2196b10

File tree

2 files changed

+64
-1
lines changed

2 files changed

+64
-1
lines changed

docs/source/nlp/nemo_megatron/intro.rst

+2-1
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,7 @@ To learn more about using NeMo to train Large Language Models at scale, please r
2020
peft/landing_page
2121
positional_embeddings
2222
mcore_customization
23+
rampup_batch_size
2324

2425

2526
References
@@ -28,4 +29,4 @@ References
2829
.. bibliography:: ../nlp_all.bib
2930
:style: plain
3031
:labelprefix: nlp-megatron
31-
:keyprefix: nlp-megatron-
32+
:keyprefix: nlp-megatron-
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,62 @@
1+
.. _rampup_batch_size:
2+
3+
Ramp Up Batch Size
4+
------------------
5+
6+
Ramp up batch size is a feature that allows training to start with a smaller global batch size and linearly increase to a target global batch size over a given number of training samples with specified incremental steps.
7+
8+
Usage
9+
-----
10+
11+
To enable global batch size rampup during training, set the rampup_batch_size parameter under the model section of training configuration. This parameter should be a list of three values:
12+
13+
* ``start_batch_size``: The initial batch size.
14+
* ``batch_size_increment``: The amount by which the batch size will increase at each step.
15+
* ``rampup_samples``: The number of training samples over which the batch size will be ramped up.
16+
17+
``model.global_batch_size=1024 model.rampup_batch_size=[256, 128, 50000000]``
18+
19+
In this example, the training will start with a batch size of 256, increment by 128, and reach the target global batch size of 1024 over 50,000,000 training samples.
20+
21+
Ramp Up Stages and Training Interruption
22+
----------------------------------------
23+
24+
Once the next rampup stage is reached (the point in training when the global batch size increases), NeMo will stop the training. It allows to rerun the training job with a larger number of GPUs or nodes for the next stage of ramp up batch size.
25+
26+
Automatic Node Scheduling
27+
-------------------------
28+
29+
In the `NeMo-Framework-Launcher <https://github.com/NVIDIA/NeMo-Framework-Launcher>`_, when using rampup batch size, a node scheduler is created automatically. This scheduler allows the use smaller number of nodes for smaller batch size stages and scales up according to the ``training.trainer.num_nodes`` parameter. This parameter corresponds to the maximum number of nodes you want to use for the maximum global batch size.
30+
31+
Example
32+
-------
33+
34+
Detailed example of ramp up batch size feature usage with GPT3 5B model and `NeMo-Framework-Launcher <https://github.com/NVIDIA/NeMo-Framework-Launcher>`_. In this example, the training started with a global batch size of 256, increased by 256 at each ramp up stage, and reached the target global batch size of 2048 over 10,000,000 training samples.
35+
36+
Node schedule looks as follows:
37+
38+
+--------------------+--------------------+
39+
| global_batch_size | num_nodes |
40+
+====================+====================+
41+
| 256 | 8 |
42+
+--------------------+--------------------+
43+
| 512 | 8 |
44+
+--------------------+--------------------+
45+
| 768 | 8 |
46+
+--------------------+--------------------+
47+
| 1024 | 8 |
48+
+--------------------+--------------------+
49+
| 1280 | 10 |
50+
+--------------------+--------------------+
51+
| 1536 | 12 |
52+
+--------------------+--------------------+
53+
| 1792 | 14 |
54+
+--------------------+--------------------+
55+
| 2048 | 16 |
56+
+--------------------+--------------------+
57+
58+
Plot of ``global_batch_size`` increase during training:
59+
60+
.. image:: https://github.com/NVIDIA/NeMo/releases/download/v2.0.0rc0/asset-post-rampup-batch-size-example.png
61+
:alt:
62+
:width: 1080px

0 commit comments

Comments
 (0)