|
| 1 | +.. _rampup_batch_size: |
| 2 | + |
| 3 | +Ramp Up Batch Size |
| 4 | +------------------ |
| 5 | + |
| 6 | +Ramp up batch size is a feature that allows training to start with a smaller global batch size and linearly increase to a target global batch size over a given number of training samples with specified incremental steps. |
| 7 | + |
| 8 | +Usage |
| 9 | +----- |
| 10 | + |
| 11 | +To enable global batch size rampup during training, set the rampup_batch_size parameter under the model section of training configuration. This parameter should be a list of three values: |
| 12 | + |
| 13 | +* ``start_batch_size``: The initial batch size. |
| 14 | +* ``batch_size_increment``: The amount by which the batch size will increase at each step. |
| 15 | +* ``rampup_samples``: The number of training samples over which the batch size will be ramped up. |
| 16 | + |
| 17 | +``model.global_batch_size=1024 model.rampup_batch_size=[256, 128, 50000000]`` |
| 18 | + |
| 19 | +In this example, the training will start with a batch size of 256, increment by 128, and reach the target global batch size of 1024 over 50,000,000 training samples. |
| 20 | + |
| 21 | +Ramp Up Stages and Training Interruption |
| 22 | +---------------------------------------- |
| 23 | + |
| 24 | +Once the next rampup stage is reached (the point in training when the global batch size increases), NeMo will stop the training. It allows to rerun the training job with a larger number of GPUs or nodes for the next stage of ramp up batch size. |
| 25 | + |
| 26 | +Automatic Node Scheduling |
| 27 | +------------------------- |
| 28 | + |
| 29 | +In the `NeMo-Framework-Launcher <https://github.com/NVIDIA/NeMo-Framework-Launcher>`_, when using rampup batch size, a node scheduler is created automatically. This scheduler allows the use smaller number of nodes for smaller batch size stages and scales up according to the ``training.trainer.num_nodes`` parameter. This parameter corresponds to the maximum number of nodes you want to use for the maximum global batch size. |
| 30 | + |
| 31 | +Example |
| 32 | +------- |
| 33 | + |
| 34 | +Detailed example of ramp up batch size feature usage with GPT3 5B model and `NeMo-Framework-Launcher <https://github.com/NVIDIA/NeMo-Framework-Launcher>`_. In this example, the training started with a global batch size of 256, increased by 256 at each ramp up stage, and reached the target global batch size of 2048 over 10,000,000 training samples. |
| 35 | + |
| 36 | +Node schedule looks as follows: |
| 37 | + |
| 38 | ++--------------------+--------------------+ |
| 39 | +| global_batch_size | num_nodes | |
| 40 | ++====================+====================+ |
| 41 | +| 256 | 8 | |
| 42 | ++--------------------+--------------------+ |
| 43 | +| 512 | 8 | |
| 44 | ++--------------------+--------------------+ |
| 45 | +| 768 | 8 | |
| 46 | ++--------------------+--------------------+ |
| 47 | +| 1024 | 8 | |
| 48 | ++--------------------+--------------------+ |
| 49 | +| 1280 | 10 | |
| 50 | ++--------------------+--------------------+ |
| 51 | +| 1536 | 12 | |
| 52 | ++--------------------+--------------------+ |
| 53 | +| 1792 | 14 | |
| 54 | ++--------------------+--------------------+ |
| 55 | +| 2048 | 16 | |
| 56 | ++--------------------+--------------------+ |
| 57 | + |
| 58 | +Plot of ``global_batch_size`` increase during training: |
| 59 | + |
| 60 | +.. image:: https://github.com/NVIDIA/NeMo/releases/download/v2.0.0rc0/asset-post-rampup-batch-size-example.png |
| 61 | + :alt: |
| 62 | + :width: 1080px |
0 commit comments