Sharing model parameters across multiple GPUs #5003

Basso42 · 2024-01-24T02:28:23Z

Basso42
Jan 24, 2024

Hello,

I am new to LLM fine-tuning. I am working on a LoRA adaptation of a ProtT5 model.

Initially, I successfully trained the model on a single GPU, and now I am attempting to leverage the power of four RTX A5000 GPUs (each with 24GB of RAM) on a single machine. My objective is to speed-up the training process by increasing the batch size, as indicated in the requirements of the model I'm training provided here. However, despite my efforts, I cannot increase the batch size more than if I used one GPU.

According to what I've read (HuggingFace doc for instance), deepspeed automatically identifies the GPUs and as I have stage 2 zero optimisation (see config below) the memory used in training of each gpu should be lower than if I only use one gpu, however it's not the case.

I am using HuggingFace Trainer through which I pass the following deepspeed config :

ds_config = {
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },

    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto"
        }
    },

    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": "auto",
            "warmup_max_lr": "auto",
            "warmup_num_steps": "auto"
        }
    },

    "zero_optimization": {
        "stage": 2,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": True
        },
        "allgather_partitions": True,
        "allgather_bucket_size": 2e8,
        "overlap_comm": True,
        "reduce_scatter": True,
        "reduce_bucket_size": 2e8,
        "contiguous_gradients": True
    },

    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 2000,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": False
}

I launch my training script with the following command deepspeed --master_port "$master_port" train_LoRA.py "$current_directory" "$num_gpus" where "$current_directory" "$num_gpus" are just some variables that I use in my training script (to load data and print the setup). The ds_config is already in the training script under a dictionary form.

This is what I get during training (the speed of iteration is exactly the same as in one GPU).

Would you have any idea of what could be the problem?

I have read documentation for days but do not understand what the problem could be...

Here are some similar questions that have been posted elsewhere :

andstor · 2024-02-06T18:07:42Z

andstor
Feb 6, 2024

Hi @Basso42, to my understanding you will not get any noticeable memory savings using more GPUs if you are already using ZeRO stage 2 with offloading. If you need to further reduce the GPU memory, you need to use ZeRO stage 3. I actually made a calculator for estimating the memory requirements for the various ZeRO stages and optimizations: https://huggingface.co/spaces/andstor/deepspeed-model-memory-usage.

ZeRO stage 2 partitions the gradients and optimizer state across the available GPUs. When you activate offloading of the optimizer state, the main GPU memory requirement is the parameters and the partitioned gradients. Further, the memory of the partitioned gradients should be negligible (if using offloading) according to the ZeRO-Offload paper:

ZeRO-Offload can transfer these gradients for each parameter individually or in small groups to the CPU memory immediately after they are computed.

Hence, every GPU mainly have to load the model parameters, no matter the GPU count. However, using more GPUs will speed up your training, as each GPU can process different data samples in parallel (data parallelism). For example, you will see the Total optimization steps will decrease proportionally to the number of GPUs.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sharing model parameters across multiple GPUs #5003

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Sharing model parameters across multiple GPUs #5003

Basso42 Jan 24, 2024

Replies: 1 comment

andstor Feb 6, 2024

Basso42
Jan 24, 2024

andstor
Feb 6, 2024