Replies: 1 comment
-
Hi @Basso42, to my understanding you will not get any noticeable memory savings using more GPUs if you are already using ZeRO stage 2 with offloading. If you need to further reduce the GPU memory, you need to use ZeRO stage 3. I actually made a calculator for estimating the memory requirements for the various ZeRO stages and optimizations: https://huggingface.co/spaces/andstor/deepspeed-model-memory-usage. ZeRO stage 2 partitions the gradients and optimizer state across the available GPUs. When you activate offloading of the optimizer state, the main GPU memory requirement is the parameters and the partitioned gradients. Further, the memory of the partitioned gradients should be negligible (if using offloading) according to the ZeRO-Offload paper:
Hence, every GPU mainly have to load the model parameters, no matter the GPU count. However, using more GPUs will speed up your training, as each GPU can process different data samples in parallel (data parallelism). For example, you will see the |
Beta Was this translation helpful? Give feedback.
-
Hello,
I am new to LLM fine-tuning. I am working on a LoRA adaptation of a ProtT5 model.
Initially, I successfully trained the model on a single GPU, and now I am attempting to leverage the power of four RTX A5000 GPUs (each with 24GB of RAM) on a single machine. My objective is to speed-up the training process by increasing the batch size, as indicated in the requirements of the model I'm training provided here. However, despite my efforts, I cannot increase the batch size more than if I used one GPU.
According to what I've read (HuggingFace doc for instance), deepspeed automatically identifies the GPUs and as I have stage 2 zero optimisation (see config below) the memory used in training of each gpu should be lower than if I only use one gpu, however it's not the case.
I am using HuggingFace Trainer through which I pass the following deepspeed config :
I launch my training script with the following command
deepspeed --master_port "$master_port" train_LoRA.py "$current_directory" "$num_gpus"
where "$current_directory" "$num_gpus" are just some variables that I use in my training script (to load data and print the setup). The ds_config is already in the training script under a dictionary form.This is what I get during training (the speed of iteration is exactly the same as in one GPU).
Would you have any idea of what could be the problem?
I have read documentation for days but do not understand what the problem could be...
Here are some similar questions that have been posted elsewhere :
Beta Was this translation helpful? Give feedback.
All reactions