CPU utilization, CPU memory for ZeRO-Offload #2652
-
I am training a large model on one host machines equipped with 4 GPUs. I noticed that rather than having a single master copy of the fp32 weight in CPU, each process has its own version of the fp32 weight. I was wondering if my observations are as implemented in Deepspeed stage3. Any thoughts? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
@taehyunzzz, could you please share some logs of your observation? The expectation for zero-offload with 4 ranks/processes is that each process maintains 1/4 of the fp32 optimizer state (including master weights) in CPU rather than a full copy. Thus, each of the 4 instances of CPUAdam should perform only 1/4 of optimizer step computation. You can also refer to the paper for more high-level discussion, if you have not already. Thanks! |
Beta Was this translation helpful? Give feedback.
@taehyunzzz, could you please share some logs of your observation? The expectation for zero-offload with 4 ranks/processes is that each process maintains 1/4 of the fp32 optimizer state (including master weights) in CPU rather than a full copy. Thus, each of the 4 instances of CPUAdam should perform only 1/4 of optimizer step computation.
You can also refer to the paper for more high-level discussion, if you have not already. Thanks!