CPU utilization, CPU memory for ZeRO-Offload #2652

taehyunzzz · 2022-12-27T05:26:52Z

taehyunzzz
Dec 27, 2022

I am training a large model on one host machines equipped with 4 GPUs.
When I run training, a process is created for each of the GPU for data parallel training.

I noticed that rather than having a single master copy of the fp32 weight in CPU, each process has its own version of the fp32 weight.
This requires not only 4x memory capacity on the CPU, but also takes more time to run CPUAdam on CPU. The 4 processes does exactly the same CPUAdam update with the shared CPU cores.

I was wondering if my observations are as implemented in Deepspeed stage3.
Why not let a single master CPU process make use of all GPU resources?
Running CPUAdam in parallel is redundant and a waste of CPU resource.

Any thoughts?

Answered by tjruwase

Jan 6, 2023

@taehyunzzz, could you please share some logs of your observation? The expectation for zero-offload with 4 ranks/processes is that each process maintains 1/4 of the fp32 optimizer state (including master weights) in CPU rather than a full copy. Thus, each of the 4 instances of CPUAdam should perform only 1/4 of optimizer step computation.

You can also refer to the paper for more high-level discussion, if you have not already. Thanks!

View full answer

tjruwase · 2023-01-06T12:44:43Z

tjruwase
Jan 6, 2023
Maintainer

@taehyunzzz, could you please share some logs of your observation? The expectation for zero-offload with 4 ranks/processes is that each process maintains 1/4 of the fp32 optimizer state (including master weights) in CPU rather than a full copy. Thus, each of the 4 instances of CPUAdam should perform only 1/4 of optimizer step computation.

You can also refer to the paper for more high-level discussion, if you have not already. Thanks!

1 reply

taehyunzzz Jan 9, 2023
Author

Actually you are correct. My profiling methodology was not so accurate. I compared CPU memory consumption for 1-GPU and 3-GPU training of a small model. The consumption seemed to have increased linearly for each additional CPU subprocess which turned out to be overheads of partitioning. I was supposed to check the sizes of fp32 partitioned parameters that are stored in cpu!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CPU utilization, CPU memory for ZeRO-Offload #2652

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

CPU utilization, CPU memory for ZeRO-Offload #2652

taehyunzzz Dec 27, 2022

Replies: 1 comment · 1 reply

tjruwase Jan 6, 2023 Maintainer

taehyunzzz Jan 9, 2023 Author

taehyunzzz
Dec 27, 2022

Replies: 1 comment 1 reply

tjruwase
Jan 6, 2023
Maintainer

taehyunzzz Jan 9, 2023
Author