Replies: 1 comment
-
@marsggbo, thanks for the question. ZeRO, without offload, is not beneficial on single GPU for the following reason. ZeRO 1/2/3 use intermediate buffers, of configurable sizes, and communication operations to implement the memory optimizations. Without offload, ZeRO de-duplicates model states across data parallelism (DP) ranks to reduce memory consumption, and so is effective when DP > 1, i.e., multi-GPU runs. Thus, on a single GPU these additional buffers and communication are strict memory and runtime overheads. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I compare two cases
If I understand correctly, when using a single GPU, zero1/2/3 should be the same as fp16. However, from my experiments, FP16 and ZeRO3 have their own advantages and disadvantages on different model structures. Regarding this, is there any difference in how deepspeed handles these two modes on a single GPU?
Beta Was this translation helpful? Give feedback.
All reactions