what's the difference between FP16 and ZeRO3 when using a single GPU? #3226

marsggbo · 2023-04-14T06:53:19Z

marsggbo
Apr 14, 2023

I compare two cases

enable fp16

    "fp16": {
        "enabled": true,
        "fp16_master_weights_and_grads": false,
        "loss_scale": 0,
        "loss_scale_window": 500,
        "hysteresis": 2,
        "min_loss_scale": 1,
        "initial_scale_power": 15
    },

enable zero3 (without offload)

    "zero_optimization": {
        "stage": 3,
        "allgather_partitions": true,
        "reduce_scatter": true,
        "allgather_bucket_size": 50000000,
        "reduce_bucket_size": 50000000,
        "overlap_comm": true,
        "contiguous_gradients": true
    },
    "zero_allow_untested_optimizer": true,
    "fp16": {
        "enabled": true,
        "fp16_master_weights_and_grads": false,
        "loss_scale": 0,
        "loss_scale_window": 500,
        "hysteresis": 2,
        "min_loss_scale": 1,
        "initial_scale_power": 15
    },

If I understand correctly, when using a single GPU, zero1/2/3 should be the same as fp16. However, from my experiments, FP16 and ZeRO3 have their own advantages and disadvantages on different model structures. Regarding this, is there any difference in how deepspeed handles these two modes on a single GPU?

tjruwase · 2023-04-17T12:00:26Z

tjruwase
Apr 17, 2023
Maintainer

@marsggbo, thanks for the question. ZeRO, without offload, is not beneficial on single GPU for the following reason. ZeRO 1/2/3 use intermediate buffers, of configurable sizes, and communication operations to implement the memory optimizations. Without offload, ZeRO de-duplicates model states across data parallelism (DP) ranks to reduce memory consumption, and so is effective when DP > 1, i.e., multi-GPU runs. Thus, on a single GPU these additional buffers and communication are strict memory and runtime overheads.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

what's the difference between FP16 and ZeRO3 when using a single GPU? #3226

{{title}}

Replies: 1 comment

{{title}}

Select a reply

what's the difference between FP16 and ZeRO3 when using a single GPU? #3226

marsggbo Apr 14, 2023

Replies: 1 comment

tjruwase Apr 17, 2023 Maintainer

marsggbo
Apr 14, 2023

tjruwase
Apr 17, 2023
Maintainer