-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPT2 Training GPU Memory Increase with LoRA and Zero 3 #161
Comments
both main and pr #145 failed. |
it seems that each forward will increase the memory. |
my env is : pytorch 1.12.1 deepspeed 0.8.2 |
Zero 2 Setting is ok yaml compute_environment: LOCAL_MACHINE
deepspeed_config:
gradient_accumulation_steps: 1
offload_optimizer_device: cpu
offload_param_device: cpu
zero3_init_flag: false
zero_stage: 2
distributed_type: DEEPSPEED
downcast_bf16: 'no'
dynamo_backend: 'NO'
fsdp_config: {}
machine_rank: 0
main_training_function: main
megatron_lm_config: {}
mixed_precision: fp16
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
use_cpu: false |
might relate to microsoft/DeepSpeed#2637 |
disable zero_init seems work on gpt2 gpt2-xl. Now i am facing Overflow now for 1.3b gpt2 with fp16 |
Using bfp16 there is no OVERFLOW Now !! |
Hello, great deep dive, thank you for raising the issue accordingly with DeepSpeed |
@pacman100 it seems is the issue from peft code . pls look at microsoft/DeepSpeed#3002 and i have made a pr to fix this issue |
I am facing the same issue without using deepspeed (single GPU training). What is the solution? |
I am facing gpu increasing until OOM
run command
ds_zero3_cpu_fp16.yaml
zero_stage3_offload_config.json
run_clm_no_trainer_lora.py
SceenSh
ot
The text was updated successfully, but these errors were encountered: