Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Cannot free parameter with ZeRO3 + offload parameter in Pytorch1.9 #3646

Closed
Andy666G opened this issue May 31, 2023 · 5 comments
Closed
Assignees
Labels
bug Something isn't working training

Comments

@Andy666G
Copy link
Contributor

Andy666G commented May 31, 2023

Describe the bug
When tranining llama 13B(https://github.com/ymcui/Chinese-LLaMA-Alpaca), I observed it cannot free parameter memory using ZeRO3 + Offload strategy parameter in pytorch1.9, but parameter memory can be freed in pytorch1.13 with the same deepspeed strategy. This issue(#3002) cannot solve this bug.

To Reproduce
deepspeed0.9.2 + pytorch1.9 + peft 0.3 + transformers4.28.1
ds_config

  1 {
  2     "zero_optimization": {
  3         "stage": 3,
  4         "offload_optimizer": {
  5             "device": "cpu",
  6             "pin_memory": true
  7         },
  8         "offload_param": {
  9             "device": "cpu",
 10             "pin_memory": true
 11         },
 12         "overlap_comm": true,
 13         "contiguous_gradients": true,
 14         "sub_group_size": 1e9,
 15         "reduce_bucket_size": "auto",
 16         "stage3_prefetch_bucket_size": "auto",
 17         "stage3_param_persistence_threshold": "auto",
 18         "stage3_max_live_parameters": 1e9,
 19         "stage3_max_reuse_distance": 1e9,
 20         "stage3_gather_16bit_weights_on_model_save": true                                                                                                                                                
 21     },
 22   "train_batch_size": 1,
 23   "train_micro_batch_size_per_gpu": 1,
 24   "fp16": {
 25         "enabled": "auto",
 26         "loss_scale": 0,
 27         "loss_scale_window": 1000,
 28         "initial_scale_power": 16,
 29         "hysteresis": 2,
 30         "min_loss_scale": 1
 31   },
 32    "optimizer": {
 33        "type": "Adam",
 34        "params": {
 35          "lr": "auto",
 36          "betas": "auto",
 37          "eps": "auto",
 38          "weight_decay": "auto"
 39        }
 40    }
 41 }  
Run llama 13B model.

Expected behavior
A clear and concise description of what you expected to happen.
Pytorch 1.13, deepspeed 0.9.2
image
Pytorch 1.9, deepspeed 0.9.2
image

ds_report output
Please run ds_report to give us details about your setup.

DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  please install triton==1.0.0 if you want to use sparse attention
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/opt/conda/lib/python3.8/site-packages/torch']
torch version .................... 1.9.0a0+c3d40fd
deepspeed install path ........... ['/opt/conda/lib/python3.8/site-packages/deepspeed']
deepspeed info ................... 0.9.2, unknown, unknown
torch cuda version ............... 11.3
torch hip version ................ None
nvcc version ..................... 11.3
deepspeed wheel compiled w. ...... torch 1.9, cuda 11.3

Screenshots
when calling post_forward_hook, parameters can be freed in pytorch1.13, but cannot be freed in pytorch1.9
Pytorch1.13
image

Pytorch1.9
image

System info (please complete the following information):

  • OS: [e.g. Ubuntu 18.04]
  • GPU count and types [e.g. one machines with x1 A100 each]
  • Interconnects (if applicable) [e.g., two machines connected with 100 Gbps IB]
  • Python version 3.8
  • Any other relevant info about your setup

Launcher context
Are you launching your experiment with the deepspeed launcher, MPI, or something else?
deepspeed launcher
Docker context
Are you using a specific docker image that you can share?
NGC22.07(Pytorch1.13) and NGC21.06(Pytorch1.9)
Additional context
Add any other context about the problem here.

@Andy666G Andy666G added bug Something isn't working training labels May 31, 2023
@jomayeri jomayeri self-assigned this Jun 2, 2023
@jomayeri
Copy link
Contributor

jomayeri commented Jun 6, 2023

Hi @Andy666G can you provide a script that reproduces this error?

@Andy666G
Copy link
Contributor Author

Andy666G commented Jun 7, 2023

Sure, here is a script @jomayeri
The pretrained_model is vicuna-13b, and any “.txt” file can be a dataset.
I have provided the deepspeed config above.

# export CUDA_VISIBLE_DEVICES=0
lora_rank=8
lora_trainable="q_proj,v_proj,k_proj,o_proj,gate_proj,down_proj,up_proj"
#modules_to_save="embed_tokens,lm_head"
# modules_to_save="lm_head"
modules_to_save=""
lora_dropout=0.1
pretrained_model="models/vicuna-13b-all-v1.1/"
chinese_tokenizer_path="models/vicuna-13b-all-v1.1/tokenizer.model"
dataset_dir="/doc"
data_cache="$PWD/cache"
per_device_batch_size=1 # 1024 ,from https://github.com/ymcui/Chinese-LLaMA-Alpaca/wiki/%E8%AE%AD%E7%BB%83%E7%BB%86%E8%8A%82
training_steps=7000 # 6000, from https://github.com/ymcui/Chinese-LLaMA-Alpaca/wiki/%E8%AE%AD%E7%BB%83%E7%BB%86%E8%8A%82
lr=2.34e-06     # from https://github.com/ymcui/Chinese-LLaMA-Alpaca/wiki/%E8%AE%AD%E7%BB%83%E7%BB%86%E8%8A%82
gradient_accumulation_steps=1
output_dir="output"
max_train_samples=${per_device_batch_size}
max_eval_samples=${per_device_batch_size}

#TODO: deepspeed
deepspeed --include localhost:0 --master_port 12688 scripts/run_clm_pt_with_peft.py \
    --model_name_or_path ${pretrained_model} \
    --tokenizer_name_or_path ${chinese_tokenizer_path} \
    --dataset_dir ${dataset_dir} \
    --data_cache_dir $data_cache \
    --validation_split_percentage 0.001 \
    --per_device_train_batch_size ${per_device_batch_size} \
    --per_device_eval_batch_size ${per_device_batch_size} \
    --do_train \
    --debug_mode \
    --torch_dtype float16 \
    --seed $RANDOM \
    --max_steps ${training_steps} \
    --lr_scheduler_type cosine \
    --learning_rate ${lr} \
    --warmup_ratio 0.05 \
    --weight_decay 0.01 \
    --logging_strategy steps \
    --logging_steps 10 \
    --save_strategy steps \
    --save_total_limit 3 \
    --save_steps 1000 \
    --gradient_accumulation_steps ${gradient_accumulation_steps} \
    --preprocessing_num_workers 8 \
    --block_size 512 \
    --output_dir ${output_dir} \
    --ddp_timeout 30000 \
    --logging_first_step True \
    --lora_rank ${lora_rank} \
    --trainable ${lora_trainable} \
    --lora_dropout ${lora_dropout} \
    --deepspeed deepspeed_config.json \
    --fp16  \
    --overwrite_output_dir \

@GuWei007
Copy link

I had the same problem and was very confused

@jomayeri
Copy link
Contributor

jomayeri commented Jul 10, 2023

@Andy666G Sorry I cannot repro this issue. In the description you say "when calling post_forward_hook" is this a hook you added? Have you raised the issue with Pytorch?

@jomayeri
Copy link
Contributor

Closing for now, please reopen if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working training
Projects
None yet
Development

No branches or pull requests

3 participants