-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Qwen2-VL-Instruct CUDA out of memory #66
Comments
I happen to have access to A5000 GPUs. I can run successfully on my end without OOM. Here's my script (which should be the same as yours except using example data and using an even longer MODEL_MAX_LEN). export CUDA_VISIBLE_DEVICES=6,7
NUM_GPUS=2
DISTRIBUTED_ARGS="
--nnodes=1 \
--nproc_per_node ${NUM_GPUS} \
--rdzv_backend c10d \
--rdzv_endpoint localhost:0
"
# arguments that are very likely to be changed
# according to your own case
MODEL_ID=qwen2-vl-2b-instruct # model id; pick on by running `python supported_models.py`
TRAIN_DATA_PATH=./example_data/celeba_image_train.json # path to the training data json file
EVAL_DATA_PATH=./example_data/celeba_image_eval.json # path to the evaluation data json file (optional)
IMAGE_FOLDER=./example_data/images # path to the image root folder; if provided, the image paths in the json should be relative
VIDEO_FOLDER=./example_data/videos # path to the video root folder; if provided, the video paths in the json should be relative
NUM_FRAMES=8 # how many frames are sampled from each video
TRAIN_VISION_ENCODER=False # whether train the vision encoder
USE_VISION_LORA=False # whether use lora for vision encoder (only effective when `TRAIN_VISION_ENCODER` is True)
TRAIN_VISION_PROJECTOR=False # whether train the vision projector (only full finetuning is supported)
USE_LORA=True # whether use lora for llm
Q_LORA=False # whether use q-lora for llm; only effective when `USE_LORA` is True
LORA_R=8 # the lora rank (both llm and vision encoder)
LORA_ALPHA=8 # the lora alpha (both llm and vision encoder)
RUN_ID=${MODEL_ID}_lora-${USE_LORA}_qlora-${Q_LORA} # a custom run id that determines the checkpoint folder and wandb run name
DS_STAGE=zero3 # deepspeed stage; < zero2 | zero3 >
PER_DEVICE_BATCH_SIZE=1 # batch size per GPU
GRAD_ACCUM=1 # gradient accumulation steps
NUM_EPOCHS=1 # number of training epochs
LR=2e-5 # learning rate
MODEL_MAX_LEN=1024 # maximum input length of the model
torchrun $DISTRIBUTED_ARGS train.py \
--model_id $MODEL_ID \
--data_path $TRAIN_DATA_PATH \
--eval_data_path $EVAL_DATA_PATH \
--image_folder $IMAGE_FOLDER \
--video_folder $VIDEO_FOLDER \
--num_frames $NUM_FRAMES \
--output_dir ./checkpoints/$RUN_ID \
--report_to wandb \
--run_name $RUN_ID \
--deepspeed ./ds_configs/${DS_STAGE}.json \
--bf16 True \
--num_train_epochs $NUM_EPOCHS \
--per_device_train_batch_size $PER_DEVICE_BATCH_SIZE \
--per_device_eval_batch_size $PER_DEVICE_BATCH_SIZE \
--gradient_accumulation_steps $GRAD_ACCUM \
--eval_strategy "epoch" \
--save_strategy "epoch" \
--save_total_limit 1 \
--learning_rate ${LR} \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--tf32 True \
--model_max_length $MODEL_MAX_LEN \
--gradient_checkpointing True \
--dataloader_num_workers 4 \
--train_vision_encoder $TRAIN_VISION_ENCODER \
--use_vision_lora $USE_VISION_LORA \
--train_vision_projector $TRAIN_VISION_PROJECTOR \
--use_lora $USE_LORA \
--q_lora $Q_LORA \
--lora_r $LORA_R \
--lora_alpha $LORA_ALPHA |
@zjysteven, thank you for you comments, I have tried what you have suggested and am still running into OOM issues. I have tried using zero2 config which allows the training to go on for longer but I eventually run into OOM issues again. Just for reference, my dataset can contain multiple images (maximum 3) per question in my json file, although I am not sure if that is an issue since before the OOM issue the data points containing multiple images I believe are loading just fine.
Zero2 config:
|
Multiple images will definitely consume more memory, so I wouldn't be surprised to see OOM. I'm afraid there won't be easy solutions like tuning the configuration here. |
Thank you for your help, if I find a solution I'll post on here. |
I am trying to use 2 A5000 GPUs on fine tuning Qwen2-VL-Instruct on an image-text dataset however I am running into CUDA memory issues. Could you advise on how to set up bash file? Here is my current file:
#!/bin/bash
export CUDA_VISIBLE_DEVICES=0,2
NUM_GPUS=2
DISTRIBUTED_ARGS="
--nnodes=1
--nproc_per_node ${NUM_GPUS}
--rdzv_backend c10d
--rdzv_endpoint localhost:0
"
MODEL_ID=qwen2-vl-2b-instruct # model id; pick on by running
python supported_models.py
TRAIN_DATA_PATH=./dataset/data/json/multi/train_findings_labels.json # path to the training data json file
EVAL_DATA_PATH=./dataset/data/json/multi/multi_eval_findings_labels.json # path to the evaluation data json file (optional)
IMAGE_FOLDER='None' # path to the image root folder; if provided, the image paths in the json should be relative
VIDEO_FOLDER='None' # path to the video root folder; if provided, the video paths in the json should be relative
NUM_FRAMES=8 # how many frames are sampled from each video
TRAIN_VISION_ENCODER=False # whether train the vision encoder
USE_VISION_LORA=False # whether use lora for vision encoder (only effective when
TRAIN_VISION_ENCODER
is True)TRAIN_VISION_PROJECTOR=False # whether train the vision projector (only full finetuning is supported)
USE_LORA=True # whether use lora for llm
Q_LORA=False # whether use q-lora for llm; only effective when
USE_LORA
is TrueLORA_R=8 # the lora rank (both llm and vision encoder)
LORA_ALPHA=8 # the lora alpha (both llm and vision encoder)
RUN_ID=${MODEL_ID}_lora-${USE_LORA}_qlora-${Q_LORA}_multi_findings_generation_classification
DS_STAGE=zero3 # deepspeed stage; < zero2 | zero3 >
PER_DEVICE_BATCH_SIZE=1 # batch size per GPU
GRAD_ACCUM=1 # gradient accumulation steps
NUM_EPOCHS=3 # number of training epochs
LR=2e-5 # learning rate
MODEL_MAX_LEN=256 # maximum input length of the model
torchrun $DISTRIBUTED_ARGS train.py
--model_id $MODEL_ID
--data_path $TRAIN_DATA_PATH
--eval_data_path $EVAL_DATA_PATH
--image_folder $IMAGE_FOLDER
--video_folder $VIDEO_FOLDER
--num_frames $NUM_FRAMES
--output_dir ./checkpoints/$RUN_ID
--report_to wandb
--run_name $RUN_ID
--deepspeed ./ds_configs/${DS_STAGE}.json
--bf16 True
--num_train_epochs $NUM_EPOCHS
--per_device_train_batch_size $PER_DEVICE_BATCH_SIZE
--per_device_eval_batch_size $PER_DEVICE_BATCH_SIZE
--gradient_accumulation_steps $GRAD_ACCUM
--eval_strategy "epoch"
--save_strategy "epoch"
--save_total_limit 1
--learning_rate ${LR}
--weight_decay 0.
--warmup_ratio 0.03
--lr_scheduler_type "cosine"
--logging_steps 1
--tf32 True
--model_max_length $MODEL_MAX_LEN
--gradient_checkpointing True
--dataloader_num_workers 2
--train_vision_encoder $TRAIN_VISION_ENCODER
--use_vision_lora $USE_VISION_LORA
--train_vision_projector $TRAIN_VISION_PROJECTOR
--use_lora $USE_LORA
--q_lora $Q_LORA
--lora_r $LORA_R
--lora_alpha $LORA_ALPHA \
The text was updated successfully, but these errors were encountered: