Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Qwen2-VL-Instruct CUDA out of memory #66

Open
rsr-droid opened this issue Jan 18, 2025 · 4 comments
Open

Qwen2-VL-Instruct CUDA out of memory #66

rsr-droid opened this issue Jan 18, 2025 · 4 comments

Comments

@rsr-droid
Copy link

rsr-droid commented Jan 18, 2025

I am trying to use 2 A5000 GPUs on fine tuning Qwen2-VL-Instruct on an image-text dataset however I am running into CUDA memory issues. Could you advise on how to set up bash file? Here is my current file:

#!/bin/bash

export CUDA_VISIBLE_DEVICES=0,2

NUM_GPUS=2
DISTRIBUTED_ARGS="
--nnodes=1
--nproc_per_node ${NUM_GPUS}
--rdzv_backend c10d
--rdzv_endpoint localhost:0
"

MODEL_ID=qwen2-vl-2b-instruct # model id; pick on by running python supported_models.py
TRAIN_DATA_PATH=./dataset/data/json/multi/train_findings_labels.json # path to the training data json file
EVAL_DATA_PATH=./dataset/data/json/multi/multi_eval_findings_labels.json # path to the evaluation data json file (optional)
IMAGE_FOLDER='None' # path to the image root folder; if provided, the image paths in the json should be relative
VIDEO_FOLDER='None' # path to the video root folder; if provided, the video paths in the json should be relative
NUM_FRAMES=8 # how many frames are sampled from each video

TRAIN_VISION_ENCODER=False # whether train the vision encoder
USE_VISION_LORA=False # whether use lora for vision encoder (only effective when TRAIN_VISION_ENCODER is True)
TRAIN_VISION_PROJECTOR=False # whether train the vision projector (only full finetuning is supported)

USE_LORA=True # whether use lora for llm
Q_LORA=False # whether use q-lora for llm; only effective when USE_LORA is True
LORA_R=8 # the lora rank (both llm and vision encoder)
LORA_ALPHA=8 # the lora alpha (both llm and vision encoder)

RUN_ID=${MODEL_ID}_lora-${USE_LORA}_qlora-${Q_LORA}_multi_findings_generation_classification

DS_STAGE=zero3 # deepspeed stage; < zero2 | zero3 >
PER_DEVICE_BATCH_SIZE=1 # batch size per GPU
GRAD_ACCUM=1 # gradient accumulation steps
NUM_EPOCHS=3 # number of training epochs

LR=2e-5 # learning rate
MODEL_MAX_LEN=256 # maximum input length of the model

torchrun $DISTRIBUTED_ARGS train.py
--model_id $MODEL_ID
--data_path $TRAIN_DATA_PATH
--eval_data_path $EVAL_DATA_PATH
--image_folder $IMAGE_FOLDER
--video_folder $VIDEO_FOLDER
--num_frames $NUM_FRAMES
--output_dir ./checkpoints/$RUN_ID
--report_to wandb
--run_name $RUN_ID
--deepspeed ./ds_configs/${DS_STAGE}.json
--bf16 True
--num_train_epochs $NUM_EPOCHS
--per_device_train_batch_size $PER_DEVICE_BATCH_SIZE
--per_device_eval_batch_size $PER_DEVICE_BATCH_SIZE
--gradient_accumulation_steps $GRAD_ACCUM
--eval_strategy "epoch"
--save_strategy "epoch"
--save_total_limit 1
--learning_rate ${LR}
--weight_decay 0.
--warmup_ratio 0.03
--lr_scheduler_type "cosine"
--logging_steps 1
--tf32 True
--model_max_length $MODEL_MAX_LEN
--gradient_checkpointing True
--dataloader_num_workers 2
--train_vision_encoder $TRAIN_VISION_ENCODER
--use_vision_lora $USE_VISION_LORA
--train_vision_projector $TRAIN_VISION_PROJECTOR
--use_lora $USE_LORA
--q_lora $Q_LORA
--lora_r $LORA_R
--lora_alpha $LORA_ALPHA \

@zjysteven
Copy link
Owner

zjysteven commented Jan 18, 2025

I happen to have access to A5000 GPUs. I can run successfully on my end without OOM. Here's my script (which should be the same as yours except using example data and using an even longer MODEL_MAX_LEN).

export CUDA_VISIBLE_DEVICES=6,7

NUM_GPUS=2
DISTRIBUTED_ARGS="
    --nnodes=1 \
    --nproc_per_node ${NUM_GPUS} \
    --rdzv_backend c10d \
    --rdzv_endpoint localhost:0
"

# arguments that are very likely to be changed
# according to your own case
MODEL_ID=qwen2-vl-2b-instruct                                   # model id; pick on by running `python supported_models.py`
TRAIN_DATA_PATH=./example_data/celeba_image_train.json  # path to the training data json file
EVAL_DATA_PATH=./example_data/celeba_image_eval.json    # path to the evaluation data json file (optional)
IMAGE_FOLDER=./example_data/images                      # path to the image root folder; if provided, the image paths in the json should be relative
VIDEO_FOLDER=./example_data/videos                      # path to the video root folder; if provided, the video paths in the json should be relative
NUM_FRAMES=8                                            # how many frames are sampled from each video

TRAIN_VISION_ENCODER=False                              # whether train the vision encoder
USE_VISION_LORA=False                                   # whether use lora for vision encoder (only effective when `TRAIN_VISION_ENCODER` is True)
TRAIN_VISION_PROJECTOR=False                            # whether train the vision projector (only full finetuning is supported)

USE_LORA=True                                           # whether use lora for llm
Q_LORA=False                                            # whether use q-lora for llm; only effective when `USE_LORA` is True
LORA_R=8                                                # the lora rank (both llm and vision encoder)
LORA_ALPHA=8                                            # the lora alpha (both llm and vision encoder)

RUN_ID=${MODEL_ID}_lora-${USE_LORA}_qlora-${Q_LORA}     # a custom run id that determines the checkpoint folder and wandb run name

DS_STAGE=zero3                                          # deepspeed stage; < zero2 | zero3 >
PER_DEVICE_BATCH_SIZE=1                                 # batch size per GPU
GRAD_ACCUM=1                                            # gradient accumulation steps
NUM_EPOCHS=1                                            # number of training epochs

LR=2e-5                                                 # learning rate
MODEL_MAX_LEN=1024                                       # maximum input length of the model


torchrun $DISTRIBUTED_ARGS train.py \
    --model_id $MODEL_ID \
    --data_path $TRAIN_DATA_PATH \
    --eval_data_path $EVAL_DATA_PATH \
    --image_folder $IMAGE_FOLDER \
    --video_folder $VIDEO_FOLDER \
    --num_frames $NUM_FRAMES \
    --output_dir ./checkpoints/$RUN_ID \
    --report_to wandb \
    --run_name $RUN_ID \
    --deepspeed ./ds_configs/${DS_STAGE}.json \
    --bf16 True \
    --num_train_epochs $NUM_EPOCHS \
    --per_device_train_batch_size $PER_DEVICE_BATCH_SIZE \
    --per_device_eval_batch_size $PER_DEVICE_BATCH_SIZE \
    --gradient_accumulation_steps $GRAD_ACCUM \
    --eval_strategy "epoch" \
    --save_strategy "epoch" \
    --save_total_limit 1 \
    --learning_rate ${LR} \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length $MODEL_MAX_LEN \
    --gradient_checkpointing True \
    --dataloader_num_workers 4 \
    --train_vision_encoder $TRAIN_VISION_ENCODER \
    --use_vision_lora $USE_VISION_LORA \
    --train_vision_projector $TRAIN_VISION_PROJECTOR \
    --use_lora $USE_LORA \
    --q_lora $Q_LORA \
    --lora_r $LORA_R \
    --lora_alpha $LORA_ALPHA

Image

Image

@rsr-droid
Copy link
Author

@zjysteven, thank you for you comments, I have tried what you have suggested and am still running into OOM issues. I have tried using zero2 config which allows the training to go on for longer but I eventually run into OOM issues again. Just for reference, my dataset can contain multiple images (maximum 3) per question in my json file, although I am not sure if that is an issue since before the OOM issue the data points containing multiple images I believe are loading just fine.

#!/bin/bash

# Set CUDA_VISIBLE_DEVICES to 0
export CUDA_VISIBLE_DEVICES=0,1,2
export HF_HOME="/working/rajan/multiview-llm/.cache"
export HF_HUB_DOWNLOAD_TIMEOUT=10000

export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512,garbage_collection_threshold:0.6,expandable_segments:True


NUM_GPUS=3
DISTRIBUTED_ARGS="
    --nnodes=1 \
    --nproc_per_node ${NUM_GPUS} \
    --rdzv_backend c10d \
    --rdzv_endpoint localhost:0
"

# arguments that are very likely to be changed
# according to your own case
MODEL_ID=qwen2-vl-2b-instruct                                                                                                   # model id; pick on by running `python supported_models.py`
TRAIN_DATA_PATH=/working/rajan/multiview-llm/Models/Finetune/dataset/data/json/multi/mimic_cxr_multi_train_findings_labels.json     # path to the training data json file
EVAL_DATA_PATH=/working/rajan/multiview-llm/Models/Finetune/dataset/data/json/multi/mimic_cxr_multi_eval_findings_labels.json       # path to the evaluation data json file (optional)
IMAGE_FOLDER='None'                                                                                                                 # path to the image root folder; if provided, the image paths in the json should be relative
VIDEO_FOLDER='None'                                                                                                                 # path to the video root folder; if provided, the video paths in the json should be relative
NUM_FRAMES=8                                                                                                                        # how many frames are sampled from each video

TRAIN_VISION_ENCODER=False                                                                                                          # whether train the vision encoder
USE_VISION_LORA=False                                                                                                               # whether use lora for vision encoder (only effective when `TRAIN_VISION_ENCODER` is True)
TRAIN_VISION_PROJECTOR=False                                                                                                        # whether train the vision projector (only full finetuning is supported)

USE_LORA=True                                                                                                                       # whether use lora for llm
Q_LORA=False                                                                                                                         # whether use q-lora for llm; only effective when `USE_LORA` is True
LORA_R=8                                                                                                                           # the lora rank (both llm and vision encoder)
LORA_ALPHA=8                                                                                                                      # the lora alpha (both llm and vision encoder)

RUN_ID=${MODEL_ID}_lora-${USE_LORA}_qlora-${Q_LORA}_multi_findings_generation_classification                      # a custom run id that determines the checkpoint folder and wandb run name

DS_STAGE=zero2                                                                                                               # deepspeed stage; < zero2 | zero3 >
PER_DEVICE_BATCH_SIZE=1                                                                                                             # batch size per GPU
GRAD_ACCUM=1                                                                                                                      # gradient accumulation steps
NUM_EPOCHS=3                                                                                                                        # number of training epochs

LR=2e-5                                                                                                                             # learning rate
MODEL_MAX_LEN=1024                                                                                                                   # maximum input length of the model

torchrun $DISTRIBUTED_ARGS train.py \
    --model_id $MODEL_ID \
    --data_path $TRAIN_DATA_PATH \
    --eval_data_path $EVAL_DATA_PATH \
    --image_folder $IMAGE_FOLDER \
    --video_folder  $VIDEO_FOLDER \
    --num_frames $NUM_FRAMES \
    --output_dir ./checkpoints/$RUN_ID \
    --report_to wandb \
    --run_name $RUN_ID \
    --deepspeed ./ds_configs/${DS_STAGE}.json \
    --bf16 True \
    --num_train_epochs $NUM_EPOCHS \
    --per_device_train_batch_size $PER_DEVICE_BATCH_SIZE \
    --per_device_eval_batch_size $PER_DEVICE_BATCH_SIZE \
    --gradient_accumulation_steps $GRAD_ACCUM \
    --eval_strategy "epoch" \
    --save_strategy "epoch" \
    --save_total_limit 1 \
    --learning_rate ${LR} \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length $MODEL_MAX_LEN \
    --gradient_checkpointing True \
    --dataloader_num_workers 12 \
    --train_vision_encoder $TRAIN_VISION_ENCODER \
    --use_vision_lora $USE_VISION_LORA \
    --train_vision_projector $TRAIN_VISION_PROJECTOR \
    --use_lora $USE_LORA \
    --q_lora $Q_LORA \
    --lora_r $LORA_R \
    --lora_alpha $LORA_ALPHA \

Zero2 config:

{
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },
    "bf16": {
        "enabled": "auto"
    },
    "train_micro_batch_size_per_gpu": "auto",
    "train_batch_size": "auto",
    "gradient_accumulation_steps": "auto",
    "zero_optimization": {
        "stage": 2,
        "overlap_comm": false,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": "auto"
    }
}

@zjysteven
Copy link
Owner

Multiple images will definitely consume more memory, so I wouldn't be surprised to see OOM. I'm afraid there won't be easy solutions like tuning the configuration here.

@rsr-droid
Copy link
Author

Thank you for your help, if I find a solution I'll post on here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants