Qwen2-VL-Instruct CUDA out of memory #66

rsr-droid · 2025-01-18T11:36:20Z

I am trying to use 2 A5000 GPUs on fine tuning Qwen2-VL-Instruct on an image-text dataset however I am running into CUDA memory issues. Could you advise on how to set up bash file? Here is my current file:

#!/bin/bash

export CUDA_VISIBLE_DEVICES=0,2

NUM_GPUS=2
DISTRIBUTED_ARGS="
--nnodes=1
--nproc_per_node ${NUM_GPUS}
--rdzv_backend c10d
--rdzv_endpoint localhost:0
"

MODEL_ID=qwen2-vl-2b-instruct # model id; pick on by running python supported_models.py
TRAIN_DATA_PATH=./dataset/data/json/multi/train_findings_labels.json # path to the training data json file
EVAL_DATA_PATH=./dataset/data/json/multi/multi_eval_findings_labels.json # path to the evaluation data json file (optional)
IMAGE_FOLDER='None' # path to the image root folder; if provided, the image paths in the json should be relative
VIDEO_FOLDER='None' # path to the video root folder; if provided, the video paths in the json should be relative
NUM_FRAMES=8 # how many frames are sampled from each video

TRAIN_VISION_ENCODER=False # whether train the vision encoder
USE_VISION_LORA=False # whether use lora for vision encoder (only effective when TRAIN_VISION_ENCODER is True)
TRAIN_VISION_PROJECTOR=False # whether train the vision projector (only full finetuning is supported)

USE_LORA=True # whether use lora for llm
Q_LORA=False # whether use q-lora for llm; only effective when USE_LORA is True
LORA_R=8 # the lora rank (both llm and vision encoder)
LORA_ALPHA=8 # the lora alpha (both llm and vision encoder)

RUN_ID=${MODEL_ID}_lora-${USE_LORA}_qlora-${Q_LORA}_multi_findings_generation_classification

DS_STAGE=zero3 # deepspeed stage; < zero2 | zero3 >
PER_DEVICE_BATCH_SIZE=1 # batch size per GPU
GRAD_ACCUM=1 # gradient accumulation steps
NUM_EPOCHS=3 # number of training epochs

LR=2e-5 # learning rate
MODEL_MAX_LEN=256 # maximum input length of the model

torchrun $DISTRIBUTED_ARGS train.py
--model_id $MODEL_ID
--data_path $TRAIN_DATA_PATH
--eval_data_path $EVAL_DATA_PATH
--image_folder $IMAGE_FOLDER
--video_folder $VIDEO_FOLDER
--num_frames $NUM_FRAMES
--output_dir ./checkpoints/$RUN_ID
--report_to wandb
--run_name $RUN_ID
--deepspeed ./ds_configs/${DS_STAGE}.json
--bf16 True
--num_train_epochs $NUM_EPOCHS
--per_device_train_batch_size $PER_DEVICE_BATCH_SIZE
--per_device_eval_batch_size $PER_DEVICE_BATCH_SIZE
--gradient_accumulation_steps $GRAD_ACCUM
--eval_strategy "epoch"
--save_strategy "epoch"
--save_total_limit 1
--learning_rate ${LR}
--weight_decay 0.
--warmup_ratio 0.03
--lr_scheduler_type "cosine"
--logging_steps 1
--tf32 True
--model_max_length $MODEL_MAX_LEN
--gradient_checkpointing True
--dataloader_num_workers 2
--train_vision_encoder $TRAIN_VISION_ENCODER
--use_vision_lora $USE_VISION_LORA
--train_vision_projector $TRAIN_VISION_PROJECTOR
--use_lora $USE_LORA
--q_lora $Q_LORA
--lora_r $LORA_R
--lora_alpha $LORA_ALPHA \

The text was updated successfully, but these errors were encountered:

zjysteven · 2025-01-18T20:18:33Z

I happen to have access to A5000 GPUs. I can run successfully on my end without OOM. Here's my script (which should be the same as yours except using example data and using an even longer MODEL_MAX_LEN).

export CUDA_VISIBLE_DEVICES=6,7

NUM_GPUS=2
DISTRIBUTED_ARGS="
    --nnodes=1 \
    --nproc_per_node ${NUM_GPUS} \
    --rdzv_backend c10d \
    --rdzv_endpoint localhost:0
"

# arguments that are very likely to be changed
# according to your own case
MODEL_ID=qwen2-vl-2b-instruct                                   # model id; pick on by running `python supported_models.py`
TRAIN_DATA_PATH=./example_data/celeba_image_train.json  # path to the training data json file
EVAL_DATA_PATH=./example_data/celeba_image_eval.json    # path to the evaluation data json file (optional)
IMAGE_FOLDER=./example_data/images                      # path to the image root folder; if provided, the image paths in the json should be relative
VIDEO_FOLDER=./example_data/videos                      # path to the video root folder; if provided, the video paths in the json should be relative
NUM_FRAMES=8                                            # how many frames are sampled from each video

TRAIN_VISION_ENCODER=False                              # whether train the vision encoder
USE_VISION_LORA=False                                   # whether use lora for vision encoder (only effective when `TRAIN_VISION_ENCODER` is True)
TRAIN_VISION_PROJECTOR=False                            # whether train the vision projector (only full finetuning is supported)

USE_LORA=True                                           # whether use lora for llm
Q_LORA=False                                            # whether use q-lora for llm; only effective when `USE_LORA` is True
LORA_R=8                                                # the lora rank (both llm and vision encoder)
LORA_ALPHA=8                                            # the lora alpha (both llm and vision encoder)

RUN_ID=${MODEL_ID}_lora-${USE_LORA}_qlora-${Q_LORA}     # a custom run id that determines the checkpoint folder and wandb run name

DS_STAGE=zero3                                          # deepspeed stage; < zero2 | zero3 >
PER_DEVICE_BATCH_SIZE=1                                 # batch size per GPU
GRAD_ACCUM=1                                            # gradient accumulation steps
NUM_EPOCHS=1                                            # number of training epochs

LR=2e-5                                                 # learning rate
MODEL_MAX_LEN=1024                                       # maximum input length of the model


torchrun $DISTRIBUTED_ARGS train.py \
    --model_id $MODEL_ID \
    --data_path $TRAIN_DATA_PATH \
    --eval_data_path $EVAL_DATA_PATH \
    --image_folder $IMAGE_FOLDER \
    --video_folder $VIDEO_FOLDER \
    --num_frames $NUM_FRAMES \
    --output_dir ./checkpoints/$RUN_ID \
    --report_to wandb \
    --run_name $RUN_ID \
    --deepspeed ./ds_configs/${DS_STAGE}.json \
    --bf16 True \
    --num_train_epochs $NUM_EPOCHS \
    --per_device_train_batch_size $PER_DEVICE_BATCH_SIZE \
    --per_device_eval_batch_size $PER_DEVICE_BATCH_SIZE \
    --gradient_accumulation_steps $GRAD_ACCUM \
    --eval_strategy "epoch" \
    --save_strategy "epoch" \
    --save_total_limit 1 \
    --learning_rate ${LR} \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length $MODEL_MAX_LEN \
    --gradient_checkpointing True \
    --dataloader_num_workers 4 \
    --train_vision_encoder $TRAIN_VISION_ENCODER \
    --use_vision_lora $USE_VISION_LORA \
    --train_vision_projector $TRAIN_VISION_PROJECTOR \
    --use_lora $USE_LORA \
    --q_lora $Q_LORA \
    --lora_r $LORA_R \
    --lora_alpha $LORA_ALPHA

rsr-droid · 2025-01-25T09:38:53Z

@zjysteven, thank you for you comments, I have tried what you have suggested and am still running into OOM issues. I have tried using zero2 config which allows the training to go on for longer but I eventually run into OOM issues again. Just for reference, my dataset can contain multiple images (maximum 3) per question in my json file, although I am not sure if that is an issue since before the OOM issue the data points containing multiple images I believe are loading just fine.

#!/bin/bash

# Set CUDA_VISIBLE_DEVICES to 0
export CUDA_VISIBLE_DEVICES=0,1,2
export HF_HOME="/working/rajan/multiview-llm/.cache"
export HF_HUB_DOWNLOAD_TIMEOUT=10000

export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512,garbage_collection_threshold:0.6,expandable_segments:True


NUM_GPUS=3
DISTRIBUTED_ARGS="
    --nnodes=1 \
    --nproc_per_node ${NUM_GPUS} \
    --rdzv_backend c10d \
    --rdzv_endpoint localhost:0
"

# arguments that are very likely to be changed
# according to your own case
MODEL_ID=qwen2-vl-2b-instruct                                                                                                   # model id; pick on by running `python supported_models.py`
TRAIN_DATA_PATH=/working/rajan/multiview-llm/Models/Finetune/dataset/data/json/multi/mimic_cxr_multi_train_findings_labels.json     # path to the training data json file
EVAL_DATA_PATH=/working/rajan/multiview-llm/Models/Finetune/dataset/data/json/multi/mimic_cxr_multi_eval_findings_labels.json       # path to the evaluation data json file (optional)
IMAGE_FOLDER='None'                                                                                                                 # path to the image root folder; if provided, the image paths in the json should be relative
VIDEO_FOLDER='None'                                                                                                                 # path to the video root folder; if provided, the video paths in the json should be relative
NUM_FRAMES=8                                                                                                                        # how many frames are sampled from each video

TRAIN_VISION_ENCODER=False                                                                                                          # whether train the vision encoder
USE_VISION_LORA=False                                                                                                               # whether use lora for vision encoder (only effective when `TRAIN_VISION_ENCODER` is True)
TRAIN_VISION_PROJECTOR=False                                                                                                        # whether train the vision projector (only full finetuning is supported)

USE_LORA=True                                                                                                                       # whether use lora for llm
Q_LORA=False                                                                                                                         # whether use q-lora for llm; only effective when `USE_LORA` is True
LORA_R=8                                                                                                                           # the lora rank (both llm and vision encoder)
LORA_ALPHA=8                                                                                                                      # the lora alpha (both llm and vision encoder)

RUN_ID=${MODEL_ID}_lora-${USE_LORA}_qlora-${Q_LORA}_multi_findings_generation_classification                      # a custom run id that determines the checkpoint folder and wandb run name

DS_STAGE=zero2                                                                                                               # deepspeed stage; < zero2 | zero3 >
PER_DEVICE_BATCH_SIZE=1                                                                                                             # batch size per GPU
GRAD_ACCUM=1                                                                                                                      # gradient accumulation steps
NUM_EPOCHS=3                                                                                                                        # number of training epochs

LR=2e-5                                                                                                                             # learning rate
MODEL_MAX_LEN=1024                                                                                                                   # maximum input length of the model

torchrun $DISTRIBUTED_ARGS train.py \
    --model_id $MODEL_ID \
    --data_path $TRAIN_DATA_PATH \
    --eval_data_path $EVAL_DATA_PATH \
    --image_folder $IMAGE_FOLDER \
    --video_folder  $VIDEO_FOLDER \
    --num_frames $NUM_FRAMES \
    --output_dir ./checkpoints/$RUN_ID \
    --report_to wandb \
    --run_name $RUN_ID \
    --deepspeed ./ds_configs/${DS_STAGE}.json \
    --bf16 True \
    --num_train_epochs $NUM_EPOCHS \
    --per_device_train_batch_size $PER_DEVICE_BATCH_SIZE \
    --per_device_eval_batch_size $PER_DEVICE_BATCH_SIZE \
    --gradient_accumulation_steps $GRAD_ACCUM \
    --eval_strategy "epoch" \
    --save_strategy "epoch" \
    --save_total_limit 1 \
    --learning_rate ${LR} \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length $MODEL_MAX_LEN \
    --gradient_checkpointing True \
    --dataloader_num_workers 12 \
    --train_vision_encoder $TRAIN_VISION_ENCODER \
    --use_vision_lora $USE_VISION_LORA \
    --train_vision_projector $TRAIN_VISION_PROJECTOR \
    --use_lora $USE_LORA \
    --q_lora $Q_LORA \
    --lora_r $LORA_R \
    --lora_alpha $LORA_ALPHA \

Zero2 config:

{
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },
    "bf16": {
        "enabled": "auto"
    },
    "train_micro_batch_size_per_gpu": "auto",
    "train_batch_size": "auto",
    "gradient_accumulation_steps": "auto",
    "zero_optimization": {
        "stage": 2,
        "overlap_comm": false,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": "auto"
    }
}

zjysteven · 2025-01-25T09:46:31Z

Multiple images will definitely consume more memory, so I wouldn't be surprised to see OOM. I'm afraid there won't be easy solutions like tuning the configuration here.

rsr-droid · 2025-01-25T13:50:08Z

Thank you for your help, if I find a solution I'll post on here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Qwen2-VL-Instruct CUDA out of memory #66

Qwen2-VL-Instruct CUDA out of memory #66

rsr-droid commented Jan 18, 2025 •

edited

Loading

zjysteven commented Jan 18, 2025 •

edited

Loading

rsr-droid commented Jan 25, 2025

zjysteven commented Jan 25, 2025

rsr-droid commented Jan 25, 2025

Qwen2-VL-Instruct CUDA out of memory #66

Qwen2-VL-Instruct CUDA out of memory #66

Comments

rsr-droid commented Jan 18, 2025 • edited Loading

zjysteven commented Jan 18, 2025 • edited Loading

rsr-droid commented Jan 25, 2025

zjysteven commented Jan 25, 2025

rsr-droid commented Jan 25, 2025

rsr-droid commented Jan 18, 2025 •

edited

Loading

zjysteven commented Jan 18, 2025 •

edited

Loading