Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 11 additions & 9 deletions examples/models/vlm/ministral3/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -85,9 +85,12 @@ The current date is {today}.

- See: [bridge.recipes.ministral3](../../apidocs/bridge/bridge.recipes.ministral3.md)
- Available recipes:
- `ministral3_3b_finetune_config`: Finetuning for 3B VL model with PEFT support
- `ministral3_8b_finetune_config`: Finetuning for 8B VL model with PEFT support
- `ministral3_14b_finetune_config`: Finetuning for 14B VL model with PEFT support
- `ministral3_3b_sft_config`: Finetuning for 3B VL model
- `ministral3_8b_sft_config`: Finetuning for 8B VL model
- `ministral3_14b_sft_config`: Finetuning for 14B VL model
- `ministral3_3b_peft_config`: Finetuning for 3B VL model with PEFT support
- `ministral3_8b_peft_config`: Finetuning for 8B VL model with PEFT support
- `ministral3_14b_peft_config`: Finetuning for 14B VL model with PEFT support

Before training, ensure the following environment variables are set:
1. `SAVE_DIR`: checkpoint and log saving directory
Expand All @@ -101,15 +104,11 @@ Pretraining is not verified for this model.

### Supervised Fine-Tuning (SFT)

See the [sft.sh](sft.sh) script for full parameter fine-tuning with configurable model parallelisms.

W&B report coming soon.
See the [sft_unpacked.sh](sft_unpacked.sh) script for full parameter fine-tuning with configurable model parallelisms.

### Parameter-Efficient Fine-Tuning (PEFT) with LoRA

See the [peft.sh](peft.sh) script for LoRA fine-tuning with configurable tensor and pipeline parallelism.

W&B report coming soon.
See the [peft_unpacked.sh](peft_unpacked.sh) script for LoRA fine-tuning with configurable tensor and pipeline parallelism.

### Recommended Configurations

Expand All @@ -124,6 +123,9 @@ W&B report coming soon.

**Note:** LoRA/DoRA significantly reduces memory requirements, allowing for larger batch sizes and fewer GPUs.

### Expected Training Dynamics
We provide a [Weights & Biases report](https://api.wandb.ai/links/nvidia-nemo-fw-public/h32cflfn) for the expected loss curves and grad norms.

## Evaluation

Coming soon.
Original file line number Diff line number Diff line change
Expand Up @@ -29,12 +29,12 @@ PRETRAINED_CHECKPOINT=${WORKSPACE}/models/Ministral-3-3B-Instruct-2512-BF16
MODEL_NAME=ministral3_3b
DATASET_NAME=cord_v2
SEQ_LENGTH=4096
TRAIN_ITERS=50
GLOBAL_BATCH_SIZE=32
TRAIN_ITERS=100
GLOBAL_BATCH_SIZE=16
MICRO_BATCH_SIZE=1
EVAL_ITERS=10
LR=0.0002
MIN_LR=0.00002
EVAL_ITERS=20
LR=0.00005
MIN_LR=0.000005
LR_WARMUP_ITERS=10
LOG_INTERVAL=1
WANDB_PROJECT=megatron-bridge-${DATASET_NAME}
Expand All @@ -47,15 +47,15 @@ for config in "${PARALLELISM_CONFIGS[@]}"; do

echo "Running LoRA finetuning with TP=$TP, PP=$PP"
uv run --no-sync python -m torch.distributed.run --nproc_per_node=8 scripts/training/run_recipe.py \
--recipe ${MODEL_NAME}_finetune_config \
--recipe ${MODEL_NAME}_peft_config \
--step_func vlm_step \
--peft_scheme lora \
checkpoint.pretrained_checkpoint=$PRETRAINED_CHECKPOINT \
model.seq_length=$SEQ_LENGTH \
train.train_iters=$TRAIN_ITERS \
train.global_batch_size=$GLOBAL_BATCH_SIZE \
train.micro_batch_size=$MICRO_BATCH_SIZE \
train.eval_iters=$EVAL_ITERS \
validation.eval_iters=$EVAL_ITERS \
optimizer.lr=$LR \
optimizer.min_lr=$MIN_LR \
scheduler.lr_warmup_iters=$LR_WARMUP_ITERS \
Expand All @@ -65,6 +65,7 @@ for config in "${PARALLELISM_CONFIGS[@]}"; do
logger.wandb_exp_name=${MODEL_NAME}_${DATASET_NAME}_lora_tp${TP}_pp${PP} \
dataset.maker_name=make_${DATASET_NAME}_dataset \
dataset.seq_length=$SEQ_LENGTH \
dataset.pack_sequences_in_batch=False \
model.tensor_model_parallel_size=$TP \
model.pipeline_model_parallel_size=$PP
done
Original file line number Diff line number Diff line change
Expand Up @@ -29,12 +29,12 @@ PRETRAINED_CHECKPOINT=${WORKSPACE}/models/Ministral-3-3B-Instruct-2512-BF16
MODEL_NAME=ministral3_3b
DATASET_NAME=cord_v2
SEQ_LENGTH=4096
TRAIN_ITERS=50
GLOBAL_BATCH_SIZE=32
TRAIN_ITERS=100
GLOBAL_BATCH_SIZE=16
MICRO_BATCH_SIZE=1
EVAL_ITERS=10
LR=0.00005
MIN_LR=0.000005
EVAL_ITERS=20
LR=0.00001
MIN_LR=0.000001
LR_WARMUP_ITERS=10
LOG_INTERVAL=1
WANDB_PROJECT=megatron-bridge-${DATASET_NAME}
Expand All @@ -47,14 +47,14 @@ for config in "${PARALLELISM_CONFIGS[@]}"; do

echo "Running full finetuning with TP=$TP, PP=$PP"
uv run --no-sync python -m torch.distributed.run --nproc_per_node=8 scripts/training/run_recipe.py \
--recipe ${MODEL_NAME}_finetune_config \
--recipe ${MODEL_NAME}_sft_config \
--step_func vlm_step \
checkpoint.pretrained_checkpoint=$PRETRAINED_CHECKPOINT \
model.seq_length=$SEQ_LENGTH \
train.train_iters=$TRAIN_ITERS \
train.global_batch_size=$GLOBAL_BATCH_SIZE \
train.micro_batch_size=$MICRO_BATCH_SIZE \
train.eval_iters=$EVAL_ITERS \
validation.eval_iters=$EVAL_ITERS \
optimizer.lr=$LR \
optimizer.min_lr=$MIN_LR \
scheduler.lr_warmup_iters=$LR_WARMUP_ITERS \
Expand All @@ -64,6 +64,7 @@ for config in "${PARALLELISM_CONFIGS[@]}"; do
logger.wandb_exp_name=${MODEL_NAME}_${DATASET_NAME}_sft_tp${TP}_pp${PP} \
dataset.maker_name=make_${DATASET_NAME}_dataset \
dataset.seq_length=$SEQ_LENGTH \
dataset.pack_sequences_in_batch=False \
model.tensor_model_parallel_size=$TP \
model.pipeline_model_parallel_size=$PP
done
10 changes: 5 additions & 5 deletions examples/models/vlm/qwen3_vl/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -106,15 +106,11 @@ Before training, ensure the following environment variables are set:
See the [sft_unpacked.sh](sft_unpacked.sh) script for full parameter fine-tuning with configurable model parallelisms, with unpacked sequences.
See the [sft.sh](sft.sh) script for full parameter fine-tuning with sequence-packing.

W&B report coming soon.

### Parameter-Efficient Fine-Tuning (PEFT) with LoRA

See the [peft_unpacked.sh](peft_unpacked.sh) script for LoRA fine-tuning with configurable tensor and pipeline parallelism, with unpacked sequences.
See the [peft.sh](peft.sh) script for LoRA fine-tuning with sequence-packing.

W&B report coming soon.

**Note:** LoRA/DoRA significantly reduces memory requirements, allowing for larger batch sizes and fewer GPUs.

## Finetuning with Energon Dataset
Expand All @@ -129,7 +125,11 @@ field_map:
conversation: json
```

Then, update the dataset path (`dataset.path=/path/to/energon/dataset`) in [sft_energon.sh](sft_energon.sh) and run the script.
Then, update the dataset path (`dataset.path=/path/to/energon/dataset`) in [peft_energon.sh](peft_energon.sh) and run the script.


### Expected Training Dynamics
We provide a [Weights & Biases report](https://api.wandb.ai/links/nvidia-nemo-fw-public/lczz4ixx) for the expected loss curves and grad norms.

## Evaluation

Expand Down
24 changes: 14 additions & 10 deletions examples/models/vlm/qwen3_vl/peft.sh
Original file line number Diff line number Diff line change
Expand Up @@ -25,10 +25,11 @@ PRETRAINED_CHECKPOINT=${WORKSPACE}/models/Qwen3-VL-8B-Instruct
MODEL_NAME=qwen3_vl_8b
DATASET_NAME=cord_v2
SEQ_LENGTH=4096
TRAIN_ITERS=50
GLOBAL_BATCH_SIZE=32
TRAIN_ITERS=100
GLOBAL_BATCH_SIZE=16
MICRO_BATCH_SIZE=2
EVAL_ITERS=10
EVAL_ITERS=20
EVAL_INTERVAL=20
LR=0.00005
MIN_LR=0.000005
LR_WARMUP_ITERS=10
Expand All @@ -45,15 +46,16 @@ for pack_config in "${SEQ_PACKING_CONFIGS[@]}"; do
IFS=',' read -r EP TP PP CP <<< "$par_config"
echo "Running LoRA finetuning pack_sequences_in_batch=$pack_config with EP=$EP TP=$TP PP=$PP CP=$CP"
uv run python -m torch.distributed.run --nproc_per_node=8 scripts/training/run_recipe.py \
--recipe ${MODEL_NAME}_finetune_config \
--recipe ${MODEL_NAME}_peft_config \
--step_func qwen3_vl_step \
--peft_scheme lora \
checkpoint.pretrained_checkpoint=$PRETRAINED_CHECKPOINT \
model.seq_length=$SEQ_LENGTH \
train.train_iters=$TRAIN_ITERS \
train.global_batch_size=$GLOBAL_BATCH_SIZE \
train.micro_batch_size=$MICRO_BATCH_SIZE \
train.eval_iters=$EVAL_ITERS \
validation.eval_iters=$EVAL_ITERS \
validation.eval_interval=$EVAL_INTERVAL \
optimizer.lr=$LR \
optimizer.min_lr=$MIN_LR \
scheduler.lr_warmup_iters=$LR_WARMUP_ITERS \
Expand All @@ -80,10 +82,11 @@ PRETRAINED_CHECKPOINT=${WORKSPACE}/models/Qwen3-VL-30B-A3B-Instruct
MODEL_NAME=qwen3_vl_30b_a3b
DATASET_NAME=cord_v2
SEQ_LENGTH=4096
TRAIN_ITERS=50
GLOBAL_BATCH_SIZE=32
TRAIN_ITERS=100
GLOBAL_BATCH_SIZE=16
MICRO_BATCH_SIZE=2
EVAL_ITERS=10
EVAL_ITERS=20
EVAL_INTERVAL=20
LR=0.00005
MIN_LR=0.000005
LR_WARMUP_ITERS=10
Expand All @@ -100,15 +103,16 @@ for pack_config in "${SEQ_PACKING_CONFIGS[@]}"; do
IFS=',' read -r EP TP PP CP <<< "$par_config"
echo "Running LoRA finetuning pack_sequences_in_batch=$pack_config with EP=$EP TP=$TP PP=$PP CP=$CP"
uv run python -m torch.distributed.run --nproc_per_node=8 scripts/training/run_recipe.py \
--recipe ${MODEL_NAME}_finetune_config \
--recipe ${MODEL_NAME}_peft_config \
--step_func qwen3_vl_step \
--peft_scheme lora \
checkpoint.pretrained_checkpoint=$PRETRAINED_CHECKPOINT \
model.seq_length=$SEQ_LENGTH \
train.train_iters=$TRAIN_ITERS \
train.global_batch_size=$GLOBAL_BATCH_SIZE \
train.micro_batch_size=$MICRO_BATCH_SIZE \
train.eval_iters=$EVAL_ITERS \
validation.eval_iters=$EVAL_ITERS \
validation.eval_interval=$EVAL_INTERVAL \
optimizer.lr=$LR \
optimizer.min_lr=$MIN_LR \
scheduler.lr_warmup_iters=$LR_WARMUP_ITERS \
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -25,10 +25,11 @@ PRETRAINED_CHECKPOINT=${WORKSPACE}/models/Qwen3-VL-8B-Instruct
MODEL_NAME=qwen3_vl_8b
DATASET_NAME=energon
SEQ_LENGTH=4096
TRAIN_ITERS=50
GLOBAL_BATCH_SIZE=32
TRAIN_ITERS=100
GLOBAL_BATCH_SIZE=16
MICRO_BATCH_SIZE=2
EVAL_ITERS=10
EVAL_ITERS=20
EVAL_INTERVAL=20
LR=0.00005
MIN_LR=0.000005
LR_WARMUP_ITERS=10
Expand All @@ -47,20 +48,20 @@ for pack_config in "${SEQ_PACKING_CONFIGS[@]}"; do
IFS=',' read -r EP TP PP CP N_PROC <<< "$par_config"
echo "Running LoRA finetuning pack_sequences_in_batch=$pack_config with EP=$EP TP=$TP PP=$PP CP=$CP N_PROC=$N_PROC"
uv run python -m torch.distributed.run --nproc_per_node=$N_PROC scripts/training/run_recipe.py \
--recipe ${MODEL_NAME}_finetune_config \
--recipe ${MODEL_NAME}_peft_energon_config \
--step_func qwen3_vl_step \
--peft_scheme lora \
--dataset_type energon \
checkpoint.pretrained_checkpoint=$PRETRAINED_CHECKPOINT \
model.seq_length=$SEQ_LENGTH \
train.train_iters=$TRAIN_ITERS \
train.global_batch_size=$GLOBAL_BATCH_SIZE \
train.micro_batch_size=$MICRO_BATCH_SIZE \
train.eval_iters=$EVAL_ITERS \
validation.eval_iters=$EVAL_ITERS \
validation.eval_interval=$EVAL_INTERVAL \
optimizer.lr=$LR \
optimizer.min_lr=$MIN_LR \
scheduler.lr_warmup_iters=$LR_WARMUP_ITERS \
checkpoint.save=${WORKSPACE}/results/${MODEL_NAME}_lora_seq_pack_${pack_config}_cp${CP} \
checkpoint.save=${WORKSPACE}/results/${MODEL_NAME}_energon_lora_seq_pack_${pack_config}_cp${CP} \
logger.log_interval=$LOG_INTERVAL \
logger.wandb_project=$WANDB_PROJECT \
logger.wandb_exp_name=${MODEL_NAME}_${DATASET_NAME}_lora_seq_pack_${pack_config}_cp${CP} \
Expand Down
34 changes: 19 additions & 15 deletions examples/models/vlm/qwen3_vl/peft_unpacked.sh
Original file line number Diff line number Diff line change
Expand Up @@ -25,33 +25,35 @@ PRETRAINED_CHECKPOINT=${WORKSPACE}/models/Qwen3-VL-8B-Instruct
MODEL_NAME=qwen3_vl_8b
DATASET_NAME=cord_v2
SEQ_LENGTH=4096
TRAIN_ITERS=50
GLOBAL_BATCH_SIZE=32
MICRO_BATCH_SIZE=1
EVAL_ITERS=10
TRAIN_ITERS=100
GLOBAL_BATCH_SIZE=16
MICRO_BATCH_SIZE=2
EVAL_ITERS=20
EVAL_INTERVAL=20
LR=0.00005
MIN_LR=0.000005
LR_WARMUP_ITERS=10
LOG_INTERVAL=1
WANDB_PROJECT=megatron-bridge-${DATASET_NAME}

# TP/PP combinations: "TP,PP"
PARALLELISM_CONFIGS=("2,1" "1,2")
PARALLELISM_CONFIGS=("4,1" "2,1")

for config in "${PARALLELISM_CONFIGS[@]}"; do
IFS=',' read -r TP PP <<< "$config"

echo "Running LoRA finetuning with TP=$TP, PP=$PP"
uv run python -m torch.distributed.run --nproc_per_node=2 scripts/training/run_recipe.py \
--recipe ${MODEL_NAME}_finetune_config \
uv run python -m torch.distributed.run --nproc_per_node=8 scripts/training/run_recipe.py \
--recipe ${MODEL_NAME}_peft_config \
--step_func qwen3_vl_step \
--peft_scheme lora \
checkpoint.pretrained_checkpoint=$PRETRAINED_CHECKPOINT \
model.seq_length=$SEQ_LENGTH \
train.train_iters=$TRAIN_ITERS \
train.global_batch_size=$GLOBAL_BATCH_SIZE \
train.micro_batch_size=$MICRO_BATCH_SIZE \
train.eval_iters=$EVAL_ITERS \
validation.eval_iters=$EVAL_ITERS \
validation.eval_interval=$EVAL_INTERVAL \
optimizer.lr=$LR \
optimizer.min_lr=$MIN_LR \
scheduler.lr_warmup_iters=$LR_WARMUP_ITERS \
Expand All @@ -71,33 +73,35 @@ PRETRAINED_CHECKPOINT=${WORKSPACE}/models/Qwen3-VL-30B-A3B-Instruct
MODEL_NAME=qwen3_vl_30b_a3b
DATASET_NAME=cord_v2
SEQ_LENGTH=4096
TRAIN_ITERS=50
GLOBAL_BATCH_SIZE=32
MICRO_BATCH_SIZE=1
EVAL_ITERS=10
TRAIN_ITERS=100
GLOBAL_BATCH_SIZE=16
MICRO_BATCH_SIZE=2
EVAL_ITERS=20
EVAL_INTERVAL=20
LR=0.00005
MIN_LR=0.000005
LR_WARMUP_ITERS=10
LOG_INTERVAL=1
WANDB_PROJECT=megatron-bridge-${DATASET_NAME}

# EP/TP/PP combinations: "EP,TP,PP" configurations
PARALLELISM_CONFIGS=("8,1,1" "4,1,1" "2,1,1")
PARALLELISM_CONFIGS=("8,1,1" "4,1,1")

for config in "${PARALLELISM_CONFIGS[@]}"; do
IFS=',' read -r EP TP PP <<< "$config"

echo "Running LoRA finetuning with EP=$EP, TP=$TP, PP=$PP"
uv run python -m torch.distributed.run --nproc_per_node=8 scripts/training/run_recipe.py \
--recipe ${MODEL_NAME}_finetune_config \
--recipe ${MODEL_NAME}_peft_config \
--step_func qwen3_vl_step \
--peft_scheme lora \
checkpoint.pretrained_checkpoint=$PRETRAINED_CHECKPOINT \
model.seq_length=$SEQ_LENGTH \
train.train_iters=$TRAIN_ITERS \
train.global_batch_size=$GLOBAL_BATCH_SIZE \
train.micro_batch_size=$MICRO_BATCH_SIZE \
train.eval_iters=$EVAL_ITERS \
validation.eval_iters=$EVAL_ITERS \
validation.eval_interval=$EVAL_INTERVAL \
optimizer.lr=$LR \
optimizer.min_lr=$MIN_LR \
scheduler.lr_warmup_iters=$LR_WARMUP_ITERS \
Expand Down
Loading
Loading