NVIDIA-NeMo · ko3n1g · Feb 14, 2026 · Feb 14, 2026 · Feb 14, 2026
diff --git a/docs/models/vlm/qwen3-vl.md b/docs/models/vlm/qwen3-vl.md
@@ -13,107 +13,9 @@ Unless explicitly stated, any megatron model path in the commands below should N
 [here](https://docs.nvidia.com/nemo/megatron-bridge/latest/training/checkpointing.html#checkpoint-contents) 
 ```
 
-## Conversion with 🤗 Hugging Face
+## Examples
 
-### Import HF → Megatron
-To import the HF model to your desired `$MEGATRON_MODEL_PATH`, run the following command.
-```bash
-python examples/conversion/convert_checkpoints.py import \
---hf-model $HF_MODEL_PATH \
---megatron-path $MEGATRON_MODEL_PATH
-```
-
-### Export Megatron → HF
-You can export a trained model with the following command.
-```bash
-python examples/conversion/convert_checkpoints.py export \
---hf-model $HF_MODEL_PATH \
---megatron-path <trained megatron model path> \
---hf-path <output hf model path>
-```
-
-### Run In-Framework Inference on Converted Checkpoint
-You can run a quick sanity check on the converted checkpoint with the following command.
-```bash
-python examples/conversion/hf_to_megatron_generate_vlm.py \
---hf_model_path $HF_MODEL_PATH \
---megatron_model_path $MEGATRON_MODEL_PATH \
---image_path <example image path> \
---prompt "Describe this image." \
---max_new_tokens 100
-```
-
-## Finetuning Recipes
-Before training, ensure the following environment variables are set:
-1. `SAVE_DIR`: to specify a checkpoint and log saving directory
-2. `HF_TOKEN`: to download models from HF Hub (if required)
-3. `HF_HOME`: (optional) to avoid re-downloading models and datasets
-4. `WANDB_API_KEY`: (optional) to enable WandB logging
-
-### Full Finetuning
-
-Example usage for full parameter finetuning:
-
-```bash
-torchrun --nproc-per-node=8 examples/models/vlm/qwen_vl/finetune_qwen_vl.py \
---pretrained-checkpoint $MEGATRON_MODEL_PATH \
---recipe qwen3_vl_8b_finetune_config \
---dataset-type hf \
-dataset.maker_name=make_cord_v2_dataset \
-train.global_batch_size=<batch size> \
-train.train_iters=<number of iterations> \
-logger.wandb_project=<optional wandb project name> \
-logger.wandb_save_dir=$SAVE_DIR \
-checkpoint.save=$SAVE_DIR/<experiment name>
-```
-
-For MoE models with expert parallelism:
-```bash
-torchrun --nproc-per-node=8 examples/models/vlm/qwen_vl/finetune_qwen_vl.py \
---pretrained-checkpoint $MEGATRON_MODEL_PATH \
---recipe qwen3_vl_30b_a3b_finetune_config \
---dataset-type hf \
-dataset.maker_name=make_cord_v2_dataset \
-train.global_batch_size=<batch size> \
-train.train_iters=<number of iterations> \
-checkpoint.save=$SAVE_DIR/<experiment name>
-```
-
-Note:
-- The `--recipe` parameter selects the model configuration:
-  - `qwen3_vl_8b_finetune_config` - for 8B dense model
-  - `qwen3_vl_30b_a3b_finetune_config` - for 30B MoE model
-- For dataset formats and additional information, refer to the [Qwen2.5-VL documentation]
-- See the full script with examples at [`examples/models/vlm/qwen_vl/finetune_qwen_vl.py`](../../../examples/models/vlm/qwen_vl/finetune_qwen_vl.py)
-
-### PEFT (Parameter-Efficient Fine-Tuning)
-
-Qwen3-VL supports PEFT methods including LoRA and DoRA for memory-efficient training. PEFT trains only adapter parameters (~1-2% of model), significantly reducing memory requirements and enabling faster training.
-
-**LoRA with 8B Dense Model (1 GPU):**
-```bash
-torchrun --nproc-per-node=1 examples/models/vlm/qwen_vl/finetune_qwen_vl.py \
---pretrained-checkpoint $MEGATRON_MODEL_PATH \
---recipe qwen3_vl_8b_finetune_config \
---dataset-type hf \
---peft lora \
-checkpoint.save=$SAVE_DIR/<experiment name>
-```
-
-**LoRA with 30B MoE Model (8 GPUs with Expert Parallelism):**
-```bash
-torchrun --nproc-per-node=8 examples/models/vlm/qwen_vl/finetune_qwen_vl.py \
---pretrained-checkpoint $MEGATRON_MODEL_PATH \
---recipe qwen3_vl_30b_a3b_finetune_config \
---dataset-type hf \
---peft lora \
-checkpoint.save=$SAVE_DIR/<experiment name>
-```
-
-**DoRA Training:**
-```bash
---peft dora
-```
+For checkpoint conversion, inference, finetuning recipes, and step-by-step training guides, see the [Qwen3-VL Examples](https://github.com/NVIDIA-NeMo/Megatron-Bridge/blob/main/examples/models/vlm/qwen3_vl/README.md).
 
 ## Hugging Face Model Cards
 - Qwen3-VL-8B: `https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct`

diff --git a/docs/training/multi-token-prediction.md b/docs/training/multi-token-prediction.md
@@ -66,12 +66,13 @@ where:
 Here's a minimal example using the Qwen3 30B-A3B recipe with MTP enabled:
 
 ```python
-from megatron.bridge.recipes.qwen import qwen3_30b_a3b_pretrain
+from megatron.bridge.recipes.qwen.qwen3_moe import qwen3_30b_a3b_pretrain_config
 from megatron.bridge.training.pretrain import pretrain
+from megatron.bridge.training.gpt_step import forward_step
+from megatron.bridge.training.config import ConfigContainer
 
-log_dir = f"/path/to/log/dir"
+log_dir = "/path/to/log/dir"
 cfg: ConfigContainer = qwen3_30b_a3b_pretrain_config()
-cfg.logger.log_dir = log_dir
 cfg.logger.tensorboard_dir = log_dir + "/tb_logs"
 cfg.checkpoint.save = log_dir + "/checkpoints"
 cfg.checkpoint.load = log_dir + "/checkpoints"
@@ -82,10 +83,11 @@ cfg.dataset.blend=[[
 ], None]
 cfg.dataset.split="9999,8,2"
 cfg.dataset.path_to_cache = "/path/to/cache"
+# cfg.model.num_layers = 8  # train a smaller model if OOM
 # MTP Configuration
-cfg.mtp_num_layers = 1
-cfg.mtp_loss_scaling_factor = 0.1
-pretrain(cfg)
+cfg.model.mtp_num_layers = 1
+cfg.model.mtp_loss_scaling_factor = 0.1
+pretrain(cfg, forward_step)
 ```
 Follow the [DCLM Tutorial](https://github.com/NVIDIA-NeMo/Megatron-Bridge/tree/main/tutorials/data/dclm) to prepare the training data 
 

diff --git a/examples/models/vlm/gemma3_vl/peft.sh b/examples/models/vlm/gemma3_vl/peft.sh
@@ -16,6 +16,10 @@
 # Workspace directory for checkpoints and results
 WORKSPACE=${WORKSPACE:-/workspace}
 
+# Before training, make sure to set WANDB_API_KEY or disable wandb logging
+# export WANDB_API_KEY=<your_wandb_api_key>
+# export WANDB_MODE=disabled
+
 # Common configurations
 PRETRAINED_CHECKPOINT=${WORKSPACE}/models/gemma-3-4b-it
 MODEL_NAME=gemma3_vl_4b

diff --git a/examples/models/vlm/gemma3_vl/sft.sh b/examples/models/vlm/gemma3_vl/sft.sh
@@ -16,6 +16,10 @@
 # Workspace directory for checkpoints and results
 WORKSPACE=${WORKSPACE:-/workspace}
 
+# Before training, make sure to set WANDB_API_KEY or disable wandb logging
+# export WANDB_API_KEY=<your_wandb_api_key>
+# export WANDB_MODE=disabled
+
 # Common configurations
 PRETRAINED_CHECKPOINT=${WORKSPACE}/models/gemma-3-4b-it
 MODEL_NAME=gemma3_vl_4b

diff --git a/examples/models/vlm/glm_45v/slurm_peft.sh b/examples/models/vlm/glm_45v/slurm_peft.sh
@@ -90,6 +90,8 @@ export NCCL_NVLS_ENABLE=0
 # Authentication tokens (set these for your environment)
 # export HF_TOKEN="hf_your_token_here"
 # export WANDB_API_KEY="your_wandb_key_here"
+# or disable wandb logging
+# export WANDB_MODE=disabled
 
 # ==============================================================================
 # Job Execution

diff --git a/examples/models/vlm/glm_45v/slurm_sft.sh b/examples/models/vlm/glm_45v/slurm_sft.sh
@@ -90,6 +90,8 @@ export NCCL_NVLS_ENABLE=0
 # Authentication tokens (set these for your environment)
 # export HF_TOKEN="hf_your_token_here"
 # export WANDB_API_KEY="your_wandb_key_here"
+# or disable wandb logging
+# export WANDB_MODE=disabled
 
 # ==============================================================================
 # Job Execution

diff --git a/examples/models/vlm/ministral3/conversion.sh b/examples/models/vlm/ministral3/conversion.sh
@@ -16,17 +16,21 @@
 # Workspace directory for checkpoints and results
 WORKSPACE=${WORKSPACE:-/workspace}
 
+# Note: Ministral 3 requires transformers version 5
+# uv pip install --upgrade transformers
+# Commands below use uv run --no-sync to avoid conflicts with the virtual environment.
+
 # Import HF → Megatron
-uv run python examples/conversion/convert_checkpoints.py import \
+uv run --no-sync python examples/conversion/convert_checkpoints.py import \
     --hf-model mistralai/Ministral-3-3B-Instruct-2512-BF16 \
     --megatron-path ${WORKSPACE}/models/Ministral-3-3B-Instruct-2512-BF16
 
 # Export Megatron → HF
-uv run python examples/conversion/convert_checkpoints.py export \
+uv run --no-sync python examples/conversion/convert_checkpoints.py export \
     --hf-model mistralai/Ministral-3-3B-Instruct-2512-BF16 \
     --megatron-path ${WORKSPACE}/models/Ministral-3-3B-Instruct-2512-BF16/iter_0000000 \
     --hf-path ${WORKSPACE}/models/Ministral-3-3B-Instruct-2512-BF16-hf-export
 
 # Round-trip validation
-uv run python -m torch.distributed.run --nproc_per_node=8 examples/conversion/hf_megatron_roundtrip_multi_gpu.py \
+uv run --no-sync python -m torch.distributed.run --nproc_per_node=8 examples/conversion/hf_megatron_roundtrip_multi_gpu.py \
     --hf-model-id mistralai/Ministral-3-3B-Instruct-2512-BF16 --tp 2 --pp 2
diff --git a/examples/models/vlm/ministral3/inference.sh b/examples/models/vlm/ministral3/inference.sh
@@ -16,8 +16,12 @@
 # Workspace directory for checkpoints and results
 WORKSPACE=${WORKSPACE:-/workspace}
 
+# Note: Ministral 3 requires transformers version 5
+# uv pip install --upgrade transformers
+# Commands below use uv run --no-sync to avoid conflicts with the virtual environment.
+
 # Inference with Hugging Face checkpoints
-uv run python -m torch.distributed.run --nproc_per_node=4 examples/conversion/hf_to_megatron_generate_vlm.py \
+uv run --no-sync python -m torch.distributed.run --nproc_per_node=4 examples/conversion/hf_to_megatron_generate_vlm.py \
     --hf_model_path mistralai/Ministral-3-3B-Instruct-2512-BF16 \
     --image_path "https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16/resolve/main/images/table.png" \
     --prompt "Describe this image." \
@@ -26,7 +30,7 @@ uv run python -m torch.distributed.run --nproc_per_node=4 examples/conversion/hf
     --pp 2
 
 # Inference with imported Megatron checkpoints
-uv run python -m torch.distributed.run --nproc_per_node=4 examples/conversion/hf_to_megatron_generate_vlm.py \
+uv run --no-sync python -m torch.distributed.run --nproc_per_node=4 examples/conversion/hf_to_megatron_generate_vlm.py \
     --hf_model_path mistralai/Ministral-3-3B-Instruct-2512-BF16 \
     --megatron_model_path ${WORKSPACE}/models/Ministral-3-3B-Instruct-2512-BF16/iter_0000000 \
     --image_path "https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16/resolve/main/images/table.png" \
@@ -36,7 +40,7 @@ uv run python -m torch.distributed.run --nproc_per_node=4 examples/conversion/hf
     --pp 2
 
 # Inference with exported HF checkpoints
-uv run python -m torch.distributed.run --nproc_per_node=4 examples/conversion/hf_to_megatron_generate_vlm.py \
+uv run --no-sync python -m torch.distributed.run --nproc_per_node=4 examples/conversion/hf_to_megatron_generate_vlm.py \
     --hf_model_path ${WORKSPACE}/models/Ministral-3-3B-Instruct-2512-BF16-hf-export \
     --image_path "https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16/resolve/main/images/table.png" \
     --prompt "Describe this image." \

diff --git a/examples/models/vlm/ministral3/peft.sh b/examples/models/vlm/ministral3/peft.sh
@@ -16,6 +16,14 @@
 # Workspace directory for checkpoints and results
 WORKSPACE=${WORKSPACE:-/workspace}
 
+# Note: Ministral 3 requires transformers version 5
+# uv pip install --upgrade transformers
+# Commands below use uv run --no-sync to avoid conflicts with the virtual environment.
+
+# Before training, make sure to set WANDB_API_KEY or disable wandb logging
+# export WANDB_API_KEY=<your_wandb_api_key>
+# export WANDB_MODE=disabled
+
 # Common configurations
 PRETRAINED_CHECKPOINT=${WORKSPACE}/models/Ministral-3-3B-Instruct-2512-BF16
 MODEL_NAME=ministral3_3b
@@ -38,7 +46,7 @@ for config in "${PARALLELISM_CONFIGS[@]}"; do
     IFS=',' read -r TP PP <<< "$config"
 
     echo "Running LoRA finetuning with TP=$TP, PP=$PP"
-    uv run python -m torch.distributed.run --nproc_per_node=8 scripts/training/run_recipe.py \
+    uv run --no-sync python -m torch.distributed.run --nproc_per_node=8 scripts/training/run_recipe.py \
         --recipe ${MODEL_NAME}_finetune_config \
         --step_func vlm_step \
         --peft_scheme lora \

diff --git a/examples/models/vlm/ministral3/sft.sh b/examples/models/vlm/ministral3/sft.sh
@@ -16,6 +16,14 @@
 # Workspace directory for checkpoints and results
 WORKSPACE=${WORKSPACE:-/workspace}
 
+# Note: Ministral 3 requires transformers version 5
+# uv pip install --upgrade transformers
+# Commands below use uv run --no-sync to avoid conflicts with the virtual environment.
+
+# Before training, make sure to set WANDB_API_KEY or disable wandb logging
+# export WANDB_API_KEY=<your_wandb_api_key>
+# export WANDB_MODE=disabled
+
 # Common configurations
 PRETRAINED_CHECKPOINT=${WORKSPACE}/models/Ministral-3-3B-Instruct-2512-BF16
 MODEL_NAME=ministral3_3b
@@ -38,7 +46,7 @@ for config in "${PARALLELISM_CONFIGS[@]}"; do
     IFS=',' read -r TP PP <<< "$config"
 
     echo "Running full finetuning with TP=$TP, PP=$PP"
-    uv run python -m torch.distributed.run --nproc_per_node=8 scripts/training/run_recipe.py \
+    uv run --no-sync python -m torch.distributed.run --nproc_per_node=8 scripts/training/run_recipe.py \
         --recipe ${MODEL_NAME}_finetune_config \
         --step_func vlm_step \
         checkpoint.pretrained_checkpoint=$PRETRAINED_CHECKPOINT \