diff --git a/docs/models/vlm/qwen3-vl.md b/docs/models/vlm/qwen3-vl.md
index ae8ed510b8..f87cfcf902 100644
--- a/docs/models/vlm/qwen3-vl.md
+++ b/docs/models/vlm/qwen3-vl.md
@@ -13,107 +13,9 @@ Unless explicitly stated, any megatron model path in the commands below should N
 [here](https://docs.nvidia.com/nemo/megatron-bridge/latest/training/checkpointing.html#checkpoint-contents) 
 ```
 
-## Conversion with 🤗 Hugging Face
+## Examples
 
-### Import HF → Megatron
-To import the HF model to your desired `$MEGATRON_MODEL_PATH`, run the following command.
-```bash
-python examples/conversion/convert_checkpoints.py import \
---hf-model $HF_MODEL_PATH \
---megatron-path $MEGATRON_MODEL_PATH
-```
-
-### Export Megatron → HF
-You can export a trained model with the following command.
-```bash
-python examples/conversion/convert_checkpoints.py export \
---hf-model $HF_MODEL_PATH \
---megatron-path <trained megatron model path> \
---hf-path <output hf model path>
-```
-
-### Run In-Framework Inference on Converted Checkpoint
-You can run a quick sanity check on the converted checkpoint with the following command.
-```bash
-python examples/conversion/hf_to_megatron_generate_vlm.py \
---hf_model_path $HF_MODEL_PATH \
---megatron_model_path $MEGATRON_MODEL_PATH \
---image_path <example image path> \
---prompt "Describe this image." \
---max_new_tokens 100
-```
-
-## Finetuning Recipes
-Before training, ensure the following environment variables are set:
-1. `SAVE_DIR`: to specify a checkpoint and log saving directory
-2. `HF_TOKEN`: to download models from HF Hub (if required)
-3. `HF_HOME`: (optional) to avoid re-downloading models and datasets
-4. `WANDB_API_KEY`: (optional) to enable WandB logging
-
-### Full Finetuning
-
-Example usage for full parameter finetuning:
-
-```bash
-torchrun --nproc-per-node=8 examples/models/vlm/qwen_vl/finetune_qwen_vl.py \
---pretrained-checkpoint $MEGATRON_MODEL_PATH \
---recipe qwen3_vl_8b_finetune_config \
---dataset-type hf \
-dataset.maker_name=make_cord_v2_dataset \
-train.global_batch_size=<batch size> \
-train.train_iters=<number of iterations> \
-logger.wandb_project=<optional wandb project name> \
-logger.wandb_save_dir=$SAVE_DIR \
-checkpoint.save=$SAVE_DIR/<experiment name>
-```
-
-For MoE models with expert parallelism:
-```bash
-torchrun --nproc-per-node=8 examples/models/vlm/qwen_vl/finetune_qwen_vl.py \
---pretrained-checkpoint $MEGATRON_MODEL_PATH \
---recipe qwen3_vl_30b_a3b_finetune_config \
---dataset-type hf \
-dataset.maker_name=make_cord_v2_dataset \
-train.global_batch_size=<batch size> \
-train.train_iters=<number of iterations> \
-checkpoint.save=$SAVE_DIR/<experiment name>
-```
-
-Note:
-- The `--recipe` parameter selects the model configuration:
-  - `qwen3_vl_8b_finetune_config` - for 8B dense model
-  - `qwen3_vl_30b_a3b_finetune_config` - for 30B MoE model
-- For dataset formats and additional information, refer to the [Qwen2.5-VL documentation]
-- See the full script with examples at [`examples/models/vlm/qwen_vl/finetune_qwen_vl.py`](../../../examples/models/vlm/qwen_vl/finetune_qwen_vl.py)
-
-### PEFT (Parameter-Efficient Fine-Tuning)
-
-Qwen3-VL supports PEFT methods including LoRA and DoRA for memory-efficient training. PEFT trains only adapter parameters (~1-2% of model), significantly reducing memory requirements and enabling faster training.
-
-**LoRA with 8B Dense Model (1 GPU):**
-```bash
-torchrun --nproc-per-node=1 examples/models/vlm/qwen_vl/finetune_qwen_vl.py \
---pretrained-checkpoint $MEGATRON_MODEL_PATH \
---recipe qwen3_vl_8b_finetune_config \
---dataset-type hf \
---peft lora \
-checkpoint.save=$SAVE_DIR/<experiment name>
-```
-
-**LoRA with 30B MoE Model (8 GPUs with Expert Parallelism):**
-```bash
-torchrun --nproc-per-node=8 examples/models/vlm/qwen_vl/finetune_qwen_vl.py \
---pretrained-checkpoint $MEGATRON_MODEL_PATH \
---recipe qwen3_vl_30b_a3b_finetune_config \
---dataset-type hf \
---peft lora \
-checkpoint.save=$SAVE_DIR/<experiment name>
-```
-
-**DoRA Training:**
-```bash
---peft dora
-```
+For checkpoint conversion, inference, finetuning recipes, and step-by-step training guides, see the [Qwen3-VL Examples](https://github.com/NVIDIA-NeMo/Megatron-Bridge/blob/main/examples/models/vlm/qwen3_vl/README.md).
 
 ## Hugging Face Model Cards
 - Qwen3-VL-8B: `https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct`
diff --git a/docs/training/multi-token-prediction.md b/docs/training/multi-token-prediction.md
index d7c3f63b2b..3cfbd5a149 100644
--- a/docs/training/multi-token-prediction.md
+++ b/docs/training/multi-token-prediction.md
@@ -66,12 +66,13 @@ where:
 Here's a minimal example using the Qwen3 30B-A3B recipe with MTP enabled:
 
 ```python
-from megatron.bridge.recipes.qwen import qwen3_30b_a3b_pretrain
+from megatron.bridge.recipes.qwen.qwen3_moe import qwen3_30b_a3b_pretrain_config
 from megatron.bridge.training.pretrain import pretrain
+from megatron.bridge.training.gpt_step import forward_step
+from megatron.bridge.training.config import ConfigContainer
 
-log_dir = f"/path/to/log/dir"
+log_dir = "/path/to/log/dir"
 cfg: ConfigContainer = qwen3_30b_a3b_pretrain_config()
-cfg.logger.log_dir = log_dir
 cfg.logger.tensorboard_dir = log_dir + "/tb_logs"
 cfg.checkpoint.save = log_dir + "/checkpoints"
 cfg.checkpoint.load = log_dir + "/checkpoints"
@@ -82,10 +83,11 @@ cfg.dataset.blend=[[
 ], None]
 cfg.dataset.split="9999,8,2"
 cfg.dataset.path_to_cache = "/path/to/cache"
+# cfg.model.num_layers = 8  # train a smaller model if OOM
 # MTP Configuration
-cfg.mtp_num_layers = 1
-cfg.mtp_loss_scaling_factor = 0.1
-pretrain(cfg)
+cfg.model.mtp_num_layers = 1
+cfg.model.mtp_loss_scaling_factor = 0.1
+pretrain(cfg, forward_step)
 ```
 Follow the [DCLM Tutorial](https://github.com/NVIDIA-NeMo/Megatron-Bridge/tree/main/tutorials/data/dclm) to prepare the training data 
 
diff --git a/examples/models/vlm/gemma3_vl/peft.sh b/examples/models/vlm/gemma3_vl/peft.sh
index a4786e14d3..c966900ee4 100644
--- a/examples/models/vlm/gemma3_vl/peft.sh
+++ b/examples/models/vlm/gemma3_vl/peft.sh
@@ -16,6 +16,10 @@
 # Workspace directory for checkpoints and results
 WORKSPACE=${WORKSPACE:-/workspace}
 
+# Before training, make sure to set WANDB_API_KEY or disable wandb logging
+# export WANDB_API_KEY=<your_wandb_api_key>
+# export WANDB_MODE=disabled
+
 # Common configurations
 PRETRAINED_CHECKPOINT=${WORKSPACE}/models/gemma-3-4b-it
 MODEL_NAME=gemma3_vl_4b
diff --git a/examples/models/vlm/gemma3_vl/sft.sh b/examples/models/vlm/gemma3_vl/sft.sh
index b7715c4eaf..820c9c3298 100755
--- a/examples/models/vlm/gemma3_vl/sft.sh
+++ b/examples/models/vlm/gemma3_vl/sft.sh
@@ -16,6 +16,10 @@
 # Workspace directory for checkpoints and results
 WORKSPACE=${WORKSPACE:-/workspace}
 
+# Before training, make sure to set WANDB_API_KEY or disable wandb logging
+# export WANDB_API_KEY=<your_wandb_api_key>
+# export WANDB_MODE=disabled
+
 # Common configurations
 PRETRAINED_CHECKPOINT=${WORKSPACE}/models/gemma-3-4b-it
 MODEL_NAME=gemma3_vl_4b
diff --git a/examples/models/vlm/glm_45v/slurm_peft.sh b/examples/models/vlm/glm_45v/slurm_peft.sh
index 017da6a74c..e876089c34 100755
--- a/examples/models/vlm/glm_45v/slurm_peft.sh
+++ b/examples/models/vlm/glm_45v/slurm_peft.sh
@@ -90,6 +90,8 @@ export NCCL_NVLS_ENABLE=0
 # Authentication tokens (set these for your environment)
 # export HF_TOKEN="hf_your_token_here"
 # export WANDB_API_KEY="your_wandb_key_here"
+# or disable wandb logging
+# export WANDB_MODE=disabled
 
 # ==============================================================================
 # Job Execution
diff --git a/examples/models/vlm/glm_45v/slurm_sft.sh b/examples/models/vlm/glm_45v/slurm_sft.sh
index f23dee3c43..e76e6968a1 100644
--- a/examples/models/vlm/glm_45v/slurm_sft.sh
+++ b/examples/models/vlm/glm_45v/slurm_sft.sh
@@ -90,6 +90,8 @@ export NCCL_NVLS_ENABLE=0
 # Authentication tokens (set these for your environment)
 # export HF_TOKEN="hf_your_token_here"
 # export WANDB_API_KEY="your_wandb_key_here"
+# or disable wandb logging
+# export WANDB_MODE=disabled
 
 # ==============================================================================
 # Job Execution
diff --git a/examples/models/vlm/ministral3/conversion.sh b/examples/models/vlm/ministral3/conversion.sh
index 296af05d3c..7b0bbad008 100755
--- a/examples/models/vlm/ministral3/conversion.sh
+++ b/examples/models/vlm/ministral3/conversion.sh
@@ -16,17 +16,21 @@
 # Workspace directory for checkpoints and results
 WORKSPACE=${WORKSPACE:-/workspace}
 
+# Note: Ministral 3 requires transformers version 5
+# uv pip install --upgrade transformers
+# Commands below use uv run --no-sync to avoid conflicts with the virtual environment.
+
 # Import HF → Megatron
-uv run python examples/conversion/convert_checkpoints.py import \
+uv run --no-sync python examples/conversion/convert_checkpoints.py import \
     --hf-model mistralai/Ministral-3-3B-Instruct-2512-BF16 \
     --megatron-path ${WORKSPACE}/models/Ministral-3-3B-Instruct-2512-BF16
 
 # Export Megatron → HF
-uv run python examples/conversion/convert_checkpoints.py export \
+uv run --no-sync python examples/conversion/convert_checkpoints.py export \
     --hf-model mistralai/Ministral-3-3B-Instruct-2512-BF16 \
     --megatron-path ${WORKSPACE}/models/Ministral-3-3B-Instruct-2512-BF16/iter_0000000 \
     --hf-path ${WORKSPACE}/models/Ministral-3-3B-Instruct-2512-BF16-hf-export
 
 # Round-trip validation
-uv run python -m torch.distributed.run --nproc_per_node=8 examples/conversion/hf_megatron_roundtrip_multi_gpu.py \
+uv run --no-sync python -m torch.distributed.run --nproc_per_node=8 examples/conversion/hf_megatron_roundtrip_multi_gpu.py \
     --hf-model-id mistralai/Ministral-3-3B-Instruct-2512-BF16 --tp 2 --pp 2
diff --git a/examples/models/vlm/ministral3/inference.sh b/examples/models/vlm/ministral3/inference.sh
index 98e20c2050..de0b8bee29 100755
--- a/examples/models/vlm/ministral3/inference.sh
+++ b/examples/models/vlm/ministral3/inference.sh
@@ -16,8 +16,12 @@
 # Workspace directory for checkpoints and results
 WORKSPACE=${WORKSPACE:-/workspace}
 
+# Note: Ministral 3 requires transformers version 5
+# uv pip install --upgrade transformers
+# Commands below use uv run --no-sync to avoid conflicts with the virtual environment.
+
 # Inference with Hugging Face checkpoints
-uv run python -m torch.distributed.run --nproc_per_node=4 examples/conversion/hf_to_megatron_generate_vlm.py \
+uv run --no-sync python -m torch.distributed.run --nproc_per_node=4 examples/conversion/hf_to_megatron_generate_vlm.py \
     --hf_model_path mistralai/Ministral-3-3B-Instruct-2512-BF16 \
     --image_path "https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16/resolve/main/images/table.png" \
     --prompt "Describe this image." \
@@ -26,7 +30,7 @@ uv run python -m torch.distributed.run --nproc_per_node=4 examples/conversion/hf
     --pp 2
 
 # Inference with imported Megatron checkpoints
-uv run python -m torch.distributed.run --nproc_per_node=4 examples/conversion/hf_to_megatron_generate_vlm.py \
+uv run --no-sync python -m torch.distributed.run --nproc_per_node=4 examples/conversion/hf_to_megatron_generate_vlm.py \
     --hf_model_path mistralai/Ministral-3-3B-Instruct-2512-BF16 \
     --megatron_model_path ${WORKSPACE}/models/Ministral-3-3B-Instruct-2512-BF16/iter_0000000 \
     --image_path "https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16/resolve/main/images/table.png" \
@@ -36,7 +40,7 @@ uv run python -m torch.distributed.run --nproc_per_node=4 examples/conversion/hf
     --pp 2
 
 # Inference with exported HF checkpoints
-uv run python -m torch.distributed.run --nproc_per_node=4 examples/conversion/hf_to_megatron_generate_vlm.py \
+uv run --no-sync python -m torch.distributed.run --nproc_per_node=4 examples/conversion/hf_to_megatron_generate_vlm.py \
     --hf_model_path ${WORKSPACE}/models/Ministral-3-3B-Instruct-2512-BF16-hf-export \
     --image_path "https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16/resolve/main/images/table.png" \
     --prompt "Describe this image." \
diff --git a/examples/models/vlm/ministral3/peft.sh b/examples/models/vlm/ministral3/peft.sh
index 0fb8e1b38e..b3c44a2f86 100755
--- a/examples/models/vlm/ministral3/peft.sh
+++ b/examples/models/vlm/ministral3/peft.sh
@@ -16,6 +16,14 @@
 # Workspace directory for checkpoints and results
 WORKSPACE=${WORKSPACE:-/workspace}
 
+# Note: Ministral 3 requires transformers version 5
+# uv pip install --upgrade transformers
+# Commands below use uv run --no-sync to avoid conflicts with the virtual environment.
+
+# Before training, make sure to set WANDB_API_KEY or disable wandb logging
+# export WANDB_API_KEY=<your_wandb_api_key>
+# export WANDB_MODE=disabled
+
 # Common configurations
 PRETRAINED_CHECKPOINT=${WORKSPACE}/models/Ministral-3-3B-Instruct-2512-BF16
 MODEL_NAME=ministral3_3b
@@ -38,7 +46,7 @@ for config in "${PARALLELISM_CONFIGS[@]}"; do
     IFS=',' read -r TP PP <<< "$config"
     
     echo "Running LoRA finetuning with TP=$TP, PP=$PP"
-    uv run python -m torch.distributed.run --nproc_per_node=8 scripts/training/run_recipe.py \
+    uv run --no-sync python -m torch.distributed.run --nproc_per_node=8 scripts/training/run_recipe.py \
         --recipe ${MODEL_NAME}_finetune_config \
         --step_func vlm_step \
         --peft_scheme lora \
diff --git a/examples/models/vlm/ministral3/sft.sh b/examples/models/vlm/ministral3/sft.sh
index 193afaf10e..a22eebbb03 100755
--- a/examples/models/vlm/ministral3/sft.sh
+++ b/examples/models/vlm/ministral3/sft.sh
@@ -16,6 +16,14 @@
 # Workspace directory for checkpoints and results
 WORKSPACE=${WORKSPACE:-/workspace}
 
+# Note: Ministral 3 requires transformers version 5
+# uv pip install --upgrade transformers
+# Commands below use uv run --no-sync to avoid conflicts with the virtual environment.
+
+# Before training, make sure to set WANDB_API_KEY or disable wandb logging
+# export WANDB_API_KEY=<your_wandb_api_key>
+# export WANDB_MODE=disabled
+
 # Common configurations
 PRETRAINED_CHECKPOINT=${WORKSPACE}/models/Ministral-3-3B-Instruct-2512-BF16
 MODEL_NAME=ministral3_3b
@@ -38,7 +46,7 @@ for config in "${PARALLELISM_CONFIGS[@]}"; do
     IFS=',' read -r TP PP <<< "$config"
     
     echo "Running full finetuning with TP=$TP, PP=$PP"
-    uv run python -m torch.distributed.run --nproc_per_node=8 scripts/training/run_recipe.py \
+    uv run --no-sync python -m torch.distributed.run --nproc_per_node=8 scripts/training/run_recipe.py \
         --recipe ${MODEL_NAME}_finetune_config \
         --step_func vlm_step \
         checkpoint.pretrained_checkpoint=$PRETRAINED_CHECKPOINT \
diff --git a/examples/models/vlm/qwen3_vl/README.md b/examples/models/vlm/qwen3_vl/README.md
new file mode 100644
index 0000000000..f62008f601
--- /dev/null
+++ b/examples/models/vlm/qwen3_vl/README.md
@@ -0,0 +1,120 @@
+# Qwen 3 VL - Vision Language Model
+
+This directory contains example scripts for Qwen 3 vision-language models.
+
+For model introduction and architecture details, see the [Qwen 3 - VL documentation](../../../../docs/models/vlm/qwen3-vl.md).
+
+## Workspace Configuration
+
+All scripts use a `WORKSPACE` environment variable to define the base directory for checkpoints and results. By default, this is set to `/workspace`. You can override it:
+
+```bash
+export WORKSPACE=/your/custom/path
+```
+
+Directory structure:
+- `${WORKSPACE}/models/` - Converted checkpoints
+- `${WORKSPACE}/results/` - Training outputs and experiment results
+
+## Checkpoint Conversion
+
+### Import HF → Megatron
+To import the HF VL model to your desired Megatron path:
+```bash
+python examples/conversion/convert_checkpoints.py import \
+  --hf-model Qwen/Qwen3-VL-8B-Instruct \
+  --megatron-path ${WORKSPACE}/models/Qwen3-VL-8B-Instruct
+```
+
+### Export Megatron → HF
+```bash
+python examples/conversion/convert_checkpoints.py export \
+  --hf-model Qwen/Qwen3-VL-8B-Instruct \
+  --megatron-path ${WORKSPACE}/models/Qwen3-VL-8B-Instruct/iter_0000000 \
+  --hf-path ${WORKSPACE}/models/Qwen3-VL-8B-Instruct-hf-export
+```
+
+## Inference
+
+### Run Inference on Converted Checkpoint
+
+```bash
+python -m torch.distributed.run --nproc_per_node=4 examples/conversion/hf_to_megatron_generate_vlm.py \
+  --hf_model_path Qwen/Qwen3-VL-8B-Instruct \
+  --megatron_model_path ${WORKSPACE}/models/Qwen3-VL-8B-Instruct/iter_0000000 \
+  --image_path "https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16/resolve/main/images/table.png" \
+  --prompt "Describe this image." \
+  --max_new_tokens 100 
+```
+
+Note:
+- `--megatron_model_path` is optional. If not specified, the script will convert the model and then run forward.
+- You can also use image URLs: `--image_path="https://example.com/image.jpg"`
+
+See the [inference.sh](inference.sh) script for commands to:
+- Run inference with Hugging Face checkpoints
+- Run inference with imported Megatron checkpoints
+- Run inference with exported Hugging Face checkpoints
+
+**Expected output:**
+```
+...
+Generation step 46
+Generation step 47
+Generation step 48
+Generation step 49
+======== GENERATED TEXT OUTPUT ========
+Image: https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16/resolve/main/images/table.png
+Prompt: Describe this image.
+Generated: <|im_start|>user
+<|vision_start|><|image_pad|><|image_pad|>
+...
+<|image_pad|><|vision_end|>Describe this image.<|im_end|>
+<|im_start|>assistant
+This image displays a **technical specifications table** comparing two variants of NVIDIA's H100 GPU: the **H100 SXM** and the **H100 NVL**.
+
+The table is organized into rows, each detailing a specific performance or hardware characteristic, with columns showing the corresponding value for each GPU variant.
+
+Here is a breakdown of the key specifications:
+
+**Performance (FLOPS & TOPS):**
+*   **FP64 (Double Precision):** The
+=======================================
+```
+
+## Finetune Recipes
+
+- Available recipes:
+  - `qwen3_vl_8b_finetune_config`: Finetuning for 8B VL model with PEFT support
+  - `qwen3_vl_30b_a3b_finetune_config`: Finetuning for 30B-A3B VL model with PEFT support
+  - `qwen3_vl_235b_a22b_finetune_config`: Finetuning for 235B-A22B VL model with PEFT support
+    
+Before training, ensure the following environment variables are set:
+1. `HF_TOKEN`: to download models from HF Hub (if required)
+2. `HF_HOME`: (optional) to avoid re-downloading models and datasets
+3. `WANDB_API_KEY`: (optional) to enable WandB logging
+
+### Pretrain
+
+- Available recipes:
+  - `qwen3_vl_8b_pretrain_config`: Pretraining for 8B VL model with PEFT support
+  - `qwen3_vl_30b_a3b_pretrain_config`: Pretraining for 30B-A3B VL model with PEFT support
+  - `qwen3_vl_235b_a22b_pretrain_config`: Pretraining for 235B-A22B VL model with PEFT support
+
+### Supervised Fine-Tuning (SFT)
+
+See the [sft.sh](sft.sh) script for full parameter fine-tuning with configurable model parallelisms.
+
+W&B report coming soon.
+
+### Parameter-Efficient Fine-Tuning (PEFT) with LoRA
+
+See the [peft.sh](peft.sh) script for LoRA fine-tuning with configurable tensor and pipeline parallelism.
+
+W&B report coming soon.
+
+**Note:** LoRA/DoRA significantly reduces memory requirements, allowing for larger batch sizes and fewer GPUs.
+
+## Evaluation
+
+Coming soon.
diff --git a/examples/models/vlm/qwen3_vl/conversion.sh b/examples/models/vlm/qwen3_vl/conversion.sh
new file mode 100755
index 0000000000..1a9a20a798
--- /dev/null
+++ b/examples/models/vlm/qwen3_vl/conversion.sh
@@ -0,0 +1,47 @@
+#!/usr/bin/env bash
+# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Workspace directory for checkpoints and results
+WORKSPACE=${WORKSPACE:-/workspace}
+
+# Import HF → Megatron for dense model
+uv run python examples/conversion/convert_checkpoints.py import \
+    --hf-model Qwen/Qwen3-VL-8B-Instruct \
+    --megatron-path ${WORKSPACE}/models/Qwen3-VL-8B-Instruct
+
+# Export Megatron → HF for dense model
+uv run python examples/conversion/convert_checkpoints.py export \
+    --hf-model Qwen/Qwen3-VL-8B-Instruct \
+    --megatron-path ${WORKSPACE}/models/Qwen3-VL-8B-Instruct/iter_0000000 \
+    --hf-path ${WORKSPACE}/models/Qwen3-VL-8B-Instruct-hf-export
+
+# Round-trip validation for dense model
+uv run python -m torch.distributed.run --nproc_per_node=4 examples/conversion/hf_megatron_roundtrip_multi_gpu.py \
+    --hf-model-id Qwen/Qwen3-VL-8B-Instruct --tp 2 --pp 2
+
+# Import HF → Megatron for MoE model
+uv run python examples/conversion/convert_checkpoints.py import \
+    --hf-model Qwen/Qwen3-VL-30B-A3B-Instruct \
+    --megatron-path ${WORKSPACE}/models/Qwen3-VL-30B-A3B-Instruct
+
+# Export Megatron → HF for MoE model
+uv run python examples/conversion/convert_checkpoints.py export \
+    --hf-model Qwen/Qwen3-VL-30B-A3B-Instruct \
+    --megatron-path ${WORKSPACE}/models/Qwen3-VL-30B-A3B-Instruct/iter_0000000 \
+    --hf-path ${WORKSPACE}/models/Qwen3-VL-30B-A3B-Instruct-hf-export
+
+# Round-trip validation for MoE model
+uv run python -m torch.distributed.run --nproc_per_node=8 examples/conversion/hf_megatron_roundtrip_multi_gpu.py \
+    --hf-model-id Qwen/Qwen3-VL-30B-A3B-Instruct --ep 8
diff --git a/examples/models/vlm/qwen3_vl/inference.sh b/examples/models/vlm/qwen3_vl/inference.sh
new file mode 100755
index 0000000000..f4e9a8483f
--- /dev/null
+++ b/examples/models/vlm/qwen3_vl/inference.sh
@@ -0,0 +1,70 @@
+#!/usr/bin/env bash
+# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Workspace directory for checkpoints and results
+WORKSPACE=${WORKSPACE:-/workspace}
+
+# Inference with Hugging Face checkpoints - Dense model
+uv run python -m torch.distributed.run --nproc_per_node=4 examples/conversion/hf_to_megatron_generate_vlm.py \
+    --hf_model_path Qwen/Qwen3-VL-8B-Instruct \
+    --image_path "https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16/resolve/main/images/table.png" \
+    --prompt "Describe this image." \
+    --max_new_tokens 100 \
+    --tp 2 \
+    --pp 2
+
+# Inference with imported Megatron checkpoints - Dense model
+uv run python -m torch.distributed.run --nproc_per_node=4 examples/conversion/hf_to_megatron_generate_vlm.py \
+    --hf_model_path Qwen/Qwen3-VL-8B-Instruct \
+    --megatron_model_path ${WORKSPACE}/models/Qwen3-VL-8B-Instruct/iter_0000000 \
+    --image_path "https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16/resolve/main/images/table.png" \
+    --prompt "Describe this image." \
+    --max_new_tokens 100 \
+    --tp 2 \
+    --pp 2
+
+# Inference with exported HF checkpoints - Dense model
+uv run python -m torch.distributed.run --nproc_per_node=4 examples/conversion/hf_to_megatron_generate_vlm.py \
+    --hf_model_path ${WORKSPACE}/models/Qwen3-VL-8B-Instruct-hf-export \
+    --image_path "https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16/resolve/main/images/table.png" \
+    --prompt "Describe this image." \
+    --max_new_tokens 100 \
+    --tp 2 \
+    --pp 2
+
+# Inference with Hugging Face checkpoints - MoE model
+uv run python -m torch.distributed.run --nproc_per_node=8 examples/conversion/hf_to_megatron_generate_vlm.py \
+    --hf_model_path Qwen/Qwen3-VL-30B-A3B-Instruct \
+    --image_path "https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16/resolve/main/images/table.png" \
+    --prompt "Describe this image." \
+    --max_new_tokens 100 \
+    --ep 8
+
+# Inference with imported Megatron checkpoints - MoE model
+uv run python -m torch.distributed.run --nproc_per_node=8 examples/conversion/hf_to_megatron_generate_vlm.py \
+    --hf_model_path Qwen/Qwen3-VL-30B-A3B-Instruct \
+    --megatron_model_path ${WORKSPACE}/models/Qwen3-VL-30B-A3B-Instruct/iter_0000000 \
+    --image_path "https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16/resolve/main/images/table.png" \
+    --prompt "Describe this image." \
+    --max_new_tokens 100 \
+    --ep 8
+
+# Inference with exported HF checkpoints - MoE model
+uv run python -m torch.distributed.run --nproc_per_node=8 examples/conversion/hf_to_megatron_generate_vlm.py \
+    --hf_model_path ${WORKSPACE}/models/Qwen3-VL-30B-A3B-Instruct-hf-export \
+    --image_path "https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16/resolve/main/images/table.png" \
+    --prompt "Describe this image." \
+    --max_new_tokens 100 \
+    --ep 8
diff --git a/examples/models/vlm/qwen3_vl/peft.sh b/examples/models/vlm/qwen3_vl/peft.sh
new file mode 100644
index 0000000000..6cddb470a0
--- /dev/null
+++ b/examples/models/vlm/qwen3_vl/peft.sh
@@ -0,0 +1,114 @@
+#!/usr/bin/env bash
+# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Workspace directory for checkpoints and results
+WORKSPACE=${WORKSPACE:-/workspace}
+
+# Before training, make sure to set WANDB_API_KEY or disable wandb logging
+# export WANDB_API_KEY=<your_wandb_api_key>
+# export WANDB_MODE=disabled
+
+# Common configurations for dense model finetuning
+PRETRAINED_CHECKPOINT=${WORKSPACE}/models/Qwen3-VL-8B-Instruct
+MODEL_NAME=qwen3_vl_8b
+DATASET_NAME=cord_v2
+SEQ_LENGTH=4096
+TRAIN_ITERS=50
+GLOBAL_BATCH_SIZE=32
+MICRO_BATCH_SIZE=1
+EVAL_ITERS=10
+LR=0.00005
+MIN_LR=0.000005
+LR_WARMUP_ITERS=10
+LOG_INTERVAL=1
+WANDB_PROJECT=megatron-bridge-${DATASET_NAME}
+
+# TP/PP combinations: "TP,PP"
+PARALLELISM_CONFIGS=("2,1" "1,2")
+
+for config in "${PARALLELISM_CONFIGS[@]}"; do
+    IFS=',' read -r TP PP <<< "$config"
+    
+    echo "Running LoRA finetuning with TP=$TP, PP=$PP"
+    uv run python -m torch.distributed.run --nproc_per_node=2 scripts/training/run_recipe.py \
+        --recipe ${MODEL_NAME}_finetune_config \
+        --step_func vlm_step \
+        --peft_scheme lora \
+        checkpoint.pretrained_checkpoint=$PRETRAINED_CHECKPOINT \
+        model.seq_length=$SEQ_LENGTH \
+        train.train_iters=$TRAIN_ITERS \
+        train.global_batch_size=$GLOBAL_BATCH_SIZE \
+        train.micro_batch_size=$MICRO_BATCH_SIZE \
+        train.eval_iters=$EVAL_ITERS \
+        optimizer.lr=$LR \
+        optimizer.min_lr=$MIN_LR \
+        scheduler.lr_warmup_iters=$LR_WARMUP_ITERS \
+        checkpoint.save=${WORKSPACE}/results/${MODEL_NAME}_lora_tp${TP}_pp${PP} \
+        logger.log_interval=$LOG_INTERVAL \
+        logger.wandb_project=$WANDB_PROJECT \
+        logger.wandb_exp_name=${MODEL_NAME}_${DATASET_NAME}_lora_tp${TP}_pp${PP} \
+        dataset.maker_name=make_${DATASET_NAME}_dataset \
+        dataset.seq_length=$SEQ_LENGTH \
+        model.tensor_model_parallel_size=$TP \
+        model.pipeline_model_parallel_size=$PP
+done
+
+
+# Common configurations for MoE model finetuning
+PRETRAINED_CHECKPOINT=${WORKSPACE}/models/Qwen3-VL-30B-A3B-Instruct
+MODEL_NAME=qwen3_vl_30b_a3b
+DATASET_NAME=cord_v2
+SEQ_LENGTH=4096
+TRAIN_ITERS=50
+GLOBAL_BATCH_SIZE=32
+MICRO_BATCH_SIZE=1
+EVAL_ITERS=10
+LR=0.00005
+MIN_LR=0.000005
+LR_WARMUP_ITERS=10
+LOG_INTERVAL=1
+WANDB_PROJECT=megatron-bridge-${DATASET_NAME}
+
+# EP/TP/PP combinations: "EP,TP,PP" configurations
+PARALLELISM_CONFIGS=("8,1,1" "1,4,2")
+
+for config in "${PARALLELISM_CONFIGS[@]}"; do
+    IFS=',' read -r EP TP PP <<< "$config"
+
+    echo "Running LoRA finetuning with EP=$EP, TP=$TP, PP=$PP"
+    uv run python -m torch.distributed.run --nproc_per_node=8 scripts/training/run_recipe.py \
+        --recipe ${MODEL_NAME}_finetune_config \
+        --step_func vlm_step \
+        --peft_scheme lora \
+        checkpoint.pretrained_checkpoint=$PRETRAINED_CHECKPOINT \
+        model.seq_length=$SEQ_LENGTH \
+        train.train_iters=$TRAIN_ITERS \
+        train.global_batch_size=$GLOBAL_BATCH_SIZE \
+        train.micro_batch_size=$MICRO_BATCH_SIZE \
+        train.eval_iters=$EVAL_ITERS \
+        optimizer.lr=$LR \
+        optimizer.min_lr=$MIN_LR \
+        scheduler.lr_warmup_iters=$LR_WARMUP_ITERS \
+        checkpoint.save=${WORKSPACE}/results/${MODEL_NAME}_lora_ep${EP}_tp${TP}_pp${PP}  \
+        logger.log_interval=$LOG_INTERVAL \
+        logger.wandb_project=$WANDB_PROJECT \
+        logger.wandb_exp_name=${MODEL_NAME}_${DATASET_NAME}_lora_ep${EP}_tp${TP}_pp${PP} \
+        dataset.maker_name=make_${DATASET_NAME}_dataset \
+        dataset.seq_length=$SEQ_LENGTH \
+        model.expert_model_parallel_size=$EP \
+        model.tensor_model_parallel_size=$TP \
+        model.pipeline_model_parallel_size=$PP
+done
+
diff --git a/examples/models/vlm/qwen3_vl/sft.sh b/examples/models/vlm/qwen3_vl/sft.sh
new file mode 100755
index 0000000000..0a26786273
--- /dev/null
+++ b/examples/models/vlm/qwen3_vl/sft.sh
@@ -0,0 +1,112 @@
+#!/usr/bin/env bash
+# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Workspace directory for checkpoints and results
+WORKSPACE=${WORKSPACE:-/workspace}
+
+# Before training, make sure to set WANDB_API_KEY or disable wandb logging
+# export WANDB_API_KEY=<your_wandb_api_key>
+# export WANDB_MODE=disabled
+
+# Common configurations for dense model finetuning
+PRETRAINED_CHECKPOINT=${WORKSPACE}/models/Qwen3-VL-8B-Instruct
+MODEL_NAME=qwen3_vl_8b
+DATASET_NAME=cord_v2
+SEQ_LENGTH=4096
+TRAIN_ITERS=50
+GLOBAL_BATCH_SIZE=32
+MICRO_BATCH_SIZE=1
+EVAL_ITERS=10
+LR=0.00005
+MIN_LR=0.000005
+LR_WARMUP_ITERS=10
+LOG_INTERVAL=1
+WANDB_PROJECT=megatron-bridge-${DATASET_NAME}
+
+# TP/PP combinations: "TP,PP"
+PARALLELISM_CONFIGS=("2,1" "1,2")
+
+for config in "${PARALLELISM_CONFIGS[@]}"; do
+    IFS=',' read -r TP PP <<< "$config"
+    
+    echo "Running full finetuning with TP=$TP, PP=$PP"
+    uv run python -m torch.distributed.run --nproc_per_node=2 scripts/training/run_recipe.py \
+        --recipe ${MODEL_NAME}_finetune_config \
+        --step_func vlm_step \
+        checkpoint.pretrained_checkpoint=$PRETRAINED_CHECKPOINT \
+        model.seq_length=$SEQ_LENGTH \
+        train.train_iters=$TRAIN_ITERS \
+        train.global_batch_size=$GLOBAL_BATCH_SIZE \
+        train.micro_batch_size=$MICRO_BATCH_SIZE \
+        train.eval_iters=$EVAL_ITERS \
+        optimizer.lr=$LR \
+        optimizer.min_lr=$MIN_LR \
+        scheduler.lr_warmup_iters=$LR_WARMUP_ITERS \
+        checkpoint.save=${WORKSPACE}/results/${MODEL_NAME}_sft_tp${TP}_pp${PP} \
+        logger.log_interval=$LOG_INTERVAL \
+        logger.wandb_project=$WANDB_PROJECT \
+        logger.wandb_exp_name=${MODEL_NAME}_${DATASET_NAME}_sft_tp${TP}_pp${PP} \
+        dataset.maker_name=make_${DATASET_NAME}_dataset \
+        dataset.seq_length=$SEQ_LENGTH \
+        model.tensor_model_parallel_size=$TP \
+        model.pipeline_model_parallel_size=$PP
+done
+
+
+# Common configurations for MoE model finetuning
+PRETRAINED_CHECKPOINT=${WORKSPACE}/models/Qwen3-VL-30B-A3B-Instruct
+MODEL_NAME=qwen3_vl_30b_a3b
+DATASET_NAME=cord_v2
+SEQ_LENGTH=4096
+TRAIN_ITERS=50
+GLOBAL_BATCH_SIZE=32
+MICRO_BATCH_SIZE=1
+EVAL_ITERS=10
+LR=0.00005
+MIN_LR=0.000005
+LR_WARMUP_ITERS=10
+LOG_INTERVAL=1
+WANDB_PROJECT=megatron-bridge-${DATASET_NAME}
+
+# EP/TP/PP/SP combinations: "EP,TP,PP,SP" configurations
+PARALLELISM_CONFIGS=("8,1,1,False" "1,4,2,False" "2,2,2,True")
+
+for config in "${PARALLELISM_CONFIGS[@]}"; do
+    IFS=',' read -r EP TP PP SP <<< "$config"
+
+    echo "Running full finetuning with EP=$EP, TP=$TP, PP=$PP, SP=$SP"
+    uv run python -m torch.distributed.run --nproc_per_node=8 scripts/training/run_recipe.py \
+        --recipe ${MODEL_NAME}_finetune_config \
+        --step_func vlm_step \
+        checkpoint.pretrained_checkpoint=$PRETRAINED_CHECKPOINT \
+        model.seq_length=$SEQ_LENGTH \
+        train.train_iters=$TRAIN_ITERS \
+        train.global_batch_size=$GLOBAL_BATCH_SIZE \
+        train.micro_batch_size=$MICRO_BATCH_SIZE \
+        train.eval_iters=$EVAL_ITERS \
+        optimizer.lr=$LR \
+        optimizer.min_lr=$MIN_LR \
+        scheduler.lr_warmup_iters=$LR_WARMUP_ITERS \
+        checkpoint.save=${WORKSPACE}/results/${MODEL_NAME}_sft_ep${EP}_tp${TP}_pp${PP}_sp_${SP} \
+        logger.log_interval=$LOG_INTERVAL \
+        logger.wandb_project=$WANDB_PROJECT \
+        logger.wandb_exp_name=${MODEL_NAME}_${DATASET_NAME}_sft_ep${EP}_tp${TP}_pp${PP}_sp_${SP} \
+        dataset.maker_name=make_${DATASET_NAME}_dataset \
+        dataset.seq_length=$SEQ_LENGTH \
+        model.expert_model_parallel_size=$EP \
+        model.tensor_model_parallel_size=$TP \
+        model.pipeline_model_parallel_size=$PP \
+        model.sequence_parallel=$SP
+done