Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -121,7 +121,7 @@ For a deeper dive into conversion design and advanced usage, see the [models REA
- Optimized paths when Transformer Engine is available
- **Flexible to Customize**: Lightweight custom training loop making it easy to configure custom logic in data loading, distributed training, checkpointing, evaluation and logging ([training framework](https://github.com/NVIDIA-NeMo/Megatron-Bridge/tree/main/src/megatron/bridge/training), [training utilities](https://github.com/NVIDIA-NeMo/Megatron-Bridge/tree/main/src/megatron/bridge/training/utils))
- **Supervised & Parameter-Efficient Finetuning**: SFT & PEFT implementation tailored for Megatron-based models that supports LoRA, DoRA, and user-defined PEFT methods ([PEFT implementations](https://github.com/NVIDIA-NeMo/Megatron-Bridge/tree/main/src/megatron/bridge/peft), [finetune module](https://github.com/NVIDIA-NeMo/Megatron-Bridge/blob/main/src/megatron/bridge/training/finetune.py), [SFT dataset](https://github.com/NVIDIA-NeMo/Megatron-Bridge/blob/main/src/megatron/bridge/data/datasets/sft.py))
- **SOTA Training Recipes**: Pre-configured production-ready training recipes for popular models like Llama 3, with optimized hyperparameters and distributed training configuration ([Llama recipes](https://github.com/NVIDIA-NeMo/Megatron-Bridge/tree/main/src/megatron/bridge/recipes/llama), [recipe examples](https://github.com/NVIDIA-NeMo/Megatron-Bridge/tree/main/examples/recipes))
- **SOTA Training Recipes**: Pre-configured production-ready training recipes for popular models like Llama 3, with optimized hyperparameters and distributed training configuration ([Llama recipes](https://github.com/NVIDIA-NeMo/Megatron-Bridge/tree/main/src/megatron/bridge/recipes/llama), [recipe examples](https://github.com/NVIDIA-NeMo/Megatron-Bridge/tree/main/examples/models))
- **Performance Optimization**: Built-in support for FP8 training, model parallelism, and memory-efficient techniques to offer high utilization and near-linear scalability to thousands of nodes. ([mixed precision](https://github.com/NVIDIA-NeMo/Megatron-Bridge/blob/main/src/megatron/bridge/training/mixed_precision.py), [communication overlap](https://github.com/NVIDIA-NeMo/Megatron-Bridge/blob/main/src/megatron/bridge/training/comm_overlap.py), [optimizer utilities](https://github.com/NVIDIA-NeMo/Megatron-Bridge/blob/main/src/megatron/bridge/recipes/utils/optimizer_utils.py))

## Supported Models
Expand Down
2 changes: 1 addition & 1 deletion docs/megatron-lm-to-megatron-bridge.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ Megatron Bridge is Python-first: configure models, data, and training via typed
Run your example training entrypoint and override config keys directly:

```bash
python examples/recipes/llama/pretrain_llama3_8b.py \
python examples/models/llama/pretrain_llama3_8b.py \
train.micro_batch_size=2 \
train.global_batch_size=128 \
model.num_layers=32 model.hidden_size=4096 model.num_attention_heads=32 \
Expand Down
6 changes: 3 additions & 3 deletions docs/models/llm/nemotron3.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ python examples/conversion/convert_checkpoints.py export \
BLEND_PATH=/path/to/dataset/blend
TOKENIZER_MODEL=/path/to/tiktok/tokenizer/model

torchrun --nproc-per-node=8 examples/recipes/nemotron_3/pretrain_nemotron_3_nano.py \
torchrun --nproc-per-node=8 examples/models/nemotron_3/pretrain_nemotron_3_nano.py \
--per-split-data-args-path=${BLEND_PATH} \
--tokenizer-model=${TOKENIZER_MODEL} \
train.global_batch_size=3072 \
Expand All @@ -58,7 +58,7 @@ Notes:

### Full Parameter Fine-Tuning
```bash
torchrun --nproc-per-node=8 examples/recipes/nemotron_3/finetune_nemotron_3_nano.py \
torchrun --nproc-per-node=8 examples/models/nemotron_3/finetune_nemotron_3_nano.py \
train.global_batch_size=128 \
train.train_iters=100 \
scheduler.lr_warmup_iters=10 \
Expand All @@ -74,7 +74,7 @@ Notes:
### LoRA Fine-Tuning
To enable LoRA fine-tuning, pass `--peft lora` to script
```bash
torchrun --nproc-per-node=8 examples/recipes/nemotron_3/finetune_nemotron_3_nano.py \
torchrun --nproc-per-node=8 examples/models/nemotron_3/finetune_nemotron_3_nano.py \
--peft lora \
train.global_batch_size=128 \
train.train_iters=100 \
Expand Down
2 changes: 1 addition & 1 deletion docs/models/llm/nemotronh.md
Original file line number Diff line number Diff line change
Expand Up @@ -184,7 +184,7 @@ bridge.export_ckpt(
## Examples

- Checkpoint conversion: [examples/conversion/convert_checkpoints.py](https://github.com/NVIDIA-NeMo/Megatron-Bridge/blob/main/examples/conversion/convert_checkpoints.py)
- Training scripts: [examples/recipes/train_any_basic.py](https://github.com/NVIDIA-NeMo/Megatron-Bridge/blob/main/examples/recipes/train_any_basic.py)
- Training scripts: [examples/models/train_any_basic.py](https://github.com/NVIDIA-NeMo/Megatron-Bridge/blob/main/examples/models/train_any_basic.py)

## Finetuning Recipes

Expand Down
8 changes: 4 additions & 4 deletions docs/models/vlm/ministral3.md
Original file line number Diff line number Diff line change
Expand Up @@ -99,7 +99,7 @@ Before training, ensure the following environment variables are set:
### Full Finetuning

```bash
torchrun --nproc-per-node=8 examples/recipes/ministral3/finetune_ministral3_vl.py \
torchrun --nproc-per-node=8 examples/models/vlm/ministral3/finetune_ministral3_vl.py \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just realized there's no finetune_ministral3_vl.py in the repo

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's fix later

--pretrained-checkpoint /models/ministral3-3b \
--dataset-type hf \
train.global_batch_size=32 \
Expand All @@ -124,7 +124,7 @@ config = ministral3_3b_finetune_config(
### Parameter-Efficient Finetuning (PEFT) with LoRA

```bash
torchrun --nproc-per-node=8 examples/recipes/ministral3/finetune_ministral3_vl.py \
torchrun --nproc-per-node=8 examples/models/vlm/ministral3/finetune_ministral3_vl.py \
--pretrained-checkpoint /models/ministral3-3b \
--peft-scheme lora \
--dataset-type hf \
Expand All @@ -142,7 +142,7 @@ You can also combine PEFT with freeze options:

Example with freeze options:
```bash
torchrun --nproc-per-node=8 examples/recipes/ministral3/finetune_ministral3_vl.py \
torchrun --nproc-per-node=8 examples/models/vlm/ministral3/finetune_ministral3_vl.py \
--pretrained-checkpoint /models/ministral3-3b \
--peft-scheme lora \
--freeze-vision-model \
Expand Down Expand Up @@ -199,7 +199,7 @@ To change the dataset, specify `dataset.maker_name=<maker_name>` in your command
## Examples
- Checkpoint import/export: [examples/conversion/convert_checkpoints.py](https://github.com/NVIDIA-NeMo/Megatron-Bridge/blob/main/examples/conversion/convert_checkpoints.py)
- Generate with VLM (HF→Megatron): [examples/conversion/hf_to_megatron_generate_vlm.py](https://github.com/NVIDIA-NeMo/Megatron-Bridge/blob/main/examples/conversion/hf_to_megatron_generate_vlm.py)
- Finetuning script: [examples/recipes/ministral3/finetune_ministral3_vl.py](https://github.com/NVIDIA-NeMo/Megatron-Bridge/blob/main/examples/recipes/ministral3/finetune_ministral3_vl.py)
- Finetuning script: [examples/models/vlm/ministral3/finetune_ministral3_vl.py](https://github.com/NVIDIA-NeMo/Megatron-Bridge/blob/main/examples/models/vlm/ministral3/finetune_ministral3_vl.py)

## Hugging Face Model Cards

Expand Down
14 changes: 7 additions & 7 deletions docs/models/vlm/nemotron-nano-v2-vl.md
Original file line number Diff line number Diff line change
Expand Up @@ -85,7 +85,7 @@ Example usage for full parameter finetuning using the
[Raven dataset](https://huggingface.co/datasets/HuggingFaceM4/the_cauldron/viewer/raven):

```bash
torchrun --nproc-per-node=8 examples/recipes/nemotron_vl/finetune_nemotron_nano_v2_vl.py \
torchrun --nproc-per-node=8 examples/models/vlm/nemotron_vl/finetune_nemotron_nano_v2_vl.py \
--hf-model-path $HF_MODEL_PATH \
--pretrained-checkpoint <megatron model path> \
dataset.maker_name=make_raven_dataset \
Expand All @@ -95,7 +95,7 @@ checkpoint.save=$SAVE_DIR/<experiment name>
```

Note:
- The config file `examples/recipes/nemotron_vl/conf/nemotron_nano_v2_vl_override_example.yaml` contains a list of arguments
- The config file `examples/models/vlm/nemotron_vl/conf/nemotron_nano_v2_vl_override_example.yaml` contains a list of arguments
that can be overridden in the command. For example, you can set `train.global_batch_size=<batch size>` in the command.
- To change the dataset, you only need to change `dataset.maker_name`. See the dataset section below for details.
- After training, you can run inference with `hf_to_megatron_generate_vlm.py` by supplying the trained megatron checkpoint.
Expand All @@ -110,7 +110,7 @@ settings out of the box in the example script:
distribution is substantially different from pretrained.)

```bash
torchrun --nproc-per-node=8 examples/recipes/nemotron_vl/finetune_nemotron_nano_v2_vl.py \
torchrun --nproc-per-node=8 examples/models/vlm/nemotron_vl/finetune_nemotron_nano_v2_vl.py \
--hf-model-path $HF_MODEL_PATH \
--pretrained-checkpoint $MEGATRON_MODEL_PATH \
--lora-on-language-model \
Expand All @@ -126,7 +126,7 @@ model.freeze_vision_projection=False
2. Apply LoRA to all linear layers in attention and MLP modules of the vision model, vision projection, and the language model.

```bash
torchrun --nproc-per-node=8 examples/recipes/nemotron_vl/finetune_nemotron_nano_v2_vl.py \
torchrun --nproc-per-node=8 examples/models/vlm/nemotron_vl/finetune_nemotron_nano_v2_vl.py \
--hf-model-path $HF_MODEL_PATH \
--pretrained-checkpoint $MEGATRON_MODEL_PATH \
--lora-on-language-model \
Expand Down Expand Up @@ -169,7 +169,7 @@ Megatron Bridge supports various vision-language dataset examples which can be u

Note on video training example:
- We provide a video config yaml file instead of the default config yaml file that overwrites a few commands. Please
pass in `--config-file "examples/recipes/nemotron_vl/conf/nemotron_nano_v2_vl_video.yaml"`.
pass in `--config-file "examples/models/vlm/nemotron_vl/conf/nemotron_nano_v2_vl_video.yaml"`.
- The LLaVA video dataset requires manual download beforehand. Please place the downloaded and extracted video files
in a folder `VIDEO_ROOT` and pass it in to the maker with `dataset.maker_kwargs={"video_root_path":$VIDEO_ROOT}`.
In the nextqa subset example, `VIDEO_ROOT` should look like
Expand All @@ -186,10 +186,10 @@ Note on video training example:

Full video training example command:
```bash
torchrun --nproc-per-node=8 examples/recipes/nemotron_vl/finetune_nemotron_nano_v2_vl.py \
torchrun --nproc-per-node=8 examples/models/vlm/nemotron_vl/finetune_nemotron_nano_v2_vl.py \
--hf-model-path $HF_MODEL_PATH \
--pretrained-checkpoint $MEGATRON_MODEL_PATH \
--config-file "examples/recipes/nemotron_vl/conf/nemotron_nano_v2_vl_video.yaml" \
--config-file "examples/models/vlm/nemotron_vl/conf/nemotron_nano_v2_vl_video.yaml" \
logger.wandb_project=<optional wandb project name> \
logger.wandb_save_dir=$SAVE_DIR \
checkpoint.save=$SAVE_DIR/<experiment name> \
Expand Down
8 changes: 4 additions & 4 deletions docs/models/vlm/qwen2.5-vl.md
Original file line number Diff line number Diff line change
Expand Up @@ -64,7 +64,7 @@ Before training, ensure the following environment variables are set.
Example usage for full parameter finetuning:

```bash
torchrun --nproc-per-node=8 examples/recipes/qwen_vl/finetune_qwen25_vl.py \
torchrun --nproc-per-node=8 examples/models/vlm/qwen_vl/finetune_qwen25_vl.py \
--pretrained-checkpoint $MEGATRON_MODEL_PATH \
--recipe qwen25_vl_3b_finetune_config \
--dataset-type hf \
Expand All @@ -82,7 +82,7 @@ Note:
- `qwen25_vl_7b_finetune_config` - for 7B model
- `qwen25_vl_32b_finetune_config` - for 32B model
- `qwen25_vl_72b_finetune_config` - for 72B model
- The config file `examples/recipes/qwen_vl/conf/qwen25_vl_pretrain_override_example.yaml` contains a list of arguments
- The config file `examples/models/vlm/qwen_vl/conf/qwen25_vl_pretrain_override_example.yaml` contains a list of arguments
that can be overridden in the command. For example, you can set `train.global_batch_size=<batch size>` in the command.
- The dataset format should be JSONL with conversation format (see dataset section below).
- After training, you can run inference with `hf_to_megatron_generate_vlm.py` by supplying the trained megatron checkpoint.
Expand All @@ -92,7 +92,7 @@ Note:
Parameter-efficient finetuning (PEFT) using LoRA or DoRA is supported. You can use the `--peft_scheme` argument to enable PEFT training:

```bash
torchrun --nproc-per-node=8 examples/recipes/qwen_vl/finetune_qwen25_vl.py \
torchrun --nproc-per-node=8 examples/models/vlm/qwen_vl/finetune_qwen25_vl.py \
--pretrained-checkpoint $MEGATRON_MODEL_PATH \
--recipe qwen25_vl_3b_finetune_config \
--peft_scheme lora \
Expand All @@ -112,7 +112,7 @@ You can also combine PEFT with freeze options to control which components are tr

Example with LoRA and freeze options:
```bash
torchrun --nproc-per-node=8 examples/recipes/qwen_vl/finetune_qwen25_vl.py \
torchrun --nproc-per-node=8 examples/models/vlm/qwen_vl/finetune_qwen25_vl.py \
--pretrained-checkpoint $MEGATRON_MODEL_PATH \
--recipe qwen25_vl_3b_finetune_config \
--peft_scheme lora \
Expand Down
6 changes: 3 additions & 3 deletions docs/models/vlm/qwen3-vl.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,7 @@ Before training, ensure the following environment variables are set:
Example usage for full parameter finetuning:

```bash
torchrun --nproc-per-node=8 examples/recipes/qwen_vl/finetune_qwen_vl.py \
torchrun --nproc-per-node=8 examples/models/vlm/qwen_vl/finetune_qwen_vl.py \
--pretrained-checkpoint $MEGATRON_MODEL_PATH \
--recipe qwen3_vl_8b_finetune_config \
--dataset-type hf \
Expand All @@ -69,7 +69,7 @@ checkpoint.save=$SAVE_DIR/<experiment name>

For MoE models with expert parallelism:
```bash
torchrun --nproc-per-node=8 examples/recipes/qwen_vl/finetune_qwen_vl.py \
torchrun --nproc-per-node=8 examples/models/vlm/qwen_vl/finetune_qwen_vl.py \
--pretrained-checkpoint $MEGATRON_MODEL_PATH \
--recipe qwen3_vl_30b_a3b_finetune_config \
--dataset-type hf \
Expand All @@ -84,7 +84,7 @@ Note:
- `qwen3_vl_8b_finetune_config` - for 8B dense model
- `qwen3_vl_30b_a3b_finetune_config` - for 30B MoE model
- For dataset formats and additional information, refer to the [Qwen2.5-VL documentation]
- See the full script with examples at [`examples/recipes/qwen_vl/finetune_qwen_vl.py`](../../../examples/recipes/qwen_vl/finetune_qwen_vl.py)
- See the full script with examples at [`examples/models/vlm/qwen_vl/finetune_qwen_vl.py`](../../../examples/models/vlm/qwen_vl/finetune_qwen_vl.py)

## Hugging Face Model Cards
- Qwen3-VL-8B: `https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct`
Expand Down
4 changes: 2 additions & 2 deletions docs/recipe-usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ This guide will cover the next steps to make use of a training recipe, including
Recipes are provided through a {py:class}`~bridge.training.config.ConfigContainer` object. This is a dataclass that holds all configuration objects needed for training. You can find a more detailed overview of the `ConfigContainer` [here](training/config-container-overview.md).
The benefit of providing the full recipe through a pythonic structure is that it is agnostic to any configuration approach that a user may prefer, whether that's YAML, `argparse` or something else. In other words, the user may override the recipe however they see fit.

The following sections detail a few different ways to override the configuration recipe. For a complete training script, please see [this example](https://github.com/NVIDIA-NeMo/Megatron-Bridge/blob/main/examples/recipes/llama/pretrain_llama3_8b.py).
The following sections detail a few different ways to override the configuration recipe. For a complete training script, please see [this example](https://github.com/NVIDIA-NeMo/Megatron-Bridge/blob/main/examples/models/llama/pretrain_llama3_8b.py).


### Python
Expand Down Expand Up @@ -184,7 +184,7 @@ if __name__ == "__main__":
train_script = run.Script(..., args=args_to_fwd)
```

For a complete example of the `run.Script` API, including argument forwarding, please see [this script](https://github.com/NVIDIA-NeMo/Megatron-Bridge/blob/main/examples/recipes/llama/pretrain_llama3_8b_nemo_run_script.py).
For a complete example of the `run.Script` API, including argument forwarding, please see [this script](https://github.com/NVIDIA-NeMo/Megatron-Bridge/blob/main/examples/models/llama/pretrain_llama3_8b_nemo_run_script.py).

#### Plugins

Expand Down
8 changes: 4 additions & 4 deletions docs/training/distillation.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,15 +49,15 @@ logit_kl_temperature: 2.0
The simplest way to run knowledge distillation is to use or adapt one of the provided recipe scripts. Here's an example for distilling Llama3.2-3B into Llama3.2-1B:

```bash
torchrun --nproc_per_node=1 examples/recipes/llama/distill_llama32_3b-1b.py
torchrun --nproc_per_node=1 examples/distillation/llama/distill_llama32_3b-1b.py
```

### Using a Custom YAML Config File

You can provide a custom YAML configuration file to override default settings:

```bash
torchrun --nproc_per_node=1 examples/recipes/llama/distill_llama32_3b-1b.py \
torchrun --nproc_per_node=1 examples/distillation/llama/distill_llama32_3b-1b.py \
--config-file my_custom_config.yaml
```

Expand All @@ -66,7 +66,7 @@ torchrun --nproc_per_node=1 examples/recipes/llama/distill_llama32_3b-1b.py \
Megatron Bridge supports Hydra-style CLI overrides for flexible configuration:

```bash
torchrun --nproc_per_node=2 examples/recipes/llama/distill_llama32_3b-1b.py \
torchrun --nproc_per_node=2 examples/distillation/llama/distill_llama32_3b-1b.py \
model.tensor_model_parallel_size=2 \
model.teacher.tensor_model_parallel_size=2
```
Expand All @@ -76,7 +76,7 @@ torchrun --nproc_per_node=2 examples/recipes/llama/distill_llama32_3b-1b.py \
CLI overrides take precedence over YAML configuration:

```bash
torchrun --nproc_per_node=2 examples/recipes/llama/distill_llama32_3b-1b.py \
torchrun --nproc_per_node=2 examples/distillation/llama/distill_llama32_3b-1b.py \
--config-file conf/my_config.yaml \
train.global_batch_size=512
```
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -25,10 +25,10 @@ Just use an existing recipe and enable decentralized process groups:

```bash
# 8 GPUs: TP2 x PP2 x DP2
uv run python -m torch.distributed.run --nproc_per_node=8 examples/recipes/decentralized_pg/pretrain_qwen3_simple.py
uv run python -m torch.distributed.run --nproc_per_node=8 examples/decentralized_pg/pretrain_qwen3_simple.py

# 4 GPUs: TP2 x PP2 x DP1
uv run python -m torch.distributed.run --nproc_per_node=4 examples/recipes/decentralized_pg/pretrain_qwen3_simple.py
uv run python -m torch.distributed.run --nproc_per_node=4 examples/decentralized_pg/pretrain_qwen3_simple.py
```

The key is just two lines:
Expand All @@ -53,14 +53,14 @@ For full control over process groups:

```bash
# 8 GPUs: TP2 x PP2 x DP2
uv run python -m torch.distributed.run --nproc_per_node=8 examples/recipes/decentralized_pg/pretrain_qwen3_with_decentralized_pg.py
uv run python -m torch.distributed.run --nproc_per_node=8 examples/decentralized_pg/pretrain_qwen3_with_decentralized_pg.py

# 4 GPUs: TP2 x PP2 x DP1
uv run python -m torch.distributed.run --nproc_per_node=4 examples/recipes/decentralized_pg/pretrain_qwen3_with_decentralized_pg.py \
uv run python -m torch.distributed.run --nproc_per_node=4 examples/decentralized_pg/pretrain_qwen3_with_decentralized_pg.py \
--tp-size 2 --pp-size 2

# 2 GPUs: TP2 x PP1 x DP1
uv run python -m torch.distributed.run --nproc_per_node=2 examples/recipes/decentralized_pg/pretrain_qwen3_with_decentralized_pg.py \
uv run python -m torch.distributed.run --nproc_per_node=2 examples/decentralized_pg/pretrain_qwen3_with_decentralized_pg.py \
--tp-size 2 --pp-size 1
```

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -27,10 +27,10 @@
How to Run
----------
# 8 GPUs: TP2 x PP2 x DP2
uv run python -m torch.distributed.run --nproc_per_node=8 examples/recipes/decentralized_pg/pretrain_qwen3_simple.py
uv run python -m torch.distributed.run --nproc_per_node=8 examples/decentralized_pg/pretrain_qwen3_simple.py

# 4 GPUs: TP2 x PP2 x DP1
uv run python -m torch.distributed.run --nproc_per_node=4 examples/recipes/decentralized_pg/pretrain_qwen3_simple.py
uv run python -m torch.distributed.run --nproc_per_node=4 examples/decentralized_pg/pretrain_qwen3_simple.py
"""

import torch
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@
How to Run
----------
# 8 GPUs: EP8
uv run python -m torch.distributed.run --nproc_per_node=8 examples/recipes/decentralized_pg/pretrain_qwen3_vl_simple.py
uv run python -m torch.distributed.run --nproc_per_node=8 examples/decentralized_pg/pretrain_qwen3_vl_simple.py
"""

import torch
Expand Down
Loading