diff --git a/docs/models/vlm/ministral3.md b/docs/models/vlm/ministral3.md index fb7c5e0f95..dbaae35e4a 100644 --- a/docs/models/vlm/ministral3.md +++ b/docs/models/vlm/ministral3.md @@ -49,157 +49,9 @@ Ministral 3 combines efficient language modeling with multimodal capabilities: - **Multimodal Projector**: Projects vision features to language model space - **Flexible Image Handling**: Supports variable resolution images and multiple images per conversation -## Conversion with 🤗 Hugging Face - -### Import HF → Megatron -To import the HF VL model to your desired Megatron path: -```bash -python examples/conversion/convert_checkpoints.py import \ ---hf-model mistralai/Ministral-3-3B-Base-2512 \ ---megatron-path /models/ministral3-3b -``` - -### Export Megatron → HF -```bash -python examples/conversion/convert_checkpoints.py export \ ---hf-model mistralai/Ministral-3-3B-Base-2512 \ ---megatron-path /results/ministral3_3b/checkpoints/iter_0001000 \ ---hf-path ./ministral3-hf-export -``` - -### Run Inference on Converted Checkpoint - -```bash -python examples/conversion/hf_to_megatron_generate_vlm.py \ ---hf_model_path mistralai/Ministral-3-3B-Base-2512 \ ---megatron_model_path /models/ministral3-3b \ ---image_path \ ---prompt "Describe this image." \ ---max_new_tokens 100 -``` - -Note: -- `--megatron_model_path` is optional. If not specified, the script will convert the model and then run forward. -- You can also use image URLs: `--image_path="https://example.com/image.jpg"` - -## Finetune Recipes - -- See: [bridge.recipes.ministral3](../../apidocs/bridge/bridge.recipes.ministral3.md) -- Available recipes: - - `ministral3_3b_finetune_config`: Finetuning for 3B VL model with PEFT support - - `ministral3_8b_finetune_config`: Finetuning for 8B VL model with PEFT support - - `ministral3_14b_finetune_config`: Finetuning for 14B VL model with PEFT support - -Before training, ensure the following environment variables are set: -1. `SAVE_DIR`: checkpoint and log saving directory -2. `HF_TOKEN`: to download models from HF Hub (if required) -3. `HF_HOME`: (optional) to avoid re-downloading models and datasets -4. `WANDB_API_KEY`: (optional) to enable WandB logging - -### Full Finetuning - -```bash -torchrun --nproc-per-node=8 examples/models/vlm/ministral3/finetune_ministral3_vl.py \ ---pretrained-checkpoint /models/ministral3-3b \ ---dataset-type hf \ -train.global_batch_size=32 \ -train.train_iters=1000 -``` - -Or programmatically: -```python -from megatron.bridge.recipes.ministral3 import ministral3_3b_finetune_config - -# Full finetuning -config = ministral3_3b_finetune_config( - name="ministral3_3b_full_finetune", - pretrained_checkpoint="/models/ministral3-3b", - dataset_type="hf", - peft=None, - train_iters=1000, - global_batch_size=32, -) -``` - -### Parameter-Efficient Finetuning (PEFT) with LoRA - -```bash -torchrun --nproc-per-node=8 examples/models/vlm/ministral3/finetune_ministral3_vl.py \ ---pretrained-checkpoint /models/ministral3-3b \ ---peft-scheme lora \ ---dataset-type hf \ -train.global_batch_size=64 \ -train.train_iters=1000 -``` - -PEFT options: -- `--peft-scheme`: Set to `lora` for LoRA or `dora` for DoRA. Omit for full finetuning. - -You can also combine PEFT with freeze options: -- `--freeze-language-model`: Freeze the language model -- `--freeze-vision-model`: Freeze the vision encoder -- `--freeze-vision-projection`: Freeze the vision projection layer - -Example with freeze options: -```bash -torchrun --nproc-per-node=8 examples/models/vlm/ministral3/finetune_ministral3_vl.py \ ---pretrained-checkpoint /models/ministral3-3b \ ---peft-scheme lora \ ---freeze-vision-model \ -train.global_batch_size=64 -``` - -Programmatic configuration: -```python -from megatron.bridge.recipes.ministral3 import ministral3_3b_finetune_config - -# LoRA finetuning -config = ministral3_3b_finetune_config( - name="ministral3_3b_lora_finetune", - pretrained_checkpoint="/models/ministral3-3b", - dataset_type="hf", - peft="lora", # or "dora" - train_iters=1000, - global_batch_size=64, -) - -# LoRA with vision model frozen -config = ministral3_3b_finetune_config( - name="ministral3_3b_lora_language_only", - pretrained_checkpoint="/models/ministral3-3b", - peft="lora", - freeze_vision_model=True, - freeze_vision_projection=True, -) -``` - -### Recommended Configurations - -| Model | Mode | TP | PP | Global Batch Size | Learning Rate | Hardware | -|-------|------|----|----|-------------------|---------------|----------| -| Ministral 3 3B | Full SFT | 1 | 1 | 32-64 | 5e-6 | 8 GPUs | -| Ministral 3 3B | LoRA/DoRA | 1 | 1 | 64-128 | 1e-4 | 8 GPUs | -| Ministral 3 8B | Full SFT | 2 | 1 | 32-64 | 5e-6 | 8 GPUs | -| Ministral 3 8B | LoRA/DoRA | 1 | 1 | 64-128 | 1e-4 | 8 GPUs | -| Ministral 3 14B | Full SFT | 4 | 1 | 16-32 | 5e-6 | 8 GPUs | -| Ministral 3 14B | LoRA/DoRA | 2 | 1 | 32-64 | 1e-4 | 8 GPUs | - -**Note:** LoRA/DoRA significantly reduces memory requirements, allowing for larger batch sizes and fewer GPUs. - -## Example Datasets - -| Dataset | Maker Name | Description | -|---------|------------|-------------| -| [cord-v2](https://huggingface.co/datasets/naver-clova-ix/cord-v2) | `make_cord_v2_dataset` | OCR receipts: Single-image-text dataset for receipt understanding | -| [MedPix-VQA](https://huggingface.co/datasets/mmoukouba/MedPix-VQA) | `make_medpix_dataset` | Medical VQA: Single-image Q&A for clinical images | -| [The Cauldron (Raven subset)](https://huggingface.co/datasets/HuggingFaceM4/the_cauldron) | `make_raven_dataset` | Visual reasoning: Multi-image analogical reasoning | - -To change the dataset, specify `dataset.maker_name=` in your command. - ## Examples -- Checkpoint import/export: [examples/conversion/convert_checkpoints.py](https://github.com/NVIDIA-NeMo/Megatron-Bridge/blob/main/examples/conversion/convert_checkpoints.py) -- Generate with VLM (HF→Megatron): [examples/conversion/hf_to_megatron_generate_vlm.py](https://github.com/NVIDIA-NeMo/Megatron-Bridge/blob/main/examples/conversion/hf_to_megatron_generate_vlm.py) -- Finetuning script: [examples/models/vlm/ministral3/finetune_ministral3_vl.py](https://github.com/NVIDIA-NeMo/Megatron-Bridge/blob/main/examples/models/vlm/ministral3/finetune_ministral3_vl.py) + +For checkpoint conversion, inference, finetuning recipes, and step-by-step training guides, see the [Ministral 3 Examples](https://github.com/NVIDIA-NeMo/Megatron-Bridge/blob/main/examples/models/vlm/ministral3/README.md). ## Hugging Face Model Cards diff --git a/examples/models/vlm/ministral3/README.md b/examples/models/vlm/ministral3/README.md index 16b3b7dbe7..578dd8908b 100644 --- a/examples/models/vlm/ministral3/README.md +++ b/examples/models/vlm/ministral3/README.md @@ -1,53 +1,8 @@ # Ministral 3 - Vision Language Model -[Mistral AI's Ministral 3](https://huggingface.co/collections/mistralai/ministral-3) is a family of edge-optimized vision-language models designed for deployment across various hardware configurations. The Ministral 3 architecture combines a powerful language model with a vision encoder for multimodal understanding. +This directory contains example scripts for Ministral 3 vision-language models. -Ministral 3 models support multimodal tasks including image captioning, visual question answering, OCR, and general vision-language understanding. Despite their compact size, these models deliver strong performance for on-device and edge deployment scenarios. - -Ministral family models are supported via the Bridge system with auto-detected configuration and weight mapping. - -```{important} -Please upgrade to `transformers` v5 and upgrade `mistral-common` in order to use the Ministral 3 models. -``` - -## Available Models - -### Vision-Language Models -- **Ministral 3 3B** (`mistralai/Ministral-3-3B-Base-2512`): 3.4B parameter vision-language model - - 26 layers, 3072 hidden size - - 32 attention heads, 8 query groups (GQA) - - Vision encoder: ~0.4B parameters - - Recommended: 1 node, 8 GPUs - -- **Ministral 3 8B** (`mistralai/Ministral-3-8B-Base-2512`): 8.4B parameter vision-language model - - 34 layers, 4096 hidden size - - 32 attention heads, 8 query groups (GQA) - - Vision encoder: ~0.4B parameters - - Recommended: 1 node, 8 GPUs - -- **Ministral 3 14B** (`mistralai/Ministral-3-14B-Base-2512`): ~14B parameter vision-language model - - 40 layers, 5120 hidden size - - 32 attention heads, 8 query groups (GQA) - - Vision encoder: ~0.4B parameters - - Recommended: 1 node, 8 GPUs - -All models support extended context lengths up to 256K tokens using YaRN RoPE scaling. - -## Model Architecture Features - -Ministral 3 combines efficient language modeling with multimodal capabilities: - -**Language Model Features:** -- **YaRN RoPE Scaling**: Advanced rope scaling for extended context lengths (up to 256K tokens) -- **Grouped Query Attention (GQA)**: Memory-efficient attention mechanism with 8 query groups -- **SwiGLU Activation**: Gated linear units with SiLU activation for improved performance -- **RMSNorm**: Layer normalization without mean centering for faster computation -- **Llama 4 Attention Scaling**: Position-dependent attention scaling for improved long-context handling - -**Vision-Language Features:** -- **Vision Encoder**: Pre-trained vision encoder for robust visual understanding -- **Multimodal Projector**: Projects vision features to language model space -- **Flexible Image Handling**: Supports variable resolution images and multiple images per conversation +For model introduction and architecture details, see the [Ministral 3 documentation](../../../../docs/models/vlm/ministral3.md). ## Workspace Configuration @@ -172,19 +127,3 @@ W&B report coming soon. ## Evaluation Coming soon. - -## Hugging Face Model Cards - -- Ministral 3 3B Base: https://huggingface.co/mistralai/Ministral-3-3B-Base-2512 -- Ministral 3 3B Instruct: https://huggingface.co/mistralai/Ministral-3-3B-Instruct-2512 -- Ministral 3 8B Base: https://huggingface.co/mistralai/Ministral-3-8B-Base-2512 -- Ministral 3 8B Instruct: https://huggingface.co/mistralai/Ministral-3-8B-Instruct-2512 -- Ministral 3 14B Base: https://huggingface.co/mistralai/Ministral-3-14B-Base-2512 -- Ministral 3 14B Instruct: https://huggingface.co/mistralai/Ministral-3-14B-Instruct-2512 - -## Related Docs -- Related LLM: [Mistral](../llm/mistral.md) -- Recipe usage: [Recipe usage](../../recipe-usage.md) -- Customizing the training recipe configuration: [Configuration overview](../../training/config-container-overview.md) -- Training entry points: [Entry points](../../training/entry-points.md) -