diff --git a/docs/models/llm/gemma3.md b/docs/models/llm/gemma3.md index 60e726b5c9..0d623920a4 100644 --- a/docs/models/llm/gemma3.md +++ b/docs/models/llm/gemma3.md @@ -180,7 +180,7 @@ torchrun --nproc-per-node=8 run/run_recipe.py \ - Gemma 3 1B: https://huggingface.co/google/gemma-3-1b-it ## Related Docs -- Gemma3 Vision-Language Models: [Gemma 3 VL](../vlm/gemma3-vl.md) +- Gemma3 Vision-Language Models: [Gemma 3 VL](https://github.com/NVIDIA-NeMo/Megatron-Bridge/blob/main/examples/models/vlm/gemma3_vl/README.md) - Recipe usage: [Recipe usage](../../recipe-usage.md) - Customizing the training recipe configuration: [Configuration overview](../../training/config-container-overview.md) - Training entry points: [Entry points](../../training/entry-points.md) diff --git a/docs/models/vlm/gemma3-vl.md b/docs/models/vlm/gemma3-vl.md index 4a04f59c08..d38488fd86 100644 --- a/docs/models/vlm/gemma3-vl.md +++ b/docs/models/vlm/gemma3-vl.md @@ -44,163 +44,9 @@ Gemma 3 VL builds on the Gemma 3 architecture with additional multimodal capabil - **Multimodal Integration**: Seamless integration of visual and textual information through learned projection layers - **Flexible Image Handling**: Supports variable resolution images and multiple images per conversation -## Conversion with 🤗 Hugging Face - -### Import HF → Megatron -To import the HF VL model to your desired Megatron path: -```bash -python examples/conversion/convert_checkpoints.py import \ ---hf-model google/gemma-3-4b-it \ ---megatron-path /models/gemma-3-4b-it -``` - -### Export Megatron → HF -```bash -python examples/conversion/convert_checkpoints.py export \ ---hf-model google/gemma-3-4b-it \ ---megatron-path /results/gemma3_vl_4b/checkpoints/iter_00001000 \ ---hf-path ./gemma3-vl-hf-export -``` - -### Run Inference on Converted Checkpoint - -```bash -python examples/conversion/hf_to_megatron_generate_vlm.py \ ---hf_model_path google/gemma-3-4b-it \ ---megatron_model_path /models/gemma-3-4b-it \ ---image_path \ ---prompt "Describe this image." \ ---max_new_tokens 100 -``` - -Note: -- `--megatron_model_path` is optional. If not specified, the script will convert the model and then run forward. -- You can also use image URLs: `--image_path="https://example.com/image.jpg"` - -## Finetune Recipes - -- See: [bridge.recipes.gemma3_vl](../../apidocs/bridge/bridge.recipes.gemma3_vl.md) -- Available recipes: - - `gemma3_vl_4b_finetune_config`: Finetuning for 4B VL model with PEFT support - - `gemma3_vl_12b_finetune_config`: Finetuning for 12B VL model with PEFT support - - `gemma3_vl_27b_finetune_config`: Finetuning for 27B VL model with PEFT support - -Before training, ensure the following environment variables are set: -1. `SAVE_DIR`: checkpoint and log saving directory -2. `HF_TOKEN`: to download models from HF Hub (if required) -3. `HF_HOME`: (optional) to avoid re-downloading models and datasets -4. `WANDB_API_KEY`: (optional) to enable WandB logging - -### Full Finetuning - -```bash -torchrun --nproc-per-node=8 run/run_vlm_recipe.py \ ---pretrained-checkpoint /models/gemma-3-4b-it \ ---recipe gemma3_vl_4b_finetune_config \ ---dataset-type hf \ -dataset.maker_name=make_cord_v2_dataset \ -train.global_batch_size=64 \ -train.train_iters=1000 \ -checkpoint.save=$SAVE_DIR/gemma3_vl_4b_finetune -``` - -Or programmatically: -```python -from megatron.bridge.recipes.gemma3_vl import gemma3_vl_4b_finetune_config - -# Full finetuning -config = gemma3_vl_4b_finetune_config( - name="gemma3_vl_4b_full_finetune", - pretrained_checkpoint="/models/gemma-3-4b-it", - dataset_type="hf", - peft=None, - train_iters=1000, - global_batch_size=64, -) -``` - -### Parameter-Efficient Finetuning (PEFT) with LoRA - -```bash -torchrun --nproc-per-node=8 run/run_vlm_recipe.py \ ---pretrained-checkpoint /models/gemma-3-4b-it \ ---recipe gemma3_vl_4b_finetune_config \ ---peft_scheme lora \ ---dataset-type hf \ -dataset.maker_name=make_cord_v2_dataset \ -train.global_batch_size=128 \ -checkpoint.save=$SAVE_DIR/gemma3_vl_4b_lora -``` - -PEFT options: -- `--peft_scheme`: Set to `lora` for LoRA or `dora` for DoRA. Omit for full finetuning. - -You can also combine PEFT with freeze options: -- `model.freeze_language_model=True`: Freeze the language model -- `model.freeze_vision_model=True`: Freeze the vision encoder -- `model.freeze_vision_projection=True`: Freeze the vision projection layer - -Example with freeze options: -```bash -torchrun --nproc-per-node=8 run/run_vlm_recipe.py \ ---pretrained-checkpoint /models/gemma-3-4b-it \ ---recipe gemma3_vl_4b_finetune_config \ ---peft_scheme lora \ -model.freeze_language_model=True \ -model.freeze_vision_model=False \ -checkpoint.save=$SAVE_DIR/gemma3_vl_4b_lora_vision -``` - -Programmatic configuration: -```python -from megatron.bridge.recipes.gemma3_vl import gemma3_vl_4b_finetune_config - -# LoRA finetuning -config = gemma3_vl_4b_finetune_config( - name="gemma3_vl_4b_lora_finetune", - pretrained_checkpoint="/models/gemma-3-4b-it", - dataset_type="hf", - peft="lora", # or "dora" - train_iters=1000, - global_batch_size=128, -) - -# LoRA with vision model frozen -config = gemma3_vl_4b_finetune_config( - name="gemma3_vl_4b_lora_language_only", - pretrained_checkpoint="/models/gemma-3-4b-it", - peft="lora", - freeze_vision_model=True, - freeze_vision_projection=True, -) -``` - -### Recommended Configurations - -| Model | Mode | TP | PP | Global Batch Size | Learning Rate | Hardware | -|-------|------|----|----|-------------------|---------------|----------| -| Gemma 3 VL 4B | Full SFT | 1 | 1 | 32-64 | 5e-6 | 8 GPUs | -| Gemma 3 VL 4B | LoRA/DoRA | 1 | 1 | 64-128 | 1e-4 | 8 GPUs | -| Gemma 3 VL 12B | Full SFT | 4 | 1 | 32-64 | 5e-6 | 8 GPUs | -| Gemma 3 VL 12B | LoRA/DoRA | 1 | 1 | 64-128 | 1e-4 | 8 GPUs | -| Gemma 3 VL 27B | Full SFT | 8 | 2 | 16-32 | 5e-6 | 16 GPUs | -| Gemma 3 VL 27B | LoRA/DoRA | 4 | 1 | 32-64 | 1e-4 | 16 GPUs | - -**Note:** LoRA/DoRA significantly reduces memory requirements, allowing for larger batch sizes and fewer GPUs. - -## Example Datasets - -| Dataset | Maker Name | Description | -|---------|------------|-------------| -| [cord-v2](https://huggingface.co/datasets/naver-clova-ix/cord-v2) | `make_cord_v2_dataset` | OCR receipts: Single-image-text dataset for receipt understanding | -| [MedPix-VQA](https://huggingface.co/datasets/mmoukouba/MedPix-VQA) | `make_medpix_dataset` | Medical VQA: Single-image Q&A for clinical images | -| [The Cauldron (Raven subset)](https://huggingface.co/datasets/HuggingFaceM4/the_cauldron) | `make_raven_dataset` | Visual reasoning: Multi-image analogical reasoning | - -To change the dataset, specify `dataset.maker_name=` in your command. - ## Examples -- Checkpoint import/export: [examples/conversion/convert_checkpoints.py](https://github.com/NVIDIA-NeMo/Megatron-Bridge/blob/main/examples/conversion/convert_checkpoints.py) -- Generate with VLM (HF→Megatron): [examples/conversion/hf_to_megatron_generate_vlm.py](https://github.com/NVIDIA-NeMo/Megatron-Bridge/blob/main/examples/conversion/hf_to_megatron_generate_vlm.py) + +For checkpoint conversion, inference, finetuning recipes, and step-by-step training guides, see the [Gemma 3 VL Examples](https://github.com/NVIDIA-NeMo/Megatron-Bridge/blob/main/examples/models/vlm/gemma3_vl/README.md). ## Hugging Face Model Cards @@ -213,4 +59,3 @@ To change the dataset, specify `dataset.maker_name=` in your command - Recipe usage: [Recipe usage](../../recipe-usage.md) - Customizing the training recipe configuration: [Configuration overview](../../training/config-container-overview.md) - Training entry points: [Entry points](../../training/entry-points.md) - diff --git a/docs/models/vlm/glm-45v.md b/docs/models/vlm/glm-45v.md index 5bf400870d..8879862739 100644 --- a/docs/models/vlm/glm-45v.md +++ b/docs/models/vlm/glm-45v.md @@ -19,7 +19,7 @@ Please update `transformers` version to 4.57.1 or higher in order to use the GLM - 128 MoE experts with shared experts - ~12B active parameters per token - Sequence length: 131,072 tokens - - Recommended: 4 nodes, 32 GPUs (LoRA/DoRA) or 16 nodes, 128 GPUs (Full SFT) + - Recommended: 32 nodes, 256 GPUs (LoRA/DoRA) or 64 nodes, 512 GPUs (Full SFT) ## Model Architecture Features @@ -39,134 +39,9 @@ GLM-4.5V combines efficient sparse MoE language modeling with multimodal capabil - **Image and Video Support**: Handles both static images and video inputs - **Flexible Image Handling**: Supports variable resolution images and multiple images per conversation -## Conversion with 🤗 Hugging Face - -### Import HF → Megatron -To import the HF VL model to your desired Megatron path: -```bash -python examples/conversion/convert_checkpoints.py import \ ---hf-model zai-org/GLM-4.5V \ ---megatron-path /models/glm-45v -``` - -### Export Megatron → HF -```bash -python examples/conversion/convert_checkpoints.py export \ ---hf-model zai-org/GLM-4.5V \ ---megatron-path /results/glm_45v/checkpoints/iter_0001000 \ ---hf-path ./glm-45v-hf-export -``` - -### Run Inference on Converted Checkpoint - -```bash -python examples/conversion/hf_to_megatron_generate_vlm.py \ ---hf_model_path zai-org/GLM-4.5V \ ---megatron_model_path /models/glm-45v \ ---image_path \ ---prompt "Describe this image." \ ---max_new_tokens 100 -``` - -Note: -- `--megatron_model_path` is optional. If not specified, the script will convert the model and then run forward. -- You can also use image URLs: `--image_path="https://example.com/image.jpg"` - -## Finetune Recipes - -- See: [bridge.recipes.glm_vl](../../apidocs/bridge/bridge.recipes.glm_vl.md) -- Available recipes: - - `glm_45v_finetune_config`: Finetuning for GLM-4.5V model with PEFT support - -Before training, ensure the following environment variables are set: -1. `SAVE_DIR`: checkpoint and log saving directory -2. `HF_TOKEN`: to download models from HF Hub (if required) -3. `HF_HOME`: (optional) to avoid re-downloading models and datasets -4. `WANDB_API_KEY`: (optional) to enable WandB logging - -### Full Finetuning - -```python -from megatron.bridge.recipes.glm_vl import glm_45v_finetune_config - -# Full finetuning -config = glm_45v_finetune_config( - name="glm_45v_full_finetune", - pretrained_checkpoint="/models/glm-45v", - dataset_type="hf", - peft=None, - train_iters=1000, - global_batch_size=32, -) -``` - -### Parameter-Efficient Finetuning (PEFT) with LoRA - -```python -config = glm_45v_finetune_config( - name="glm_45v_full_finetune", - pretrained_checkpoint="/models/glm-45v", - dataset_type="hf", - peft='lora', - train_iters=1000, - global_batch_size=32, -) -``` - -PEFT options: -- `--peft-scheme`: Set to `lora` for LoRA or `dora` for DoRA. Omit for full finetuning. - -You can also combine PEFT with freeze options: -- `--freeze-language-model`: Freeze the language model -- `--freeze-vision-model`: Freeze the vision encoder -- `--freeze-vision-projection`: Freeze the vision projection layer - -Example with freeze options: -```python -from megatron.bridge.recipes.glm_vl import glm_45v_finetune_config - -# LoRA finetuning -config = glm_45v_finetune_config( - name="glm_45v_lora_finetune", - pretrained_checkpoint="/models/glm-45v", - dataset_type="hf", - peft="lora", # or "dora" - train_iters=1000, - global_batch_size=64, -) - -# LoRA with vision model frozen -config = glm_45v_finetune_config( - name="glm_45v_lora_language_only", - pretrained_checkpoint="/models/glm-45v", - peft="lora", - freeze_vision_model=True, - freeze_vision_projection=True, -) -``` - -### Recommended Configurations - -| Model | Mode | TP | PP | EP | Global Batch Size | Learning Rate | Hardware | -|-------|------|----|----|-----|-------------------|---------------|----------| -| GLM-4.5V | Full SFT | 1 | 8 | 16 | 16-32 | 5e-6 | 128 GPUs (16 nodes) | -| GLM-4.5V | LoRA/DoRA | 1 | 8 | 4 | 32-64 | 1e-4 | 32 GPUs (4 nodes) | - -**Note:** LoRA/DoRA significantly reduces memory requirements, allowing for larger batch sizes and fewer GPUs. The sparse MoE architecture requires Expert Parallelism (EP) for efficient training. - -## Example Datasets - -| Dataset | Maker Name | Description | -|---------|------------|-------------| -| [cord-v2](https://huggingface.co/datasets/naver-clova-ix/cord-v2) | `make_cord_v2_dataset` | OCR receipts: Single-image-text dataset for receipt understanding | -| [MedPix-VQA](https://huggingface.co/datasets/mmoukouba/MedPix-VQA) | `make_medpix_dataset` | Medical VQA: Single-image Q&A for clinical images | -| [The Cauldron (Raven subset)](https://huggingface.co/datasets/HuggingFaceM4/the_cauldron) | `make_raven_dataset` | Visual reasoning: Multi-image analogical reasoning | - -To change the dataset, specify `dataset.maker_name=` in your command. - ## Examples -- Checkpoint import/export: [examples/conversion/convert_checkpoints.py](https://github.com/NVIDIA-NeMo/Megatron-Bridge/blob/main/examples/conversion/convert_checkpoints.py) -- Generate with VLM (HF→Megatron): [examples/conversion/hf_to_megatron_generate_vlm.py](https://github.com/NVIDIA-NeMo/Megatron-Bridge/blob/main/examples/conversion/hf_to_megatron_generate_vlm.py) + +For checkpoint conversion, inference, finetuning recipes, and step-by-step training guides, see the [GLM-4.5V Examples](https://github.com/NVIDIA-NeMo/Megatron-Bridge/blob/main/examples/models/vlm/glm_45v/README.md). ## Hugging Face Model Cards diff --git a/examples/models/vlm/gemma3_vl/README.md b/examples/models/vlm/gemma3_vl/README.md index 285688e4c6..b389ecbd1d 100644 --- a/examples/models/vlm/gemma3_vl/README.md +++ b/examples/models/vlm/gemma3_vl/README.md @@ -1,6 +1,8 @@ -# Gemma 3 VL - Vision Language Model +# Gemma 3 VL Examples -This directory contains examples for Gemma 3 Vision Language Model, including checkpoint conversion, inference, and fine-tuning. +This directory contains example scripts for Gemma 3 VL vision-language models. + +For model introduction and architecture details, see the [Gemma 3 VL documentation](../../../../docs/models/vlm/gemma3-vl.md). ## Workspace Configuration @@ -16,15 +18,43 @@ Directory structure: ## Checkpoint Conversion -See the [conversion.sh](conversion.sh) script for commands to: -- Import Hugging Face checkpoints to Megatron format -- Export Megatron checkpoints back to Hugging Face format -- Run multi-GPU round-trip validation between formats +### Import HF → Megatron +To import the HF VL model to your desired Megatron path: +```bash +python examples/conversion/convert_checkpoints.py import \ +--hf-model google/gemma-3-4b-it \ +--megatron-path /models/gemma-3-4b-it +``` +### Export Megatron → HF +```bash +python examples/conversion/convert_checkpoints.py export \ +--hf-model google/gemma-3-4b-it \ +--megatron-path /results/gemma3_vl_4b/checkpoints/iter_00001000 \ +--hf-path ./gemma3-vl-hf-export +``` + +See the [conversion.sh](conversion.sh) script for more examples including: +- Multi-GPU round-trip validation between formats ## Inference -**See the [inference.sh](inference.sh) script for commands to: +### Run Inference on Converted Checkpoint + +```bash +python examples/conversion/hf_to_megatron_generate_vlm.py \ +--hf_model_path google/gemma-3-4b-it \ +--megatron_model_path /models/gemma-3-4b-it \ +--image_path \ +--prompt "Describe this image." \ +--max_new_tokens 100 +``` + +Note: +- `--megatron_model_path` is optional. If not specified, the script will convert the model and then run forward. +- You can also use image URLs: `--image_path="https://example.com/image.jpg"` + +See the [inference.sh](inference.sh) script for commands to: - Run inference with Hugging Face checkpoints - Run inference with imported Megatron checkpoints - Run inference with exported Hugging Face checkpoints @@ -51,22 +81,49 @@ The image is a table comparing the technical specifications of two ======================================= ``` -## Pretrain +## Finetune Recipes + +- See: [bridge.recipes.gemma3_vl](../../../../docs/apidocs/bridge/bridge.recipes.gemma3_vl.md) +- Available recipes: + - `gemma3_vl_4b_finetune_config`: Finetuning for 4B VL model with PEFT support + - `gemma3_vl_12b_finetune_config`: Finetuning for 12B VL model with PEFT support + - `gemma3_vl_27b_finetune_config`: Finetuning for 27B VL model with PEFT support + +Before training, ensure the following environment variables are set: +1. `SAVE_DIR`: checkpoint and log saving directory +2. `HF_TOKEN`: to download models from HF Hub (if required) +3. `HF_HOME`: (optional) to avoid re-downloading models and datasets +4. `WANDB_API_KEY`: (optional) to enable WandB logging + +### Pretrain Pretraining is not verified for this model. -## Supervised Fine-Tuning (SFT) +### Supervised Fine-Tuning (SFT) See the [sft.sh](sft.sh) script for full parameter fine-tuning with configurable model parallelisms. -[W&B Report](TODO) +W&B report coming soon. -## Parameter-Efficient Fine-Tuning (PEFT) +### Parameter-Efficient Fine-Tuning (PEFT) with LoRA See the [peft.sh](peft.sh) script for LoRA fine-tuning with configurable tensor and pipeline parallelism. -[W&B Report](TODO) +W&B report coming soon. + +### Recommended Configurations + +| Model | Mode | TP | PP | Global Batch Size | Learning Rate | Hardware | +|-------|------|----|----|-------------------|---------------|----------| +| Gemma 3 VL 4B | Full SFT | 2 | 1 | 32 | 5e-5 | 8 GPUs | +| Gemma 3 VL 4B | LoRA/DoRA | 2 | 1 | 32 | 2e-4 | 8 GPUs | +| Gemma 3 VL 12B | Full SFT | 4 | 1 | 32 | 5e-5 | 8 GPUs | +| Gemma 3 VL 12B | LoRA/DoRA | 2 | 1 | 32 | 2e-4 | 8 GPUs | +| Gemma 3 VL 27B | Full SFT | 8 | 2 | 32 | 5e-5 | 16 GPUs | +| Gemma 3 VL 27B | LoRA/DoRA | 4 | 1 | 32 | 2e-4 | 8 GPUs | + +**Note:** LoRA/DoRA significantly reduces memory requirements, allowing for larger batch sizes and fewer GPUs. ## Evaluation -TBD \ No newline at end of file +Coming soon. diff --git a/examples/models/vlm/glm_45v/README.md b/examples/models/vlm/glm_45v/README.md new file mode 100644 index 0000000000..615550ac4b --- /dev/null +++ b/examples/models/vlm/glm_45v/README.md @@ -0,0 +1,177 @@ +# GLM-4.5V Examples + +This directory contains example scripts for GLM-4.5V vision-language model. + +For model introduction and architecture details, see the [GLM-4.5V documentation](../../../../docs/models/vlm/glm-45v.md). + +## Workspace Configuration + +All scripts use a `WORKSPACE` environment variable to define the base directory for checkpoints and results. By default, this is set to `/workspace`. You can override it: + +```bash +export WORKSPACE=/your/custom/path +``` + +Directory structure: +- `${WORKSPACE}/models/` - Converted checkpoints +- `${WORKSPACE}/results/` - Training outputs and experiment results + +## Checkpoint Conversion + +### Import HF → Megatron +To import the HF VL model to your desired Megatron path: +```bash +python examples/conversion/convert_checkpoints.py import \ +--hf-model zai-org/GLM-4.5V \ +--megatron-path /models/GLM-4.5V +``` + +### Export Megatron → HF +```bash +python examples/conversion/convert_checkpoints.py export \ +--hf-model zai-org/GLM-4.5V \ +--megatron-path /results/glm_45v/checkpoints/iter_00001000 \ +--hf-path ./glm-45v-hf-export +``` + +See the [conversion.sh](conversion.sh) script for more examples including: +- Multi-GPU round-trip validation between formats + +## Inference + +### Run Inference on Converted Checkpoint + +```bash +python examples/conversion/hf_to_megatron_generate_vlm.py \ +--hf_model_path zai-org/GLM-4.5V \ +--megatron_model_path /models/GLM-4.5V \ +--image_path \ +--prompt "Describe this image." \ +--max_new_tokens 100 \ +--trust_remote_code +``` + +Note: +- `--megatron_model_path` is optional. If not specified, the script will convert the model and then run forward. +- You can also use image URLs: `--image_path="https://example.com/image.jpg"` +- GLM-4.5V requires `--trust_remote_code` flag + +See the [inference.sh](inference.sh) script for commands to: +- Run inference with Hugging Face checkpoints +- Run inference with imported Megatron checkpoints +- Run inference with exported Hugging Face checkpoints + +**Expected output:** +```text +... +Generation step 46 +Generation step 47 +Generation step 48 +Generation step 49 +======== GENERATED TEXT OUTPUT ======== +Image: https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16/resolve/main/images/table.png +Prompt: Describe this image. +Generated: [gMASK]<|user|> +<|begin_of_image|><|image|>...<|end_of_image|>Describe this image.<|assistant|> +The image shows a technical specifications table comparing two NVIDIA GPU models: H100 SXM and H100 NVL. The table is organized with rows representing different technical specifications and columns for each GPU model. + +Here's a breakdown of the information presented: + +======================================= +``` + +## Finetune Recipes + +- See: [bridge.recipes.glm_vl](../../../../docs/apidocs/bridge/bridge.recipes.glm_vl.md) +- Available recipes: + - `glm_45v_finetune_config`: Finetuning for GLM-4.5V model with PEFT support + +Before training, ensure the following environment variables are set: +1. `SAVE_DIR`: checkpoint and log saving directory +2. `HF_TOKEN`: to download models from HF Hub (if required) +3. `HF_HOME`: (optional) to avoid re-downloading models and datasets +4. `WANDB_API_KEY`: (optional) to enable WandB logging + +### Pretraining + +Pretraining is not verified for this model. + +### Supervised Fine-Tuning (SFT) + +Full parameter fine-tuning requires 64 nodes (512 GPUs) with TP=1, PP=8, EP=16. + +**Usage:** +```bash +# 1. Edit slurm_sft.sh to configure: +# - #SBATCH directives (partition, account, etc.) +# - CONTAINER_IMAGE path + +# 2. Submit the job: +sbatch slurm_sft.sh +``` + +See [slurm_sft.sh](slurm_sft.sh) for the full Slurm job script. + +W&B report coming soon. + +### Parameter-Efficient Fine-Tuning (PEFT) with LoRA + +LoRA fine-tuning requires 8 nodes (64 GPUs) with TP=1, PP=8, EP=4. + +**Usage:** +```bash +# 1. Edit slurm_peft.sh to configure: +# - #SBATCH directives (partition, account, etc.) +# - CONTAINER_IMAGE path + +# 2. Submit the job: +sbatch slurm_peft.sh +``` + +See [slurm_peft.sh](slurm_peft.sh) for the full Slurm job script. + +W&B report coming soon. + + +**Note:** LoRA/DoRA significantly reduces memory requirements, allowing for fewer GPUs. Expert parallelism (EP) is essential for efficient training of this MoE model. + +### Recommended Configurations + +| Model | Mode | TP | PP | EP | Global Batch Size | Learning Rate | Hardware | +|-------|------|----|----|-----|-------------------|---------------|----------| +| GLM-4.5V | Full SFT | 1 | 8 | 16 | 32 | 5e-6 | 512 GPUs (64 nodes) | +| GLM-4.5V | LoRA/DoRA | 1 | 8 | 4 | 32 | 1e-4 | 64 GPUs (8 nodes) | + +### Multi-Node Setup with Local Repository + +If you are mounting a local Megatron Bridge repository, you must pre-sync the uv cache to avoid race conditions when multiple nodes attempt to sync simultaneously. Follow these steps: + +1. **Start a container with your mounts and run `uv sync`:** + ```bash + # Start an interactive container with the same mounts you'll use in Slurm + srun --nodes=1 --ntasks=1 --container-image=/path/to/container.sqsh \ + --container-mounts=/path/to/Megatron-Bridge:/opt/Megatron-Bridge,/shared/uv_cache:/shared/uv_cache \ + --pty bash + + # Inside the container, pre-sync to the shared cache + cd /opt/Megatron-Bridge + UV_CACHE_DIR=/shared/uv_cache uv sync + ``` + +2. **Update the Slurm script with UV_CACHE_DIR and mounts:** + ```bash + # In slurm_sft.sh or slurm_peft.sh, set: + export UV_CACHE_DIR="/shared/uv_cache" + + # And configure container mounts: + CONTAINER_MOUNTS="/path/to/Megatron-Bridge:/opt/Megatron-Bridge,/shared/uv_cache:/shared/uv_cache" + ``` + +3. **Submit the job:** + ```bash + sbatch slurm_sft.sh # or slurm_peft.sh + ``` + +## Evaluation + +Coming soon. diff --git a/examples/models/vlm/glm_45v/conversion.sh b/examples/models/vlm/glm_45v/conversion.sh new file mode 100755 index 0000000000..3947834215 --- /dev/null +++ b/examples/models/vlm/glm_45v/conversion.sh @@ -0,0 +1,33 @@ +#!/usr/bin/env bash +# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# Workspace directory for checkpoints and results +WORKSPACE=${WORKSPACE:-/workspace} + +# Import HF → Megatron +uv run python examples/conversion/convert_checkpoints.py import \ + --hf-model zai-org/GLM-4.5V \ + --megatron-path ${WORKSPACE}/models/GLM-4.5V + +# Export Megatron → HF +uv run python examples/conversion/convert_checkpoints.py export \ + --hf-model zai-org/GLM-4.5V \ + --megatron-path ${WORKSPACE}/models/GLM-4.5V/iter_0000000 \ + --hf-path ${WORKSPACE}/models/GLM-4.5V-hf-export + +# Round-trip validation +# Note: GLM-4.5V is a large MoE model, adjust parallelism as needed +uv run python -m torch.distributed.run --nproc_per_node=8 examples/conversion/hf_megatron_roundtrip_multi_gpu.py \ + --hf-model-id zai-org/GLM-4.5V --tp 1 --pp 2 --ep 4 --trust-remote-code diff --git a/examples/models/vlm/glm_45v/inference.sh b/examples/models/vlm/glm_45v/inference.sh new file mode 100755 index 0000000000..497c18134a --- /dev/null +++ b/examples/models/vlm/glm_45v/inference.sh @@ -0,0 +1,54 @@ +#!/usr/bin/env bash +# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# Workspace directory for checkpoints and results +WORKSPACE=${WORKSPACE:-/workspace} + +# GLM-4.5V is a large MoE model (106B parameters) +# Using TP=1, PP=4, EP=2 for inference (8 GPUs minimum) + +# Inference with Hugging Face checkpoints +uv run python -m torch.distributed.run --nproc_per_node=8 examples/conversion/hf_to_megatron_generate_vlm.py \ + --hf_model_path zai-org/GLM-4.5V \ + --image_path "https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16/resolve/main/images/table.png" \ + --prompt "Describe this image." \ + --max_new_tokens 50 \ + --tp 1 \ + --pp 4 \ + --ep 2 \ + --trust_remote_code + +# Inference with imported Megatron checkpoints +uv run python -m torch.distributed.run --nproc_per_node=8 examples/conversion/hf_to_megatron_generate_vlm.py \ + --hf_model_path zai-org/GLM-4.5V \ + --megatron_model_path ${WORKSPACE}/models/GLM-4.5V/iter_0000000 \ + --image_path "https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16/resolve/main/images/table.png" \ + --prompt "Describe this image." \ + --max_new_tokens 50 \ + --tp 1 \ + --pp 2 \ + --ep 4 \ + --trust_remote_code + +# Inference with exported HF checkpoints +uv run python -m torch.distributed.run --nproc_per_node=8 examples/conversion/hf_to_megatron_generate_vlm.py \ + --hf_model_path ${WORKSPACE}/models/GLM-4.5V-hf-export \ + --image_path "https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16/resolve/main/images/table.png" \ + --prompt "Describe this image." \ + --max_new_tokens 50 \ + --tp 1 \ + --pp 2 \ + --ep 4 \ + --trust_remote_code diff --git a/examples/models/vlm/glm_45v/slurm_peft.sh b/examples/models/vlm/glm_45v/slurm_peft.sh new file mode 100755 index 0000000000..017da6a74c --- /dev/null +++ b/examples/models/vlm/glm_45v/slurm_peft.sh @@ -0,0 +1,166 @@ +#!/bin/bash +# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# ============================================================================== +# GLM-4.5V Parameter-Efficient Fine-Tuning (PEFT) with LoRA +# +# GLM-4.5V is a large MoE model (106B parameters) +# LoRA/DoRA significantly reduces memory requirements +# Recommended: TP=1, PP=8, EP=4 for LoRA (64 GPUs, 8 nodes) +# +# Usage: +# 1. Modify the #SBATCH directives below for your cluster +# 2. Set CONTAINER_IMAGE to your container path +# 3. Submit: sbatch slurm_peft.sh +# ============================================================================== + +#SBATCH --job-name=glm45v-lora +#SBATCH --nodes=8 +#SBATCH --ntasks-per-node=8 +#SBATCH --gpus-per-node=8 +#SBATCH --time=08:00:00 +#SBATCH --partition=gpu +#SBATCH --account=my_account +#SBATCH --output=logs/glm45v_lora_%j.out +#SBATCH --error=logs/glm45v_lora_%j.err +#SBATCH --exclusive + +# ============================================================================== +# CONFIGURATION +# ============================================================================== + +# Workspace directory for checkpoints and results +WORKSPACE=${WORKSPACE:-/workspace} + +# Model and training configurations +PRETRAINED_CHECKPOINT=${WORKSPACE}/models/GLM-4.5V +MODEL_NAME=glm_45v +DATASET_NAME=cord_v2 +SEQ_LENGTH=8192 +TRAIN_ITERS=50 +GLOBAL_BATCH_SIZE=32 +MICRO_BATCH_SIZE=1 +EVAL_ITERS=10 +LR=0.0001 +MIN_LR=0.00001 +LR_WARMUP_ITERS=10 +LOG_INTERVAL=1 +WANDB_PROJECT=megatron-bridge-${DATASET_NAME} + +# Parallelism configuration +TP=1 +PP=8 +EP=4 + +# Container image (required) +CONTAINER_IMAGE="" +# CONTAINER_IMAGE="/path/to/container.sqsh" + +# Container mounts (optional, space-separated) +CONTAINER_MOUNTS="" +# CONTAINER_MOUNTS="/data:/data /workspace:/workspace" + +# ============================================================================== +# Environment Setup +# ============================================================================== + +# NCCL optimizations for large-scale training +export TORCH_NCCL_AVOID_RECORD_STREAMS=1 +export NCCL_NVLS_ENABLE=0 + +# UV cache on shared filesystem (recommended for multi-node setups) +# Pre-sync once before submitting jobs: UV_CACHE_DIR=/path/to/cache uv sync +# export UV_CACHE_DIR="/path/to/shared/uv_cache" + +# HuggingFace cache directory (recommended for shared filesystem) +# export HF_HOME="/path/to/shared/HF_HOME" + +# Authentication tokens (set these for your environment) +# export HF_TOKEN="hf_your_token_here" +# export WANDB_API_KEY="your_wandb_key_here" + +# ============================================================================== +# Job Execution +# ============================================================================== + +echo "======================================" +echo "GLM-4.5V LoRA Fine-Tuning Job" +echo "======================================" +echo "Job ID: $SLURM_JOB_ID" +echo "Nodes: $SLURM_JOB_NUM_NODES" +echo "GPUs per node: $SLURM_GPUS_PER_NODE" +echo "Model: $MODEL_NAME" +echo "Parallelism: TP=$TP, PP=$PP, EP=$EP" +echo "PEFT: LoRA" +echo "======================================" + +# Create logs directory if it doesn't exist +mkdir -p logs + +# Build CLI overrides +CLI_OVERRIDES=" + checkpoint.pretrained_checkpoint=$PRETRAINED_CHECKPOINT \ + model.seq_length=$SEQ_LENGTH \ + train.train_iters=$TRAIN_ITERS \ + train.global_batch_size=$GLOBAL_BATCH_SIZE \ + train.micro_batch_size=$MICRO_BATCH_SIZE \ + train.eval_iters=$EVAL_ITERS \ + optimizer.lr=$LR \ + optimizer.min_lr=$MIN_LR \ + scheduler.lr_warmup_iters=$LR_WARMUP_ITERS \ + checkpoint.save=${WORKSPACE}/results/${MODEL_NAME}_lora_tp${TP}_pp${PP}_ep${EP} \ + logger.log_interval=$LOG_INTERVAL \ + logger.wandb_project=$WANDB_PROJECT \ + logger.wandb_exp_name=${MODEL_NAME}_${DATASET_NAME}_lora_tp${TP}_pp${PP}_ep${EP} \ + dataset.maker_name=make_${DATASET_NAME}_dataset \ + dataset.seq_length=$SEQ_LENGTH \ + model.tensor_model_parallel_size=$TP \ + model.pipeline_model_parallel_size=$PP \ + model.expert_model_parallel_size=$EP +" + +# Build command +# Only local rank 0 on each node runs uv sync, then all ranks run with --no-sync +CMD="if [ \"\$SLURM_LOCALID\" -eq 0 ]; then uv sync; else sleep 2; fi && " +CMD="$CMD uv run --no-sync python scripts/training/run_recipe.py" +CMD="$CMD --recipe ${MODEL_NAME}_finetune_config" +CMD="$CMD --step_func vlm_step" +CMD="$CMD --peft_scheme lora" +CMD="$CMD $CLI_OVERRIDES" + +echo "Executing command..." +echo "======================================" + +# Require container image +if [ -z "$CONTAINER_IMAGE" ]; then + echo "ERROR: CONTAINER_IMAGE must be set. Please specify a valid container image." + exit 1 +fi + +# Build srun command +SRUN_CMD="srun --mpi=pmix --container-image=$CONTAINER_IMAGE" + +# Add container mounts +if [ -n "$CONTAINER_MOUNTS" ]; then + for mount in $CONTAINER_MOUNTS; do + SRUN_CMD="$SRUN_CMD --container-mounts=$mount" + done +fi + +$SRUN_CMD bash -c "$CMD" + +echo "======================================" +echo "Job completed" +echo "======================================" diff --git a/examples/models/vlm/glm_45v/slurm_sft.sh b/examples/models/vlm/glm_45v/slurm_sft.sh new file mode 100644 index 0000000000..f23dee3c43 --- /dev/null +++ b/examples/models/vlm/glm_45v/slurm_sft.sh @@ -0,0 +1,164 @@ +#!/bin/bash +# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# ============================================================================== +# GLM-4.5V Full Supervised Fine-Tuning (SFT) +# +# GLM-4.5V is a large MoE model (106B parameters) +# Recommended: TP=1, PP=8, EP=16 for full SFT (512 GPUs, 64 nodes) +# For smaller setups, use LoRA/DoRA instead (see slurm_peft.sh) +# +# Usage: +# 1. Modify the #SBATCH directives below for your cluster +# 2. Set CONTAINER_IMAGE to your container path +# 3. Submit: sbatch slurm_sft.sh +# ============================================================================== + +#SBATCH --job-name=glm45v-sft +#SBATCH --nodes=64 +#SBATCH --ntasks-per-node=8 +#SBATCH --gpus-per-node=8 +#SBATCH --time=24:00:00 +#SBATCH --partition=gpu +#SBATCH --account=my_account +#SBATCH --output=logs/glm45v_sft_%j.out +#SBATCH --error=logs/glm45v_sft_%j.err +#SBATCH --exclusive + +# ============================================================================== +# CONFIGURATION +# ============================================================================== + +# Workspace directory for checkpoints and results +WORKSPACE=${WORKSPACE:-/workspace} + +# Model and training configurations +PRETRAINED_CHECKPOINT=${WORKSPACE}/models/GLM-4.5V +MODEL_NAME=glm_45v +DATASET_NAME=cord_v2 +SEQ_LENGTH=8192 +TRAIN_ITERS=50 +GLOBAL_BATCH_SIZE=64 +MICRO_BATCH_SIZE=1 +EVAL_ITERS=10 +LR=0.000005 +MIN_LR=0.0000005 +LR_WARMUP_ITERS=10 +LOG_INTERVAL=1 +WANDB_PROJECT=megatron-bridge-${DATASET_NAME} + +# Parallelism configuration +TP=1 +PP=8 +EP=16 + +# Container image (required) +CONTAINER_IMAGE="" +# CONTAINER_IMAGE="/path/to/container.sqsh" + +# Container mounts (optional, space-separated) +CONTAINER_MOUNTS="" +# CONTAINER_MOUNTS="/data:/data /workspace:/workspace" + +# ============================================================================== +# Environment Setup +# ============================================================================== + +# NCCL optimizations for large-scale training +export TORCH_NCCL_AVOID_RECORD_STREAMS=1 +export NCCL_NVLS_ENABLE=0 + +# UV cache on shared filesystem (recommended for multi-node setups) +# Pre-sync once before submitting jobs: UV_CACHE_DIR=/path/to/cache uv sync +# export UV_CACHE_DIR="/path/to/shared/uv_cache" + +# HuggingFace cache directory (recommended for shared filesystem) +# export HF_HOME="/path/to/shared/HF_HOME" + +# Authentication tokens (set these for your environment) +# export HF_TOKEN="hf_your_token_here" +# export WANDB_API_KEY="your_wandb_key_here" + +# ============================================================================== +# Job Execution +# ============================================================================== + +echo "======================================" +echo "GLM-4.5V Full SFT Training Job" +echo "======================================" +echo "Job ID: $SLURM_JOB_ID" +echo "Nodes: $SLURM_JOB_NUM_NODES" +echo "GPUs per node: $SLURM_GPUS_PER_NODE" +echo "Model: $MODEL_NAME" +echo "Parallelism: TP=$TP, PP=$PP, EP=$EP" +echo "======================================" + +# Create logs directory if it doesn't exist +mkdir -p logs + +# Build CLI overrides +CLI_OVERRIDES=" + checkpoint.pretrained_checkpoint=$PRETRAINED_CHECKPOINT \ + model.seq_length=$SEQ_LENGTH \ + train.train_iters=$TRAIN_ITERS \ + train.global_batch_size=$GLOBAL_BATCH_SIZE \ + train.micro_batch_size=$MICRO_BATCH_SIZE \ + train.eval_iters=$EVAL_ITERS \ + optimizer.lr=$LR \ + optimizer.min_lr=$MIN_LR \ + scheduler.lr_warmup_iters=$LR_WARMUP_ITERS \ + checkpoint.save=${WORKSPACE}/results/${MODEL_NAME}_sft_tp${TP}_pp${PP}_ep${EP} \ + logger.log_interval=$LOG_INTERVAL \ + logger.wandb_project=$WANDB_PROJECT \ + logger.wandb_exp_name=${MODEL_NAME}_${DATASET_NAME}_sft_tp${TP}_pp${PP}_ep${EP} \ + dataset.maker_name=make_${DATASET_NAME}_dataset \ + dataset.seq_length=$SEQ_LENGTH \ + model.tensor_model_parallel_size=$TP \ + model.pipeline_model_parallel_size=$PP \ + model.expert_model_parallel_size=$EP +" + +# Build command +# Only local rank 0 on each node runs uv sync, then all ranks run with --no-sync +CMD="if [ \"\$SLURM_LOCALID\" -eq 0 ]; then uv sync; else sleep 2; fi && " +CMD="$CMD uv run --no-sync python scripts/training/run_recipe.py" +CMD="$CMD --recipe ${MODEL_NAME}_finetune_config" +CMD="$CMD --step_func vlm_step" +CMD="$CMD $CLI_OVERRIDES" + +echo "Executing command..." +echo "======================================" + +# Require container image +if [ -z "$CONTAINER_IMAGE" ]; then + echo "ERROR: CONTAINER_IMAGE must be set. Please specify a valid container image." + exit 1 +fi + +# Build srun command +SRUN_CMD="srun --mpi=pmix --container-image=$CONTAINER_IMAGE" + +# Add container mounts +if [ -n "$CONTAINER_MOUNTS" ]; then + for mount in $CONTAINER_MOUNTS; do + SRUN_CMD="$SRUN_CMD --container-mounts=$mount" + done +fi + +$SRUN_CMD bash -c "$CMD" + +echo "======================================" +echo "Job completed" +echo "======================================" diff --git a/src/megatron/bridge/recipes/__init__.py b/src/megatron/bridge/recipes/__init__.py index 10618ae372..746f988560 100644 --- a/src/megatron/bridge/recipes/__init__.py +++ b/src/megatron/bridge/recipes/__init__.py @@ -21,6 +21,8 @@ from megatron.bridge.recipes.deepseek import * from megatron.bridge.recipes.gemma import * from megatron.bridge.recipes.gemma3_vl import * +from megatron.bridge.recipes.glm import * +from megatron.bridge.recipes.glm_vl import * from megatron.bridge.recipes.gpt import * from megatron.bridge.recipes.gpt_oss import * from megatron.bridge.recipes.llama import * diff --git a/src/megatron/bridge/recipes/glm_vl/glm_45v.py b/src/megatron/bridge/recipes/glm_vl/glm_45v.py index 69bb78261c..63a8cb2bf7 100644 --- a/src/megatron/bridge/recipes/glm_vl/glm_45v.py +++ b/src/megatron/bridge/recipes/glm_vl/glm_45v.py @@ -44,7 +44,7 @@ def set_glm_45v_pipeline_model_parallel_layout( - model_cfg: GPTModelProvider, layout: Optional[Union[str, List[List[str]]]] = None + model_cfg: GPTModelProvider, layout: Optional[Union[str, List[List[str]]]] = None, is_peft: bool = False ) -> None: """Set the GLM-4.5V pipeline model parallel layout. @@ -54,6 +54,7 @@ def set_glm_45v_pipeline_model_parallel_layout( Args: model_cfg: The model provider configuration to modify. layout: Optional custom layout. If None, uses predefined layouts based on PP/VP sizes. + is_peft: Whether the model is trained with PEFT. """ # GLM-4.5V has no MTP layers last_layer = ["loss"] @@ -61,14 +62,31 @@ def set_glm_45v_pipeline_model_parallel_layout( vp_size = model_cfg.virtual_pipeline_model_parallel_size or 1 # GLM-4.5 Air has 46 decoder layers + # GLM-4.5 Vision Encoder is huge, we need to balance the first stage with the least number of layers # Layout maps for common PP/VP combinations - layout_map = { - (1, 1): None, - (2, 1): [["embedding"] + ["decoder"] * 23, ["decoder"] * 23 + last_layer], - (4, 1): [["embedding"] + ["decoder"] * 11, ["decoder"] * 12, ["decoder"] * 12, ["decoder"] * 11 + last_layer], - (8, 1): [["embedding"] + ["decoder"] * 5] + [["decoder"] * 6] * 6 + [["decoder"] * 5 + last_layer], - (16, 1): [["embedding"] + ["decoder"] * 2] + [["decoder"] * 3] * 14 + [["decoder"] * 2 + last_layer], - } + # We use different layouts for PEFT and full SFT. + if is_peft: + layout_map = { + (4, 1): [ + ["embedding"] + ["decoder"] * 11, + ["decoder"] * 12, + ["decoder"] * 12, + ["decoder"] * 11 + last_layer, + ], + (8, 1): [["embedding"] + ["decoder"] * 5] + [["decoder"] * 6] * 6 + [["decoder"] * 5 + last_layer], + (16, 1): [["embedding"] + ["decoder"] * 2] + [["decoder"] * 3] * 14 + [["decoder"] * 2 + last_layer], + } + else: + layout_map = { + (4, 1): [ + ["embedding"] + ["decoder"] * 11, + ["decoder"] * 12, + ["decoder"] * 12, + ["decoder"] * 11 + last_layer, + ], + (8, 1): [["embedding"] + ["decoder"]] + [["decoder"] * 7] * 6 + [["decoder"] * 3 + last_layer], + (16, 1): [["embedding"]] + [["decoder"] * 3] * 14 + [["decoder"] * 3 + last_layer], + } if layout is not None: model_cfg.pipeline_model_parallel_layout = layout @@ -133,9 +151,9 @@ class GLM45VCommonKwargs(TypedDict, total=False): def glm_45v_finetune_config(**user_kwargs: Unpack[GLM45VCommonKwargs]) -> ConfigContainer: """Return a fine-tuning config for GLM-4.5V (based on GLM-4.5 Air 106B). - Default configuration: 4 nodes, 32 GPUs total - - LoRA/DoRA: TP=1, PP=8, EP=4 (32 GPUs, 4 nodes), LR=1e-4 - - Full SFT: TP=1, PP=8, EP=16 (128 GPUs, 16 nodes), LR=5e-6 + Default configuration: + - LoRA/DoRA: TP=1, PP=8, EP=4 (64 GPUs, 8 nodes), LR=1e-4 + - Full SFT: TP=1, PP=8, EP=16 (512 GPUs, 64 nodes), LR=5e-6 GLM-4.5V is a Vision-Language model with: - 106B total parameters (based on GLM-4.5 Air) @@ -151,9 +169,10 @@ def glm_45v_finetune_config(**user_kwargs: Unpack[GLM45VCommonKwargs]) -> Config recommended_kwargs: GLM45VCommonKwargs = { "hf_path": "zai-org/GLM-4.5V", "tensor_model_parallel_size": 1, - "pipeline_model_parallel_size": 4, + "pipeline_model_parallel_size": 8, "pipeline_dtype": torch.bfloat16, - "expert_model_parallel_size": 16 if is_full_sft else 2, + "expert_model_parallel_size": 16 if is_full_sft else 4, + "global_batch_size": 64 if is_full_sft else 32, "peft": peft_value, "finetune_lr": 5e-6 if is_full_sft else 1e-4, } @@ -186,7 +205,7 @@ def _glm_45v_common( train_iters: int = 300000, global_batch_size: int = 32, micro_batch_size: int = 1, - seq_length: int = 4096, + seq_length: int = 8192, lr: float = 3e-4, min_lr: float = 3e-5, lr_warmup_iters: int = 500, @@ -242,7 +261,7 @@ def _glm_45v_common( model_cfg.seq_length = seq_length # Set pipeline model parallel layout for asymmetric stages - set_glm_45v_pipeline_model_parallel_layout(model_cfg, layout) + set_glm_45v_pipeline_model_parallel_layout(model_cfg, layout, is_peft=peft is not None) # Pipeline split for asymmetric stages are specified with the layout above model_cfg.account_for_embedding_in_pipeline_split = False diff --git a/tests/unit_tests/recipes/test_glm_45v_recipes.py b/tests/unit_tests/recipes/test_glm_45v_recipes.py index f4614b3fbc..d60d6636c2 100644 --- a/tests/unit_tests/recipes/test_glm_45v_recipes.py +++ b/tests/unit_tests/recipes/test_glm_45v_recipes.py @@ -252,10 +252,10 @@ def test_glm_45v_lora_defaults(monkeypatch: pytest.MonkeyPatch): _assert_basic_config(cfg) - # For LoRA, GLM-4.5V should use TP=1, PP=4, EP=2 + # For LoRA, GLM-4.5V should use TP=1, PP=8, EP=4 assert cfg.model.tensor_model_parallel_size == 1 - assert cfg.model.pipeline_model_parallel_size == 4 - assert cfg.model.expert_model_parallel_size == 2 + assert cfg.model.pipeline_model_parallel_size == 8 + assert cfg.model.expert_model_parallel_size == 4 # Check PEFT config assert cfg.peft is not None @@ -284,8 +284,8 @@ def test_glm_45v_dora_defaults(monkeypatch: pytest.MonkeyPatch): # For DoRA, GLM-4.5V should use same parallelism as LoRA assert cfg.model.tensor_model_parallel_size == 1 - assert cfg.model.pipeline_model_parallel_size == 4 - assert cfg.model.expert_model_parallel_size == 2 + assert cfg.model.pipeline_model_parallel_size == 8 + assert cfg.model.expert_model_parallel_size == 4 # Check PEFT config (DoRA has alpha=64 by default, unlike LoRA's alpha=32) assert cfg.peft is not None @@ -311,9 +311,9 @@ def test_glm_45v_full_sft_defaults(monkeypatch: pytest.MonkeyPatch): _assert_basic_config(cfg) - # For full SFT, GLM-4.5V should use TP=1, PP=4, EP=16 + # For full SFT, GLM-4.5V should use TP=1, PP=8, EP=16 assert cfg.model.tensor_model_parallel_size == 1 - assert cfg.model.pipeline_model_parallel_size == 4 + assert cfg.model.pipeline_model_parallel_size == 8 assert cfg.model.expert_model_parallel_size == 16 assert cfg.peft is None @@ -363,80 +363,88 @@ def test_glm_45v_peft_with_freeze_options(monkeypatch: pytest.MonkeyPatch): # Pipeline layout tests -def test_glm_45v_pipeline_layout_pp1(): - """Test pipeline layout for PP=1.""" + + +def test_glm_45v_pipeline_layout_pp4(): + """Test pipeline layout for PP=4.""" model_cfg = _FakeModelCfg() - model_cfg.pipeline_model_parallel_size = 1 + model_cfg.pipeline_model_parallel_size = 4 model_cfg.virtual_pipeline_model_parallel_size = 1 _glm_45v_module.set_glm_45v_pipeline_model_parallel_layout(model_cfg) - # PP=1 should have no layout (None) - assert model_cfg.pipeline_model_parallel_layout is None + # PP=4 should have 4 stages + assert model_cfg.pipeline_model_parallel_layout is not None + assert len(model_cfg.pipeline_model_parallel_layout) == 4 + # First stage: embedding + 11 decoder layers + assert model_cfg.pipeline_model_parallel_layout[0][0] == "embedding" + # Last stage should have loss + assert "loss" in model_cfg.pipeline_model_parallel_layout[-1] -def test_glm_45v_pipeline_layout_pp2(): - """Test pipeline layout for PP=2.""" +def test_glm_45v_pipeline_layout_pp8(): + """Test pipeline layout for PP=8.""" model_cfg = _FakeModelCfg() - model_cfg.pipeline_model_parallel_size = 2 + model_cfg.pipeline_model_parallel_size = 8 model_cfg.virtual_pipeline_model_parallel_size = 1 _glm_45v_module.set_glm_45v_pipeline_model_parallel_layout(model_cfg) - # PP=2 should split 46 layers: first stage 1+23=24, second stage 23 + # PP=8 should have 8 stages (full SFT layout: embedding+1, then 7*6, then 3+loss) assert model_cfg.pipeline_model_parallel_layout is not None - assert len(model_cfg.pipeline_model_parallel_layout) == 2 - # First stage: embedding + 23 decoder layers + assert len(model_cfg.pipeline_model_parallel_layout) == 8 + # First stage: embedding + 1 decoder layer assert model_cfg.pipeline_model_parallel_layout[0][0] == "embedding" - assert model_cfg.pipeline_model_parallel_layout[0].count("decoder") == 23 - # Last stage: 23 decoder layers + loss - assert model_cfg.pipeline_model_parallel_layout[1].count("decoder") == 23 - assert "loss" in model_cfg.pipeline_model_parallel_layout[1] + assert model_cfg.pipeline_model_parallel_layout[0].count("decoder") == 1 + # Last stage should have loss + assert "loss" in model_cfg.pipeline_model_parallel_layout[-1] -def test_glm_45v_pipeline_layout_pp4(): - """Test pipeline layout for PP=4.""" +def test_glm_45v_pipeline_layout_pp16(): + """Test pipeline layout for PP=16.""" model_cfg = _FakeModelCfg() - model_cfg.pipeline_model_parallel_size = 4 + model_cfg.pipeline_model_parallel_size = 16 model_cfg.virtual_pipeline_model_parallel_size = 1 _glm_45v_module.set_glm_45v_pipeline_model_parallel_layout(model_cfg) - # PP=4 should have 4 stages + # PP=16 should have 16 stages (full SFT layout: embedding alone, then 3*14, then 3+loss) assert model_cfg.pipeline_model_parallel_layout is not None - assert len(model_cfg.pipeline_model_parallel_layout) == 4 - # First stage: embedding + 11 decoder layers + assert len(model_cfg.pipeline_model_parallel_layout) == 16 + # First stage: embedding only (no decoder layers, to balance vision encoder cost) assert model_cfg.pipeline_model_parallel_layout[0][0] == "embedding" + assert model_cfg.pipeline_model_parallel_layout[0].count("decoder") == 0 # Last stage should have loss assert "loss" in model_cfg.pipeline_model_parallel_layout[-1] -def test_glm_45v_pipeline_layout_pp8(): - """Test pipeline layout for PP=8.""" +def test_glm_45v_pipeline_layout_pp8_peft(): + """Test pipeline layout for PP=8 with PEFT.""" model_cfg = _FakeModelCfg() model_cfg.pipeline_model_parallel_size = 8 model_cfg.virtual_pipeline_model_parallel_size = 1 - _glm_45v_module.set_glm_45v_pipeline_model_parallel_layout(model_cfg) + _glm_45v_module.set_glm_45v_pipeline_model_parallel_layout(model_cfg, is_peft=True) - # PP=8 should have 8 stages + # PP=8 PEFT layout: embedding+5, then 6*6, then 5+loss assert model_cfg.pipeline_model_parallel_layout is not None assert len(model_cfg.pipeline_model_parallel_layout) == 8 # First stage: embedding + 5 decoder layers assert model_cfg.pipeline_model_parallel_layout[0][0] == "embedding" + assert model_cfg.pipeline_model_parallel_layout[0].count("decoder") == 5 # Last stage should have loss assert "loss" in model_cfg.pipeline_model_parallel_layout[-1] -def test_glm_45v_pipeline_layout_pp16(): - """Test pipeline layout for PP=16.""" +def test_glm_45v_pipeline_layout_pp16_peft(): + """Test pipeline layout for PP=16 with PEFT.""" model_cfg = _FakeModelCfg() model_cfg.pipeline_model_parallel_size = 16 model_cfg.virtual_pipeline_model_parallel_size = 1 - _glm_45v_module.set_glm_45v_pipeline_model_parallel_layout(model_cfg) + _glm_45v_module.set_glm_45v_pipeline_model_parallel_layout(model_cfg, is_peft=True) - # PP=16 should have 16 stages + # PP=16 PEFT layout: embedding+2, then 3*14, then 2+loss assert model_cfg.pipeline_model_parallel_layout is not None assert len(model_cfg.pipeline_model_parallel_layout) == 16 # First stage: embedding + 2 decoder layers @@ -465,7 +473,7 @@ def test_glm_45v_pipeline_layout_in_config(monkeypatch: pytest.MonkeyPatch): monkeypatch.setattr(_glm_45v_module, "AutoBridge", _FakeAutoBridge) overrides = _safe_overrides_for("glm_45v_finetune_config") - overrides["pipeline_model_parallel_size"] = 2 + overrides["pipeline_model_parallel_size"] = 8 cfg = _glm_45v_module.glm_45v_finetune_config(**overrides)