roboflow · AlexBodner · May 10, 2025 · May 10, 2025 · May 10, 2025 · May 10, 2025
diff --git a/docs/index.md b/docs/index.md
@@ -69,10 +69,16 @@ we recommend creating a dedicated Python environment for each model.
     pip install "maestro[qwen_2_5_vl]"
     ```
 
+=== "SmolVLM2"
+
+    ```bash
+    pip install "maestro[smolvlm2]"
+    ```
+
 ### CLI
 
 Kick off fine-tuning with our command-line interface, which leverages the configuration
-and training routines defined in each model’s core module. Simply specify key parameters such as
+and training routines defined in each model's core module. Simply specify key parameters such as
 the dataset location, number of epochs, batch size, optimization strategy, and metrics.
 
 === "Florence-2"
@@ -108,6 +114,17 @@ the dataset location, number of epochs, batch size, optimization strategy, and m
       --metrics "edit_distance"
     ```
 
+=== "SmolVLM2"
+
+    ```bash
+    maestro smolvlm2 train \
+      --dataset "dataset/location" \
+      --epochs 10 \
+      --batch-size 4 \
+      --optimization_strategy "lora" \
+      --metrics "edit_distance"
+    ```
+
 ### Python
 
 For greater control, use the Python API to fine-tune your models.
@@ -148,7 +165,6 @@ and training setup.
     ```
 
 === "Qwen2.5-VL"
-
     ```python
     from maestro.trainer.models.qwen_2_5_vl.core import train
 
@@ -162,3 +178,18 @@ and training setup.
 
     train(config)
     ```
+
+=== "SmolVLM2"
+    ```python
+    from maestro.trainer.models.smolvlm2.core import train
+
+    config = {
+        "dataset": "dataset/location",
+        "epochs": 10,
+        "batch_size": 4,
+        "optimization_strategy": "lora",
+        "metrics": ["edit_distance"],
+    }
+
+    train(config)
+    ```
diff --git a/docs/models/smolvlm_2.md b/docs/models/smolvlm_2.md
@@ -0,0 +1,91 @@
+---
+comments: true
+---
+
+## Overview
+
+SmolVLM2 is a lightweight vision-language model developed by Hugging Face. It offers impressive capabilities for multimodal understanding while maintaining a compact size compared to larger VLMs. The model excels at tasks such as image captioning, visual question answering, and object detection, making it accessible for applications with limited computational resources.
+
+Built to balance performance and efficiency, SmolVLM2 provides a valuable option for developers seeking to implement vision-language capabilities without the overhead of larger models. The 500M parameter variant delivers practical results while being significantly more resource-friendly than multi-billion parameter alternatives.
+
+## Install
+
+```bash
+pip install "maestro[smolvlm_2]"
+```
+
+## Train
+
+The training routines support various optimization strategies such as LoRA, QLoRA, and freezing the vision encoder. Customize your fine-tuning process via CLI or Python to align with your dataset and task requirements.
+
+### CLI
+
+Kick off training from the command line by running the command below. Be sure to replace the dataset path and adjust the hyperparameters (such as epochs and batch size) to suit your needs.
+
+```bash
+maestro smolvlm_2 train \
+  --model_id "HuggingFaceTB/SmolVLM-500M-Instruct" \
+  --dataset "dataset/location" \
+  --epochs 10 \
+  --batch-size 4 \
+  --accumulate_grad_batches 4 \
+  --optimization_strategy "lora" \
+  --metrics "edit_distance"
+```
+
+
+
+### Python
+```python
+from maestro.trainer.models.smolvlm_2.core import train
+
+config = {
+    "model_id": "HuggingFaceTB/SmolVLM-500M-Instruct",
+    "dataset": "dataset/location",
+    "lr": 2e-5,
+    "epochs": 10,
+    "batch_size": 4,
+    "accumulate_grad_batches": 4,
+    "num_workers": 0,
+    "optimization_strategy": "lora",
+    "metrics": ["edit_distance"],
+    "device": "cuda"
+}
+
+
+train(config)
+```
+
+
+## Load
+
+Load a pre-trained or fine-tuned SmolVLM model along with its processor using the load_model function. Specify your model's path and the desired optimization strategy.
+
+```python
+from maestro.trainer.models.smolvlm_2.checkpoints import (
+    OptimizationStrategy, load_model
+)
+
+processor, model = load_model(
+    model_id_or_path="model/location",
+    optimization_strategy=OptimizationStrategy.NONE
+)
+```
+## Predict
+
+Perform inference with SmolVLM using the predict function. Supply an image and a text prefix to obtain predictions, such as object detection outputs or captions.
+
+```python
+from maestro.trainer.common.datasets.jsonl import JSONLDataset
+from maestro.trainer.models.smolvlm_2.inference import predict
+
+ds = JSONLDataset(
+    jsonl_file_path="dataset/location/test/annotations.jsonl",
+    image_directory_path="dataset/location/test",
+)
+
+image, entry = ds[0]
+
+predict(model=model, processor=processor, image=image, prefix=entry["prefix"])
+```
+
diff --git a/maestro/cli/introspection.py b/maestro/cli/introspection.py
@@ -28,6 +28,13 @@ def find_training_recipes(app: typer.Typer) -> None:
     except Exception:
         _warn_about_recipe_import_error(model_name="Qwen2.5-VL")
 
+    try:
+        from maestro.trainer.models.smolvlm_2.entrypoint import smolvlm_2_app
+
+        app.add_typer(smolvlm2_app, name="smolvlm_2")
+    except Exception:
+        _warn_about_recipe_import_error(model_name="SmolVLM2")
+
 
 def _warn_about_recipe_import_error(model_name: str) -> None:
     disable_warnings = str2bool(

diff --git a/maestro/trainer/models/smolvlm_2/__init__.py b/maestro/trainer/models/smolvlm_2/__init__.py
diff --git a/maestro/trainer/models/smolvlm_2/checkpoints.py b/maestro/trainer/models/smolvlm_2/checkpoints.py
@@ -0,0 +1,158 @@
+import os
+from enum import Enum
+from typing import Optional
+
+import torch
+from peft import LoraConfig, get_peft_model
+from transformers import AutoModelForImageTextToText, AutoProcessor, BitsAndBytesConfig
+
+from maestro.trainer.common.utils.device import parse_device_spec
+from maestro.trainer.logger import get_maestro_logger
+
+DEFAULT_SMOLVLM_2_MODEL_ID = "HuggingFaceTB/SmolVLM-500M-Instruct"  # "HuggingFaceTB/SmolVLM2-2.2B-Instruct"
+DEFAULT_SMOLVLM_2_MODEL_REVISION = "refs/heads/main"
+DEFAULT_SMOLVLM_2_LORA_PARAMS = {
+    "r": 8,
+    "lora_alpha": 8,
+    "lora_dropout": 0.1,
+    "bias": "none",
+    "target_modules": ["down_proj", "o_proj", "k_proj", "q_proj", "gate_proj", "up_proj", "v_proj"],
+    "init_lora_weights": "gaussian",
+    "use_dora": True,
+}
+DEFAULT_SMOLVLM_2_QLORA_PARAMS = {
+    "r": 8,
+    "lora_alpha": 8,
+    "lora_dropout": 0.1,
+    "bias": "none",
+    "target_modules": ["down_proj", "o_proj", "k_proj", "q_proj", "gate_proj", "up_proj", "v_proj"],
+    "init_lora_weights": "gaussian",
+    "use_dora": False,
+}
+logger = get_maestro_logger()
+
+
+def save_checkpoint(
+    model: AutoModelForImageTextToText, processor: AutoProcessor, path: str, metadata: Optional[dict] = None
+) -> None:
+    """
+    Save model checkpoint.
+
+    Args:
+        model: Model to save
+        processor: Processor to save
+        path: Path to save checkpoint
+        metadata: Optional metadata to save
+    """
+    os.makedirs(path, exist_ok=True)
+
+    # Save model
+    model.save_pretrained(path)
+
+    # Save processor
+    processor.save_pretrained(path)
+
+    # Save metadata if provided
+    if metadata is not None:
+        torch.save(metadata, os.path.join(path, "metadata.pt"))
+
+
+def save_model(
+    target_dir: str,
+    processor: AutoProcessor,
+    model: AutoModelForImageTextToText,
+) -> None:
+    """
+    Save a SmolVLM 2 model and its processor to disk.
+
+    Args:
+        target_dir: Directory path where the model and processor will be saved.
+            Will be created if it doesn't exist.
+        processor: The SmolVLM 2 processor to save.
+        model: The SmolVLM 2model to save.
+    """
+    os.makedirs(target_dir, exist_ok=True)
+    processor.save_pretrained(target_dir)
+    model.save_pretrained(target_dir)
+
+
+class OptimizationStrategy(Enum):
+    """Enumeration for optimization strategies."""
+
+    LORA = "lora"
+    QLORA = "qlora"
+    FREEZE = "freeze"
+    NONE = "none"
+
+
+def load_model(
+    model_id_or_path: str = DEFAULT_SMOLVLM_2_MODEL_ID,
+    revision: str = DEFAULT_SMOLVLM_2_MODEL_REVISION,
+    device: str | torch.device = "auto",
+    optimization_strategy: OptimizationStrategy = OptimizationStrategy.NONE,
+    peft_advanced_params: Optional[dict] = None,
+    cache_dir: Optional[str] = None,
+    longest_edge: int = 512,
+) -> tuple[AutoProcessor, AutoModelForImageTextToText]:
+    device = parse_device_spec(device)
+    processor = AutoProcessor.from_pretrained(
+        model_id_or_path, do_resize=True, size={"longest_edge": longest_edge}, trust_remote_code=True, revision=revision
+    )
+
+    if optimization_strategy in {OptimizationStrategy.LORA, OptimizationStrategy.QLORA}:
+        default_params = (
+            DEFAULT_SMOLVLM_2_QLORA_PARAMS
+            if optimization_strategy == OptimizationStrategy.QLORA
+            else DEFAULT_SMOLVLM_2_LORA_PARAMS
+        )
+        if peft_advanced_params is not None:
+            default_params.update(peft_advanced_params)
+            try:
+                lora_config = LoraConfig(**default_params)
+                logger.info("Successfully created LoraConfig")
+            except TypeError:
+                logger.exception("Invalid parameters for LoraConfig")
+                raise
+        else:
+            logger.info("No additiopnal LoRA parameters provided. Using default configuration.")
+            lora_config = LoraConfig(**default_params)
+
+        bnb_config = (
+            BitsAndBytesConfig(
+                load_in_4bit=True,
+                bnb_4bit_use_double_quant=True,
+                bnb_4bit_quant_type="nf4",
+                bnb_4bit_compute_dtype=torch.bfloat16,
+            )
+            if optimization_strategy == OptimizationStrategy.QLORA
+            else None
+        )
+
+        model = AutoModelForImageTextToText.from_pretrained(
+            pretrained_model_name_or_path=model_id_or_path,
+            revision=revision,
+            trust_remote_code=True,
+            device_map="auto",
+            quantization_config=bnb_config,
+            torch_dtype=torch.bfloat16,
+            cache_dir=cache_dir,
+            # _attn_implementation="flash_attention_2",
+        )
+        model = get_peft_model(model, lora_config)
+        model.print_trainable_parameters()
+    else:
+        model = AutoModelForImageTextToText.from_pretrained(
+            pretrained_model_name_or_path=model_id_or_path,
+            revision=revision,
+            trust_remote_code=True,
+            device_map="auto",
+            cache_dir=cache_dir,
+            torch_dtype=torch.bfloat16,
+            # _attn_implementation="flash_attention_2"
+        ).to(device)
+
+        if optimization_strategy == OptimizationStrategy.FREEZE:
+            for param in model.model.vision_model.parameters():
+                param.requires_grad = False
+
+    return processor, model