roboflow · AshAnand34 · May 10, 2025 · May 10, 2025 · May 10, 2025 · May 10, 2025
diff --git a/docs/index.md b/docs/index.md
@@ -69,10 +69,16 @@ we recommend creating a dedicated Python environment for each model.
     pip install "maestro[qwen_2_5_vl]"
     ```
 
+=== "SmolVLM2"
+
+    ```bash
+    pip install "maestro[smolvlm2]"
+    ```
+
 ### CLI
 
 Kick off fine-tuning with our command-line interface, which leverages the configuration
-and training routines defined in each model’s core module. Simply specify key parameters such as
+and training routines defined in each model's core module. Simply specify key parameters such as
 the dataset location, number of epochs, batch size, optimization strategy, and metrics.
 
 === "Florence-2"
@@ -108,6 +114,17 @@ the dataset location, number of epochs, batch size, optimization strategy, and m
       --metrics "edit_distance"
     ```
 
+=== "SmolVLM2"
+
+    ```bash
+    maestro smolvlm2 train \
+      --dataset "dataset/location" \
+      --epochs 10 \
+      --batch-size 4 \
+      --optimization_strategy "lora" \
+      --metrics "edit_distance"
+    ```
+
 ### Python
 
 For greater control, use the Python API to fine-tune your models.
@@ -148,7 +165,6 @@ and training setup.
     ```
 
 === "Qwen2.5-VL"
-
     ```python
     from maestro.trainer.models.qwen_2_5_vl.core import train
 
@@ -162,3 +178,18 @@ and training setup.
 
     train(config)
     ```
+
+=== "SmolVLM2"
+    ```python
+    from maestro.trainer.models.smolvlm2.core import train
+
+    config = {
+        "dataset": "dataset/location",
+        "epochs": 10,
+        "batch_size": 4,
+        "optimization_strategy": "lora",
+        "metrics": ["edit_distance"],
+    }
+
+    train(config)
+    ```
diff --git a/docs/models/smolvlm2.md b/docs/models/smolvlm2.md
@@ -0,0 +1,99 @@
+---
+comments: true
+---
+
+## Overview
+
+SmolVLM2 is a lightweight vision-language model developed by Smol AI. It offers impressive capabilities for multimodal understanding while maintaining a compact size compared to larger VLMs. The model excels at tasks such as image captioning, visual question answering, and object detection, making it accessible for applications with limited computational resources.
+
+Built to balance performance and efficiency, SmolVLM2 provides a valuable option for developers seeking to implement vision-language capabilities without the overhead of larger models. The 500M parameter variant delivers practical results while being significantly more resource-friendly than multi-billion parameter alternatives.
+
+## Install
+
+```bash
+pip install "maestro[smolvlm2]"
+```
+
+## Train
+
+The training routines support various optimization strategies such as LoRA, QLoRA, and freezing the vision encoder. Customize your fine-tuning process via CLI or Python to align with your dataset and task requirements.
+
+### CLI
+
+Kick off training from the command line by running the command below. Be sure to replace the dataset path and adjust the hyperparameters (such as epochs and batch size) to suit your needs.
+
+```bash
+maestro smolvlm2 train \
+  --dataset "dataset/location" \
+  --epochs 10 \
+  --batch-size 4 \
+  --optimization_strategy "qlora" \
+  --metrics "edit_distance"
+```
+
+### Python
+
+For more control, you can fine-tune SmolVLM2 using the Python API. Create a configuration dictionary with your training parameters and pass it to the train function to integrate the process into your custom workflow.
+
+```python
+from maestro.trainer.models.smolvlm2.core import train
+
+config = {
+    "dataset": "dataset/location",
+    "epochs": 10,
+    "batch_size": 4,
+    "optimization_strategy": "qlora",
+    "metrics": ["edit_distance"],
+}
+
+results = train(config)
+```
+
+## Inference
+
+Use SmolVLM2 for inference on images using either the CLI or Python API.
+
+### CLI
+
+```bash
+maestro smolvlm2 predict \
+  --image "path/to/image.jpg" \
+  --prompt "Describe this image"
+```
+
+### Python
+
+```python
+from maestro.trainer.models.smolvlm2.entrypoint import SmolVLM2
+
+model = SmolVLM2()
+result = model.generate(
+    images="path/to/image.jpg",
+    prompt="Describe this image",
+    max_new_tokens=512
+)
+
+print(result["text"])
+```
+
+## Object Detection
+
+SmolVLM2 can perform object detection on images, identifying and localizing objects with bounding boxes.
+
+```python
+from maestro.trainer.models.smolvlm2.entrypoint import SmolVLM2
+from maestro.trainer.models.smolvlm2.detection import result_to_detections_formatter
+
+model = SmolVLM2()
+result = model.generate(
+    images="path/to/image.jpg",
+    prompt="Detect the following objects: person, car, dog"
+)
+
+# Convert text output to detections format
+boxes, class_ids = result_to_detections_formatter(
+    text=result["text"],
+    resolution_wh=(640, 480),
+    classes=["person", "car", "dog"]
+)
+```
diff --git a/maestro/cli/introspection.py b/maestro/cli/introspection.py
@@ -28,6 +28,13 @@ def find_training_recipes(app: typer.Typer) -> None:
     except Exception:
         _warn_about_recipe_import_error(model_name="Qwen2.5-VL")
 
+    try:
+        from maestro.trainer.models.smolvlm2.entrypoint import smolvlm2_app
+
+        app.add_typer(smolvlm2_app, name="smolvlm2")
+    except Exception:
+        _warn_about_recipe_import_error(model_name="SmolVLM2")
+
 
 def _warn_about_recipe_import_error(model_name: str) -> None:
     disable_warnings = str2bool(

diff --git a/maestro/trainer/models/smolvlm2/__init__.py b/maestro/trainer/models/smolvlm2/__init__.py
diff --git a/maestro/trainer/models/smolvlm2/checkpoints.py b/maestro/trainer/models/smolvlm2/checkpoints.py
@@ -0,0 +1,55 @@
+import os
+from typing import Optional
+
+import torch
+from transformers import AutoModelForVision2Seq, AutoProcessor
+
+
+def save_checkpoint(
+    model: AutoModelForVision2Seq, processor: AutoProcessor, path: str, metadata: Optional[dict] = None
+) -> None:
+    """
+    Save model checkpoint.
+
+    Args:
+        model: Model to save
+        processor: Processor to save
+        path: Path to save checkpoint
+        metadata: Optional metadata to save
+    """
+    os.makedirs(path, exist_ok=True)
+
+    # Save model
+    model.save_pretrained(path)
+
+    # Save processor
+    processor.save_pretrained(path)
+
+    # Save metadata if provided
+    if metadata is not None:
+        torch.save(metadata, os.path.join(path, "metadata.pt"))
+
+
+def load_checkpoint(path: str, device: str = "cuda" if torch.cuda.is_available() else "cpu") -> dict:
+    """
+    Load model checkpoint.
+
+    Args:
+        path: Path to checkpoint
+        device: Device to load model on
+
+    Returns:
+        Dictionary containing model, processor, and metadata
+    """
+    # Load model
+    model = AutoModelForVision2Seq.from_pretrained(path)
+    model.to(device)
+
+    # Load processor
+    processor = AutoProcessor.from_pretrained(path)
+
+    # Load metadata if exists
+    metadata_path = os.path.join(path, "metadata.pt")
+    metadata = torch.load(metadata_path) if os.path.exists(metadata_path) else None
+
+    return {"model": model, "processor": processor, "metadata": metadata}