Skip to content
Closed
35 changes: 33 additions & 2 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,10 +69,16 @@ we recommend creating a dedicated Python environment for each model.
pip install "maestro[qwen_2_5_vl]"
```

=== "SmolVLM2"

```bash
pip install "maestro[smolvlm2]"
```

### CLI

Kick off fine-tuning with our command-line interface, which leverages the configuration
and training routines defined in each models core module. Simply specify key parameters such as
and training routines defined in each model's core module. Simply specify key parameters such as
the dataset location, number of epochs, batch size, optimization strategy, and metrics.

=== "Florence-2"
Expand Down Expand Up @@ -108,6 +114,17 @@ the dataset location, number of epochs, batch size, optimization strategy, and m
--metrics "edit_distance"
```

=== "SmolVLM2"

```bash
maestro smolvlm2 train \
--dataset "dataset/location" \
--epochs 10 \
--batch-size 4 \
--optimization_strategy "lora" \
--metrics "edit_distance"
```

### Python

For greater control, use the Python API to fine-tune your models.
Expand Down Expand Up @@ -148,7 +165,6 @@ and training setup.
```

=== "Qwen2.5-VL"

```python
from maestro.trainer.models.qwen_2_5_vl.core import train

Expand All @@ -162,3 +178,18 @@ and training setup.

train(config)
```

=== "SmolVLM2"
```python
from maestro.trainer.models.smolvlm2.core import train

config = {
"dataset": "dataset/location",
"epochs": 10,
"batch_size": 4,
"optimization_strategy": "lora",
"metrics": ["edit_distance"],
}

train(config)
```
99 changes: 99 additions & 0 deletions docs/models/smolvlm2.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
---
comments: true
---

## Overview

SmolVLM2 is a lightweight vision-language model developed by Smol AI. It offers impressive capabilities for multimodal understanding while maintaining a compact size compared to larger VLMs. The model excels at tasks such as image captioning, visual question answering, and object detection, making it accessible for applications with limited computational resources.

Built to balance performance and efficiency, SmolVLM2 provides a valuable option for developers seeking to implement vision-language capabilities without the overhead of larger models. The 500M parameter variant delivers practical results while being significantly more resource-friendly than multi-billion parameter alternatives.

## Install

```bash
pip install "maestro[smolvlm2]"
```

## Train

The training routines support various optimization strategies such as LoRA, QLoRA, and freezing the vision encoder. Customize your fine-tuning process via CLI or Python to align with your dataset and task requirements.

### CLI

Kick off training from the command line by running the command below. Be sure to replace the dataset path and adjust the hyperparameters (such as epochs and batch size) to suit your needs.

```bash
maestro smolvlm2 train \
--dataset "dataset/location" \
--epochs 10 \
--batch-size 4 \
--optimization_strategy "qlora" \
--metrics "edit_distance"
```

### Python

For more control, you can fine-tune SmolVLM2 using the Python API. Create a configuration dictionary with your training parameters and pass it to the train function to integrate the process into your custom workflow.

```python
from maestro.trainer.models.smolvlm2.core import train

config = {
"dataset": "dataset/location",
"epochs": 10,
"batch_size": 4,
"optimization_strategy": "qlora",
"metrics": ["edit_distance"],
}

results = train(config)
```

## Inference

Use SmolVLM2 for inference on images using either the CLI or Python API.

### CLI

```bash
maestro smolvlm2 predict \
--image "path/to/image.jpg" \
--prompt "Describe this image"
```

### Python

```python
from maestro.trainer.models.smolvlm2.entrypoint import SmolVLM2

model = SmolVLM2()
result = model.generate(
images="path/to/image.jpg",
prompt="Describe this image",
max_new_tokens=512
)

print(result["text"])
```

## Object Detection

SmolVLM2 can perform object detection on images, identifying and localizing objects with bounding boxes.

```python
from maestro.trainer.models.smolvlm2.entrypoint import SmolVLM2
from maestro.trainer.models.smolvlm2.detection import result_to_detections_formatter

model = SmolVLM2()
result = model.generate(
images="path/to/image.jpg",
prompt="Detect the following objects: person, car, dog"
)

# Convert text output to detections format
boxes, class_ids = result_to_detections_formatter(
text=result["text"],
resolution_wh=(640, 480),
classes=["person", "car", "dog"]
)
```
7 changes: 7 additions & 0 deletions maestro/cli/introspection.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,13 @@ def find_training_recipes(app: typer.Typer) -> None:
except Exception:
_warn_about_recipe_import_error(model_name="Qwen2.5-VL")

try:
from maestro.trainer.models.smolvlm2.entrypoint import smolvlm2_app

app.add_typer(smolvlm2_app, name="smolvlm2")
except Exception:
_warn_about_recipe_import_error(model_name="SmolVLM2")


def _warn_about_recipe_import_error(model_name: str) -> None:
disable_warnings = str2bool(
Expand Down
Empty file.
55 changes: 55 additions & 0 deletions maestro/trainer/models/smolvlm2/checkpoints.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
import os
from typing import Optional

import torch
from transformers import AutoModelForVision2Seq, AutoProcessor


def save_checkpoint(
model: AutoModelForVision2Seq, processor: AutoProcessor, path: str, metadata: Optional[dict] = None
) -> None:
"""
Save model checkpoint.

Args:
model: Model to save
processor: Processor to save
path: Path to save checkpoint
metadata: Optional metadata to save
"""
os.makedirs(path, exist_ok=True)

# Save model
model.save_pretrained(path)

# Save processor
processor.save_pretrained(path)

# Save metadata if provided
if metadata is not None:
torch.save(metadata, os.path.join(path, "metadata.pt"))


def load_checkpoint(path: str, device: str = "cuda" if torch.cuda.is_available() else "cpu") -> dict:
"""
Load model checkpoint.

Args:
path: Path to checkpoint
device: Device to load model on

Returns:
Dictionary containing model, processor, and metadata
"""
# Load model
model = AutoModelForVision2Seq.from_pretrained(path)
model.to(device)

# Load processor
processor = AutoProcessor.from_pretrained(path)

# Load metadata if exists
metadata_path = os.path.join(path, "metadata.pt")
metadata = torch.load(metadata_path) if os.path.exists(metadata_path) else None

return {"model": model, "processor": processor, "metadata": metadata}
Loading