Skip to content

Conversation

westonbrown
Copy link

@westonbrown westonbrown commented Aug 7, 2025

Add LlamaCpp Model Provider Tutorial

Issue #, if available: Related to strands-agents/sdk-python#585

Summary

This PR adds the first comprehensive tutorial showcasing the new LlamaCppModel provider class (merged in #585), demonstrating how to run on-device quantized function calling models with the Strands Agents SDK. This tutorial fills a gap in our documentation by showing developers how to deploy AI agents locally, using efficient quantized models that run on resource constrained hardware.

Value to the Repository

This tutorial is essential because it:

  1. Demonstrates the LlamaCppModel class - First official tutorial for the newly added model provider
  2. Enables offline AI - Shows how to build agents that work without internet connectivity

Key Features Demonstrated

  • Local Model Deployment: Run quantized GGUF models (4-bit, 8-bit) locally
  • Multimodal Processing: Audio transcription/translation and image analysis
  • Grammar Constraints: GBNF grammar for guaranteed output formats
  • Performance Optimization: Benchmarking and optimization strategies

Tutorial Structure

03-llamacpp-model/
├── README.md                 # Setup guide and overview
├── llamacpp_demo.ipynb      # Interactive tutorial notebook
├── requirements.txt         # Python dependencies
└── utils/                   # Helper modules
    ├── audio_recorder.py    # Speech recognition interface
    ├── image_utils.py       # Image processing utilities
    ├── grammar_utils.py     # Grammar constraint demos
    └── benchmark_utils.py   # Performance testing tools

What Users Learn

  1. LlamaCppModel Setup - Configure and use the new strands.models.llamacpp.LlamaCppModel class
  2. Quantized Models - Download and run GGUF models (Qwen, Llama, Mistral, etc.)
  3. Grammar Constraints - Use GBNF grammars for controlled generation
  4. Multimodal Agents - Build agents that process audio, images, and text
  5. Performance Tuning - Optimize inference speed and memory usage
  6. Tool Integration - Add custom functions to local agents

Example Code Snippet

from strands import Agent
from strands.models.llamacpp import LlamaCppModel

# Create local model instance
model = LlamaCppModel(
    base_url="http://localhost:8080",
    params={"temperature": 0.7, "max_tokens": 300}
)

# Use grammar constraints
model.use_grammar_constraint('root ::= "yes" | "no"')

# Create agent with local model
agent = Agent(model=model)
response = agent("Is Python a compiled language?")  # Returns: "no"

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@westonbrown
Copy link
Author

FYSA @EashanKaushik

@westonbrown westonbrown closed this by deleting the head repository Aug 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant