From c4a30dd7a42d2249979ccd9cea9ec2eb3d61adbc Mon Sep 17 00:00:00 2001 From: Aaron Brown Date: Tue, 9 Sep 2025 12:34:26 -0500 Subject: [PATCH 1/3] Added llamacpp doc; --- .../concepts/model-providers/llamacpp.md | 449 ++++++++++++++++++ 1 file changed, 449 insertions(+) create mode 100644 docs/user-guide/concepts/model-providers/llamacpp.md diff --git a/docs/user-guide/concepts/model-providers/llamacpp.md b/docs/user-guide/concepts/model-providers/llamacpp.md new file mode 100644 index 00000000..1e774675 --- /dev/null +++ b/docs/user-guide/concepts/model-providers/llamacpp.md @@ -0,0 +1,449 @@ +# llama.cpp + +[llama.cpp](https://github.com/ggml-org/llama.cpp) is a high-performance C++ inference engine for running large language models locally. Strands provides native support for llama.cpp servers, enabling you to run quantized models efficiently on resource-constrained hardware including edge devices. + +The [`LlamaCppModel`](../../../api-reference/models.md#strands.models.llamacpp) class in Strands enables seamless integration with llama.cpp's OpenAI-compatible API, supporting: + +- Text generation with advanced sampling parameters +- Multimodal capabilities (audio and images) +- Tool/function calling +- Grammar-constrained generation +- Native JSON schema validation +- Streaming responses +- Prompt caching for performance + +## Getting Started + +### Prerequisites + +First install the python client into your python environment: +```bash +pip install strands-agents strands-agents-tools +``` + +Note: llama.cpp support is included in the base Strands package and requires no additional dependencies. + +### Model Selection + +llama.cpp supports any GGUF-format quantized model. Popular options include: +- **Llama 3**: Meta's latest foundation models +- **Qwen2.5**: Alibaba's multilingual models with multimodal variants +- **Mistral**: High-performance open models +- **Phi**: Microsoft's efficient small models + +You can find GGUF models on [Hugging Face](https://huggingface.co/models?search=gguf). + +#### Quantization Formats + +Choose the right quantization for your needs: +- **Q4_K_M**: Best balance of quality and size (recommended for most users) +- **Q5_K_M**: Higher quality, slightly larger +- **Q8_0**: Near-original quality, much larger +- **Q3_K_S**: Smaller size, reduced quality (for edge devices) + +Next, you'll need to install and setup a llama.cpp server. + +#### Option 1: Native Installation + +1. Build llama.cpp from source: + ```bash + git clone https://github.com/ggml-org/llama.cpp + cd llama.cpp + make + ``` + +2. Download a quantized model using Hugging Face CLI: + ```bash + # Install Hugging Face CLI if needed + pip install huggingface-hub + + # Create models directory + mkdir -p models && cd models + + # Example: Download Qwen2.5 7B quantized model + huggingface-cli download ggml-org/Qwen2.5-7B-GGUF \ + Qwen2.5-7B-Q4_K_M.gguf --local-dir . + + # For multimodal models, also download the projector + huggingface-cli download ggml-org/Qwen2.5-Omni-7B-GGUF \ + mmproj-Qwen2.5-Omni-7B-Q8_0.gguf --local-dir . + + cd .. + ``` + +3. Start the llama.cpp server: + ```bash + # Basic text model + ./llama-server -m models/Qwen2.5-7B-Q4_K_M.gguf \ + --host 0.0.0.0 --port 8080 -c 8192 --jinja + + # Multimodal model (with vision/audio support) + ./llama-server -m models/Qwen2.5-Omni-7B-Q4_K_M.gguf \ + --mmproj models/mmproj-Qwen2.5-Omni-7B-Q8_0.gguf \ + --host 0.0.0.0 --port 8080 -c 8192 -ngl 50 --jinja + ``` + +#### Option 2: Docker Installation + +1. Pull the llama.cpp Docker image: + ```bash + docker pull ghcr.io/ggml-org/llama.cpp:server + ``` + +2. Run the llama.cpp container with a model: + ```bash + docker run -d -v /path/to/models:/models -p 8080:8080 \ + ghcr.io/ggml-org/llama.cpp:server \ + -m /models/model.gguf --host 0.0.0.0 --port 8080 + ``` + +3. Verify the server is running: + ```bash + curl http://localhost:8080/health + ``` + +## Basic Usage + +Here's how to create an agent using a llama.cpp model: + +```python +from strands import Agent +from strands.models.llamacpp import LlamaCppModel + +# Create a llama.cpp model instance +llamacpp_model = LlamaCppModel( + base_url="http://localhost:8080", # llama.cpp server address + model_id="default" # Model identifier (usually "default") +) + +# Create an agent using the llama.cpp model +agent = Agent(model=llamacpp_model) + +# Use the agent +agent("Tell me about Strands agents.") # Prints model output to stdout by default +``` + +## Configuration Options + +The [`LlamaCppModel`](../../../api-reference/models.md#strands.models.llamacpp) supports extensive [configuration parameters](../../../api-reference/models.md#strands.models.llamacpp.LlamaCppModel.LlamaCppConfig): + +### Standard Parameters + +| Parameter | Description | Default | +|-----------|-------------|---------| +| `base_url` | The address of the llama.cpp server | "http://localhost:8080" | +| `model_id` | The model identifier | "default" | +| `temperature` | Controls randomness (0.0-2.0) | None | +| `max_tokens` | Maximum number of tokens to generate | None | +| `top_p` | Nucleus sampling parameter | None | +| `frequency_penalty` | Frequency penalty (-2.0 to 2.0) | None | +| `presence_penalty` | Presence penalty (-2.0 to 2.0) | None | +| `stop` | List of stop sequences | None | +| `seed` | Random seed for reproducibility | None | + +### llama.cpp-Specific Parameters + +| Parameter | Description | Default | +|-----------|-------------|---------| +| `repeat_penalty` | Penalize repeat tokens (1.0 = no penalty) | None | +| `top_k` | Top-k sampling (0 = disabled) | None | +| `min_p` | Min-p sampling threshold (0.0-1.0) | None | +| `typical_p` | Typical-p sampling (0.0-1.0) | None | +| `tfs_z` | Tail-free sampling parameter | None | +| `mirostat` | Mirostat sampling mode (0, 1, or 2) | None | +| `mirostat_lr` | Mirostat learning rate | None | +| `mirostat_ent` | Mirostat target entropy | None | +| `grammar` | GBNF grammar for constrained generation | None | +| `json_schema` | JSON schema for structured output | None | +| `cache_prompt` | Cache prompt for faster generation | None | + +### Example with Configuration + +```python +from strands import Agent +from strands.models.llamacpp import LlamaCppModel + +# Create a configured llama.cpp model +llamacpp_model = LlamaCppModel( + base_url="http://localhost:8080", + model_id="default", + params={ + "temperature": 0.7, + "max_tokens": 500, + "repeat_penalty": 1.1, + "top_k": 40, + "cache_prompt": True + } +) + +# Create an agent with the configured model +agent = Agent(model=llamacpp_model) + +# Use the agent +response = agent("Write a short story about an AI assistant.") +``` + +## Advanced Features + +### Grammar-Constrained Generation + +llama.cpp supports GBNF grammar constraints to ensure output follows specific patterns: + +```python +from strands import Agent +from strands.models.llamacpp import LlamaCppModel + +# Create model with grammar constraint +llamacpp_model = LlamaCppModel( + base_url="http://localhost:8080", + params={ + "grammar": ''' + root ::= answer + answer ::= "yes" | "no" | "maybe" + ''' + } +) + +agent = Agent(model=llamacpp_model) + +# Response will be constrained to "yes", "no", or "maybe" +response = agent("Is the Earth flat?") +``` + +### Advanced Sampling Parameters + +llama.cpp offers sophisticated sampling control for fine-tuning output quality. Here are recommended configurations for different use cases: + +```python +# High Quality (Slower, more accurate) +high_quality_model = LlamaCppModel( + base_url="http://localhost:8080", + params={ + "temperature": 0.3, + "top_k": 10, + "repeat_penalty": 1.2, + "max_tokens": 500 + } +) + +# Balanced Performance (Good quality, reasonable speed) +balanced_model = LlamaCppModel( + base_url="http://localhost:8080", + params={ + "temperature": 0.7, + "top_k": 40, + "min_p": 0.05, + "cache_prompt": True + } +) + +# Creative Writing (More varied output) +creative_model = LlamaCppModel( + base_url="http://localhost:8080", + params={ + "temperature": 0.9, + "top_p": 0.95, + "typical_p": 0.95, + "repeat_penalty": 1.1, + "mirostat": 2, + "mirostat_ent": 5.0 + } +) + +# Speed Optimized (Fastest inference) +speed_model = LlamaCppModel( + base_url="http://localhost:8080", + params={ + "temperature": 0.8, + "top_k": 20, + "cache_prompt": True, + "n_probs": 0 # Disable probability computation + } +) +``` + +### Updating Configuration at Runtime + +You can update the model configuration during runtime: + +```python +# Create the model with initial configuration +llamacpp_model = LlamaCppModel( + base_url="http://localhost:8080", + params={"temperature": 0.7} +) + +# Update configuration later +llamacpp_model.update_config( + params={ + "temperature": 0.9, + "top_k": 50 + } +) +``` + +### Structured Output + +llama.cpp supports structured output through native JSON schema validation. When you use [`Agent.structured_output()`](../../../api-reference/agent.md#strands.agent.agent.Agent.structured_output), the model constrains its output to match your schema: + +```python +from pydantic import BaseModel, Field +from strands import Agent +from strands.models.llamacpp import LlamaCppModel + +class BookAnalysis(BaseModel): + """Analyze a book's key information.""" + title: str = Field(description="The book's title") + author: str = Field(description="The book's author") + genre: str = Field(description="Primary genre or category") + summary: str = Field(description="Brief summary of the book") + rating: int = Field(description="Rating from 1-10", ge=1, le=10) + +llamacpp_model = LlamaCppModel( + base_url="http://localhost:8080" +) + +agent = Agent(model=llamacpp_model) + +result = agent.structured_output( + BookAnalysis, + """ + Analyze this book: "The Hitchhiker's Guide to the Galaxy" by Douglas Adams. + It's a science fiction comedy about Arthur Dent's adventures through space + after Earth is destroyed. It's widely considered a classic of humorous sci-fi. + """ +) + +print(f"Title: {result.title}") +print(f"Author: {result.author}") +print(f"Genre: {result.genre}") +print(f"Rating: {result.rating}") +``` + +### Multimodal Support + +For models that support multimodal input (e.g., Qwen2.5-Omni), llama.cpp can process audio and images. The SDK automatically handles the formatting for multimodal content: + +```python +# Audio processing example with Qwen2.5-Omni +audio_message = { + "role": "user", + "content": [ + { + "type": "audio", + "audio": { + "data": base64_encoded_audio, # Base64 encoded audio + "format": "wav" + } + }, + { + "type": "text", + "text": "Please transcribe what was said and identify the language." + } + ] +} + +# Image analysis example +from PIL import Image +import io +import base64 + +# Load and encode image +img = Image.open("example.png") +img_bytes = io.BytesIO() +img.save(img_bytes, format='PNG') +img_base64 = base64.b64encode(img_bytes.getvalue()).decode() + +image_message = { + "role": "user", + "content": [ + { + "type": "image", + "image": { + "data": img_base64, + "format": "png" + } + }, + { + "type": "text", + "text": "Describe this image in detail." + } + ] +} + +response = agent([image_message]) +``` + +Note: Multimodal support requires: +1. A multimodal model (e.g., Qwen2.5-Omni) +2. The multimodal projector file (mmproj) +3. Starting the server with `--mmproj` flag + +## Tool Support + +llama.cpp models with function calling support can use tools through Strands' tool system: + +```python +from strands import Agent +from strands.models.llamacpp import LlamaCppModel +from strands_tools import calculator, current_time + +# Create a llama.cpp model +llamacpp_model = LlamaCppModel( + base_url="http://localhost:8080" +) + +# Create an agent with tools +agent = Agent( + model=llamacpp_model, + tools=[calculator, current_time] +) + +# Use the agent with tools +response = agent("What's the square root of 144 plus the current time?") +``` + +## Performance Optimization + +### Prompt Caching + +Enable prompt caching for faster subsequent queries: + +```python +llamacpp_model = LlamaCppModel( + base_url="http://localhost:8080", + params={"cache_prompt": True} +) +``` + +### Server Optimization + +Optimize the llama.cpp server for your hardware: + +```bash +# GPU acceleration (NVIDIA) - offload layers to GPU +./llama-server -m model.gguf --host 0.0.0.0 --port 8080 -ngl 50 + +# Full recommended configuration +./llama-server -m model.gguf \ + --host 0.0.0.0 \ + --port 8080 \ + -c 8192 \ # Larger context window + -ngl 50 \ # GPU layers (adjust based on VRAM) + --jinja \ # Enable Jinja templating + --batch-size 512 # Optimize batch processing +``` + +Key optimization flags: +- `-ngl`: Number of layers to offload to GPU (0-100, adjust based on VRAM) +- `-c`: Context size (default 512, increase for longer conversations) +- `--jinja`: Required for proper chat template processing +- `--parallel`: Number of parallel slots for concurrent requests +- `--batch-size`: Batch size for prompt processing + +## Related Resources + +- [llama.cpp Documentation](https://github.com/ggml-org/llama.cpp) +- [llama.cpp Server Documentation](https://github.com/ggml-org/llama.cpp/tree/master/tools/server) +- [GGUF Model Format](https://github.com/ggml-org/ggml/blob/master/docs/gguf.md) +- [Hugging Face GGUF Models](https://huggingface.co/models?search=gguf) \ No newline at end of file From c78731ae681c431842f981df7c29a8f3f2770e50 Mon Sep 17 00:00:00 2001 From: Aaron Brown Date: Tue, 9 Sep 2025 12:51:25 -0500 Subject: [PATCH 2/3] updated mkdocs with llamacpp; --- mkdocs.yml | 1 + 1 file changed, 1 insertion(+) diff --git a/mkdocs.yml b/mkdocs.yml index 6ae5af57..e64906ee 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -91,6 +91,7 @@ nav: - Amazon Bedrock: user-guide/concepts/model-providers/amazon-bedrock.md - Anthropic: user-guide/concepts/model-providers/anthropic.md - LiteLLM: user-guide/concepts/model-providers/litellm.md + - llama.cpp: user-guide/concepts/model-providers/llamacpp.md - LlamaAPI: user-guide/concepts/model-providers/llamaapi.md - MistralAI: user-guide/concepts/model-providers/mistral.md - Ollama: user-guide/concepts/model-providers/ollama.md From 050f5bcc2f81cc56520abb2a54b5328da9c14679 Mon Sep 17 00:00:00 2001 From: Aaron Brown Date: Mon, 15 Sep 2025 11:34:20 -0500 Subject: [PATCH 3/3] refined documentation to align with required headings format; --- .../concepts/model-providers/llamacpp.md | 484 +++++------------- 1 file changed, 126 insertions(+), 358 deletions(-) diff --git a/docs/user-guide/concepts/model-providers/llamacpp.md b/docs/user-guide/concepts/model-providers/llamacpp.md index 1e774675..b6c7df33 100644 --- a/docs/user-guide/concepts/model-providers/llamacpp.md +++ b/docs/user-guide/concepts/model-providers/llamacpp.md @@ -1,200 +1,147 @@ # llama.cpp -[llama.cpp](https://github.com/ggml-org/llama.cpp) is a high-performance C++ inference engine for running large language models locally. Strands provides native support for llama.cpp servers, enabling you to run quantized models efficiently on resource-constrained hardware including edge devices. +[llama.cpp](https://github.com/ggml-org/llama.cpp) is a high-performance C++ inference engine for running large language models locally. The Strands Agents SDK implements a llama.cpp provider, allowing you to run agents against any llama.cpp server with quantized models. -The [`LlamaCppModel`](../../../api-reference/models.md#strands.models.llamacpp) class in Strands enables seamless integration with llama.cpp's OpenAI-compatible API, supporting: +## Installation -- Text generation with advanced sampling parameters -- Multimodal capabilities (audio and images) -- Tool/function calling -- Grammar-constrained generation -- Native JSON schema validation -- Streaming responses -- Prompt caching for performance +llama.cpp support is included in the base Strands Agents package. To install, run: -## Getting Started - -### Prerequisites - -First install the python client into your python environment: ```bash pip install strands-agents strands-agents-tools ``` -Note: llama.cpp support is included in the base Strands package and requires no additional dependencies. - -### Model Selection - -llama.cpp supports any GGUF-format quantized model. Popular options include: -- **Llama 3**: Meta's latest foundation models -- **Qwen2.5**: Alibaba's multilingual models with multimodal variants -- **Mistral**: High-performance open models -- **Phi**: Microsoft's efficient small models - -You can find GGUF models on [Hugging Face](https://huggingface.co/models?search=gguf). - -#### Quantization Formats - -Choose the right quantization for your needs: -- **Q4_K_M**: Best balance of quality and size (recommended for most users) -- **Q5_K_M**: Higher quality, slightly larger -- **Q8_0**: Near-original quality, much larger -- **Q3_K_S**: Smaller size, reduced quality (for edge devices) - -Next, you'll need to install and setup a llama.cpp server. - -#### Option 1: Native Installation - -1. Build llama.cpp from source: - ```bash - git clone https://github.com/ggml-org/llama.cpp - cd llama.cpp - make - ``` - -2. Download a quantized model using Hugging Face CLI: - ```bash - # Install Hugging Face CLI if needed - pip install huggingface-hub - - # Create models directory - mkdir -p models && cd models - - # Example: Download Qwen2.5 7B quantized model - huggingface-cli download ggml-org/Qwen2.5-7B-GGUF \ - Qwen2.5-7B-Q4_K_M.gguf --local-dir . - - # For multimodal models, also download the projector - huggingface-cli download ggml-org/Qwen2.5-Omni-7B-GGUF \ - mmproj-Qwen2.5-Omni-7B-Q8_0.gguf --local-dir . - - cd .. - ``` - -3. Start the llama.cpp server: - ```bash - # Basic text model - ./llama-server -m models/Qwen2.5-7B-Q4_K_M.gguf \ - --host 0.0.0.0 --port 8080 -c 8192 --jinja - - # Multimodal model (with vision/audio support) - ./llama-server -m models/Qwen2.5-Omni-7B-Q4_K_M.gguf \ - --mmproj models/mmproj-Qwen2.5-Omni-7B-Q8_0.gguf \ - --host 0.0.0.0 --port 8080 -c 8192 -ngl 50 --jinja - ``` - -#### Option 2: Docker Installation - -1. Pull the llama.cpp Docker image: - ```bash - docker pull ghcr.io/ggml-org/llama.cpp:server - ``` - -2. Run the llama.cpp container with a model: - ```bash - docker run -d -v /path/to/models:/models -p 8080:8080 \ - ghcr.io/ggml-org/llama.cpp:server \ - -m /models/model.gguf --host 0.0.0.0 --port 8080 - ``` - -3. Verify the server is running: - ```bash - curl http://localhost:8080/health - ``` - -## Basic Usage - -Here's how to create an agent using a llama.cpp model: +## Usage + +After setting up a llama.cpp server, you can import and initialize the Strands Agents' llama.cpp provider as follows: ```python from strands import Agent from strands.models.llamacpp import LlamaCppModel +from strands_tools import calculator -# Create a llama.cpp model instance -llamacpp_model = LlamaCppModel( - base_url="http://localhost:8080", # llama.cpp server address - model_id="default" # Model identifier (usually "default") +model = LlamaCppModel( + base_url="http://localhost:8080", + # **model_config + model_id="default", + params={ + "max_tokens": 1000, + "temperature": 0.7, + "repeat_penalty": 1.1, + } ) -# Create an agent using the llama.cpp model -agent = Agent(model=llamacpp_model) - -# Use the agent -agent("Tell me about Strands agents.") # Prints model output to stdout by default +agent = Agent(model=model, tools=[calculator]) +response = agent("What is 2+2") +print(response) ``` -## Configuration Options - -The [`LlamaCppModel`](../../../api-reference/models.md#strands.models.llamacpp) supports extensive [configuration parameters](../../../api-reference/models.md#strands.models.llamacpp.LlamaCppModel.LlamaCppConfig): - -### Standard Parameters - -| Parameter | Description | Default | -|-----------|-------------|---------| -| `base_url` | The address of the llama.cpp server | "http://localhost:8080" | -| `model_id` | The model identifier | "default" | -| `temperature` | Controls randomness (0.0-2.0) | None | -| `max_tokens` | Maximum number of tokens to generate | None | -| `top_p` | Nucleus sampling parameter | None | -| `frequency_penalty` | Frequency penalty (-2.0 to 2.0) | None | -| `presence_penalty` | Presence penalty (-2.0 to 2.0) | None | -| `stop` | List of stop sequences | None | -| `seed` | Random seed for reproducibility | None | - -### llama.cpp-Specific Parameters - -| Parameter | Description | Default | -|-----------|-------------|---------| -| `repeat_penalty` | Penalize repeat tokens (1.0 = no penalty) | None | -| `top_k` | Top-k sampling (0 = disabled) | None | -| `min_p` | Min-p sampling threshold (0.0-1.0) | None | -| `typical_p` | Typical-p sampling (0.0-1.0) | None | -| `tfs_z` | Tail-free sampling parameter | None | -| `mirostat` | Mirostat sampling mode (0, 1, or 2) | None | -| `mirostat_lr` | Mirostat learning rate | None | -| `mirostat_ent` | Mirostat target entropy | None | -| `grammar` | GBNF grammar for constrained generation | None | -| `json_schema` | JSON schema for structured output | None | -| `cache_prompt` | Cache prompt for faster generation | None | - -### Example with Configuration +To connect to a remote llama.cpp server, you can specify a different base URL: ```python -from strands import Agent -from strands.models.llamacpp import LlamaCppModel - -# Create a configured llama.cpp model -llamacpp_model = LlamaCppModel( - base_url="http://localhost:8080", +model = LlamaCppModel( + base_url="http://your-server:8080", model_id="default", params={ "temperature": 0.7, - "max_tokens": 500, - "repeat_penalty": 1.1, - "top_k": 40, "cache_prompt": True } ) +``` + +## Configuration -# Create an agent with the configured model -agent = Agent(model=llamacpp_model) +### Server Setup -# Use the agent -response = agent("Write a short story about an AI assistant.") +Before using LlamaCppModel, you need a running llama.cpp server with a GGUF model: + +```bash +# Download a model (e.g., using Hugging Face CLI) +huggingface-cli download ggml-org/Qwen2.5-7B-GGUF \ + Qwen2.5-7B-Q4_K_M.gguf --local-dir ./models + +# Start the server +llama-server -m models/Qwen2.5-7B-Q4_K_M.gguf \ + --host 0.0.0.0 --port 8080 -c 8192 --jinja ``` +### Model Configuration + +The `model_config` configures the underlying model selected for inference. The supported configurations are: + +| Parameter | Description | Example | Default | +|-----------|-------------|---------|---------| +| `base_url` | llama.cpp server URL | `http://localhost:8080` | `http://localhost:8080` | +| `model_id` | Model identifier | `default` | `default` | +| `params` | Model parameters | `{"temperature": 0.7, "max_tokens": 1000}` | `None` | + +### Supported Parameters + +Standard parameters: + +- `temperature`, `max_tokens`, `top_p`, `frequency_penalty`, `presence_penalty`, `stop`, `seed` + +llama.cpp-specific parameters: + +- `repeat_penalty`, `top_k`, `min_p`, `typical_p`, `tfs_z`, `mirostat`, `grammar`, `json_schema`, `cache_prompt` + +## Troubleshooting + +### Connection Refused + +If you encounter connection errors, ensure: + +1. The llama.cpp server is running (`llama-server` command) +2. The server URL and port are correct +3. No firewall is blocking the connection + +### Context Window Overflow + +If you get context overflow errors: + +- Increase context size with `-c` flag when starting server +- Reduce input size +- Enable prompt caching with `cache_prompt: True` + ## Advanced Features -### Grammar-Constrained Generation +### Structured Output -llama.cpp supports GBNF grammar constraints to ensure output follows specific patterns: +llama.cpp models support structured output through native JSON schema validation. When you use [`Agent.structured_output()`](../../../api-reference/agent.md#strands.agent.agent.Agent.structured_output), the SDK uses llama.cpp's json_schema parameter to constrain output: ```python +from pydantic import BaseModel, Field from strands import Agent from strands.models.llamacpp import LlamaCppModel -# Create model with grammar constraint -llamacpp_model = LlamaCppModel( +class PersonInfo(BaseModel): + """Extract person information from text.""" + name: str = Field(description="Full name of the person") + age: int = Field(description="Age in years") + occupation: str = Field(description="Job or profession") + +model = LlamaCppModel( + base_url="http://localhost:8080", + model_id="default", +) + +agent = Agent(model=model) + +result = agent.structured_output( + PersonInfo, + "John Smith is a 30-year-old software engineer working at a tech startup." +) + +print(f"Name: {result.name}") # "John Smith" +print(f"Age: {result.age}") # 30 +print(f"Job: {result.occupation}") # "software engineer" +``` + +### Grammar Constraints + +llama.cpp supports GBNF grammar constraints to ensure output follows specific patterns: + +```python +model = LlamaCppModel( base_url="http://localhost:8080", params={ "grammar": ''' @@ -204,246 +151,67 @@ llamacpp_model = LlamaCppModel( } ) -agent = Agent(model=llamacpp_model) - -# Response will be constrained to "yes", "no", or "maybe" -response = agent("Is the Earth flat?") +agent = Agent(model=model) +response = agent("Is the Earth flat?") # Will only output "yes", "no", or "maybe" ``` -### Advanced Sampling Parameters +### Advanced Sampling -llama.cpp offers sophisticated sampling control for fine-tuning output quality. Here are recommended configurations for different use cases: +llama.cpp offers sophisticated sampling parameters for fine-tuning output: ```python -# High Quality (Slower, more accurate) -high_quality_model = LlamaCppModel( +# High-quality output (slower) +model = LlamaCppModel( base_url="http://localhost:8080", params={ "temperature": 0.3, "top_k": 10, "repeat_penalty": 1.2, - "max_tokens": 500 - } -) - -# Balanced Performance (Good quality, reasonable speed) -balanced_model = LlamaCppModel( - base_url="http://localhost:8080", - params={ - "temperature": 0.7, - "top_k": 40, - "min_p": 0.05, - "cache_prompt": True } ) -# Creative Writing (More varied output) -creative_model = LlamaCppModel( +# Creative writing +model = LlamaCppModel( base_url="http://localhost:8080", params={ "temperature": 0.9, "top_p": 0.95, - "typical_p": 0.95, - "repeat_penalty": 1.1, "mirostat": 2, - "mirostat_ent": 5.0 - } -) - -# Speed Optimized (Fastest inference) -speed_model = LlamaCppModel( - base_url="http://localhost:8080", - params={ - "temperature": 0.8, - "top_k": 20, - "cache_prompt": True, - "n_probs": 0 # Disable probability computation + "mirostat_ent": 5.0, } ) ``` -### Updating Configuration at Runtime - -You can update the model configuration during runtime: - -```python -# Create the model with initial configuration -llamacpp_model = LlamaCppModel( - base_url="http://localhost:8080", - params={"temperature": 0.7} -) - -# Update configuration later -llamacpp_model.update_config( - params={ - "temperature": 0.9, - "top_k": 50 - } -) -``` - -### Structured Output - -llama.cpp supports structured output through native JSON schema validation. When you use [`Agent.structured_output()`](../../../api-reference/agent.md#strands.agent.agent.Agent.structured_output), the model constrains its output to match your schema: - -```python -from pydantic import BaseModel, Field -from strands import Agent -from strands.models.llamacpp import LlamaCppModel - -class BookAnalysis(BaseModel): - """Analyze a book's key information.""" - title: str = Field(description="The book's title") - author: str = Field(description="The book's author") - genre: str = Field(description="Primary genre or category") - summary: str = Field(description="Brief summary of the book") - rating: int = Field(description="Rating from 1-10", ge=1, le=10) - -llamacpp_model = LlamaCppModel( - base_url="http://localhost:8080" -) - -agent = Agent(model=llamacpp_model) - -result = agent.structured_output( - BookAnalysis, - """ - Analyze this book: "The Hitchhiker's Guide to the Galaxy" by Douglas Adams. - It's a science fiction comedy about Arthur Dent's adventures through space - after Earth is destroyed. It's widely considered a classic of humorous sci-fi. - """ -) - -print(f"Title: {result.title}") -print(f"Author: {result.author}") -print(f"Genre: {result.genre}") -print(f"Rating: {result.rating}") -``` - ### Multimodal Support -For models that support multimodal input (e.g., Qwen2.5-Omni), llama.cpp can process audio and images. The SDK automatically handles the formatting for multimodal content: +For multimodal models like Qwen2.5-Omni, llama.cpp can process images and audio: ```python -# Audio processing example with Qwen2.5-Omni -audio_message = { - "role": "user", - "content": [ - { - "type": "audio", - "audio": { - "data": base64_encoded_audio, # Base64 encoded audio - "format": "wav" - } - }, - { - "type": "text", - "text": "Please transcribe what was said and identify the language." - } - ] -} - -# Image analysis example +# Requires multimodal model and --mmproj flag when starting server from PIL import Image -import io import base64 +import io -# Load and encode image +# Image analysis img = Image.open("example.png") img_bytes = io.BytesIO() img.save(img_bytes, format='PNG') img_base64 = base64.b64encode(img_bytes.getvalue()).decode() image_message = { - "role": "user", + "role": "user", "content": [ - { - "type": "image", - "image": { - "data": img_base64, - "format": "png" - } - }, - { - "type": "text", - "text": "Describe this image in detail." - } + {"type": "image", "image": {"data": img_base64, "format": "png"}}, + {"type": "text", "text": "Describe this image"} ] } response = agent([image_message]) ``` -Note: Multimodal support requires: -1. A multimodal model (e.g., Qwen2.5-Omni) -2. The multimodal projector file (mmproj) -3. Starting the server with `--mmproj` flag - -## Tool Support - -llama.cpp models with function calling support can use tools through Strands' tool system: - -```python -from strands import Agent -from strands.models.llamacpp import LlamaCppModel -from strands_tools import calculator, current_time - -# Create a llama.cpp model -llamacpp_model = LlamaCppModel( - base_url="http://localhost:8080" -) - -# Create an agent with tools -agent = Agent( - model=llamacpp_model, - tools=[calculator, current_time] -) - -# Use the agent with tools -response = agent("What's the square root of 144 plus the current time?") -``` - -## Performance Optimization - -### Prompt Caching - -Enable prompt caching for faster subsequent queries: - -```python -llamacpp_model = LlamaCppModel( - base_url="http://localhost:8080", - params={"cache_prompt": True} -) -``` - -### Server Optimization - -Optimize the llama.cpp server for your hardware: - -```bash -# GPU acceleration (NVIDIA) - offload layers to GPU -./llama-server -m model.gguf --host 0.0.0.0 --port 8080 -ngl 50 - -# Full recommended configuration -./llama-server -m model.gguf \ - --host 0.0.0.0 \ - --port 8080 \ - -c 8192 \ # Larger context window - -ngl 50 \ # GPU layers (adjust based on VRAM) - --jinja \ # Enable Jinja templating - --batch-size 512 # Optimize batch processing -``` - -Key optimization flags: -- `-ngl`: Number of layers to offload to GPU (0-100, adjust based on VRAM) -- `-c`: Context size (default 512, increase for longer conversations) -- `--jinja`: Required for proper chat template processing -- `--parallel`: Number of parallel slots for concurrent requests -- `--batch-size`: Batch size for prompt processing - -## Related Resources +## References -- [llama.cpp Documentation](https://github.com/ggml-org/llama.cpp) +- [API](../../../api-reference/models.md) +- [llama.cpp](https://github.com/ggml-org/llama.cpp) - [llama.cpp Server Documentation](https://github.com/ggml-org/llama.cpp/tree/master/tools/server) -- [GGUF Model Format](https://github.com/ggml-org/ggml/blob/master/docs/gguf.md) -- [Hugging Face GGUF Models](https://huggingface.co/models?search=gguf) \ No newline at end of file +- [GGUF Models on Hugging Face](https://huggingface.co/models?search=gguf) \ No newline at end of file