generated from amazon-archives/__template_Apache-2.0
-
Notifications
You must be signed in to change notification settings - Fork 234
llamacpp-model-tutorial; #139
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
westonbrown
wants to merge
1
commit into
strands-agents:main
Choose a base branch
from
westonbrown:llamacpp-model-tutorial
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
39 changes: 39 additions & 0 deletions
39
01-tutorials/01-fundamentals/02-model-providers/03-llamacpp-model/.gitignore
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,39 @@ | ||
# Model files | ||
models/*.gguf | ||
|
||
# Python | ||
__pycache__/ | ||
*.py[cod] | ||
*$py.class | ||
.Python | ||
venv/ | ||
env/ | ||
|
||
# Jupyter | ||
.ipynb_checkpoints/ | ||
*.ipynb_checkpoints | ||
|
||
# Audio files | ||
*.wav | ||
*.mp3 | ||
*.m4a | ||
|
||
# Image files | ||
*.png | ||
*.jpg | ||
*.jpeg | ||
!docs/*.png | ||
|
||
# Logs | ||
*.log | ||
server.log | ||
|
||
# OS | ||
.DS_Store | ||
Thumbs.db | ||
|
||
# IDE | ||
.vscode/ | ||
.idea/ | ||
*.swp | ||
*.swo |
202 changes: 202 additions & 0 deletions
202
01-tutorials/01-fundamentals/02-model-providers/03-llamacpp-model/README.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,202 @@ | ||
# LlamaCpp Provider for Strands SDK | ||
|
||
This tutorial demonstrates using the LlamaCpp provider with Strands Agents. LlamaCpp enables running quantized models locally with advanced features like grammar constraints, multimodal support, and custom sampling parameters. | ||
|
||
## Prerequisites | ||
|
||
- Python 3.8+ | ||
- llama.cpp with server support ([Installation Guide](https://github.com/ggerganov/llama.cpp)) | ||
- 16GB RAM (minimum 8GB) | ||
- 8GB storage for model files | ||
|
||
## Quick Start | ||
|
||
### 1. Install Dependencies | ||
|
||
```bash | ||
pip install -r requirements.txt | ||
``` | ||
|
||
### 2. Download Model Files | ||
|
||
Download the quantized Qwen2.5-Omni model for multimodal capabilities: | ||
|
||
```bash | ||
# Create models directory | ||
mkdir -p models && cd models | ||
|
||
# Download main model (4.68 GB) | ||
huggingface-cli download ggml-org/Qwen2.5-Omni-7B-GGUF \ | ||
Qwen2.5-Omni-7B-Q4_K_M.gguf --local-dir . | ||
|
||
# Download multimodal projector (1.55 GB) | ||
huggingface-cli download ggml-org/Qwen2.5-Omni-7B-GGUF \ | ||
mmproj-Qwen2.5-Omni-7B-Q8_0.gguf --local-dir . | ||
|
||
cd .. | ||
``` | ||
|
||
Both files are required for audio and vision support. | ||
|
||
### 3. Start LlamaCpp Server | ||
|
||
```bash | ||
llama-server -m models/Qwen2.5-Omni-7B-Q4_K_M.gguf \ | ||
--mmproj models/mmproj-Qwen2.5-Omni-7B-Q8_0.gguf \ | ||
--host 0.0.0.0 --port 8080 -c 8192 -ngl 50 --jinja | ||
``` | ||
|
||
Key parameters: | ||
- `-m`: Path to main model file | ||
- `--mmproj`: Path to multimodal projector | ||
- `-c`: Context window size (default: 8192) | ||
- `-ngl`: Number of GPU layers (0 for CPU-only) | ||
- `--jinja`: Enable template support for tools | ||
|
||
The server will log "loaded multimodal model" when ready. | ||
|
||
### 4. Run the Tutorial | ||
|
||
```bash | ||
jupyter notebook llamacpp_demo.ipynb | ||
``` | ||
|
||
## Key Features | ||
|
||
### Grammar Constraints | ||
|
||
Control output format using GBNF (Backus-Naur Form) grammars: | ||
|
||
```python | ||
model.use_grammar_constraint('root ::= "yes" | "no"') | ||
``` | ||
|
||
### Advanced Sampling | ||
|
||
LlamaCpp provides fine-grained control over text generation: | ||
|
||
- **Mirostat**: Dynamic perplexity control | ||
- **TFS**: Tail-free sampling for quality improvement | ||
- **Min-p**: Minimum probability threshold | ||
- **Custom sampler ordering**: Control sampling pipeline | ||
|
||
### Structured Output | ||
|
||
Generate validated JSON output using Pydantic models: | ||
|
||
```python | ||
agent.structured_output(MyModel, "Generate user data") | ||
``` | ||
|
||
### Multimodal Capabilities | ||
|
||
- **Audio Input**: Process speech and audio files | ||
- **Vision Input**: Analyze images | ||
- **Combined Processing**: Simultaneous audio-visual understanding | ||
|
||
### Performance Optimization | ||
|
||
- Prompt caching for repeated queries | ||
- Slot-based session management | ||
- GPU acceleration with configurable layers | ||
|
||
## Tutorial Content | ||
|
||
The Jupyter notebook demonstrates: | ||
|
||
1. **Grammar Constraints**: Enforce specific output formats | ||
2. **Sampling Strategies**: Compare generation quality with different parameters | ||
3. **Structured Output**: Type-safe data generation | ||
4. **Tool Integration**: Function calling with LlamaCpp | ||
5. **Audio Processing**: Speech recognition and understanding | ||
6. **Image Analysis**: Visual content interpretation | ||
7. **Multimodal Agents**: Combined audio-visual processing | ||
8. **Performance Testing**: Optimization techniques and benchmarks | ||
|
||
## Additional Examples | ||
|
||
The `examples/` directory contains standalone Python scripts demonstrating specific features. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. examples directory doesn't exist. could you please remove this from readme? |
||
|
||
## Parameter Reference | ||
|
||
### Standard Parameters | ||
|
||
- `temperature`: Controls randomness (0.0-2.0) | ||
- `max_tokens`: Maximum response length | ||
- `top_p`: Nucleus sampling threshold | ||
- `frequency_penalty`: Reduce repetition | ||
- `presence_penalty`: Encourage topic diversity | ||
|
||
### LlamaCpp-Specific Parameters | ||
|
||
- `grammar`: GBNF grammar string | ||
- `json_schema`: JSON schema for structured output | ||
- `mirostat`: Enable Mirostat sampling (0, 1, or 2) | ||
- `min_p`: Minimum probability cutoff | ||
- `repeat_penalty`: Penalize token repetition | ||
- `cache_prompt`: Enable prompt caching | ||
- `slot_id`: Session slot for multi-user support | ||
|
||
See the notebook for detailed parameter usage examples. | ||
|
||
## Hardware Requirements | ||
|
||
### Minimum Configuration | ||
- 8GB RAM | ||
- 4GB VRAM (or CPU-only mode) | ||
- 8GB storage | ||
|
||
### Recommended Configuration | ||
- 16GB RAM | ||
- 8GB+ VRAM | ||
- CUDA-capable GPU or Apple Silicon | ||
|
||
### GPU Acceleration | ||
- **NVIDIA**: Requires CUDA toolkit | ||
- **Apple Silicon**: Metal support included | ||
- **AMD**: ROCm support (experimental) | ||
- **CPU Mode**: Set `-ngl 0` when starting server | ||
|
||
## About Quantized Models | ||
|
||
### What is Quantization? | ||
|
||
Quantization reduces model size by using lower precision numbers (e.g., 4-bit instead of 16-bit). This enables running large language models on consumer hardware with minimal quality loss. | ||
|
||
### Qwen2.5-Omni-7B Model | ||
|
||
- **Parameters**: 7.6 billion | ||
- **Quantization**: 4-bit (Q4_K_M format) | ||
- **Size**: 4.68GB (vs ~15GB unquantized) | ||
- **Context**: 8,192 tokens (expandable to 32K) | ||
- **Languages**: 23 languages supported | ||
|
||
### LlamaCpp vs Ollama | ||
|
||
| Feature | LlamaCpp | Ollama | | ||
|---------|----------|--------| | ||
| **Model Format** | GGUF files | Modelfile abstraction | | ||
| **Control** | Full parameter access | Simplified interface | | ||
| **Features** | Grammar, multimodal, sampling | Basic generation | | ||
| **Use Case** | Advanced applications | Quick prototyping | | ||
|
||
LlamaCpp provides lower-level control suitable for production applications requiring specific output formats or advanced features. | ||
|
||
## Troubleshooting | ||
|
||
### Common Issues | ||
|
||
1. **Server won't start**: Verify llama.cpp installation and model file paths | ||
2. **Out of memory**: Reduce GPU layers with `-ngl` parameter | ||
3. **No multimodal support**: Ensure both model files are downloaded | ||
4. **Slow performance**: Enable GPU acceleration or reduce context size | ||
|
||
### Additional Resources | ||
|
||
- [LlamaCpp Documentation](https://github.com/ggerganov/llama.cpp) | ||
- [GGUF Format Specification](https://github.com/ggerganov/ggml/blob/master/docs/gguf.md) | ||
- [Strands SDK Documentation](https://docs.strands.dev) | ||
|
||
## License | ||
|
||
MIT |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wrong file name