TensorZero creates a feedback loop for optimizing LLM applications — turning production data into smarter, faster, and cheaper models.
- Integrate our model gateway
- Send metrics or feedback
- Optimize prompts, models, and inference strategies
- Watch your LLMs improve over time
It provides a data & learning flywheel for LLMs by unifying:
- Inference: one API for all LLMs, with <1ms P99 overhead
- Observability: inference & feedback → your database
- Optimization: from prompts to fine-tuning and RL
- Experimentation: built-in A/B testing, routing, fallbacks
Website
·
Docs
·
Twitter
·
Slack
·
Discord
Quick Start (5min)
·
Comprehensive Tutorial
·
Deployment Guide
·
API Reference
·
Configuration Reference
Integrate with TensorZero once and access every major LLM provider.
Model Providers | Features |
The TensorZero Gateway natively supports:
Need something else? Your provider is most likely supported because TensorZero integrates with any OpenAI-compatible API (e.g. Ollama). |
The TensorZero Gateway supports advanced features like:
The TensorZero Gateway is written in Rust 🦀 with performance in mind (<1ms p99 latency overhead @ 10k QPS).
See Benchmarks. You can run inference using the TensorZero client (recommended), the OpenAI client, or the HTTP API. |
Usage: TensorZero Python Client (Recommended)
You can access any provider using the TensorZero Python client.
- Deploy
tensorzero/gateway
using Docker. Detailed instructions → - Optional: Set up the TensorZero configuration.
- Run inference:
from tensorzero import TensorZeroGateway
with TensorZeroGateway(...) as client:
response = client.inference(
model_name="openai::gpt-4o-mini",
input={
"messages": [
{
"role": "user",
"content": "Write a haiku about artificial intelligence.",
}
]
},
)
See Quick Start for more information.
Usage: OpenAI Python Client
You can access any provider using the OpenAI Python client with TensorZero.
- Deploy
tensorzero/gateway
using Docker. Detailed instructions → - Set up the TensorZero configuration.
- Run inference:
from openai import OpenAI
with OpenAI(base_url="http://localhost:3000/openai/v1") as client:
response = client.chat.completions.create(
model="tensorzero::your_function_name", # defined in configuration (step 2)
messages=[
{
"role": "user",
"content": "Write a haiku about artificial intelligence.",
}
],
)
See Quick Start for more information.
Usage: Other Languages & Platforms (HTTP)
TensorZero supports virtually any programming language or platform via its HTTP API.
- Deploy
tensorzero/gateway
using Docker. Detailed instructions → - Optional: Set up the TensorZero configuration.
- Run inference:
curl -X POST "http://localhost:3000/inference" \
-H "Content-Type: application/json" \
-d '{
"model_name": "openai::gpt-4o-mini",
"input": {
"messages": [
{
"role": "user",
"content": "Write a haiku about artificial intelligence."
}
]
}
}'
See Quick Start for more information.
Send production metrics and human feedback to easily optimize your prompts, models, and inference strategies — using the UI or programmatically.
Optimize closed-source and open-source models using supervised fine-tuning (SFT) and preference fine-tuning (DPO).
Supervised Fine-tuning — UI | Preference Fine-tuning (DPO) — Jupyter Notebook |
Boost performance by dynamically updating your prompts with relevant examples, combining responses from multiple inferences, and more.
Best-of-N Sampling | Mixture-of-N Sampling |
Dynamic In-Context Learning (DICL) | |
More coming soon... |
Optimize your prompts programmatically using research-driven optimization techniques.
Today we provide a sample integration with DSPy.
More coming soon...
Zoom in to debug individual API calls, or zoom out to monitor metrics across models and prompts over time — all using the open-source TensorZero UI.
Observability » Inference | Observability » Function |
Watch LLMs get better at data extraction in real-time with TensorZero!
Dynamic in-context learning (DICL) is a powerful inference-time optimization available out of the box with TensorZero. It enhances LLM performance by automatically incorporating relevant historical examples into the prompt, without the need for model fine-tuning.
LLMs-get-better-at-data-extraction-in-real-time-with-TensorZero.mp4
- The TensorZero Gateway is a high-performance model gateway written in Rust 🦀 that provides a unified API interface for all major LLM providers, allowing for seamless cross-platform integration and fallbacks.
- It handles structured schema-based inference with <1ms P99 latency overhead (see Benchmarks) and built-in observability, experimentation, and inference-time optimizations.
- It also collects downstream metrics and feedback associated with these inferences, with first-class support for multi-step LLM systems.
- Everything is stored in a ClickHouse data warehouse that you control for real-time, scalable, and developer-friendly analytics.
- Over time, TensorZero Recipes leverage this structured dataset to optimize your prompts and models: run pre-built recipes for common workflows like fine-tuning, or create your own with complete flexibility using any language and platform.
- Finally, the gateway's experimentation features and GitOps orchestration enable you to iterate and deploy with confidence, be it a single LLM or thousands of LLMs.
Our goal is to help engineers build, manage, and optimize the next generation of LLM applications: systems that learn from real-world experience. Read more about our Vision & Roadmap.
Start building today. The Quick Start shows it's easy to set up an LLM application with TensorZero. If you want to dive deeper, the Tutorial teaches how to build a simple chatbot, an email copilot, a weather RAG system, and a structured data extraction pipeline.
Questions? Ask us on Slack or Discord.
Using TensorZero at work? Email us at [email protected] to set up a Slack or Teams channel with your team (free).
Work with us. We're hiring in NYC. We'd also welcome open-source contributions!
We are working on a series of complete runnable examples illustrating TensorZero's data & learning flywheel.
Optimizing Data Extraction (NER) with TensorZero
This example shows how to use TensorZero to optimize a data extraction pipeline. We demonstrate techniques like fine-tuning and dynamic in-context learning (DICL). In the end, a optimized GPT-4o Mini model outperforms GPT-4o on this task — at a fraction of the cost and latency — using a small amount of training data.
Writing Haikus to Satisfy a Judge with Hidden Preferences
This example fine-tunes GPT-4o Mini to generate haikus tailored to a specific taste. You'll see TensorZero's "data flywheel in a box" in action: better variants leads to better data, and better data leads to better variants. You'll see progress by fine-tuning the LLM multiple times.
Improving LLM Chess Ability with Best-of-N Sampling
This example showcases how best-of-N sampling can significantly enhance an LLM's chess-playing abilities by selecting the most promising moves from multiple generated options.
Improving Math Reasoning with a Custom Recipe for Automated Prompt Engineering (DSPy)
TensorZero provides a number of pre-built optimization recipes covering common LLM engineering workflows. But you can also easily create your own recipes and workflows! This example shows how to optimize a TensorZero function using an arbitrary tool — here, DSPy.
& many more on the way!