DeepEval provides a Pythonic way to run offline evaluations on your LLM pipelines so you can launch comfortably into production. The guiding philosophy is a "Pytest for LLM" that aims to make productionizing and evaluating LLMs as easy as ensuring all tests pass.
Would you like to be onboarded / would like a demo / want to see about our roadmap? Feel free to book in a time on our calendly here: https://calendly.com/d/z7h-75h-6dz/confident-ai-demo
We highly recommend getting started reading our documentation here: https://docs.confident-ai.com/docs/
Join our discord: https://discord.gg/a3K9c8GRGt
- Opinionated tests for answer relevancy, factual consistency, toxicness, bias
- Web UI to view tests, implementations, comparisons
- Auto-evaluation through synthetic query-answer creation
pip install deepeval
Watch a Youtube video on how to get started here: Get started in under 1 minute
# Optional - if you want a web UI
deepeval login
# Run the API key and implementation name
deepeval login --api-key $API_KEY --implementation-name "sample"
# Generate a sample test file
deepeval test generate --output-file test_sample.py
# Run this test
deepeval test run test_sample.py
deepeval test run tests/test_sample.py
Grab your API key from https://app.confident-ai.com to start logging!
# test_example.py
import os
import openai
from deepeval.metrics.factual_consistency import FactualConsistencyMetric
from deepeval.test_case import LLMTestCase
from deepeval.run_test import assert_test
openai.api_key = "sk-XXX"
# Write a sample ChatGPT function
def generate_chatgpt_output(query: str):
response = openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "assistant", "content": "The customer success phone line is 1200-231-231 and the customer success state is in Austin."},
{"role": "user", "content": query}
]
)
expected_output = response.choices[0].message.content
return expected_output
def test_llm_output():
query = "What is the customer success phone line?"
expected_output = "Our customer success phone line is 1200-231-231."
test_case = LLMTestCase(query=query, expected_output=expected_output)
metric = FactualConsistencyMetric()
assert_test(test_case, metrics=[metric])
Once you have set that up, you can simply call pytest
deepeval test run test_example.py
# Output
Running tests ... ✅
Once you have ran tests, you should be able to see your dashboard on https://app.confident-ai.com
To define a custom metric, you simply need to define the measure
and is_successful
property.
from deepeval.test_case import LLMTestCase
from deepeval.metrics.metric import Metric
from deepeval.run_test import assert_test
# Run this test
class LengthMetric(Metric):
"""This metric checks if the output is more than 3 letters"""
def __init__(self, minimum_length: int = 3):
self.minimum_length = minimum_length
def measure(self, test_case: LLMTestCase):
# sends to server
text = test_case.output
score = len(text)
self.success = bool(score > self.minimum_length)
return score
def is_successful(self):
return self.success
@property
def __name__(self):
return "Length"
def test_length_metric():
metric = LengthMetric()
test_case = LLMTestCase(
output="This is a long sentence that is more than 3 letters"
)
assert_test(test_case, [metric])
We integrate DeepEval tightly with common frameworks such as Langchain and lLamaIndex.
Generating synthetic queries allows you to quickly evaluate the queries related to your prompts. We help developers get up and running with a lot of example queries.
Set up a simple dashboard in just 1 line of code. You can read more about how to do this here in our documentation.
Our up-coming roadmap:
- Project View To Web UI
- Integration with HumanEval
- Integration with Microsoft Guidance
- Guardrail integrations (Nvidia-Nemo, GuardRails AI)
Our motivation behind this is to streamline the testing process behind Language Learning Model (LLM) applications such as Retrieval Augmented Generation (RAG). We intend to accomplish this by making the process of writing tests as straightforward as authoring unit tests in Python.
Any seasoned Python developer knows the importance of having something like PyTest, a default testing suite renown for its clean, user-friendly interface that makes test-writing an efficient and hassle-free task. Yet, when we transition from traditional software development to the realm of Machine Learning (ML), this streamlined process becomes, perhaps surprisingly, a rarity.
In the ML world, feedback often ends up being in the raw form of an evaluation loss, which deviates a bit from the structured way that feedback is usually consumed in software development.
As we see a rising tide in the deployment of agents, LLMs, and AI, an unfulfilled necessity emerges: a tool that can extend the same, familiar abstractions and tooling found in general software development to the ML engineers. The aim? Facilitate a more rapid feedback loop that expedite iterative improvements.
A lacuna such as this in the ML landscape underscores the importance of establishing a novel type of testing framework specifically tailored for LLMs. This will ensure engineers can keep evolving their prompts, agents, and LLMs, all the while continually adding new items to their test suite.
DeepEval – your tool for easy and efficient LLM testing. Time to revolutionize how we write, run, automate and manage our LLM tests!
Introducing DeepEval.
Built by the Confident AI Team. For any questions/business enquiries - please contact [email protected]
@misc{deepeval,
author = {Jacky Wong},
title = {DeepEval: Framework to unit test LLMS},
year = {2023},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/confident-ai/deepeval}},
}