The Prompt Evaluator is a test suite that helps evaluate prompt templates and AI models. It enables Product Managers and Developers to create prompt templates with custom variables, define test cases with specific variable values and expected responses, and match the generated responses exactly or fuzzily. The suite also allows for comparing GraphQL query responses and measuring the accuracy of prompt templates against different AI models. By leveraging the capabilities of the Prompt Evaluator, Product Managers and Developers can make informed decisions, iterate on their prompt designs, and enhance the overall quality and accuracy of their AI-powered applications.
-
Experiments - The experiment feature in our product allows users to create collections of prompt templates. Users can define their own conversations with various roles and prompts, incorporating variables where necessary. Users can evaluate the performance of prompts by executing them with different OpenAI models and associated test cases.
-
Prompt Templates - Prompt templates are the building blocks of an Experiment which allow users to define their own prompts. They are highly customizable, allowing users the flexibility to modify the content, format, and variables according to their requirements.
-
Test Cases - These are the cases on which the accuracy of a prompt is evaluated. Users can define their own test cases and associate them with prompts. Test cases can be defined as a list of inputs and expected outputs.
By running prompt templates with different models and test cases, users gain valuable insights into the performance and suitability of their prompts for different scenarios. For detailed information on the features, please refer to the product guide.
Prompt Evaluator has two components:
This is the backend component of the Prompt Evaluator tool. It is built using Django and MongoDB. It exposes a GraphQL API for the frontend to consume. The frontend is built using Next.js. The frontend and backend communicate with each other using the GraphQL API. It is a standalone application that can be deployed separately. For better understanding of the architecture, please refer to the following diagrams:
- Language: - Python 3.9
- Framework: - Django 3.2.7
- Database: - MongoDB 5.0.3
- API: - GraphQL
Follow the instructions below for installation:
-
Check you have installed python version greater than or equal to 3.9
-
Install MongoDB: If MongoDB is not installed on your system, you can download the MongoDB installer from the official website, run it, and follow the installation instructions. Make sure to have MongoDB installed before proceeding.
-
Go to the project directory and copy the contents of .env.sample file in .env file and add the values for all env variables.
cd prompt-eval-be
# For Linux/macOS
cp .env.sample .env
# For Windows
copy .env.sample .env
-
Generate
OPENAI_API_KEY
using this link and update the same in .env file. -
Run below command to specify the management of large files with Git.
brew install git-lfs
- Create a python virtual environment
python3 -m venv .venv
- Activate the virtual environment
source .venv/bin/activate
- Upgrade to the latest pip version
python -m pip install --upgrade pip
- Clone the evals submodule
git submodule update --init --recursive
pip install -e evals_framework
- Install the dependencies
pip install -r requirements.txt
Run the test cases using following command
python manage.py test graphQL
Run the following command to generate test coverage report
coverage run manage.py test
coverage report
- Activate the virtual environment if not already activated
source .venv/bin/activate
- Run the api server using following command
python manage.py runserver 8000
- Sequence diagrams: - docs/sequenceDiagram.mermaid
- DBML diagrams: - docs/db.dbml
- Product usage guide: - docs/productGuide.md
- Deployment to AWS guide: - docs/deployment.md
We welcome more helping hands to make Prompt Evaluator better. Feel free to report issues, raise PRs for fixes & enhancements. We are constantly working towards addressing broader, more generic issues to provide a clear and user-centric solution that unleashes your full potential. Stay tuned for exciting updates as we continue to enhance our tool.
Built with ❤️ by True Sparrow