|
1 | | -# Lightspeed Core Evaluation |
2 | | -Evaluation tooling for lightspeed-core project |
| 1 | +# LightSpeed Evaluation Framework |
3 | 2 |
|
4 | | -## Installation |
5 | | -- **Requires Python 3.11** |
6 | | -- Install `uv` |
7 | | -- Check `uv --version` is working |
8 | | -- If running Python 3.11 from `venv`, make sure no conflicting packages are installed. In case of problems create a clean venv for Python 3.11 and `uv`. |
9 | | -- Run `uv sync` |
10 | | -- Optional: For development, run `make install-tools` |
11 | | - + if `uv` is not installed this will install `uv` by running `pip install uv` in your current Python environment. |
| 3 | +A comprehensive framework for evaluating GenAI applications. |
12 | 4 |
|
| 5 | +## 🎯 Key Features |
13 | 6 |
|
14 | | -## Description |
15 | | -Currently we have 2 types of evaluations. |
16 | | -1. `consistency`: Ability to compare responses against ground-truth answer for specific provider+model. Objective of this evaluation is to flag any variation in specific provider+model response. Currently a combination of similarity distances are used to calculate final score. Cut-off scores are used to flag any deviations. This also stores a .csv file with query, pre-defined answer, API response & score. Input for this is a [json file](eval_data/question_answer_pair.json) |
| 7 | +- **Multi-Framework Support**: Seamlessly use metrics from Ragas, DeepEval, and custom implementations |
| 8 | +- **Turn & Conversation-Level Evaluation**: Support for both individual queries and multi-turn conversations |
| 9 | +- **LLM Provider Flexibility**: OpenAI, Anthropic, Watsonx, Azure, Gemini, Ollama via LiteLLM |
| 10 | +- **Flexible Configuration**: Configurable environment & metric metadata |
| 11 | +- **Rich Output**: CSV, JSON, TXT reports + visualization graphs (pass rates, distributions, heatmaps) |
| 12 | +- **Early Validation**: Catch configuration errors before expensive LLM calls |
| 13 | +- **Statistical Analysis**: Statistics for every metric with score distribution analysis |
| 14 | +- **Agent Evaluation**: Framework for evaluating AI agent performance (future integration planned) |
17 | 15 |
|
18 | | -2. `model`: Ability to compare responses against single ground-truth answer. Here we can do evaluation for more than one provider+model at a time. This creates a json file as summary report with scores (f1-score) for each provider+model. Along with selected QnAs from above json file, we can also provide additional QnAs using a parquet file (optional). [Sample QnA set (parquet)](eval_data/interview_qna_30_per_title.parquet) with 30 queries per OCP documentation title. |
| 16 | +## 🚀 Quick Start |
19 | 17 |
|
20 | | -  |
| 18 | +### Installation |
21 | 19 |
|
22 | | -**Notes** |
23 | | -- QnAs should `not` be used for model training or tuning. This is created only for evaluation purpose. |
24 | | -- QnAs were generated from OCP docs by LLMs. It is possible that some of the questions/answers are not entirely correct. We are constantly trying to verify both Questions & Answers manually. If you find any QnA pair to be modified or removed, please create a PR. |
25 | | -- OLS API should be ready/live with all the required provider+model configured. |
26 | | -- It is possible that we want to run both consistency and model evaluation together. To avoid multiple API calls for same query, *model* evaluation first checks .csv file generated by *consistency* evaluation. If response is not present in csv file, then only we call API to get the response. |
| 20 | +```bash |
| 21 | +# From Git |
| 22 | +pip install git+https://github.com/lightspeed-core/lightspeed-evaluation.git |
27 | 23 |
|
28 | | -### e2e test case |
29 | | - |
30 | | -These evaluations are also part of **e2e test cases**. Currently *consistency* evaluation is parimarily used to gate PRs. Final e2e suite will also invoke *model* evaluation which will use .csv files generated by earlier suites, if any file is not present then last suite will fail. |
31 | | - |
32 | | -### Usage |
33 | | -``` |
34 | | -uv run evaluate |
| 24 | +# Local Development |
| 25 | +uv sync |
35 | 26 | ``` |
36 | 27 |
|
37 | | -### Input Data/QnA pool |
38 | | -[Json file](eval_data/question_answer_pair.json) |
39 | | - |
40 | | -[Sample QnA set (parquet)](eval_data/interview_qna_30_per_title.parquet) |
41 | | - |
42 | | -Please refer above files for the structure, add new data accordingly. |
43 | | - |
44 | | -### Arguments |
45 | | -**eval_type**: This will control which evaluation, we want to do. Currently we have 3 options. |
46 | | -1. `consistency` -> Compares model specific answer for QnAs provided in json file |
47 | | -2. `model` -> Compares set of models based on their response and generates a summary report. For this we can provide additional QnAs in parquet format, along with json file. |
48 | | -3. `all` -> Both of the above evaluations. |
49 | | - |
50 | | -**eval_api_url**: OLS API url. Default is `http://localhost:8080`. If deployed in a cluster, then pass cluster API url. |
51 | | - |
52 | | -**eval_api_token_file**: Path to a text file containing OLS API token. Required, if OLS is deployed in cluster. |
53 | | - |
54 | | -**eval_scenario**: This is primarily required to indetify which pre-defined answers need to be compared. Values can be `with_rag`, `without_rag`. Currently we always do evaluation for the API with rag. |
55 | | - |
56 | | -**eval_query_ids**: Option to give set of query ids for evaluation. By default all queries are processed. |
57 | | - |
58 | | -**eval_provider_model_id**: We can provide set of provider/model combinations as ids for comparison. |
59 | | - |
60 | | -**qna_pool_file**: Applicable only for `model` evaluation. Provide file path to the parquet file having additional QnAs. Default is None. |
61 | | - |
62 | | -**eval_out_dir**: Directory, where output csv/json files will be saved. |
63 | | - |
64 | | -**eval_metrics**: By default all scores/metrics are calculated, but this decides which scores will be used to create the graph. |
65 | | -This is a list of metrics. Ex: cosine, euclidean distance, precision/recall/F1 score, answer relevancy score, LLM based similarity score. |
66 | | - |
67 | | -**judge_provider / judge_model**: Provider / Model for judge LLM. This is required for LLM based evaluation (answer relevancy score, LLM based similarity score). This needs to be configured correctly through config yaml file. [Sample provider/model configuration](../../examples/olsconfig.yaml) |
68 | | - |
69 | | -**eval_modes**: Apart from OLS api, we may want to evaluate vanilla model or with just OLS paramaters/prompt/RAG so that we can have baseline score. This is a list of modes. Ex: vanilla, ols_param, ols_prompt, ols_rag, & ols (actual api). |
| 28 | +### Basic Usage |
| 29 | + |
| 30 | +```bash |
| 31 | +# Set API key |
| 32 | +export OPENAI_API_KEY="your-key" |
| 33 | + |
| 34 | +# Run evaluation |
| 35 | +lightspeed-eval --system-config config/system.yaml --eval-data config/evaluation_data.yaml |
| 36 | + |
| 37 | +## 📊 Supported Metrics |
| 38 | + |
| 39 | +### Turn-Level (Single Query) |
| 40 | +- **Ragas** |
| 41 | + - Response Evaluation |
| 42 | + - `faithfulness` |
| 43 | + - `response_relevancy` |
| 44 | + - Context Evaluation |
| 45 | + - `context_recall` |
| 46 | + - `context_relevance` |
| 47 | + - `context_precision_without_reference` |
| 48 | + - `context_precision_with_reference` |
| 49 | +- **Custom** |
| 50 | + - Response Evaluation |
| 51 | + - `answer_correctness` |
| 52 | + |
| 53 | +### Conversation-Level (Multi-turn) |
| 54 | +- **DeepEval** |
| 55 | + - `conversation_completeness` |
| 56 | + - `conversation_relevancy` |
| 57 | + - `knowledge_retention` |
| 58 | + |
| 59 | +## ⚙️ Configuration |
| 60 | + |
| 61 | +### System Config (`config/system.yaml`) |
| 62 | +```yaml |
| 63 | +llm: |
| 64 | + provider: "openai" |
| 65 | + model: "gpt-4o-mini" |
| 66 | + temperature: 0.0 |
| 67 | + timeout: 120 |
| 68 | +
|
| 69 | +metrics_metadata: |
| 70 | + turn_level: |
| 71 | + "ragas:faithfulness": |
| 72 | + threshold: 0.8 |
| 73 | + type: "turn" |
| 74 | + framework: "ragas" |
| 75 | + |
| 76 | + conversation_level: |
| 77 | + "deepeval:conversation_completeness": |
| 78 | + threshold: 0.8 |
| 79 | + type: "conversation" |
| 80 | + framework: "deepeval" |
| 81 | +``` |
70 | 82 |
|
71 | | -### Outputs |
72 | | -Evaluation scripts creates below files. |
73 | | -- CSV file with response for given provider/model & modes. |
74 | | -- response evaluation result with scores (for consistency check). |
75 | | -- Final csv file with all results, json score summary & graph (for model evaluation) |
| 83 | +### Evaluation Data (`config/evaluation_data.yaml`) |
| 84 | +```yaml |
| 85 | +- conversation_group_id: "test_conversation" |
| 86 | + description: "Sample evaluation" |
| 87 | + |
| 88 | + # Turn-level metrics (empty list = skip turn evaluation) |
| 89 | + turn_metrics: |
| 90 | + - "ragas:faithfulness" |
| 91 | + - "custom:answer_correctness" |
| 92 | + |
| 93 | + # Turn-level metrics metadata (threshold + other properties) |
| 94 | + turn_metrics_metadata: |
| 95 | + "ragas:response_relevancy": |
| 96 | + threshold: 0.8 |
| 97 | + weight: 1.0 |
| 98 | + "custom:answer_correctness": |
| 99 | + threshold: 0.75 |
| 100 | + |
| 101 | + # Conversation-level metrics (empty list = skip conversation evaluation) |
| 102 | + conversation_metrics: |
| 103 | + - "deepeval:conversation_completeness" |
| 104 | + |
| 105 | + turns: |
| 106 | + - turn_id: 1 |
| 107 | + query: "What is OpenShift?" |
| 108 | + response: "Red Hat OpenShift powers the entire application lifecycle...." |
| 109 | + contexts: |
| 110 | + - content: "Red Hat OpenShift powers...." |
| 111 | + expected_response: "Red Hat OpenShift...." |
| 112 | +``` |
76 | 113 |
|
77 | | -[Evaluation Result](eval_data/result/README.md) |
| 114 | +## 📈 Output & Visualization |
| 115 | + |
| 116 | +### Generated Reports |
| 117 | +- **CSV**: Detailed results with status, scores, reasons |
| 118 | +- **JSON**: Summary statistics with score distributions |
| 119 | +- **TXT**: Human-readable summary |
| 120 | +- **PNG**: 4 visualization types (pass rates, score distributions, heatmaps, status breakdown) |
| 121 | + |
| 122 | +### Key Metrics in Output |
| 123 | +- **PASS/FAIL/ERROR**: Status based on thresholds |
| 124 | +- **Actual Reasons**: DeepEval provides LLM-generated explanations, Custom metrics provide detailed reasoning |
| 125 | +- **Score Statistics**: Mean, median, standard deviation, min/max for every metric |
| 126 | + |
| 127 | +## 🧪 Development |
| 128 | + |
| 129 | +### Development Tools |
| 130 | +```bash |
| 131 | +uv sync --group dev |
| 132 | +uv run black . |
| 133 | +uv run ruff check . |
| 134 | +uv run mypy . |
| 135 | +uv run pyright . |
| 136 | +uv run pylint . |
| 137 | +uv run pytest tests --cov=src |
| 138 | +``` |
78 | 139 |
|
79 | 140 | ## Agent Evaluation |
80 | 141 | For a detailed walkthrough of the new agent-evaluation framework, refer |
81 | 142 | [lsc_agent_eval/README.md](lsc_agent_eval/README.md) |
82 | 143 |
|
83 | | -## RAG retrieval script |
84 | | -``` |
85 | | -python -m scripts.evaluation.query_rag |
86 | | -``` |
87 | | -This is used to generate a .csv file having retrieved chunks for given set of queries with similarity score. This is not part of actual evaluation. But useful to do a spot check to understand the text that we send to LLMs as context (this may explain any deviation in the response) |
88 | | - |
89 | | -#### Arguments |
90 | | -*db-path*: Path to the RAG index |
91 | | - |
92 | | -*product-index*: RAG index ID |
93 | | - |
94 | | -*model-path*: Path or name of the embedding model |
| 144 | +## Generate answers (optional - for creating test data) |
| 145 | +For generating answers (optional) refer [README-generate-answers](README-generate-answers.md) |
95 | 146 |
|
96 | | -*queries*: Set of queries separated by space. If not passed default queries are used. |
| 147 | +## 📄 License & Contributing |
97 | 148 |
|
98 | | -*top-k*: How many chunks we want to retrieve. Default is 10. |
| 149 | +This project is licensed under the Apache License 2.0. See the LICENSE file for details. |
99 | 150 |
|
100 | | -*output_dir*: To save the .csv file. |
| 151 | +Contributions welcome - see development setup above for code quality tools. |
0 commit comments