Skip to content

Commit aa729db

Browse files
committed
make new lsc eval as primary eval package
1 parent c7cfbe4 commit aa729db

File tree

37 files changed

+1124
-4242
lines changed

37 files changed

+1124
-4242
lines changed

.gitignore

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -178,10 +178,9 @@ llm_cache/
178178

179179
# Evaluation output folder
180180
eval_output*/
181-
lsc_eval/eval_output*/
182181

183182
# DeepEval telemetry and configuration
184-
lsc_eval/.deepeval/
183+
.deepeval/
185184

186185
# Keeping experimental changes here
187-
wip*/
186+
wip*/

README.md

Lines changed: 132 additions & 81 deletions
Original file line numberDiff line numberDiff line change
@@ -1,100 +1,151 @@
1-
# Lightspeed Core Evaluation
2-
Evaluation tooling for lightspeed-core project
1+
# LightSpeed Evaluation Framework
32

4-
## Installation
5-
- **Requires Python 3.11**
6-
- Install `uv`
7-
- Check `uv --version` is working
8-
- If running Python 3.11 from `venv`, make sure no conflicting packages are installed. In case of problems create a clean venv for Python 3.11 and `uv`.
9-
- Run `uv sync`
10-
- Optional: For development, run `make install-tools`
11-
+ if `uv` is not installed this will install `uv` by running `pip install uv` in your current Python environment.
3+
A comprehensive framework for evaluating GenAI applications.
124

5+
## 🎯 Key Features
136

14-
## Description
15-
Currently we have 2 types of evaluations.
16-
1. `consistency`: Ability to compare responses against ground-truth answer for specific provider+model. Objective of this evaluation is to flag any variation in specific provider+model response. Currently a combination of similarity distances are used to calculate final score. Cut-off scores are used to flag any deviations. This also stores a .csv file with query, pre-defined answer, API response & score. Input for this is a [json file](eval_data/question_answer_pair.json)
7+
- **Multi-Framework Support**: Seamlessly use metrics from Ragas, DeepEval, and custom implementations
8+
- **Turn & Conversation-Level Evaluation**: Support for both individual queries and multi-turn conversations
9+
- **LLM Provider Flexibility**: OpenAI, Anthropic, Watsonx, Azure, Gemini, Ollama via LiteLLM
10+
- **Flexible Configuration**: Configurable environment & metric metadata
11+
- **Rich Output**: CSV, JSON, TXT reports + visualization graphs (pass rates, distributions, heatmaps)
12+
- **Early Validation**: Catch configuration errors before expensive LLM calls
13+
- **Statistical Analysis**: Statistics for every metric with score distribution analysis
14+
- **Agent Evaluation**: Framework for evaluating AI agent performance (future integration planned)
1715

18-
2. `model`: Ability to compare responses against single ground-truth answer. Here we can do evaluation for more than one provider+model at a time. This creates a json file as summary report with scores (f1-score) for each provider+model. Along with selected QnAs from above json file, we can also provide additional QnAs using a parquet file (optional). [Sample QnA set (parquet)](eval_data/interview_qna_30_per_title.parquet) with 30 queries per OCP documentation title.
16+
## 🚀 Quick Start
1917

20-
![Evaluation Metric & flow](assets/response_eval_flow.png)
18+
### Installation
2119

22-
**Notes**
23-
- QnAs should `not` be used for model training or tuning. This is created only for evaluation purpose.
24-
- QnAs were generated from OCP docs by LLMs. It is possible that some of the questions/answers are not entirely correct. We are constantly trying to verify both Questions & Answers manually. If you find any QnA pair to be modified or removed, please create a PR.
25-
- OLS API should be ready/live with all the required provider+model configured.
26-
- It is possible that we want to run both consistency and model evaluation together. To avoid multiple API calls for same query, *model* evaluation first checks .csv file generated by *consistency* evaluation. If response is not present in csv file, then only we call API to get the response.
20+
```bash
21+
# From Git
22+
pip install git+https://github.com/lightspeed-core/lightspeed-evaluation.git
2723

28-
### e2e test case
29-
30-
These evaluations are also part of **e2e test cases**. Currently *consistency* evaluation is parimarily used to gate PRs. Final e2e suite will also invoke *model* evaluation which will use .csv files generated by earlier suites, if any file is not present then last suite will fail.
31-
32-
### Usage
33-
```
34-
uv run evaluate
24+
# Local Development
25+
uv sync
3526
```
3627

37-
### Input Data/QnA pool
38-
[Json file](eval_data/question_answer_pair.json)
39-
40-
[Sample QnA set (parquet)](eval_data/interview_qna_30_per_title.parquet)
41-
42-
Please refer above files for the structure, add new data accordingly.
43-
44-
### Arguments
45-
**eval_type**: This will control which evaluation, we want to do. Currently we have 3 options.
46-
1. `consistency` -> Compares model specific answer for QnAs provided in json file
47-
2. `model` -> Compares set of models based on their response and generates a summary report. For this we can provide additional QnAs in parquet format, along with json file.
48-
3. `all` -> Both of the above evaluations.
49-
50-
**eval_api_url**: OLS API url. Default is `http://localhost:8080`. If deployed in a cluster, then pass cluster API url.
51-
52-
**eval_api_token_file**: Path to a text file containing OLS API token. Required, if OLS is deployed in cluster.
53-
54-
**eval_scenario**: This is primarily required to indetify which pre-defined answers need to be compared. Values can be `with_rag`, `without_rag`. Currently we always do evaluation for the API with rag.
55-
56-
**eval_query_ids**: Option to give set of query ids for evaluation. By default all queries are processed.
57-
58-
**eval_provider_model_id**: We can provide set of provider/model combinations as ids for comparison.
59-
60-
**qna_pool_file**: Applicable only for `model` evaluation. Provide file path to the parquet file having additional QnAs. Default is None.
61-
62-
**eval_out_dir**: Directory, where output csv/json files will be saved.
63-
64-
**eval_metrics**: By default all scores/metrics are calculated, but this decides which scores will be used to create the graph.
65-
This is a list of metrics. Ex: cosine, euclidean distance, precision/recall/F1 score, answer relevancy score, LLM based similarity score.
66-
67-
**judge_provider / judge_model**: Provider / Model for judge LLM. This is required for LLM based evaluation (answer relevancy score, LLM based similarity score). This needs to be configured correctly through config yaml file. [Sample provider/model configuration](../../examples/olsconfig.yaml)
68-
69-
**eval_modes**: Apart from OLS api, we may want to evaluate vanilla model or with just OLS paramaters/prompt/RAG so that we can have baseline score. This is a list of modes. Ex: vanilla, ols_param, ols_prompt, ols_rag, & ols (actual api).
28+
### Basic Usage
29+
30+
```bash
31+
# Set API key
32+
export OPENAI_API_KEY="your-key"
33+
34+
# Run evaluation
35+
lightspeed-eval --system-config config/system.yaml --eval-data config/evaluation_data.yaml
36+
37+
## 📊 Supported Metrics
38+
39+
### Turn-Level (Single Query)
40+
- **Ragas**
41+
- Response Evaluation
42+
- `faithfulness`
43+
- `response_relevancy`
44+
- Context Evaluation
45+
- `context_recall`
46+
- `context_relevance`
47+
- `context_precision_without_reference`
48+
- `context_precision_with_reference`
49+
- **Custom**
50+
- Response Evaluation
51+
- `answer_correctness`
52+
53+
### Conversation-Level (Multi-turn)
54+
- **DeepEval**
55+
- `conversation_completeness`
56+
- `conversation_relevancy`
57+
- `knowledge_retention`
58+
59+
## ⚙️ Configuration
60+
61+
### System Config (`config/system.yaml`)
62+
```yaml
63+
llm:
64+
provider: "openai"
65+
model: "gpt-4o-mini"
66+
temperature: 0.0
67+
timeout: 120
68+
69+
metrics_metadata:
70+
turn_level:
71+
"ragas:faithfulness":
72+
threshold: 0.8
73+
type: "turn"
74+
framework: "ragas"
75+
76+
conversation_level:
77+
"deepeval:conversation_completeness":
78+
threshold: 0.8
79+
type: "conversation"
80+
framework: "deepeval"
81+
```
7082

71-
### Outputs
72-
Evaluation scripts creates below files.
73-
- CSV file with response for given provider/model & modes.
74-
- response evaluation result with scores (for consistency check).
75-
- Final csv file with all results, json score summary & graph (for model evaluation)
83+
### Evaluation Data (`config/evaluation_data.yaml`)
84+
```yaml
85+
- conversation_group_id: "test_conversation"
86+
description: "Sample evaluation"
87+
88+
# Turn-level metrics (empty list = skip turn evaluation)
89+
turn_metrics:
90+
- "ragas:faithfulness"
91+
- "custom:answer_correctness"
92+
93+
# Turn-level metrics metadata (threshold + other properties)
94+
turn_metrics_metadata:
95+
"ragas:response_relevancy":
96+
threshold: 0.8
97+
weight: 1.0
98+
"custom:answer_correctness":
99+
threshold: 0.75
100+
101+
# Conversation-level metrics (empty list = skip conversation evaluation)
102+
conversation_metrics:
103+
- "deepeval:conversation_completeness"
104+
105+
turns:
106+
- turn_id: 1
107+
query: "What is OpenShift?"
108+
response: "Red Hat OpenShift powers the entire application lifecycle...."
109+
contexts:
110+
- content: "Red Hat OpenShift powers...."
111+
expected_response: "Red Hat OpenShift...."
112+
```
76113

77-
[Evaluation Result](eval_data/result/README.md)
114+
## 📈 Output & Visualization
115+
116+
### Generated Reports
117+
- **CSV**: Detailed results with status, scores, reasons
118+
- **JSON**: Summary statistics with score distributions
119+
- **TXT**: Human-readable summary
120+
- **PNG**: 4 visualization types (pass rates, score distributions, heatmaps, status breakdown)
121+
122+
### Key Metrics in Output
123+
- **PASS/FAIL/ERROR**: Status based on thresholds
124+
- **Actual Reasons**: DeepEval provides LLM-generated explanations, Custom metrics provide detailed reasoning
125+
- **Score Statistics**: Mean, median, standard deviation, min/max for every metric
126+
127+
## 🧪 Development
128+
129+
### Development Tools
130+
```bash
131+
uv sync --group dev
132+
uv run black .
133+
uv run ruff check .
134+
uv run mypy .
135+
uv run pyright .
136+
uv run pylint .
137+
uv run pytest tests --cov=src
138+
```
78139

79140
## Agent Evaluation
80141
For a detailed walkthrough of the new agent-evaluation framework, refer
81142
[lsc_agent_eval/README.md](lsc_agent_eval/README.md)
82143

83-
## RAG retrieval script
84-
```
85-
python -m scripts.evaluation.query_rag
86-
```
87-
This is used to generate a .csv file having retrieved chunks for given set of queries with similarity score. This is not part of actual evaluation. But useful to do a spot check to understand the text that we send to LLMs as context (this may explain any deviation in the response)
88-
89-
#### Arguments
90-
*db-path*: Path to the RAG index
91-
92-
*product-index*: RAG index ID
93-
94-
*model-path*: Path or name of the embedding model
144+
## Generate answers (optional - for creating test data)
145+
For generating answers (optional) refer [README-generate-answers](README-generate-answers.md)
95146

96-
*queries*: Set of queries separated by space. If not passed default queries are used.
147+
## 📄 License & Contributing
97148

98-
*top-k*: How many chunks we want to retrieve. Default is 10.
149+
This project is licensed under the Apache License 2.0. See the LICENSE file for details.
99150

100-
*output_dir*: To save the .csv file.
151+
Contributions welcome - see development setup above for code quality tools.

archive/pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -72,4 +72,4 @@ requires = ["pdm-backend"]
7272
build-backend = "pdm.backend"
7373

7474
[tool.pdm]
75-
distribution = true
75+
distribution = true

lsc_eval/config/evaluation_data.yaml renamed to config/evaluation_data.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# LSC Evaluation Framework - Sample/Mock Data
1+
# LightSpeed Evaluation Framework - Sample/Mock Data
22

33
- conversation_group_id: "conv_group_1"
44
description: "conversation group description"

lsc_eval/config/system.yaml renamed to config/system.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# LSC Evaluation Framework Configuration
1+
# LightSpeed Evaluation Framework Configuration
22

33
# LLM Configuration
44
llm:

lsc_agent_eval/src/lsc_agent_eval/core/utils/judge.py

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -99,8 +99,10 @@ def evaluate_response(self, prompt: str) -> Optional[str]:
9999
choices = getattr(response, "choices", None)
100100
if choices and len(choices) > 0:
101101
message = getattr(
102-
choices[0], "message", None
103-
) # pylint: disable=unsubscriptable-object
102+
choices[0], # pylint: disable=unsubscriptable-object
103+
"message",
104+
None,
105+
)
104106
if message:
105107
content = getattr(message, "content", None)
106108

0 commit comments

Comments
 (0)