Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 2 additions & 3 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -178,10 +178,9 @@ llm_cache/

# Evaluation output folder
eval_output*/
lsc_eval/eval_output*/

# DeepEval telemetry and configuration
lsc_eval/.deepeval/
.deepeval/

# Keeping experimental changes here
wip*/
wip*/
214 changes: 133 additions & 81 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,100 +1,152 @@
# Lightspeed Core Evaluation
Evaluation tooling for lightspeed-core project
# LightSpeed Evaluation Framework

## Installation
- **Requires Python 3.11**
- Install `uv`
- Check `uv --version` is working
- If running Python 3.11 from `venv`, make sure no conflicting packages are installed. In case of problems create a clean venv for Python 3.11 and `uv`.
- Run `uv sync`
- Optional: For development, run `make install-tools`
+ if `uv` is not installed this will install `uv` by running `pip install uv` in your current Python environment.
A comprehensive framework for evaluating GenAI applications.

## 🎯 Key Features

## Description
Currently we have 2 types of evaluations.
1. `consistency`: Ability to compare responses against ground-truth answer for specific provider+model. Objective of this evaluation is to flag any variation in specific provider+model response. Currently a combination of similarity distances are used to calculate final score. Cut-off scores are used to flag any deviations. This also stores a .csv file with query, pre-defined answer, API response & score. Input for this is a [json file](eval_data/question_answer_pair.json)
- **Multi-Framework Support**: Seamlessly use metrics from Ragas, DeepEval, and custom implementations
- **Turn & Conversation-Level Evaluation**: Support for both individual queries and multi-turn conversations
- **LLM Provider Flexibility**: OpenAI, Anthropic, Watsonx, Azure, Gemini, Ollama via LiteLLM
- **Flexible Configuration**: Configurable environment & metric metadata
- **Rich Output**: CSV, JSON, TXT reports + visualization graphs (pass rates, distributions, heatmaps)
- **Early Validation**: Catch configuration errors before expensive LLM calls
- **Statistical Analysis**: Statistics for every metric with score distribution analysis
- **Agent Evaluation**: Framework for evaluating AI agent performance (future integration planned)

2. `model`: Ability to compare responses against single ground-truth answer. Here we can do evaluation for more than one provider+model at a time. This creates a json file as summary report with scores (f1-score) for each provider+model. Along with selected QnAs from above json file, we can also provide additional QnAs using a parquet file (optional). [Sample QnA set (parquet)](eval_data/interview_qna_30_per_title.parquet) with 30 queries per OCP documentation title.
## 🚀 Quick Start

![Evaluation Metric & flow](assets/response_eval_flow.png)
### Installation

**Notes**
- QnAs should `not` be used for model training or tuning. This is created only for evaluation purpose.
- QnAs were generated from OCP docs by LLMs. It is possible that some of the questions/answers are not entirely correct. We are constantly trying to verify both Questions & Answers manually. If you find any QnA pair to be modified or removed, please create a PR.
- OLS API should be ready/live with all the required provider+model configured.
- It is possible that we want to run both consistency and model evaluation together. To avoid multiple API calls for same query, *model* evaluation first checks .csv file generated by *consistency* evaluation. If response is not present in csv file, then only we call API to get the response.
```bash
# From Git
pip install git+https://github.com/lightspeed-core/lightspeed-evaluation.git

### e2e test case

These evaluations are also part of **e2e test cases**. Currently *consistency* evaluation is parimarily used to gate PRs. Final e2e suite will also invoke *model* evaluation which will use .csv files generated by earlier suites, if any file is not present then last suite will fail.

### Usage
```
uv run evaluate
# Local Development
pip install uv
uv sync
```

### Input Data/QnA pool
[Json file](eval_data/question_answer_pair.json)

[Sample QnA set (parquet)](eval_data/interview_qna_30_per_title.parquet)

Please refer above files for the structure, add new data accordingly.

### Arguments
**eval_type**: This will control which evaluation, we want to do. Currently we have 3 options.
1. `consistency` -> Compares model specific answer for QnAs provided in json file
2. `model` -> Compares set of models based on their response and generates a summary report. For this we can provide additional QnAs in parquet format, along with json file.
3. `all` -> Both of the above evaluations.

**eval_api_url**: OLS API url. Default is `http://localhost:8080`. If deployed in a cluster, then pass cluster API url.

**eval_api_token_file**: Path to a text file containing OLS API token. Required, if OLS is deployed in cluster.

**eval_scenario**: This is primarily required to indetify which pre-defined answers need to be compared. Values can be `with_rag`, `without_rag`. Currently we always do evaluation for the API with rag.

**eval_query_ids**: Option to give set of query ids for evaluation. By default all queries are processed.

**eval_provider_model_id**: We can provide set of provider/model combinations as ids for comparison.

**qna_pool_file**: Applicable only for `model` evaluation. Provide file path to the parquet file having additional QnAs. Default is None.

**eval_out_dir**: Directory, where output csv/json files will be saved.

**eval_metrics**: By default all scores/metrics are calculated, but this decides which scores will be used to create the graph.
This is a list of metrics. Ex: cosine, euclidean distance, precision/recall/F1 score, answer relevancy score, LLM based similarity score.

**judge_provider / judge_model**: Provider / Model for judge LLM. This is required for LLM based evaluation (answer relevancy score, LLM based similarity score). This needs to be configured correctly through config yaml file. [Sample provider/model configuration](../../examples/olsconfig.yaml)

**eval_modes**: Apart from OLS api, we may want to evaluate vanilla model or with just OLS paramaters/prompt/RAG so that we can have baseline score. This is a list of modes. Ex: vanilla, ols_param, ols_prompt, ols_rag, & ols (actual api).
### Basic Usage

```bash
# Set API key
export OPENAI_API_KEY="your-key"

# Run evaluation
lightspeed-eval --system-config config/system.yaml --eval-data config/evaluation_data.yaml

## 📊 Supported Metrics

### Turn-Level (Single Query)
- **Ragas**
- Response Evaluation
- `faithfulness`
- `response_relevancy`
- Context Evaluation
- `context_recall`
- `context_relevance`
- `context_precision_without_reference`
- `context_precision_with_reference`
- **Custom**
- Response Evaluation
- `answer_correctness`

### Conversation-Level (Multi-turn)
- **DeepEval**
- `conversation_completeness`
- `conversation_relevancy`
- `knowledge_retention`

## ⚙️ Configuration

### System Config (`config/system.yaml`)
```yaml
llm:
provider: "openai"
model: "gpt-4o-mini"
temperature: 0.0
timeout: 120

metrics_metadata:
turn_level:
"ragas:faithfulness":
threshold: 0.8
type: "turn"
framework: "ragas"

conversation_level:
"deepeval:conversation_completeness":
threshold: 0.8
type: "conversation"
framework: "deepeval"
```

### Outputs
Evaluation scripts creates below files.
- CSV file with response for given provider/model & modes.
- response evaluation result with scores (for consistency check).
- Final csv file with all results, json score summary & graph (for model evaluation)
### Evaluation Data (`config/evaluation_data.yaml`)
```yaml
- conversation_group_id: "test_conversation"
description: "Sample evaluation"

# Turn-level metrics (empty list = skip turn evaluation)
turn_metrics:
- "ragas:faithfulness"
- "custom:answer_correctness"

# Turn-level metrics metadata (threshold + other properties)
turn_metrics_metadata:
"ragas:response_relevancy":
threshold: 0.8
weight: 1.0
"custom:answer_correctness":
threshold: 0.75

# Conversation-level metrics (empty list = skip conversation evaluation)
conversation_metrics:
- "deepeval:conversation_completeness"

turns:
- turn_id: 1
query: "What is OpenShift?"
response: "Red Hat OpenShift powers the entire application lifecycle...."
contexts:
- content: "Red Hat OpenShift powers...."
expected_response: "Red Hat OpenShift...."
```

[Evaluation Result](eval_data/result/README.md)
## 📈 Output & Visualization

### Generated Reports
- **CSV**: Detailed results with status, scores, reasons
- **JSON**: Summary statistics with score distributions
- **TXT**: Human-readable summary
- **PNG**: 4 visualization types (pass rates, score distributions, heatmaps, status breakdown)

### Key Metrics in Output
- **PASS/FAIL/ERROR**: Status based on thresholds
- **Actual Reasons**: DeepEval provides LLM-generated explanations, Custom metrics provide detailed reasoning
- **Score Statistics**: Mean, median, standard deviation, min/max for every metric

## 🧪 Development

### Development Tools
```bash
uv sync --group dev
uv run black .
uv run ruff check .
uv run mypy .
uv run pyright .
uv run pylint .
uv run pytest tests --cov=src
```

## Agent Evaluation
For a detailed walkthrough of the new agent-evaluation framework, refer
[lsc_agent_eval/README.md](lsc_agent_eval/README.md)

## RAG retrieval script
```
python -m scripts.evaluation.query_rag
```
This is used to generate a .csv file having retrieved chunks for given set of queries with similarity score. This is not part of actual evaluation. But useful to do a spot check to understand the text that we send to LLMs as context (this may explain any deviation in the response)

#### Arguments
*db-path*: Path to the RAG index

*product-index*: RAG index ID

*model-path*: Path or name of the embedding model
## Generate answers (optional - for creating test data)
For generating answers (optional) refer [README-generate-answers](README-generate-answers.md)

*queries*: Set of queries separated by space. If not passed default queries are used.
## 📄 License & Contributing

*top-k*: How many chunks we want to retrieve. Default is 10.
This project is licensed under the Apache License 2.0. See the LICENSE file for details.

*output_dir*: To save the .csv file.
Contributions welcome - see development setup above for code quality tools.
99 changes: 99 additions & 0 deletions archive/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
# Lightspeed Core Evaluation
Evaluation tooling for lightspeed-core project. [Refer latest README.md](../README.md).

**This is not maintained anymore.**

## Installation
- **Requires Python 3.11**
- Install `pdm`
- Check `pdm --version` is working
- If running Python 3.11 from `venv`, make sure no conflicting packages are installed. In case of problems create a clean venv for Python 3.11 and `pdm`.
- Run `pdm install`
- Optional: For development, run `make install-tools`
+ if `pdm` is not installed this will install `pdm` by running `pip install pdm` in your current Python environment.


## Description
Currently we have 2 types of evaluations.
1. `consistency`: Ability to compare responses against ground-truth answer for specific provider+model. Objective of this evaluation is to flag any variation in specific provider+model response. Currently a combination of similarity distances are used to calculate final score. Cut-off scores are used to flag any deviations. This also stores a .csv file with query, pre-defined answer, API response & score. Input for this is a [json file](../eval_data/question_answer_pair.json)

2. `model`: Ability to compare responses against single ground-truth answer. Here we can do evaluation for more than one provider+model at a time. This creates a json file as summary report with scores (f1-score) for each provider+model. Along with selected QnAs from above json file, we can also provide additional QnAs using a parquet file (optional). [Sample QnA set (parquet)](../eval_data/interview_qna_30_per_title.parquet) with 30 queries per OCP documentation title.

![Evaluation Metric & flow](assets/response_eval_flow.png)

**Notes**
- QnAs should `not` be used for model training or tuning. This is created only for evaluation purpose.
- QnAs were generated from OCP docs by LLMs. It is possible that some of the questions/answers are not entirely correct. We are constantly trying to verify both Questions & Answers manually. If you find any QnA pair to be modified or removed, please create a PR.
- OLS API should be ready/live with all the required provider+model configured.
- It is possible that we want to run both consistency and model evaluation together. To avoid multiple API calls for same query, *model* evaluation first checks .csv file generated by *consistency* evaluation. If response is not present in csv file, then only we call API to get the response.

### e2e test case

These evaluations are also part of **e2e test cases**. Currently *consistency* evaluation is parimarily used to gate PRs. Final e2e suite will also invoke *model* evaluation which will use .csv files generated by earlier suites, if any file is not present then last suite will fail.

### Usage
```bash
pdm run evaluate
```

### Input Data/QnA pool
[Json file](../eval_data/question_answer_pair.json)

[Sample QnA set (parquet)](../eval_data/interview_qna_30_per_title.parquet)

Please refer above files for the structure, add new data accordingly.

## Arguments
**eval_type**: This will control which evaluation, we want to do. Currently we have 3 options.
1. `consistency` -> Compares model specific answer for QnAs provided in json file
2. `model` -> Compares set of models based on their response and generates a summary report. For this we can provide additional QnAs in parquet format, along with json file.
3. `all` -> Both of the above evaluations.

**eval_api_url**: OLS API url. Default is `http://localhost:8080`. If deployed in a cluster, then pass cluster API url.

**eval_api_token_file**: Path to a text file containing OLS API token. Required, if OLS is deployed in cluster.

**eval_scenario**: This is primarily required to identify which pre-defined answers need to be compared. Values can be `with_rag`, `without_rag`. Currently we always do evaluation for the API with rag.

**eval_query_ids**: Option to give set of query ids for evaluation. By default all queries are processed.

**eval_provider_model_id**: We can provide set of provider/model combinations as ids for comparison.

**qna_pool_file**: Applicable only for `model` evaluation. Provide file path to the parquet file having additional QnAs. Default is None.

**eval_out_dir**: Directory, where output csv/json files will be saved.

**eval_metrics**: By default all scores/metrics are calculated, but this decides which scores will be used to create the graph.
This is a list of metrics. Ex: cosine, euclidean distance, precision/recall/F1 score, answer relevancy score, LLM based similarity score.

**judge_provider / judge_model**: Provider / Model for judge LLM. This is required for LLM based evaluation (answer relevancy score, LLM based similarity score). This needs to be configured correctly through config yaml file. [Sample provider/model configuration](https://github.com/road-core/service/blob/main/examples/rcsconfig.yaml)

**eval_modes**: Apart from OLS api, we may want to evaluate vanilla model or with just OLS parameters/prompt/RAG so that we can have baseline score. This is a list of modes. Ex: vanilla, ols_param, ols_prompt, ols_rag, & ols (actual api).

### Outputs
Evaluation scripts creates below files.
- CSV file with response for given provider/model & modes.
- response evaluation result with scores (for consistency check).
- Final csv file with all results, json score summary & graph (for model evaluation)

[Evaluation Result](example_result/README.md)


# RAG retrieval script
```
python -m lightspeed_core_evaluation.evaluation.query_rag
```
This is used to generate a .csv file having retrieved chunks for given set of queries with similarity score. This is not part of actual evaluation. But useful to do a spot check to understand the text that we send to LLMs as context (this may explain any deviation in the response)

#### Arguments
*db-path*: Path to the RAG index

*product-index*: RAG index ID

*model-path*: Path or name of the embedding model

*queries*: Set of queries separated by space. If not passed default queries are used.

*top-k*: How many chunks we want to retrieve. Default is 10.

*output_dir*: To save the .csv file.
File renamed without changes
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@
- (watsonx) ibm/granite-3-2-8b-instruct (API Version: 2025-04-02)
- (azure) gpt-4o-mini (Model Version: 2024-07-18, API Version: 2024-02-15-preview)
- Judge provider/model (LLM based eval): (watsonx) llama-3-1-8b-instruct
- QnA evaluation dataset: [QnAs from OCP doc](../ocp_doc_qna-edited.parquet)
- QnA evaluation dataset: [QnAs from OCP doc](../../eval_data/ocp_doc_qna-edited.parquet)
- API run mode: without tool calling (streaming internally)
- RAG SHA: 56269892dcf5279b9857c04918e8fba587008990b09146e907d7af9303bd6c9e
- OCP doc: 4.18
Expand Down
Loading