lightspeed-core · tisnik · Sep 1, 2025 · Sep 1, 2025 · Sep 1, 2025 · Sep 1, 2025
diff --git a/.gitignore b/.gitignore
@@ -178,10 +178,9 @@ llm_cache/
 
 # Evaluation output folder
 eval_output*/
-lsc_eval/eval_output*/
 
 # DeepEval telemetry and configuration
-lsc_eval/.deepeval/
+.deepeval/
 
 # Keeping experimental changes here
-wip*/
+wip*/
diff --git a/README.md b/README.md
@@ -1,100 +1,152 @@
-# Lightspeed Core Evaluation
-Evaluation tooling for lightspeed-core project
+# LightSpeed Evaluation Framework
 
-## Installation
-- **Requires Python 3.11**
-- Install `uv`
-- Check `uv --version` is working
-- If running Python 3.11 from `venv`, make sure no conflicting packages are installed. In case of problems create a clean venv for Python 3.11 and `uv`.
-- Run `uv sync`
-- Optional: For development, run `make install-tools`
-    + if `uv` is not installed this will install `uv` by running `pip install uv` in your current Python environment.
+A comprehensive framework for evaluating GenAI applications.
 
+## 🎯 Key Features
 
-## Description
-Currently we have 2 types of evaluations.
-1. `consistency`: Ability to compare responses against ground-truth answer for specific provider+model. Objective of this evaluation is to flag any variation in specific provider+model response. Currently a combination of similarity distances are used to calculate final score. Cut-off scores are used to flag any deviations. This also stores a .csv file with query, pre-defined answer, API response & score. Input for this is a [json file](eval_data/question_answer_pair.json)
+- **Multi-Framework Support**: Seamlessly use metrics from Ragas, DeepEval, and custom implementations
+- **Turn & Conversation-Level Evaluation**: Support for both individual queries and multi-turn conversations  
+- **LLM Provider Flexibility**: OpenAI, Anthropic, Watsonx, Azure, Gemini, Ollama via LiteLLM
+- **Flexible Configuration**: Configurable environment & metric metadata
+- **Rich Output**: CSV, JSON, TXT reports + visualization graphs (pass rates, distributions, heatmaps)
+- **Early Validation**: Catch configuration errors before expensive LLM calls
+- **Statistical Analysis**: Statistics for every metric with score distribution analysis
+- **Agent Evaluation**: Framework for evaluating AI agent performance (future integration planned)
 
-2. `model`: Ability to compare responses against single ground-truth answer. Here we can do evaluation for more than one provider+model at a time. This creates a json file as summary report with scores (f1-score) for each provider+model. Along with selected QnAs from above json file, we can also provide additional QnAs using a parquet file (optional). [Sample QnA set (parquet)](eval_data/interview_qna_30_per_title.parquet) with 30 queries per OCP documentation title.
+## 🚀 Quick Start
 
-    ![Evaluation Metric & flow](assets/response_eval_flow.png)
+### Installation
 
-**Notes**
-- QnAs should `not` be used for model training or tuning. This is created only for evaluation purpose.
-- QnAs were generated from OCP docs by LLMs. It is possible that some of the questions/answers are not entirely correct. We are constantly trying to verify both Questions & Answers manually. If you find any QnA pair to be modified or removed, please create a PR.
-- OLS API should be ready/live with all the required provider+model configured.
-- It is possible that we want to run both consistency and model evaluation together. To avoid multiple API calls for same query, *model* evaluation first checks .csv file generated by *consistency* evaluation. If response is not present in csv file, then only we call API to get the response.
+```bash
+# From Git
+pip install git+https://github.com/lightspeed-core/lightspeed-evaluation.git
 
-### e2e test case
-
-These evaluations are also part of **e2e test cases**. Currently *consistency* evaluation is parimarily used to gate PRs. Final e2e suite will also invoke *model* evaluation which will use .csv files generated by earlier suites, if any file is not present then last suite will fail.
-
-### Usage
-```
-uv run evaluate
+# Local Development
+pip install uv
+uv sync
 ```
 
-### Input Data/QnA pool
-[Json file](eval_data/question_answer_pair.json)
-
-[Sample QnA set (parquet)](eval_data/interview_qna_30_per_title.parquet)
-
-Please refer above files for the structure, add new data accordingly.
-
-### Arguments
-**eval_type**: This will control which evaluation, we want to do. Currently we have 3 options.
-1. `consistency` -> Compares model specific answer for QnAs provided in json file
-2. `model` -> Compares set of models based on their response and generates a summary report. For this we can provide additional QnAs in parquet format, along with json file.
-3. `all` -> Both of the above evaluations.
-
-**eval_api_url**: OLS API url. Default is `http://localhost:8080`. If deployed in a cluster, then pass cluster API url.
-
-**eval_api_token_file**: Path to a text file containing OLS API token. Required, if OLS is deployed in cluster.
-
-**eval_scenario**: This is primarily required to indetify which pre-defined answers need to be compared. Values can be `with_rag`, `without_rag`. Currently we always do evaluation for the API with rag.
-
-**eval_query_ids**: Option to give set of query ids for evaluation. By default all queries are processed.
-
-**eval_provider_model_id**: We can provide set of provider/model combinations as ids for comparison.
-
-**qna_pool_file**: Applicable only for `model` evaluation. Provide file path to the parquet file having additional QnAs. Default is None.
-
-**eval_out_dir**: Directory, where output csv/json files will be saved.
-
-**eval_metrics**: By default all scores/metrics are calculated, but this decides which scores will be used to create the graph.
-This is a list of metrics. Ex: cosine, euclidean distance, precision/recall/F1 score, answer relevancy score, LLM based similarity score.
-
-**judge_provider / judge_model**: Provider / Model for judge LLM. This is required for LLM based evaluation (answer relevancy score, LLM based similarity score). This needs to be configured correctly through config yaml file. [Sample provider/model configuration](../../examples/olsconfig.yaml)
-
-**eval_modes**: Apart from OLS api, we may want to evaluate vanilla model or with just OLS paramaters/prompt/RAG so that we can have baseline score. This is a list of modes. Ex: vanilla, ols_param, ols_prompt, ols_rag, & ols (actual api).
+### Basic Usage
+
+```bash
+# Set API key
+export OPENAI_API_KEY="your-key"
+
+# Run evaluation
+lightspeed-eval --system-config config/system.yaml --eval-data config/evaluation_data.yaml
+
+## 📊 Supported Metrics
+
+### Turn-Level (Single Query)
+- **Ragas**
+  - Response Evaluation
+    - `faithfulness`
+    - `response_relevancy`
+  - Context Evaluation
+    - `context_recall`
+    - `context_relevance`
+    - `context_precision_without_reference`
+    - `context_precision_with_reference`
+- **Custom**
+  - Response Evaluation
+    - `answer_correctness`
+
+### Conversation-Level (Multi-turn)
+- **DeepEval**
+  - `conversation_completeness`
+  - `conversation_relevancy`
+  - `knowledge_retention`
+
+## ⚙️ Configuration
+
+### System Config (`config/system.yaml`)
+```yaml
+llm:
+  provider: "openai"
+  model: "gpt-4o-mini"
+  temperature: 0.0
+  timeout: 120
+
+metrics_metadata:
+  turn_level:
+    "ragas:faithfulness":
+      threshold: 0.8
+      type: "turn"
+      framework: "ragas"
+
+  conversation_level:
+    "deepeval:conversation_completeness":
+      threshold: 0.8
+      type: "conversation"
+      framework: "deepeval"
+```
 
-### Outputs
-Evaluation scripts creates below files.
-- CSV file with response for given provider/model & modes.
-- response evaluation result with scores (for consistency check).
-- Final csv file with all results, json score summary & graph (for model evaluation)
+### Evaluation Data (`config/evaluation_data.yaml`)
+```yaml
+- conversation_group_id: "test_conversation"
+  description: "Sample evaluation"
+
+  # Turn-level metrics (empty list = skip turn evaluation)
+  turn_metrics:
+    - "ragas:faithfulness"
+    - "custom:answer_correctness"
+
+  # Turn-level metrics metadata (threshold + other properties)
+  turn_metrics_metadata:
+    "ragas:response_relevancy": 
+      threshold: 0.8
+      weight: 1.0
+    "custom:answer_correctness": 
+      threshold: 0.75
+
+  # Conversation-level metrics (empty list = skip conversation evaluation)   
+  conversation_metrics:
+    - "deepeval:conversation_completeness"
+
+  turns:
+    - turn_id: 1
+      query: "What is OpenShift?"
+      response: "Red Hat OpenShift powers the entire application lifecycle...."
+      contexts:
+        - content: "Red Hat OpenShift powers...."
+      expected_response: "Red Hat OpenShift...."
+```
 
-[Evaluation Result](eval_data/result/README.md)
+## 📈 Output & Visualization
+
+### Generated Reports
+- **CSV**: Detailed results with status, scores, reasons
+- **JSON**: Summary statistics with score distributions
+- **TXT**: Human-readable summary
+- **PNG**: 4 visualization types (pass rates, score distributions, heatmaps, status breakdown)
+
+### Key Metrics in Output
+- **PASS/FAIL/ERROR**: Status based on thresholds
+- **Actual Reasons**: DeepEval provides LLM-generated explanations, Custom metrics provide detailed reasoning
+- **Score Statistics**: Mean, median, standard deviation, min/max for every metric
+
+## 🧪 Development
+
+### Development Tools
+```bash
+uv sync --group dev
+uv run black .
+uv run ruff check .
+uv run mypy .
+uv run pyright .
+uv run pylint .
+uv run pytest tests --cov=src
+```
 
 ## Agent Evaluation
 For a detailed walkthrough of the new agent-evaluation framework, refer
 [lsc_agent_eval/README.md](lsc_agent_eval/README.md)
 
-## RAG retrieval script
-```
-python -m scripts.evaluation.query_rag
-```
-This is used to generate a .csv file having retrieved chunks for given set of queries with similarity score. This is not part of actual evaluation. But useful to do a spot check to understand the text that we send to LLMs as context (this may explain any deviation in the response)
-
-#### Arguments
-*db-path*: Path to the RAG index
-
-*product-index*: RAG index ID
-
-*model-path*: Path or name of the embedding model
+## Generate answers (optional - for creating test data)
+For generating answers (optional) refer [README-generate-answers](README-generate-answers.md)
 
-*queries*: Set of queries separated by space. If not passed default queries are used.
+## 📄 License & Contributing
 
-*top-k*: How many chunks we want to retrieve. Default is 10.
+This project is licensed under the Apache License 2.0. See the LICENSE file for details.
 
-*output_dir*: To save the .csv file.
+Contributions welcome - see development setup above for code quality tools.
diff --git a/archive/README.md b/archive/README.md
@@ -0,0 +1,99 @@
+# Lightspeed Core Evaluation
+Evaluation tooling for lightspeed-core project. [Refer latest README.md](../README.md).
+
+**This is not maintained anymore.**
+
+## Installation
+- **Requires Python 3.11**
+- Install `pdm`
+- Check `pdm --version` is working
+- If running Python 3.11 from `venv`, make sure no conflicting packages are installed. In case of problems create a clean venv for Python 3.11 and `pdm`.
+- Run `pdm install`
+- Optional: For development, run `make install-tools`
+    + if `pdm` is not installed this will install `pdm` by running `pip install pdm` in your current Python environment.
+
+
+## Description
+Currently we have 2 types of evaluations.
+1. `consistency`: Ability to compare responses against ground-truth answer for specific provider+model. Objective of this evaluation is to flag any variation in specific provider+model response. Currently a combination of similarity distances are used to calculate final score. Cut-off scores are used to flag any deviations. This also stores a .csv file with query, pre-defined answer, API response & score. Input for this is a [json file](../eval_data/question_answer_pair.json)
+
+2. `model`: Ability to compare responses against single ground-truth answer. Here we can do evaluation for more than one provider+model at a time. This creates a json file as summary report with scores (f1-score) for each provider+model. Along with selected QnAs from above json file, we can also provide additional QnAs using a parquet file (optional). [Sample QnA set (parquet)](../eval_data/interview_qna_30_per_title.parquet) with 30 queries per OCP documentation title.
+
+    ![Evaluation Metric & flow](assets/response_eval_flow.png)
+
+**Notes**
+- QnAs should `not` be used for model training or tuning. This is created only for evaluation purpose.
+- QnAs were generated from OCP docs by LLMs. It is possible that some of the questions/answers are not entirely correct. We are constantly trying to verify both Questions & Answers manually. If you find any QnA pair to be modified or removed, please create a PR.
+- OLS API should be ready/live with all the required provider+model configured.
+- It is possible that we want to run both consistency and model evaluation together. To avoid multiple API calls for same query, *model* evaluation first checks .csv file generated by *consistency* evaluation. If response is not present in csv file, then only we call API to get the response.
+
+### e2e test case
+
+These evaluations are also part of **e2e test cases**. Currently *consistency* evaluation is parimarily used to gate PRs. Final e2e suite will also invoke *model* evaluation which will use .csv files generated by earlier suites, if any file is not present then last suite will fail.
+
+### Usage
+```bash
+pdm run evaluate
+```
+
+### Input Data/QnA pool
+[Json file](../eval_data/question_answer_pair.json)
+
+[Sample QnA set (parquet)](../eval_data/interview_qna_30_per_title.parquet)
+
+Please refer above files for the structure, add new data accordingly.
+
+## Arguments
+**eval_type**: This will control which evaluation, we want to do. Currently we have 3 options.
+1. `consistency` -> Compares model specific answer for QnAs provided in json file
+2. `model` -> Compares set of models based on their response and generates a summary report. For this we can provide additional QnAs in parquet format, along with json file.
+3. `all` -> Both of the above evaluations.
+
+**eval_api_url**: OLS API url. Default is `http://localhost:8080`. If deployed in a cluster, then pass cluster API url.
+
+**eval_api_token_file**: Path to a text file containing OLS API token. Required, if OLS is deployed in cluster.
+
+**eval_scenario**: This is primarily required to identify which pre-defined answers need to be compared. Values can be `with_rag`, `without_rag`. Currently we always do evaluation for the API with rag.
+
+**eval_query_ids**: Option to give set of query ids for evaluation. By default all queries are processed.
+
+**eval_provider_model_id**: We can provide set of provider/model combinations as ids for comparison.
+
+**qna_pool_file**: Applicable only for `model` evaluation. Provide file path to the parquet file having additional QnAs. Default is None.
+
+**eval_out_dir**: Directory, where output csv/json files will be saved.
+
+**eval_metrics**: By default all scores/metrics are calculated, but this decides which scores will be used to create the graph.
+This is a list of metrics. Ex: cosine, euclidean distance, precision/recall/F1 score, answer relevancy score, LLM based similarity score.
+
+**judge_provider / judge_model**: Provider / Model for judge LLM. This is required for LLM based evaluation (answer relevancy score, LLM based similarity score). This needs to be configured correctly through config yaml file. [Sample provider/model configuration](https://github.com/road-core/service/blob/main/examples/rcsconfig.yaml)
+
+**eval_modes**: Apart from OLS api, we may want to evaluate vanilla model or with just OLS parameters/prompt/RAG so that we can have baseline score. This is a list of modes. Ex: vanilla, ols_param, ols_prompt, ols_rag, & ols (actual api).
+
+### Outputs
+Evaluation scripts creates below files.
+- CSV file with response for given provider/model & modes.
+- response evaluation result with scores (for consistency check).
+- Final csv file with all results, json score summary & graph (for model evaluation)
+
+[Evaluation Result](example_result/README.md)
+
+
+# RAG retrieval script
+```
+python -m lightspeed_core_evaluation.evaluation.query_rag
+```
+This is used to generate a .csv file having retrieved chunks for given set of queries with similarity score. This is not part of actual evaluation. But useful to do a spot check to understand the text that we send to LLMs as context (this may explain any deviation in the response)
+
+#### Arguments
+*db-path*: Path to the RAG index
+
+*product-index*: RAG index ID
+
+*model-path*: Path or name of the embedding model
+
+*queries*: Set of queries separated by space. If not passed default queries are used.
+
+*top-k*: How many chunks we want to retrieve. Default is 10.
+
+*output_dir*: To save the .csv file.
diff --git a/assets/response_eval_flow.png → archive/assets/response_eval_flow.png b/assets/response_eval_flow.png → archive/assets/response_eval_flow.png
diff --git a/example_result/README.md → archive/example_result/README.md b/example_result/README.md → archive/example_result/README.md
@@ -11,7 +11,7 @@
     - (watsonx) ibm/granite-3-2-8b-instruct (API Version: 2025-04-02)
     - (azure) gpt-4o-mini (Model Version: 2024-07-18, API Version: 2024-02-15-preview)
 - Judge provider/model (LLM based eval): (watsonx) llama-3-1-8b-instruct
-- QnA evaluation dataset: [QnAs from OCP doc](../ocp_doc_qna-edited.parquet)
+- QnA evaluation dataset: [QnAs from OCP doc](../../eval_data/ocp_doc_qna-edited.parquet)
 - API run mode: without tool calling (streaming internally)
 - RAG SHA: 56269892dcf5279b9857c04918e8fba587008990b09146e907d7af9303bd6c9e
     - OCP doc: 4.18

diff --git a/...el_evaluation_result-answer_relevancy.png → ...el_evaluation_result-answer_relevancy.png b/...el_evaluation_result-answer_relevancy.png → ...el_evaluation_result-answer_relevancy.png
diff --git a/...aluation_result-answer_similarity_llm.png → ...aluation_result-answer_similarity_llm.png b/...aluation_result-answer_similarity_llm.png → ...aluation_result-answer_similarity_llm.png
diff --git a/...ult/model_evaluation_result-cos_score.png → ...ult/model_evaluation_result-cos_score.png b/...ult/model_evaluation_result-cos_score.png → ...ult/model_evaluation_result-cos_score.png
diff --git a/...ult/model_evaluation_result-rougeL_f1.png → ...ult/model_evaluation_result-rougeL_f1.png b/...ult/model_evaluation_result-rougeL_f1.png → ...ult/model_evaluation_result-rougeL_f1.png
diff --git a/...el_evaluation_result-rougeL_precision.png → ...el_evaluation_result-rougeL_precision.png b/...el_evaluation_result-rougeL_precision.png → ...el_evaluation_result-rougeL_precision.png
diff --git a/...model_evaluation_result-rougeL_recall.png → ...model_evaluation_result-rougeL_recall.png b/...model_evaluation_result-rougeL_recall.png → ...model_evaluation_result-rougeL_recall.png
diff --git a/example_result/model_evaluation_summary.json → ...mple_result/model_evaluation_summary.json b/example_result/model_evaluation_summary.json → ...mple_result/model_evaluation_summary.json
diff --git a/src/lightspeed_core_evaluation/__init__.py → ...ve/lightspeed_core_evaluation/__init__.py b/src/lightspeed_core_evaluation/__init__.py → ...ve/lightspeed_core_evaluation/__init__.py
diff --git a/src/lightspeed_core_evaluation/driver.py → archive/lightspeed_core_evaluation/driver.py b/src/lightspeed_core_evaluation/driver.py → archive/lightspeed_core_evaluation/driver.py
diff --git a/...tspeed_core_evaluation/eval_run_common.py → ...tspeed_core_evaluation/eval_run_common.py b/...tspeed_core_evaluation/eval_run_common.py → ...tspeed_core_evaluation/eval_run_common.py
diff --git a/src/lightspeed_core_evaluation/query_rag.py → ...e/lightspeed_core_evaluation/query_rag.py b/src/lightspeed_core_evaluation/query_rag.py → ...e/lightspeed_core_evaluation/query_rag.py
diff --git a/src/lightspeed_core_evaluation/rag_eval.py → ...ve/lightspeed_core_evaluation/rag_eval.py b/src/lightspeed_core_evaluation/rag_eval.py → ...ve/lightspeed_core_evaluation/rag_eval.py
diff --git a/...ed_core_evaluation/response_evaluation.py → ...ed_core_evaluation/response_evaluation.py b/...ed_core_evaluation/response_evaluation.py → ...ed_core_evaluation/response_evaluation.py
diff --git a/...ghtspeed_core_evaluation/taxonomy_eval.py → ...ghtspeed_core_evaluation/taxonomy_eval.py b/...ghtspeed_core_evaluation/taxonomy_eval.py → ...ghtspeed_core_evaluation/taxonomy_eval.py
diff --git a/...htspeed_core_evaluation/utils/__init__.py → ...htspeed_core_evaluation/utils/__init__.py b/...htspeed_core_evaluation/utils/__init__.py → ...htspeed_core_evaluation/utils/__init__.py
diff --git a/...tspeed_core_evaluation/utils/constants.py → ...tspeed_core_evaluation/utils/constants.py b/...tspeed_core_evaluation/utils/constants.py → ...tspeed_core_evaluation/utils/constants.py
diff --git a/...ightspeed_core_evaluation/utils/models.py → ...ightspeed_core_evaluation/utils/models.py b/...ightspeed_core_evaluation/utils/models.py → ...ightspeed_core_evaluation/utils/models.py
diff --git a/src/lightspeed_core_evaluation/utils/plot.py → .../lightspeed_core_evaluation/utils/plot.py b/src/lightspeed_core_evaluation/utils/plot.py → .../lightspeed_core_evaluation/utils/plot.py
diff --git a/...ghtspeed_core_evaluation/utils/prompts.py → ...ghtspeed_core_evaluation/utils/prompts.py b/...ghtspeed_core_evaluation/utils/prompts.py → ...ghtspeed_core_evaluation/utils/prompts.py
diff --git a/src/lightspeed_core_evaluation/utils/rag.py → ...e/lightspeed_core_evaluation/utils/rag.py b/src/lightspeed_core_evaluation/utils/rag.py → ...e/lightspeed_core_evaluation/utils/rag.py
diff --git a/..._core_evaluation/utils/relevancy_score.py → ..._core_evaluation/utils/relevancy_score.py b/..._core_evaluation/utils/relevancy_score.py → ..._core_evaluation/utils/relevancy_score.py
diff --git a/...htspeed_core_evaluation/utils/response.py → ...htspeed_core_evaluation/utils/response.py b/...htspeed_core_evaluation/utils/response.py → ...htspeed_core_evaluation/utils/response.py
diff --git a/...lightspeed_core_evaluation/utils/score.py → ...lightspeed_core_evaluation/utils/score.py b/...lightspeed_core_evaluation/utils/score.py → ...lightspeed_core_evaluation/utils/score.py
diff --git a/..._evaluation/utils/similarity_score_llm.py → ..._evaluation/utils/similarity_score_llm.py b/..._evaluation/utils/similarity_score_llm.py → ..._evaluation/utils/similarity_score_llm.py