Skip to content

Commit c7cfbe4

Browse files
committed
archive old ols eval tool
1 parent ca9a863 commit c7cfbe4

28 files changed

+175
-1
lines changed

archive/README.md

Lines changed: 99 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,99 @@
1+
# Lightspeed Core Evaluation
2+
Evaluation tooling for lightspeed-core project. [Refer latest README.md](../README.md).
3+
4+
**This is not maintained anymore.**
5+
6+
## Installation
7+
- **Requires Python 3.11**
8+
- Install `pdm`
9+
- Check `pdm --version` is working
10+
- If running Python 3.11 from `venv`, make sure no conflicting packages are installed. In case of problems create a clean venv for Python 3.11 and `pdm`.
11+
- Run `pdm install`
12+
- Optional: For development, run `make install-tools`
13+
+ if `pdm` is not installed this will install `pdm` by running `pip install pdm` in your current Python environment.
14+
15+
16+
## Description
17+
Currently we have 2 types of evaluations.
18+
1. `consistency`: Ability to compare responses against ground-truth answer for specific provider+model. Objective of this evaluation is to flag any variation in specific provider+model response. Currently a combination of similarity distances are used to calculate final score. Cut-off scores are used to flag any deviations. This also stores a .csv file with query, pre-defined answer, API response & score. Input for this is a [json file](../eval_data/question_answer_pair.json)
19+
20+
2. `model`: Ability to compare responses against single ground-truth answer. Here we can do evaluation for more than one provider+model at a time. This creates a json file as summary report with scores (f1-score) for each provider+model. Along with selected QnAs from above json file, we can also provide additional QnAs using a parquet file (optional). [Sample QnA set (parquet)](../eval_data/interview_qna_30_per_title.parquet) with 30 queries per OCP documentation title.
21+
22+
![Evaluation Metric & flow](assets/response_eval_flow.png)
23+
24+
**Notes**
25+
- QnAs should `not` be used for model training or tuning. This is created only for evaluation purpose.
26+
- QnAs were generated from OCP docs by LLMs. It is possible that some of the questions/answers are not entirely correct. We are constantly trying to verify both Questions & Answers manually. If you find any QnA pair to be modified or removed, please create a PR.
27+
- OLS API should be ready/live with all the required provider+model configured.
28+
- It is possible that we want to run both consistency and model evaluation together. To avoid multiple API calls for same query, *model* evaluation first checks .csv file generated by *consistency* evaluation. If response is not present in csv file, then only we call API to get the response.
29+
30+
### e2e test case
31+
32+
These evaluations are also part of **e2e test cases**. Currently *consistency* evaluation is parimarily used to gate PRs. Final e2e suite will also invoke *model* evaluation which will use .csv files generated by earlier suites, if any file is not present then last suite will fail.
33+
34+
### Usage
35+
```
36+
pdm run evaluate
37+
```
38+
39+
### Input Data/QnA pool
40+
[Json file](../eval_data/question_answer_pair.json)
41+
42+
[Sample QnA set (parquet)](../eval_data/interview_qna_30_per_title.parquet)
43+
44+
Please refer above files for the structure, add new data accordingly.
45+
46+
### Arguments
47+
**eval_type**: This will control which evaluation, we want to do. Currently we have 3 options.
48+
1. `consistency` -> Compares model specific answer for QnAs provided in json file
49+
2. `model` -> Compares set of models based on their response and generates a summary report. For this we can provide additional QnAs in parquet format, along with json file.
50+
3. `all` -> Both of the above evaluations.
51+
52+
**eval_api_url**: OLS API url. Default is `http://localhost:8080`. If deployed in a cluster, then pass cluster API url.
53+
54+
**eval_api_token_file**: Path to a text file containing OLS API token. Required, if OLS is deployed in cluster.
55+
56+
**eval_scenario**: This is primarily required to indetify which pre-defined answers need to be compared. Values can be `with_rag`, `without_rag`. Currently we always do evaluation for the API with rag.
57+
58+
**eval_query_ids**: Option to give set of query ids for evaluation. By default all queries are processed.
59+
60+
**eval_provider_model_id**: We can provide set of provider/model combinations as ids for comparison.
61+
62+
**qna_pool_file**: Applicable only for `model` evaluation. Provide file path to the parquet file having additional QnAs. Default is None.
63+
64+
**eval_out_dir**: Directory, where output csv/json files will be saved.
65+
66+
**eval_metrics**: By default all scores/metrics are calculated, but this decides which scores will be used to create the graph.
67+
This is a list of metrics. Ex: cosine, euclidean distance, precision/recall/F1 score, answer relevancy score, LLM based similarity score.
68+
69+
**judge_provider / judge_model**: Provider / Model for judge LLM. This is required for LLM based evaluation (answer relevancy score, LLM based similarity score). This needs to be configured correctly through config yaml file. [Sample provider/model configuration](https://github.com/road-core/service/blob/main/examples/rcsconfig.yaml)
70+
71+
**eval_modes**: Apart from OLS api, we may want to evaluate vanilla model or with just OLS paramaters/prompt/RAG so that we can have baseline score. This is a list of modes. Ex: vanilla, ols_param, ols_prompt, ols_rag, & ols (actual api).
72+
73+
### Outputs
74+
Evaluation scripts creates below files.
75+
- CSV file with response for given provider/model & modes.
76+
- response evaluation result with scores (for consistency check).
77+
- Final csv file with all results, json score summary & graph (for model evaluation)
78+
79+
[Evaluation Result](example_result/README.md)
80+
81+
82+
# RAG retrieval script
83+
```
84+
python -m lightspeed_core_evaluation.evaluation.query_rag
85+
```
86+
This is used to generate a .csv file having retrieved chunks for given set of queries with similarity score. This is not part of actual evaluation. But useful to do a spot check to understand the text that we send to LLMs as context (this may explain any deviation in the response)
87+
88+
#### Arguments
89+
*db-path*: Path to the RAG index
90+
91+
*product-index*: RAG index ID
92+
93+
*model-path*: Path or name of the embedding model
94+
95+
*queries*: Set of queries separated by space. If not passed default queries are used.
96+
97+
*top-k*: How many chunks we want to retrieve. Default is 10.
98+
99+
*output_dir*: To save the .csv file.
File renamed without changes.

example_result/README.md renamed to archive/example_result/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@
1111
- (watsonx) ibm/granite-3-2-8b-instruct (API Version: 2025-04-02)
1212
- (azure) gpt-4o-mini (Model Version: 2024-07-18, API Version: 2024-02-15-preview)
1313
- Judge provider/model (LLM based eval): (watsonx) llama-3-1-8b-instruct
14-
- QnA evaluation dataset: [QnAs from OCP doc](../ocp_doc_qna-edited.parquet)
14+
- QnA evaluation dataset: [QnAs from OCP doc](../eval_data/ocp_doc_qna-edited.parquet)
1515
- API run mode: without tool calling (streaming internally)
1616
- RAG SHA: 56269892dcf5279b9857c04918e8fba587008990b09146e907d7af9303bd6c9e
1717
- OCP doc: 4.18

0 commit comments

Comments
 (0)