Skip to content

Commit 13cd754

Browse files
authored
Merge pull request #22 from asamal4/multi-turn-eval
agent eval: multi-turn & refactoring
2 parents f1b9877 + 79cf74e commit 13cd754

26 files changed

+2162
-1565
lines changed

lsc_agent_eval/README.md

Lines changed: 154 additions & 71 deletions
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,17 @@
11
# Lightspeed Agent Evaluation
22

3-
A standalone package for evaluating agent-based systems, specifically designed for evaluating agent goal achievement.
3+
A framework for evaluating AI agent performance.
44

55
## Features
66

77
- **Agent Goal Evaluation**: Evaluate whether an agent successfully achieves specified goals
8+
- **Multi-turn Evaluation**: Organize evaluations into conversation groups for multi-turn testing
89
- **Multi-type Evaluation**: Support for different evaluation types:
910
- `judge-llm`: LLM-based evaluation using a judge model
1011
- `script`: Script-based evaluation using verification scripts (similar to [k8s-bench](https://github.com/GoogleCloudPlatform/kubectl-ai/tree/main/k8s-bench))
11-
- `sub-string`: Simple substring matching evaluation
12+
- `sub-string`: Simple substring matching evaluation (ALL keywords must be present in response)
1213
- **Setup/Cleanup Scripts**: Support for running setup and cleanup scripts before/after evaluation
13-
- **Result Tracking**: Result tracking and CSV output
14+
- **Result Tracking**: Result tracking with CSV output and JSON statistics
1415
- **Standalone Package**: Can be installed and used independently of the main lightspeed-core-evaluation package
1516
- **LiteLLM Integration**: Unified interface for Judge LLM
1617

@@ -45,13 +46,102 @@ pip install -e .
4546
pdm install
4647
```
4748

48-
## Usage
49+
## Data Configuration
50+
51+
The evaluation is configured using a YAML file that defines conversations. Each conversation contains one or more evaluations and includes:
52+
53+
- `conversation_group`: Identifier for grouping related evaluations/conversation
54+
- `description`: Description of the conversation (Optional)
55+
- `setup_script`: Setup script to run before the conversation (Optional)
56+
- `cleanup_script`: Cleanup script to run after the conversation (Optional)
57+
- `conversation`: List of evaluations in this conversation
58+
59+
Each evaluation within a conversation can include:
60+
- `eval_id`: Unique identifier for the evaluation
61+
- `eval_query`: The query/task to send to the agent
62+
- `eval_type`: Type of evaluation (judge-llm, script, sub-string)
63+
- `expected_response`: Expected response (for judge-llm evaluation)
64+
- `expected_keywords`: Keywords to look for (for sub-string evaluation)
65+
- `eval_verify_script`: Verification script (for script evaluation)
66+
- `description`: Description of the evaluation (Optional)
67+
68+
Note: `eval_id` can't contain duplicate values within a conversation group. But it is okay for cross conversation group (A warning is logged anyway for awareness)
69+
70+
### Example Data Configuration
71+
72+
```yaml
73+
# Multi-turn Conversations
74+
- conversation_group: conv1
75+
description: Basic conversation flow testing cluster operations
76+
conversation:
77+
- eval_id: eval1
78+
eval_query: Hi!
79+
eval_type: judge-llm
80+
expected_response: Hello! I'm an AI assistant for the Installer.
81+
description: Initial greeting to start conversation
82+
- eval_id: eval2
83+
eval_query: Get me active clusters
84+
eval_type: judge-llm
85+
expected_response: Active clusters are x1, x2.
86+
description: Request for cluster information
87+
88+
- conversation_group: conv2
89+
description: Multi-turn conversation with setup/cleanup
90+
setup_script: sample_data/script/setup_environment.sh
91+
cleanup_script: sample_data/script/cleanup_environment.sh
92+
conversation:
93+
- eval_id: eval1
94+
eval_query: Hi! Can you help me manage pods?
95+
eval_type: judge-llm
96+
expected_response: Hello! I can help you manage pods.
97+
description: Initial greeting
98+
- eval_id: eval2
99+
eval_query: Create a pod named test-pod
100+
eval_type: script
101+
eval_verify_script: sample_data/script/verify_pod.sh
102+
description: Create pod and verify
103+
- eval_id: eval3
104+
eval_query: List all pods
105+
eval_type: sub-string
106+
expected_keywords: ['test-pod']
107+
description: Verify pod is listed
108+
109+
# Single-turn Conversations
110+
- conversation_group: conv3
111+
description: Test namespace creation and detection with scripts
112+
setup_script: sample_data/script/conv3/setup.sh
113+
cleanup_script: sample_data/script/conv3/cleanup.sh
114+
conversation:
115+
- eval_id: eval1
116+
eval_query: is there a openshift-lightspeed namespace ?
117+
eval_type: sub-string
118+
expected_keywords:
119+
- 'yes'
120+
- 'lightspeed'
121+
description: Check for openshift-lightspeed namespace after setup
122+
```
123+
124+
The `sample_data/` directory contains example configurations:
125+
- `agent_goal_eval_example.yaml`: Examples with various evaluation types
126+
- `script/`: Example setup, cleanup, and verify scripts
127+
128+
## Judge LLM
129+
130+
For judge-llm evaluations, currently LiteLLM is used.
131+
132+
### Judge LLM - Setup
133+
Expectation is that, either a third-party inference provider access is there or local model inference is already created. The eval framework doesn't handle this.
134+
135+
- **OpenAI**: Set `OPENAI_API_KEY` environment variable
136+
- **Azure OpenAI**: Set `AZURE_OPENAI_API_KEY`, `AZURE_OPENAI_ENDPOINT`
137+
- **IBM Watsonx**: Set `WATSONX_API_KEY`, `WATSONX_API_BASE`, `WATSONX_PROJECT_ID`
138+
- **Ollama**: Set `OLLAMA_API_BASE` (for local models)
139+
- **Any Other Provider**: Check [LiteLLM documentation](https://docs.litellm.ai/docs/providers)
49140

50-
### Command Line Interface
141+
## Usage
51142

52143
```bash
53-
# Run agent evaluation with basic configuration
54-
lsc-agent-eval \
144+
lsc_agent_eval \
55145
--eval_data_yaml agent_goal_eval.yaml \
56146
--agent_endpoint http://localhost:8080 \
57147
--agent_provider watsonx \
@@ -61,8 +151,6 @@ lsc-agent-eval \
61151
--result_dir ./eval_output
62152
```
63153

64-
### Python API
65-
66154
```python
67155
from lsc_agent_eval import AgentGoalEval
68156
@@ -84,44 +172,7 @@ evaluator = AgentGoalEval(args)
84172
evaluator.run_evaluation()
85173
```
86174

87-
## Configuration
88-
89-
The evaluation is configured using a YAML file that defines test cases. Each test case can include:
90-
91-
- `eval_id`: Unique identifier for the evaluation
92-
- `eval_query`: The query/task to send to the agent
93-
- `eval_type`: Type of evaluation (judge-llm, script, sub-string)
94-
- `expected_response`: Expected response (for judge-llm evaluation)
95-
- `expected_keywords`: Keywords to look for (for sub-string evaluation)
96-
- `eval_verify_script`: Verification script (for script evaluation)
97-
- `eval_setup_script`: Optional setup script to run before evaluation
98-
- `eval_cleanup_script`: Optional cleanup script to run after evaluation
99-
100-
### Example YAML Configuration
101-
102-
```yaml
103-
# data/example_eval.yaml
104-
- eval_id: eval1
105-
eval_query: "is there a openshift-monitoring namespace?"
106-
eval_type: sub-string
107-
expected_keywords:
108-
- 'yes'
109-
- openshift-monitoring
110-
111-
- eval_id: eval2
112-
eval_query: "is there a openshift-monitoring namespace?"
113-
eval_type: judge-llm
114-
expected_response: "there is a openshift-monitoring namespace."
115-
116-
- eval_id: eval3
117-
eval_query: "create a namespace called openshift-lightspeed"
118-
eval_setup_script: script/eval3/setup.sh
119-
eval_type: script
120-
eval_verify_script: script/eval3/verify.sh
121-
eval_cleanup_script: script/eval3/cleanup.sh
122-
```
123-
124-
## Command Line Arguments
175+
### Key Arguments
125176

126177
- `--eval_data_yaml`: Path to the YAML file containing evaluation data
127178
- `--agent_endpoint`: Endpoint URL for the agent API (default: <http://localhost:8080>)
@@ -133,33 +184,60 @@ The evaluation is configured using a YAML file that defines test cases. Each tes
133184
- `--result_dir`: Directory to save evaluation results (default: eval_output/)
134185
- `--kubeconfig`: Path to kubeconfig file (if needed for scripts)
135186

136-
## Output
187+
## Evaluation Flow
137188

138-
The evaluation results are saved to a CSV file containing:
139-
- `eval_id`: Evaluation identifier
140-
- `query`: The query sent to the agent
141-
- `response`: The agent's response
142-
- `eval_type`: Type of evaluation performed
143-
- `result`: Result (pass/fail)
189+
### Conversation Processing Order
144190

145-
## Dependencies
191+
1. **Load Configuration**: Parse and validate YAML configuration
192+
2. **Process Conversations**: For each conversation group:
193+
- Run setup script (if provided)
194+
- Run all evaluations sequentially:
195+
- For the first evaluation: Send query without conversation ID, receive new conversation ID from API
196+
- For subsequent evaluations: Use the conversation ID from the first evaluation to maintain context
197+
- Execute evaluation based on eval_type (either sub-string, judge-llm or script)
198+
- Run cleanup script (if provided)
199+
3. **Save Results**: Export to CSV and JSON with statistics
146200

147-
This package depends on:
148-
- `pandas`: Data manipulation and analysis
149-
- `httpx`: HTTP client for API calls
150-
- `tqdm`: Progress bars
151-
- `pyyaml`: YAML file processing
152-
- `litellm`: Unified interface to 100+ LLM providers
201+
### Script Execution
153202

154-
## LiteLLM Integration (Judge LLM)
203+
- **Setup Scripts**: Run once before all evaluations in a conversation
204+
- If setup fails, all evaluations in the conversation are marked as ERROR
205+
- **Cleanup Scripts**: Run once after all evaluations in a conversation
206+
- Cleanup failures are logged as warnings (non-critical)
207+
- Always executed regardless of evaluation results
208+
- **Verify Scripts**: Run per individual evaluation for script type evaluations
209+
- Used to verify the agent's action is successful
155210

156-
For judge-llm evaluations, you can use any of the 100+ supported providers:
211+
### Error Handling
157212

158-
- **OpenAI**: Set `OPENAI_API_KEY` environment variable
159-
- **Azure OpenAI**: Set `AZURE_OPENAI_API_KEY`, `AZURE_OPENAI_ENDPOINT`
160-
- **IBM Watsonx**: Set `WATSONX_API_KEY`, `WATSONX_API_BASE`, `WATSONX_PROJECT_ID`
161-
- **Ollama**: Set `OLLAMA_API_BASE` (for local models)
162-
- **And many more**: See [LiteLLM documentation](https://docs.litellm.ai/docs/providers)
213+
- **Setup Failure**: Marks all evaluations in conversation as ERROR
214+
- **Cleanup Failure**: Logged as warning, does not affect evaluation results
215+
- **API Errors**: Evaluation marked as Error
216+
- **Evaluation Failure**: Individual evaluation marked as ERROR or FAIL
217+
- **Configuration Errors**: Detailed validation message
218+
219+
## Output
220+
221+
The framework generates two types of output:
222+
223+
### CSV Results (`agent_goal_eval_results_YYYYMMDD_HHMMSS.csv`)
224+
225+
Contains detailed results with columns:
226+
- `conversation_group`: The conversation group identifier
227+
- `conversation_id`: The conversation ID returned by the Agent API
228+
- `eval_id`: Individual evaluation identifier
229+
- `result`: PASS, FAIL, or ERROR
230+
- `eval_type`: Type of evaluation performed
231+
- `query`: The question/task sent to the agent
232+
- `response`: The agent's response
233+
- `error`: Error message (if any)
234+
235+
### JSON Statistics (`agent_goal_eval_summary_YYYYMMDD_HHMMSS.json`)
236+
237+
Result statistics:
238+
- **Overall Summary**: Total evaluations, pass/fail/error counts, success rate
239+
- **By Conversation**: Breakdown of results for each conversation group
240+
- **By Evaluation Type**: Performance metrics for each evaluation type (judge-llm, script, sub-string)
163241

164242
## Development
165243

@@ -174,10 +252,15 @@ cd lightspeed-evaluation/lsc_agent_eval
174252
pdm install --dev
175253

176254
# Run tests
177-
pdm run pytest
255+
pdm run pytest tests --cov=src
178256

179257
# Run linting
180258
pdm run ruff check
259+
pdm run isort src tests
260+
pdm run black src tests
261+
pdm run mypy src
262+
pdm run pyright src
263+
pdm run pylint src
181264
```
182265

183266
### Contributing
@@ -186,7 +269,7 @@ pdm run ruff check
186269
2. Create a feature branch
187270
3. Make your changes
188271
4. Add tests for new functionality
189-
5. Run the test suite
272+
5. Run tests and lint checks
190273
6. Submit a pull request
191274

192275
## License
@@ -195,4 +278,4 @@ This project is licensed under the Apache License 2.0. See the LICENSE file for
195278

196279
## Support
197280

198-
For issues and questions, please use the [GitHub Issues](https://github.com/lightspeed-core/lightspeed-evaluation/issues) tracker.
281+
For issues and questions, please use the [GitHub Issues](https://github.com/lightspeed-core/lightspeed-evaluation/issues) tracker.
Lines changed: 65 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -1,26 +1,68 @@
1-
- eval_id: eval1
2-
eval_query: is there a openshift-monitoring namespace ?
3-
eval_type: sub-string
4-
expected_keywords:
5-
- 'yes'
6-
- openshift-monitoring
1+
- conversation_group: conv1
2+
description: Test namespace detection using substring matching
3+
conversation:
4+
- eval_id: eval1
5+
eval_query: is there a openshift-monitoring namespace ?
6+
eval_type: sub-string
7+
expected_keywords:
8+
- 'yes'
9+
- openshift-monitoring
10+
description: Check for openshift-monitoring namespace existence
711

8-
- eval_id: eval2
9-
eval_query: is there a openshift-monitoring namespace ?
10-
eval_type: judge-llm
11-
expected_response: there is a openshift-monitoring namespace.
12+
- conversation_group: conv2
13+
description: Test namespace detection using LLM judge
14+
conversation:
15+
- eval_id: eval1
16+
eval_query: is there a openshift-monitoring namespace ?
17+
eval_type: judge-llm
18+
expected_response: there is a openshift-monitoring namespace.
19+
description: Verify openshift-monitoring namespace with LLM evaluation
1220

13-
- eval_id: eval3
14-
eval_query: is there a openshift-lightspeed namespace ?
15-
eval_setup_script: sample_data/script/eval3/setup.sh
16-
eval_type: sub-string
17-
expected_keywords:
18-
- 'yes'
19-
eval_cleanup_script: sample_data/script/eval3/cleanup.sh
21+
- conversation_group: conv3
22+
description: Test namespace creation and detection with scripts
23+
setup_script: sample_data/script/conv3/setup.sh
24+
cleanup_script: sample_data/script/conv3/cleanup.sh
25+
conversation:
26+
- eval_id: eval1
27+
eval_query: is there a openshift-lightspeed namespace ?
28+
eval_type: sub-string
29+
expected_keywords:
30+
- 'yes'
31+
description: Check for openshift-lightspeed namespace after setup
2032

21-
- eval_id: eval4
22-
eval_query: create a namespace called openshift-lightspeed
23-
eval_setup_script: sample_data/script/eval4/setup.sh
24-
eval_type: script
25-
eval_verify_script: sample_data/script/eval4/verify.sh
26-
eval_cleanup_script: sample_data/script/eval4/cleanup.sh
33+
- conversation_group: conv4
34+
description: Test namespace creation with full script validation
35+
setup_script: sample_data/script/conv4/setup.sh
36+
cleanup_script: sample_data/script/conv4/cleanup.sh
37+
conversation:
38+
- eval_id: eval1
39+
eval_query: create a namespace called openshift-lightspeed
40+
eval_type: script
41+
eval_verify_script: sample_data/script/conv4/eval1/verify.sh
42+
description: Create namespace and verify with script
43+
44+
- conversation_group: conv5
45+
description: Test conversation retention - multi turn success
46+
conversation:
47+
- eval_id: eval1
48+
eval_query: what is openshift virtualization ?
49+
eval_type: sub-string
50+
expected_keywords:
51+
- virtualization
52+
description: Test first conversation
53+
- eval_id: eval2
54+
eval_query: what was my previous query ?
55+
eval_type: sub-string
56+
expected_keywords:
57+
- virtualization
58+
description: Test second conversation
59+
60+
- conversation_group: conv6
61+
description: Test conversation retention - new conversation
62+
conversation:
63+
- eval_id: eval1
64+
eval_query: what was my previous query ?
65+
eval_type: sub-string
66+
expected_keywords:
67+
- virtualization
68+
description: new conversation (failure)

0 commit comments

Comments
 (0)