Skip to content

Commit 89e849e

Browse files
committed
agent eval: multi-turn & refactoring
1 parent f1b9877 commit 89e849e

File tree

25 files changed

+2182
-1506
lines changed

25 files changed

+2182
-1506
lines changed

lsc_agent_eval/README.md

Lines changed: 155 additions & 71 deletions
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,17 @@
11
# Lightspeed Agent Evaluation
22

3-
A standalone package for evaluating agent-based systems, specifically designed for evaluating agent goal achievement.
3+
A framework for evaluating AI agent performance.
44

55
## Features
66

77
- **Agent Goal Evaluation**: Evaluate whether an agent successfully achieves specified goals
8+
- **Conversation-based Evaluation**: Organize evaluations into conversation groups for context-aware multi-turn testing
89
- **Multi-type Evaluation**: Support for different evaluation types:
910
- `judge-llm`: LLM-based evaluation using a judge model
1011
- `script`: Script-based evaluation using verification scripts (similar to [k8s-bench](https://github.com/GoogleCloudPlatform/kubectl-ai/tree/main/k8s-bench))
11-
- `sub-string`: Simple substring matching evaluation
12+
- `sub-string`: Simple substring matching evaluation (ALL keywords must be present in response)
1213
- **Setup/Cleanup Scripts**: Support for running setup and cleanup scripts before/after evaluation
13-
- **Result Tracking**: Result tracking and CSV output
14+
- **Result Tracking**: Result tracking with CSV output and JSON statistics
1415
- **Standalone Package**: Can be installed and used independently of the main lightspeed-core-evaluation package
1516
- **LiteLLM Integration**: Unified interface for Judge LLM
1617

@@ -45,13 +46,100 @@ pip install -e .
4546
pdm install
4647
```
4748

48-
## Usage
49+
## Data Configuration
50+
51+
The evaluation is configured using a YAML file that defines conversations. Each conversation contains one or more evaluations and includes:
52+
53+
- `conversation_group`: Identifier for grouping related evaluations/conversation
54+
- `description`: Description of the conversation (Optional)
55+
- `setup_script`: Setup script to run before the conversation (Optional)
56+
- `cleanup_script`: Cleanup script to run after the conversation (Optional)
57+
- `conversation`: List of evaluations in this conversation
58+
59+
Each evaluation within a conversation can include:
60+
- `eval_id`: Unique identifier for the evaluation
61+
- `eval_query`: The query/task to send to the agent
62+
- `eval_type`: Type of evaluation (judge-llm, script, sub-string)
63+
- `expected_response`: Expected response (for judge-llm evaluation)
64+
- `expected_keywords`: Keywords to look for (for sub-string evaluation)
65+
- `eval_verify_script`: Verification script (for script evaluation)
66+
- `description`: Description of the evaluation (Optional)
67+
68+
### Example Data Configuration
69+
70+
```yaml
71+
# Multi-turn Conversations (one evaluation per conversation)
72+
- conversation_group: conv1
73+
description: Basic conversation flow testing cluster operations
74+
conversation:
75+
- eval_id: eval1
76+
eval_query: Hi!
77+
eval_type: judge-llm
78+
expected_response: Hello! I'm an AI assistant for the Installer.
79+
description: Initial greeting to start conversation
80+
- eval_id: eval2
81+
eval_query: Get me active clusters
82+
eval_type: judge-llm
83+
expected_response: Active clusters are x1, x2.
84+
description: Request for cluster information
85+
86+
- conversation_group: conv2
87+
description: Multi-turn conversation with setup/cleanup
88+
setup_script: sample_data/script/setup_environment.sh
89+
cleanup_script: sample_data/script/cleanup_environment.sh
90+
conversation:
91+
- eval_id: eval1
92+
eval_query: Hi! Can you help me manage pods?
93+
eval_type: judge-llm
94+
expected_response: Hello! I can help you manage pods.
95+
description: Initial greeting
96+
- eval_id: eval2
97+
eval_query: Create a pod named test-pod
98+
eval_type: script
99+
eval_verify_script: sample_data/script/verify_pod.sh
100+
description: Create pod and verify
101+
- eval_id: eval3
102+
eval_query: List all pods
103+
eval_type: sub-string
104+
expected_keywords: ['test-pod']
105+
description: Verify pod is listed
106+
107+
# Single-turn Conversations (one evaluation per conversation)
108+
- conversation_group: conv3
109+
description: Test namespace creation and detection with scripts
110+
setup_script: sample_data/script/conv3/setup.sh
111+
cleanup_script: sample_data/script/conv3/cleanup.sh
112+
conversation:
113+
- eval_id: eval1
114+
eval_query: is there a openshift-lightspeed namespace ?
115+
eval_type: sub-string
116+
expected_keywords:
117+
- 'yes'
118+
- 'lightspeed'
119+
description: Check for openshift-lightspeed namespace after setup
120+
```
49121
50-
### Command Line Interface
122+
The `sample_data/` directory contains example configurations:
123+
- `agent_goal_eval_example.yaml`: Examples with various evaluation types
124+
- `script/`: Example setup, cleanup, and verify scripts
125+
126+
## Judge LLM
127+
128+
For judge-llm evaluations, currently LiteLLM is used.
129+
130+
### Judge LLM - Setup
131+
Expectation is that, either a third party inference provider access is there or local model inference is already created. The eval framework doesn't handle this.
132+
133+
- **OpenAI**: Set `OPENAI_API_KEY` environment variable
134+
- **Azure OpenAI**: Set `AZURE_OPENAI_API_KEY`, `AZURE_OPENAI_ENDPOINT`
135+
- **IBM Watsonx**: Set `WATSONX_API_KEY`, `WATSONX_API_BASE`, `WATSONX_PROJECT_ID`
136+
- **Ollama**: Set `OLLAMA_API_BASE` (for local models)
137+
- **Any Other Provider**: Check [LiteLLM documentation](https://docs.litellm.ai/docs/providers)
138+
139+
## Usage
51140

52141
```bash
53-
# Run agent evaluation with basic configuration
54-
lsc-agent-eval \
142+
lsc_agent_eval \
55143
--eval_data_yaml agent_goal_eval.yaml \
56144
--agent_endpoint http://localhost:8080 \
57145
--agent_provider watsonx \
@@ -61,8 +149,6 @@ lsc-agent-eval \
61149
--result_dir ./eval_output
62150
```
63151

64-
### Python API
65-
66152
```python
67153
from lsc_agent_eval import AgentGoalEval
68154
@@ -84,44 +170,7 @@ evaluator = AgentGoalEval(args)
84170
evaluator.run_evaluation()
85171
```
86172

87-
## Configuration
88-
89-
The evaluation is configured using a YAML file that defines test cases. Each test case can include:
90-
91-
- `eval_id`: Unique identifier for the evaluation
92-
- `eval_query`: The query/task to send to the agent
93-
- `eval_type`: Type of evaluation (judge-llm, script, sub-string)
94-
- `expected_response`: Expected response (for judge-llm evaluation)
95-
- `expected_keywords`: Keywords to look for (for sub-string evaluation)
96-
- `eval_verify_script`: Verification script (for script evaluation)
97-
- `eval_setup_script`: Optional setup script to run before evaluation
98-
- `eval_cleanup_script`: Optional cleanup script to run after evaluation
99-
100-
### Example YAML Configuration
101-
102-
```yaml
103-
# data/example_eval.yaml
104-
- eval_id: eval1
105-
eval_query: "is there a openshift-monitoring namespace?"
106-
eval_type: sub-string
107-
expected_keywords:
108-
- 'yes'
109-
- openshift-monitoring
110-
111-
- eval_id: eval2
112-
eval_query: "is there a openshift-monitoring namespace?"
113-
eval_type: judge-llm
114-
expected_response: "there is a openshift-monitoring namespace."
115-
116-
- eval_id: eval3
117-
eval_query: "create a namespace called openshift-lightspeed"
118-
eval_setup_script: script/eval3/setup.sh
119-
eval_type: script
120-
eval_verify_script: script/eval3/verify.sh
121-
eval_cleanup_script: script/eval3/cleanup.sh
122-
```
123-
124-
## Command Line Arguments
173+
### Key Arguments
125174

126175
- `--eval_data_yaml`: Path to the YAML file containing evaluation data
127176
- `--agent_endpoint`: Endpoint URL for the agent API (default: <http://localhost:8080>)
@@ -133,33 +182,63 @@ The evaluation is configured using a YAML file that defines test cases. Each tes
133182
- `--result_dir`: Directory to save evaluation results (default: eval_output/)
134183
- `--kubeconfig`: Path to kubeconfig file (if needed for scripts)
135184

136-
## Output
185+
## Evaluation Flow
137186

138-
The evaluation results are saved to a CSV file containing:
139-
- `eval_id`: Evaluation identifier
140-
- `query`: The query sent to the agent
141-
- `response`: The agent's response
142-
- `eval_type`: Type of evaluation performed
143-
- `result`: Result (pass/fail)
187+
### Conversation Processing Order
144188

145-
## Dependencies
189+
1. **Load Configuration**: Parse and validate YAML configuration
190+
2. **Generate UUIDs**: Create unique conversation UUIDs for each conversation group
191+
3. **Process Conversations**: For each conversation group:
192+
- Run setup script (if provided)
193+
- Run all evaluations
194+
- Get Agent API response (with a shared conversation UUID)
195+
- Execute evaluation based on eval_type (either sub-string, judge-llm or script)
196+
- Run cleanup script (if provided)
197+
4. **Save Results**: Export to CSV and JSON with statistics
146198

147-
This package depends on:
148-
- `pandas`: Data manipulation and analysis
149-
- `httpx`: HTTP client for API calls
150-
- `tqdm`: Progress bars
151-
- `pyyaml`: YAML file processing
152-
- `litellm`: Unified interface to 100+ LLM providers
199+
### Script Execution
153200

154-
## LiteLLM Integration (Judge LLM)
201+
- **Setup Scripts**: Run once before all evaluations in a conversation
202+
- If setup fails, all evaluations in the conversation are marked as ERROR
203+
- **Cleanup Scripts**: Run once after all evaluations in a conversation
204+
- Cleanup failures are logged as warnings (non-critical)
205+
- Always executed regardless of evaluation results
206+
- **Verify Scripts**: Run per individual evaluation for script type evaluations
207+
- Used to verify the agent's action is successful
155208

156-
For judge-llm evaluations, you can use any of the 100+ supported providers:
209+
### Error Handling
157210

158-
- **OpenAI**: Set `OPENAI_API_KEY` environment variable
159-
- **Azure OpenAI**: Set `AZURE_OPENAI_API_KEY`, `AZURE_OPENAI_ENDPOINT`
160-
- **IBM Watsonx**: Set `WATSONX_API_KEY`, `WATSONX_API_BASE`, `WATSONX_PROJECT_ID`
161-
- **Ollama**: Set `OLLAMA_API_BASE` (for local models)
162-
- **And many more**: See [LiteLLM documentation](https://docs.litellm.ai/docs/providers)
211+
- **Setup Failure**: Marks all evaluations in conversation as ERROR
212+
- **Cleanup Failure**: Logged as warning, does not affect evaluation results
213+
- **API Errors**: Evaluation marked as Error
214+
- **Evaluation Failure**: Individual evaluation marked as ERROR or FAIL
215+
- **Configuration Errors**: Detailed validation message
216+
217+
## Output
218+
219+
The framework generates two types of output:
220+
221+
### CSV Results (`agent_goal_eval_results_YYYYMMDD_HHMMSS.csv`)
222+
223+
Contains detailed results with columns:
224+
- `conversation_group`: The conversation group identifier
225+
- `conversation_uuid`: The UUID used for API calls (internal)
226+
- `eval_id`: Individual evaluation identifier
227+
- `result`: PASS, FAIL, or ERROR
228+
- `eval_type`: Type of evaluation performed
229+
- `query`: The question/task sent to the agent
230+
- `response`: The agent's response
231+
- `expected_response`: Expected response (for judge-llm evaluations)
232+
- `expected_keywords`: Expected keywords (for sub-string evaluations)
233+
- `eval_verify_script`: Verification script path (for script evaluations)
234+
- `error`: Error message (if any)
235+
236+
### JSON Statistics (`agent_goal_eval_summary_YYYYMMDD_HHMMSS.json`)
237+
238+
Result statistics:
239+
- **Overall Summary**: Total evaluations, pass/fail/error counts, success rate
240+
- **By Conversation**: Breakdown of results for each conversation group
241+
- **By Evaluation Type**: Performance metrics for each evaluation type (judge-llm, script, sub-string)
163242

164243
## Development
165244

@@ -174,10 +253,15 @@ cd lightspeed-evaluation/lsc_agent_eval
174253
pdm install --dev
175254

176255
# Run tests
177-
pdm run pytest
256+
pdm run pytest tests --cov=src
178257

179258
# Run linting
180259
pdm run ruff check
260+
pdm run isort src tests
261+
pdm run black src tests
262+
pdm run mypy src
263+
pdm run pyright src
264+
pdm run pylint src
181265
```
182266

183267
### Contributing
@@ -186,7 +270,7 @@ pdm run ruff check
186270
2. Create a feature branch
187271
3. Make your changes
188272
4. Add tests for new functionality
189-
5. Run the test suite
273+
5. Run the lint checks
190274
6. Submit a pull request
191275

192276
## License
@@ -195,4 +279,4 @@ This project is licensed under the Apache License 2.0. See the LICENSE file for
195279

196280
## Support
197281

198-
For issues and questions, please use the [GitHub Issues](https://github.com/lightspeed-core/lightspeed-evaluation/issues) tracker.
282+
For issues and questions, please use the [GitHub Issues](https://github.com/lightspeed-core/lightspeed-evaluation/issues) tracker.
Lines changed: 65 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -1,26 +1,68 @@
1-
- eval_id: eval1
2-
eval_query: is there a openshift-monitoring namespace ?
3-
eval_type: sub-string
4-
expected_keywords:
5-
- 'yes'
6-
- openshift-monitoring
1+
- conversation_group: conv1
2+
description: Test namespace detection using substring matching
3+
conversation:
4+
- eval_id: eval1
5+
eval_query: is there a openshift-monitoring namespace ?
6+
eval_type: sub-string
7+
expected_keywords:
8+
- 'yes'
9+
- openshift-monitoring
10+
description: Check for openshift-monitoring namespace existence
711

8-
- eval_id: eval2
9-
eval_query: is there a openshift-monitoring namespace ?
10-
eval_type: judge-llm
11-
expected_response: there is a openshift-monitoring namespace.
12+
- conversation_group: conv2
13+
description: Test namespace detection using LLM judge
14+
conversation:
15+
- eval_id: eval1
16+
eval_query: is there a openshift-monitoring namespace ?
17+
eval_type: judge-llm
18+
expected_response: there is a openshift-monitoring namespace.
19+
description: Verify openshift-monitoring namespace with LLM evaluation
1220

13-
- eval_id: eval3
14-
eval_query: is there a openshift-lightspeed namespace ?
15-
eval_setup_script: sample_data/script/eval3/setup.sh
16-
eval_type: sub-string
17-
expected_keywords:
18-
- 'yes'
19-
eval_cleanup_script: sample_data/script/eval3/cleanup.sh
21+
- conversation_group: conv3
22+
description: Test namespace creation and detection with scripts
23+
setup_script: sample_data/script/conv3/setup.sh
24+
cleanup_script: sample_data/script/conv3/cleanup.sh
25+
conversation:
26+
- eval_id: eval1
27+
eval_query: is there a openshift-lightspeed namespace ?
28+
eval_type: sub-string
29+
expected_keywords:
30+
- 'yes'
31+
description: Check for openshift-lightspeed namespace after setup
2032

21-
- eval_id: eval4
22-
eval_query: create a namespace called openshift-lightspeed
23-
eval_setup_script: sample_data/script/eval4/setup.sh
24-
eval_type: script
25-
eval_verify_script: sample_data/script/eval4/verify.sh
26-
eval_cleanup_script: sample_data/script/eval4/cleanup.sh
33+
- conversation_group: conv4
34+
description: Test namespace creation with full script validation
35+
setup_script: sample_data/script/conv4/setup.sh
36+
cleanup_script: sample_data/script/conv4/cleanup.sh
37+
conversation:
38+
- eval_id: eval1
39+
eval_query: create a namespace called openshift-lightspeed
40+
eval_type: script
41+
eval_verify_script: sample_data/script/conv4/eval1/verify.sh
42+
description: Create namespace and verify with script
43+
44+
- conversation_group: conv5
45+
description: Test conversation retention - multi turn success
46+
conversation:
47+
- eval_id: eval1
48+
eval_query: what is openshift virtualization ?
49+
eval_type: sub-string
50+
expected_keywords:
51+
- virtualization
52+
description: Test first conversation
53+
- eval_id: eval2
54+
eval_query: what was my previous query ?
55+
eval_type: sub-string
56+
expected_keywords:
57+
- virtualization
58+
description: Test second conversation
59+
60+
- conversation_group: conv6
61+
description: Test conversation retention - new conversation
62+
conversation:
63+
- eval_id: eval1
64+
eval_query: what was my previous query ?
65+
eval_type: sub-string
66+
expected_keywords:
67+
- virtualization
68+
description: new conversation (failure)

0 commit comments

Comments
 (0)