You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
-**Multi-turn Evaluation**: Organize evaluations into conversation groups for multi-turn testing
8
9
-**Multi-type Evaluation**: Support for different evaluation types:
9
10
-`judge-llm`: LLM-based evaluation using a judge model
10
11
-`script`: Script-based evaluation using verification scripts (similar to [k8s-bench](https://github.com/GoogleCloudPlatform/kubectl-ai/tree/main/k8s-bench))
-`description`: Description of the evaluation (Optional)
67
+
68
+
Note: `eval_id` can't contain duplicate values within a conversation group. But it is okay for cross conversation group (A warning is logged anyway for awareness)
eval_query: is there a openshift-lightspeed namespace ?
117
+
eval_type: sub-string
118
+
expected_keywords:
119
+
- 'yes'
120
+
- 'lightspeed'
121
+
description: Check for openshift-lightspeed namespace after setup
122
+
```
123
+
124
+
The `sample_data/` directory contains example configurations:
125
+
- `agent_goal_eval_example.yaml`: Examples with various evaluation types
126
+
- `script/`: Example setup, cleanup, and verify scripts
127
+
128
+
## Judge LLM
129
+
130
+
For judge-llm evaluations, currently LiteLLM is used.
131
+
132
+
### Judge LLM - Setup
133
+
Expectation is that, either a third-party inference provider access is there or local model inference is already created. The eval framework doesn't handle this.
134
+
135
+
- **OpenAI**: Set `OPENAI_API_KEY` environment variable
136
+
- **Azure OpenAI**: Set `AZURE_OPENAI_API_KEY`, `AZURE_OPENAI_ENDPOINT`
137
+
- **IBM Watsonx**: Set `WATSONX_API_KEY`, `WATSONX_API_BASE`, `WATSONX_PROJECT_ID`
138
+
- **Ollama**: Set `OLLAMA_API_BASE` (for local models)
139
+
- **Any Other Provider**: Check [LiteLLM documentation](https://docs.litellm.ai/docs/providers)
-**Overall Summary**: Total evaluations, pass/fail/error counts, success rate
239
+
-**By Conversation**: Breakdown of results for each conversation group
240
+
-**By Evaluation Type**: Performance metrics for each evaluation type (judge-llm, script, sub-string)
163
241
164
242
## Development
165
243
@@ -174,10 +252,15 @@ cd lightspeed-evaluation/lsc_agent_eval
174
252
pdm install --dev
175
253
176
254
# Run tests
177
-
pdm run pytest
255
+
pdm run pytest tests --cov=src
178
256
179
257
# Run linting
180
258
pdm run ruff check
259
+
pdm run isort src tests
260
+
pdm run black src tests
261
+
pdm run mypy src
262
+
pdm run pyright src
263
+
pdm run pylint src
181
264
```
182
265
183
266
### Contributing
@@ -186,7 +269,7 @@ pdm run ruff check
186
269
2. Create a feature branch
187
270
3. Make your changes
188
271
4. Add tests for new functionality
189
-
5. Run the test suite
272
+
5. Run tests and lint checks
190
273
6. Submit a pull request
191
274
192
275
## License
@@ -195,4 +278,4 @@ This project is licensed under the Apache License 2.0. See the LICENSE file for
195
278
196
279
## Support
197
280
198
-
For issues and questions, please use the [GitHub Issues](https://github.com/lightspeed-core/lightspeed-evaluation/issues) tracker.
281
+
For issues and questions, please use the [GitHub Issues](https://github.com/lightspeed-core/lightspeed-evaluation/issues) tracker.
0 commit comments