You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
-**Conversation-based Evaluation**: Organize evaluations into conversation groups for context-aware multi-turn testing
8
9
-**Multi-type Evaluation**: Support for different evaluation types:
9
10
-`judge-llm`: LLM-based evaluation using a judge model
10
11
-`script`: Script-based evaluation using verification scripts (similar to [k8s-bench](https://github.com/GoogleCloudPlatform/kubectl-ai/tree/main/k8s-bench))
eval_query: is there a openshift-lightspeed namespace ?
115
+
eval_type: sub-string
116
+
expected_keywords:
117
+
- 'yes'
118
+
- 'lightspeed'
119
+
description: Check for openshift-lightspeed namespace after setup
120
+
```
49
121
50
-
### Command Line Interface
122
+
The `sample_data/` directory contains example configurations:
123
+
- `agent_goal_eval_example.yaml`: Examples with various evaluation types
124
+
- `script/`: Example setup, cleanup, and verify scripts
125
+
126
+
## Judge LLM
127
+
128
+
For judge-llm evaluations, currently LiteLLM is used.
129
+
130
+
### Judge LLM - Setup
131
+
Expectation is that, either a third party inference provider access is there or local model inference is already created. The eval framework doesn't handle this.
132
+
133
+
- **OpenAI**: Set `OPENAI_API_KEY` environment variable
134
+
- **Azure OpenAI**: Set `AZURE_OPENAI_API_KEY`, `AZURE_OPENAI_ENDPOINT`
135
+
- **IBM Watsonx**: Set `WATSONX_API_KEY`, `WATSONX_API_BASE`, `WATSONX_PROJECT_ID`
136
+
- **Ollama**: Set `OLLAMA_API_BASE` (for local models)
137
+
- **Any Other Provider**: Check [LiteLLM documentation](https://docs.litellm.ai/docs/providers)
-**Overall Summary**: Total evaluations, pass/fail/error counts, success rate
240
+
-**By Conversation**: Breakdown of results for each conversation group
241
+
-**By Evaluation Type**: Performance metrics for each evaluation type (judge-llm, script, sub-string)
163
242
164
243
## Development
165
244
@@ -174,10 +253,15 @@ cd lightspeed-evaluation/lsc_agent_eval
174
253
pdm install --dev
175
254
176
255
# Run tests
177
-
pdm run pytest
256
+
pdm run pytest tests --cov=src
178
257
179
258
# Run linting
180
259
pdm run ruff check
260
+
pdm run isort src tests
261
+
pdm run black src tests
262
+
pdm run mypy src
263
+
pdm run pyright src
264
+
pdm run pylint src
181
265
```
182
266
183
267
### Contributing
@@ -186,7 +270,7 @@ pdm run ruff check
186
270
2. Create a feature branch
187
271
3. Make your changes
188
272
4. Add tests for new functionality
189
-
5. Run the test suite
273
+
5. Run the lint checks
190
274
6. Submit a pull request
191
275
192
276
## License
@@ -195,4 +279,4 @@ This project is licensed under the Apache License 2.0. See the LICENSE file for
195
279
196
280
## Support
197
281
198
-
For issues and questions, please use the [GitHub Issues](https://github.com/lightspeed-core/lightspeed-evaluation/issues) tracker.
282
+
For issues and questions, please use the [GitHub Issues](https://github.com/lightspeed-core/lightspeed-evaluation/issues) tracker.
0 commit comments