Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
225 changes: 154 additions & 71 deletions lsc_agent_eval/README.md
Original file line number Diff line number Diff line change
@@ -1,16 +1,17 @@
# Lightspeed Agent Evaluation

A standalone package for evaluating agent-based systems, specifically designed for evaluating agent goal achievement.
A framework for evaluating AI agent performance.

## Features

- **Agent Goal Evaluation**: Evaluate whether an agent successfully achieves specified goals
- **Multi-turn Evaluation**: Organize evaluations into conversation groups for multi-turn testing
- **Multi-type Evaluation**: Support for different evaluation types:
- `judge-llm`: LLM-based evaluation using a judge model
- `script`: Script-based evaluation using verification scripts (similar to [k8s-bench](https://github.com/GoogleCloudPlatform/kubectl-ai/tree/main/k8s-bench))
- `sub-string`: Simple substring matching evaluation
- `sub-string`: Simple substring matching evaluation (ALL keywords must be present in response)
- **Setup/Cleanup Scripts**: Support for running setup and cleanup scripts before/after evaluation
- **Result Tracking**: Result tracking and CSV output
- **Result Tracking**: Result tracking with CSV output and JSON statistics
- **Standalone Package**: Can be installed and used independently of the main lightspeed-core-evaluation package
- **LiteLLM Integration**: Unified interface for Judge LLM

Expand Down Expand Up @@ -45,13 +46,102 @@ pip install -e .
pdm install
```

## Usage
## Data Configuration

The evaluation is configured using a YAML file that defines conversations. Each conversation contains one or more evaluations and includes:

- `conversation_group`: Identifier for grouping related evaluations/conversation
- `description`: Description of the conversation (Optional)
- `setup_script`: Setup script to run before the conversation (Optional)
- `cleanup_script`: Cleanup script to run after the conversation (Optional)
- `conversation`: List of evaluations in this conversation

Each evaluation within a conversation can include:
- `eval_id`: Unique identifier for the evaluation
- `eval_query`: The query/task to send to the agent
- `eval_type`: Type of evaluation (judge-llm, script, sub-string)
- `expected_response`: Expected response (for judge-llm evaluation)
- `expected_keywords`: Keywords to look for (for sub-string evaluation)
- `eval_verify_script`: Verification script (for script evaluation)
- `description`: Description of the evaluation (Optional)

Note: `eval_id` can't contain duplicate values within a conversation group. But it is okay for cross conversation group (A warning is logged anyway for awareness)

### Example Data Configuration

```yaml
# Multi-turn Conversations
- conversation_group: conv1
description: Basic conversation flow testing cluster operations
conversation:
- eval_id: eval1
eval_query: Hi!
eval_type: judge-llm
expected_response: Hello! I'm an AI assistant for the Installer.
description: Initial greeting to start conversation
- eval_id: eval2
eval_query: Get me active clusters
eval_type: judge-llm
expected_response: Active clusters are x1, x2.
description: Request for cluster information

- conversation_group: conv2
description: Multi-turn conversation with setup/cleanup
setup_script: sample_data/script/setup_environment.sh
cleanup_script: sample_data/script/cleanup_environment.sh
conversation:
- eval_id: eval1
eval_query: Hi! Can you help me manage pods?
eval_type: judge-llm
expected_response: Hello! I can help you manage pods.
description: Initial greeting
- eval_id: eval2
eval_query: Create a pod named test-pod
eval_type: script
eval_verify_script: sample_data/script/verify_pod.sh
description: Create pod and verify
- eval_id: eval3
eval_query: List all pods
eval_type: sub-string
expected_keywords: ['test-pod']
description: Verify pod is listed

# Single-turn Conversations
- conversation_group: conv3
description: Test namespace creation and detection with scripts
setup_script: sample_data/script/conv3/setup.sh
cleanup_script: sample_data/script/conv3/cleanup.sh
conversation:
- eval_id: eval1
eval_query: is there a openshift-lightspeed namespace ?
eval_type: sub-string
expected_keywords:
- 'yes'
- 'lightspeed'
description: Check for openshift-lightspeed namespace after setup
```

The `sample_data/` directory contains example configurations:
- `agent_goal_eval_example.yaml`: Examples with various evaluation types
- `script/`: Example setup, cleanup, and verify scripts

## Judge LLM

For judge-llm evaluations, currently LiteLLM is used.

### Judge LLM - Setup
Expectation is that, either a third-party inference provider access is there or local model inference is already created. The eval framework doesn't handle this.

- **OpenAI**: Set `OPENAI_API_KEY` environment variable
- **Azure OpenAI**: Set `AZURE_OPENAI_API_KEY`, `AZURE_OPENAI_ENDPOINT`
- **IBM Watsonx**: Set `WATSONX_API_KEY`, `WATSONX_API_BASE`, `WATSONX_PROJECT_ID`
- **Ollama**: Set `OLLAMA_API_BASE` (for local models)
- **Any Other Provider**: Check [LiteLLM documentation](https://docs.litellm.ai/docs/providers)

### Command Line Interface
## Usage

```bash
# Run agent evaluation with basic configuration
lsc-agent-eval \
lsc_agent_eval \
--eval_data_yaml agent_goal_eval.yaml \
--agent_endpoint http://localhost:8080 \
--agent_provider watsonx \
Expand All @@ -61,8 +151,6 @@ lsc-agent-eval \
--result_dir ./eval_output
```

### Python API

```python
from lsc_agent_eval import AgentGoalEval

Expand All @@ -84,44 +172,7 @@ evaluator = AgentGoalEval(args)
evaluator.run_evaluation()
```

## Configuration

The evaluation is configured using a YAML file that defines test cases. Each test case can include:

- `eval_id`: Unique identifier for the evaluation
- `eval_query`: The query/task to send to the agent
- `eval_type`: Type of evaluation (judge-llm, script, sub-string)
- `expected_response`: Expected response (for judge-llm evaluation)
- `expected_keywords`: Keywords to look for (for sub-string evaluation)
- `eval_verify_script`: Verification script (for script evaluation)
- `eval_setup_script`: Optional setup script to run before evaluation
- `eval_cleanup_script`: Optional cleanup script to run after evaluation

### Example YAML Configuration

```yaml
# data/example_eval.yaml
- eval_id: eval1
eval_query: "is there a openshift-monitoring namespace?"
eval_type: sub-string
expected_keywords:
- 'yes'
- openshift-monitoring

- eval_id: eval2
eval_query: "is there a openshift-monitoring namespace?"
eval_type: judge-llm
expected_response: "there is a openshift-monitoring namespace."

- eval_id: eval3
eval_query: "create a namespace called openshift-lightspeed"
eval_setup_script: script/eval3/setup.sh
eval_type: script
eval_verify_script: script/eval3/verify.sh
eval_cleanup_script: script/eval3/cleanup.sh
```

## Command Line Arguments
### Key Arguments

- `--eval_data_yaml`: Path to the YAML file containing evaluation data
- `--agent_endpoint`: Endpoint URL for the agent API (default: <http://localhost:8080>)
Expand All @@ -133,33 +184,60 @@ The evaluation is configured using a YAML file that defines test cases. Each tes
- `--result_dir`: Directory to save evaluation results (default: eval_output/)
- `--kubeconfig`: Path to kubeconfig file (if needed for scripts)

## Output
## Evaluation Flow

The evaluation results are saved to a CSV file containing:
- `eval_id`: Evaluation identifier
- `query`: The query sent to the agent
- `response`: The agent's response
- `eval_type`: Type of evaluation performed
- `result`: Result (pass/fail)
### Conversation Processing Order

## Dependencies
1. **Load Configuration**: Parse and validate YAML configuration
2. **Process Conversations**: For each conversation group:
- Run setup script (if provided)
- Run all evaluations sequentially:
- For the first evaluation: Send query without conversation ID, receive new conversation ID from API
- For subsequent evaluations: Use the conversation ID from the first evaluation to maintain context
- Execute evaluation based on eval_type (either sub-string, judge-llm or script)
- Run cleanup script (if provided)
3. **Save Results**: Export to CSV and JSON with statistics

This package depends on:
- `pandas`: Data manipulation and analysis
- `httpx`: HTTP client for API calls
- `tqdm`: Progress bars
- `pyyaml`: YAML file processing
- `litellm`: Unified interface to 100+ LLM providers
### Script Execution

## LiteLLM Integration (Judge LLM)
- **Setup Scripts**: Run once before all evaluations in a conversation
- If setup fails, all evaluations in the conversation are marked as ERROR
- **Cleanup Scripts**: Run once after all evaluations in a conversation
- Cleanup failures are logged as warnings (non-critical)
- Always executed regardless of evaluation results
- **Verify Scripts**: Run per individual evaluation for script type evaluations
- Used to verify the agent's action is successful

For judge-llm evaluations, you can use any of the 100+ supported providers:
### Error Handling

- **OpenAI**: Set `OPENAI_API_KEY` environment variable
- **Azure OpenAI**: Set `AZURE_OPENAI_API_KEY`, `AZURE_OPENAI_ENDPOINT`
- **IBM Watsonx**: Set `WATSONX_API_KEY`, `WATSONX_API_BASE`, `WATSONX_PROJECT_ID`
- **Ollama**: Set `OLLAMA_API_BASE` (for local models)
- **And many more**: See [LiteLLM documentation](https://docs.litellm.ai/docs/providers)
- **Setup Failure**: Marks all evaluations in conversation as ERROR
- **Cleanup Failure**: Logged as warning, does not affect evaluation results
- **API Errors**: Evaluation marked as Error
- **Evaluation Failure**: Individual evaluation marked as ERROR or FAIL
- **Configuration Errors**: Detailed validation message

## Output

The framework generates two types of output:

### CSV Results (`agent_goal_eval_results_YYYYMMDD_HHMMSS.csv`)

Contains detailed results with columns:
- `conversation_group`: The conversation group identifier
- `conversation_id`: The conversation ID returned by the Agent API
- `eval_id`: Individual evaluation identifier
- `result`: PASS, FAIL, or ERROR
- `eval_type`: Type of evaluation performed
- `query`: The question/task sent to the agent
- `response`: The agent's response
- `error`: Error message (if any)

### JSON Statistics (`agent_goal_eval_summary_YYYYMMDD_HHMMSS.json`)

Result statistics:
- **Overall Summary**: Total evaluations, pass/fail/error counts, success rate
- **By Conversation**: Breakdown of results for each conversation group
- **By Evaluation Type**: Performance metrics for each evaluation type (judge-llm, script, sub-string)

## Development

Expand All @@ -174,10 +252,15 @@ cd lightspeed-evaluation/lsc_agent_eval
pdm install --dev

# Run tests
pdm run pytest
pdm run pytest tests --cov=src

# Run linting
pdm run ruff check
pdm run isort src tests
pdm run black src tests
pdm run mypy src
pdm run pyright src
pdm run pylint src
```

### Contributing
Expand All @@ -186,7 +269,7 @@ pdm run ruff check
2. Create a feature branch
3. Make your changes
4. Add tests for new functionality
5. Run the test suite
5. Run tests and lint checks
6. Submit a pull request

## License
Expand All @@ -195,4 +278,4 @@ This project is licensed under the Apache License 2.0. See the LICENSE file for

## Support

For issues and questions, please use the [GitHub Issues](https://github.com/lightspeed-core/lightspeed-evaluation/issues) tracker.
For issues and questions, please use the [GitHub Issues](https://github.com/lightspeed-core/lightspeed-evaluation/issues) tracker.
88 changes: 65 additions & 23 deletions lsc_agent_eval/sample_data/agent_goal_eval_example.yaml
Original file line number Diff line number Diff line change
@@ -1,26 +1,68 @@
- eval_id: eval1
eval_query: is there a openshift-monitoring namespace ?
eval_type: sub-string
expected_keywords:
- 'yes'
- openshift-monitoring
- conversation_group: conv1
description: Test namespace detection using substring matching
conversation:
- eval_id: eval1
eval_query: is there a openshift-monitoring namespace ?
eval_type: sub-string
expected_keywords:
- 'yes'
- openshift-monitoring
description: Check for openshift-monitoring namespace existence

- eval_id: eval2
eval_query: is there a openshift-monitoring namespace ?
eval_type: judge-llm
expected_response: there is a openshift-monitoring namespace.
- conversation_group: conv2
description: Test namespace detection using LLM judge
conversation:
- eval_id: eval1
eval_query: is there a openshift-monitoring namespace ?
eval_type: judge-llm
expected_response: there is a openshift-monitoring namespace.
description: Verify openshift-monitoring namespace with LLM evaluation

- eval_id: eval3
eval_query: is there a openshift-lightspeed namespace ?
eval_setup_script: sample_data/script/eval3/setup.sh
eval_type: sub-string
expected_keywords:
- 'yes'
eval_cleanup_script: sample_data/script/eval3/cleanup.sh
- conversation_group: conv3
description: Test namespace creation and detection with scripts
setup_script: sample_data/script/conv3/setup.sh
cleanup_script: sample_data/script/conv3/cleanup.sh
conversation:
- eval_id: eval1
eval_query: is there a openshift-lightspeed namespace ?
eval_type: sub-string
expected_keywords:
- 'yes'
description: Check for openshift-lightspeed namespace after setup

- eval_id: eval4
eval_query: create a namespace called openshift-lightspeed
eval_setup_script: sample_data/script/eval4/setup.sh
eval_type: script
eval_verify_script: sample_data/script/eval4/verify.sh
eval_cleanup_script: sample_data/script/eval4/cleanup.sh
- conversation_group: conv4
description: Test namespace creation with full script validation
setup_script: sample_data/script/conv4/setup.sh
cleanup_script: sample_data/script/conv4/cleanup.sh
conversation:
- eval_id: eval1
eval_query: create a namespace called openshift-lightspeed
eval_type: script
eval_verify_script: sample_data/script/conv4/eval1/verify.sh
description: Create namespace and verify with script

- conversation_group: conv5
description: Test conversation retention - multi turn success
conversation:
- eval_id: eval1
eval_query: what is openshift virtualization ?
eval_type: sub-string
expected_keywords:
- virtualization
description: Test first conversation
- eval_id: eval2
eval_query: what was my previous query ?
eval_type: sub-string
expected_keywords:
- virtualization
description: Test second conversation

- conversation_group: conv6
description: Test conversation retention - new conversation
conversation:
- eval_id: eval1
eval_query: what was my previous query ?
eval_type: sub-string
expected_keywords:
- virtualization
description: new conversation (failure)
Loading