lightspeed-core · tisnik · Jul 31, 2025 · Jul 27, 2025
diff --git a/lsc_agent_eval/README.md b/lsc_agent_eval/README.md
@@ -1,16 +1,17 @@
 # Lightspeed Agent Evaluation
 
-A standalone package for evaluating agent-based systems, specifically designed for evaluating agent goal achievement.
+A framework for evaluating AI agent performance.
 
 ## Features
 
 - **Agent Goal Evaluation**: Evaluate whether an agent successfully achieves specified goals
+- **Multi-turn Evaluation**: Organize evaluations into conversation groups for multi-turn testing
 - **Multi-type Evaluation**: Support for different evaluation types:
   - `judge-llm`: LLM-based evaluation using a judge model
   - `script`: Script-based evaluation using verification scripts (similar to [k8s-bench](https://github.com/GoogleCloudPlatform/kubectl-ai/tree/main/k8s-bench))
-  - `sub-string`: Simple substring matching evaluation
+  - `sub-string`: Simple substring matching evaluation (ALL keywords must be present in response)
 - **Setup/Cleanup Scripts**: Support for running setup and cleanup scripts before/after evaluation
-- **Result Tracking**: Result tracking and CSV output
+- **Result Tracking**: Result tracking with CSV output and JSON statistics
 - **Standalone Package**: Can be installed and used independently of the main lightspeed-core-evaluation package
 - **LiteLLM Integration**: Unified interface for Judge LLM
 
@@ -45,13 +46,102 @@ pip install -e .
 pdm install
 ```
 
-## Usage
+## Data Configuration
+
+The evaluation is configured using a YAML file that defines conversations. Each conversation contains one or more evaluations and includes:
+
+- `conversation_group`: Identifier for grouping related evaluations/conversation
+- `description`: Description of the conversation (Optional)
+- `setup_script`: Setup script to run before the conversation (Optional)
+- `cleanup_script`: Cleanup script to run after the conversation (Optional)
+- `conversation`: List of evaluations in this conversation
+
+Each evaluation within a conversation can include:
+- `eval_id`: Unique identifier for the evaluation
+- `eval_query`: The query/task to send to the agent
+- `eval_type`: Type of evaluation (judge-llm, script, sub-string)
+- `expected_response`: Expected response (for judge-llm evaluation)
+- `expected_keywords`: Keywords to look for (for sub-string evaluation)
+- `eval_verify_script`: Verification script (for script evaluation)
+- `description`: Description of the evaluation (Optional)
+
+Note: `eval_id` can't contain duplicate values within a conversation group. But it is okay for cross conversation group (A warning is logged anyway for awareness)
+
+### Example Data Configuration
+
+```yaml
+# Multi-turn Conversations
+- conversation_group: conv1
+  description: Basic conversation flow testing cluster operations
+  conversation:
+    - eval_id: eval1
+      eval_query: Hi!
+      eval_type: judge-llm
+      expected_response: Hello! I'm an AI assistant for the Installer.
+      description: Initial greeting to start conversation
+    - eval_id: eval2
+      eval_query: Get me active clusters
+      eval_type: judge-llm
+      expected_response: Active clusters are x1, x2.
+      description: Request for cluster information
+
+- conversation_group: conv2
+  description: Multi-turn conversation with setup/cleanup
+  setup_script: sample_data/script/setup_environment.sh
+  cleanup_script: sample_data/script/cleanup_environment.sh
+  conversation:
+    - eval_id: eval1
+      eval_query: Hi! Can you help me manage pods?
+      eval_type: judge-llm
+      expected_response: Hello! I can help you manage pods.
+      description: Initial greeting
+    - eval_id: eval2
+      eval_query: Create a pod named test-pod
+      eval_type: script
+      eval_verify_script: sample_data/script/verify_pod.sh
+      description: Create pod and verify
+    - eval_id: eval3
+      eval_query: List all pods
+      eval_type: sub-string
+      expected_keywords: ['test-pod']
+      description: Verify pod is listed
+
+# Single-turn Conversations
+- conversation_group: conv3
+  description: Test namespace creation and detection with scripts
+  setup_script: sample_data/script/conv3/setup.sh
+  cleanup_script: sample_data/script/conv3/cleanup.sh
+  conversation:
+    - eval_id: eval1
+      eval_query: is there a openshift-lightspeed namespace ?
+      eval_type: sub-string
+      expected_keywords:
+        - 'yes'
+        - 'lightspeed'
+      description: Check for openshift-lightspeed namespace after setup
+```
+
+The `sample_data/` directory contains example configurations:
+- `agent_goal_eval_example.yaml`: Examples with various evaluation types
+- `script/`: Example setup, cleanup, and verify scripts
+
+## Judge LLM
+
+For judge-llm evaluations, currently LiteLLM is used.
+
+### Judge LLM - Setup
+Expectation is that, either a third-party inference provider access is there or local model inference is already created. The eval framework doesn't handle this.
+
+- **OpenAI**: Set `OPENAI_API_KEY` environment variable
+- **Azure OpenAI**: Set `AZURE_OPENAI_API_KEY`, `AZURE_OPENAI_ENDPOINT`
+- **IBM Watsonx**: Set `WATSONX_API_KEY`, `WATSONX_API_BASE`, `WATSONX_PROJECT_ID`
+- **Ollama**: Set `OLLAMA_API_BASE` (for local models)
+- **Any Other Provider**: Check [LiteLLM documentation](https://docs.litellm.ai/docs/providers)
 
-### Command Line Interface
+## Usage
 
 ```bash
-# Run agent evaluation with basic configuration
-lsc-agent-eval \
+lsc_agent_eval \
     --eval_data_yaml agent_goal_eval.yaml \
     --agent_endpoint http://localhost:8080 \
     --agent_provider watsonx \
@@ -61,8 +151,6 @@ lsc-agent-eval \
     --result_dir ./eval_output
 ```
 
-### Python API
-
 ```python
 from lsc_agent_eval import AgentGoalEval
 
@@ -84,44 +172,7 @@ evaluator = AgentGoalEval(args)
 evaluator.run_evaluation()
 ```
 
-## Configuration
-
-The evaluation is configured using a YAML file that defines test cases. Each test case can include:
-
-- `eval_id`: Unique identifier for the evaluation
-- `eval_query`: The query/task to send to the agent
-- `eval_type`: Type of evaluation (judge-llm, script, sub-string)
-- `expected_response`: Expected response (for judge-llm evaluation)
-- `expected_keywords`: Keywords to look for (for sub-string evaluation)
-- `eval_verify_script`: Verification script (for script evaluation)
-- `eval_setup_script`: Optional setup script to run before evaluation
-- `eval_cleanup_script`: Optional cleanup script to run after evaluation
-
-### Example YAML Configuration
-
-```yaml
-# data/example_eval.yaml
-- eval_id: eval1
-  eval_query: "is there a openshift-monitoring namespace?"
-  eval_type: sub-string
-  expected_keywords:
-    - 'yes'
-    - openshift-monitoring
-
-- eval_id: eval2
-  eval_query: "is there a openshift-monitoring namespace?"
-  eval_type: judge-llm
-  expected_response: "there is a openshift-monitoring namespace."
-
-- eval_id: eval3
-  eval_query: "create a namespace called openshift-lightspeed"
-  eval_setup_script: script/eval3/setup.sh
-  eval_type: script
-  eval_verify_script: script/eval3/verify.sh
-  eval_cleanup_script: script/eval3/cleanup.sh
-```
-
-## Command Line Arguments
+### Key Arguments
 
 - `--eval_data_yaml`: Path to the YAML file containing evaluation data
 - `--agent_endpoint`: Endpoint URL for the agent API (default: <http://localhost:8080>)
@@ -133,33 +184,60 @@ The evaluation is configured using a YAML file that defines test cases. Each tes
 - `--result_dir`: Directory to save evaluation results (default: eval_output/)
 - `--kubeconfig`: Path to kubeconfig file (if needed for scripts)
 
-## Output
+## Evaluation Flow
 
-The evaluation results are saved to a CSV file containing:
-- `eval_id`: Evaluation identifier
-- `query`: The query sent to the agent
-- `response`: The agent's response
-- `eval_type`: Type of evaluation performed
-- `result`: Result (pass/fail)
+### Conversation Processing Order
 
-## Dependencies
+1. **Load Configuration**: Parse and validate YAML configuration
+2. **Process Conversations**: For each conversation group:
+   - Run setup script (if provided)
+   - Run all evaluations sequentially:
+     - For the first evaluation: Send query without conversation ID, receive new conversation ID from API
+     - For subsequent evaluations: Use the conversation ID from the first evaluation to maintain context
+     - Execute evaluation based on eval_type (either sub-string, judge-llm or script)
+   - Run cleanup script (if provided)
+3. **Save Results**: Export to CSV and JSON with statistics
 
-This package depends on:
-- `pandas`: Data manipulation and analysis
-- `httpx`: HTTP client for API calls
-- `tqdm`: Progress bars
-- `pyyaml`: YAML file processing
-- `litellm`: Unified interface to 100+ LLM providers
+### Script Execution
 
-## LiteLLM Integration (Judge LLM)
+- **Setup Scripts**: Run once before all evaluations in a conversation
+  - If setup fails, all evaluations in the conversation are marked as ERROR
+- **Cleanup Scripts**: Run once after all evaluations in a conversation
+  - Cleanup failures are logged as warnings (non-critical)
+  - Always executed regardless of evaluation results
+- **Verify Scripts**: Run per individual evaluation for script type evaluations
+  - Used to verify the agent's action is successful
 
-For judge-llm evaluations, you can use any of the 100+ supported providers:
+### Error Handling
 
-- **OpenAI**: Set `OPENAI_API_KEY` environment variable
-- **Azure OpenAI**: Set `AZURE_OPENAI_API_KEY`, `AZURE_OPENAI_ENDPOINT`
-- **IBM Watsonx**: Set `WATSONX_API_KEY`, `WATSONX_API_BASE`, `WATSONX_PROJECT_ID`
-- **Ollama**: Set `OLLAMA_API_BASE` (for local models)
-- **And many more**: See [LiteLLM documentation](https://docs.litellm.ai/docs/providers)
+- **Setup Failure**: Marks all evaluations in conversation as ERROR
+- **Cleanup Failure**: Logged as warning, does not affect evaluation results
+- **API Errors**: Evaluation marked as Error
+- **Evaluation Failure**: Individual evaluation marked as ERROR or FAIL
+- **Configuration Errors**: Detailed validation message
+
+## Output
+
+The framework generates two types of output:
+
+### CSV Results (`agent_goal_eval_results_YYYYMMDD_HHMMSS.csv`)
+
+Contains detailed results with columns:
+- `conversation_group`: The conversation group identifier
+- `conversation_id`: The conversation ID returned by the Agent API
+- `eval_id`: Individual evaluation identifier
+- `result`: PASS, FAIL, or ERROR
+- `eval_type`: Type of evaluation performed
+- `query`: The question/task sent to the agent
+- `response`: The agent's response
+- `error`: Error message (if any)
+
+### JSON Statistics (`agent_goal_eval_summary_YYYYMMDD_HHMMSS.json`)
+
+Result statistics:
+- **Overall Summary**: Total evaluations, pass/fail/error counts, success rate
+- **By Conversation**: Breakdown of results for each conversation group
+- **By Evaluation Type**: Performance metrics for each evaluation type (judge-llm, script, sub-string)
 
 ## Development
 
@@ -174,10 +252,15 @@ cd lightspeed-evaluation/lsc_agent_eval
 pdm install --dev
 
 # Run tests
-pdm run pytest
+pdm run pytest tests --cov=src
 
 # Run linting
 pdm run ruff check
+pdm run isort src tests
+pdm run black src tests
+pdm run mypy src
+pdm run pyright src
+pdm run pylint src
 ```
 
 ### Contributing
@@ -186,7 +269,7 @@ pdm run ruff check
 2. Create a feature branch
 3. Make your changes
 4. Add tests for new functionality
-5. Run the test suite
+5. Run tests and lint checks
 6. Submit a pull request
 
 ## License
@@ -195,4 +278,4 @@ This project is licensed under the Apache License 2.0. See the LICENSE file for
 
 ## Support
 
-For issues and questions, please use the [GitHub Issues](https://github.com/lightspeed-core/lightspeed-evaluation/issues) tracker. 
+For issues and questions, please use the [GitHub Issues](https://github.com/lightspeed-core/lightspeed-evaluation/issues) tracker. 
diff --git a/lsc_agent_eval/sample_data/agent_goal_eval_example.yaml b/lsc_agent_eval/sample_data/agent_goal_eval_example.yaml
@@ -1,26 +1,68 @@
-- eval_id: eval1
-  eval_query: is there a openshift-monitoring namespace ?
-  eval_type: sub-string
-  expected_keywords:
-  - 'yes'
-  - openshift-monitoring
+- conversation_group: conv1
+  description: Test namespace detection using substring matching
+  conversation:
+    - eval_id: eval1
+      eval_query: is there a openshift-monitoring namespace ?
+      eval_type: sub-string
+      expected_keywords:
+        - 'yes'
+        - openshift-monitoring
+      description: Check for openshift-monitoring namespace existence
 
-- eval_id: eval2
-  eval_query: is there a openshift-monitoring namespace ?
-  eval_type: judge-llm
-  expected_response: there is a openshift-monitoring namespace.
+- conversation_group: conv2
+  description: Test namespace detection using LLM judge
+  conversation:
+    - eval_id: eval1
+      eval_query: is there a openshift-monitoring namespace ?
+      eval_type: judge-llm
+      expected_response: there is a openshift-monitoring namespace.
+      description: Verify openshift-monitoring namespace with LLM evaluation
 
-- eval_id: eval3
-  eval_query: is there a openshift-lightspeed namespace ?
-  eval_setup_script: sample_data/script/eval3/setup.sh
-  eval_type: sub-string
-  expected_keywords:
-  - 'yes'
-  eval_cleanup_script: sample_data/script/eval3/cleanup.sh
+- conversation_group: conv3
+  description: Test namespace creation and detection with scripts
+  setup_script: sample_data/script/conv3/setup.sh
+  cleanup_script: sample_data/script/conv3/cleanup.sh
+  conversation:
+    - eval_id: eval1
+      eval_query: is there a openshift-lightspeed namespace ?
+      eval_type: sub-string
+      expected_keywords:
+        - 'yes'
+      description: Check for openshift-lightspeed namespace after setup
 
-- eval_id: eval4
-  eval_query: create a namespace called openshift-lightspeed
-  eval_setup_script: sample_data/script/eval4/setup.sh
-  eval_type: script
-  eval_verify_script: sample_data/script/eval4/verify.sh
-  eval_cleanup_script: sample_data/script/eval4/cleanup.sh
+- conversation_group: conv4
+  description: Test namespace creation with full script validation
+  setup_script: sample_data/script/conv4/setup.sh
+  cleanup_script: sample_data/script/conv4/cleanup.sh
+  conversation:
+    - eval_id: eval1
+      eval_query: create a namespace called openshift-lightspeed
+      eval_type: script
+      eval_verify_script: sample_data/script/conv4/eval1/verify.sh
+      description: Create namespace and verify with script
+
+- conversation_group: conv5
+  description: Test conversation retention - multi turn success
+  conversation:
+    - eval_id: eval1
+      eval_query: what is openshift virtualization ?
+      eval_type: sub-string
+      expected_keywords:
+        - virtualization
+      description: Test first conversation
+    - eval_id: eval2
+      eval_query: what was my previous query ?
+      eval_type: sub-string
+      expected_keywords:
+        - virtualization
+      description: Test second conversation
+
+- conversation_group: conv6
+  description: Test conversation retention - new conversation
+  conversation:
+    - eval_id: eval1
+      eval_query: what was my previous query ?
+      eval_type: sub-string
+      expected_keywords:
+        - virtualization
+      description: new conversation (failure)
diff --git a/..._eval/sample_data/script/eval3/cleanup.sh → ..._eval/sample_data/script/conv3/cleanup.sh b/..._eval/sample_data/script/eval3/cleanup.sh → ..._eval/sample_data/script/conv3/cleanup.sh
diff --git a/...nt_eval/sample_data/script/eval3/setup.sh → ...nt_eval/sample_data/script/conv3/setup.sh b/...nt_eval/sample_data/script/eval3/setup.sh → ...nt_eval/sample_data/script/conv3/setup.sh
diff --git a/..._eval/sample_data/script/eval4/cleanup.sh → ..._eval/sample_data/script/conv4/cleanup.sh b/..._eval/sample_data/script/eval4/cleanup.sh → ..._eval/sample_data/script/conv4/cleanup.sh
diff --git a/...t_eval/sample_data/script/eval4/verify.sh → .../sample_data/script/conv4/eval1/verify.sh b/...t_eval/sample_data/script/eval4/verify.sh → .../sample_data/script/conv4/eval1/verify.sh
diff --git a/...nt_eval/sample_data/script/eval4/setup.sh → ...nt_eval/sample_data/script/conv4/setup.sh b/...nt_eval/sample_data/script/eval4/setup.sh → ...nt_eval/sample_data/script/conv4/setup.sh