|
| 1 | +# LSC Evaluation Framework - Test Suite |
| 2 | + |
| 3 | +This directory contains comprehensive test cases for the LSC Evaluation Framework, generated based on the `system.yaml` configuration file. |
| 4 | + |
| 5 | +## Test Structure |
| 6 | + |
| 7 | +``` |
| 8 | +tests/ |
| 9 | +├── conftest.py # Pytest fixtures and configuration |
| 10 | +├── test_runner.py # Test runner script |
| 11 | +├── README.md # This file |
| 12 | +├── core/ # Core functionality tests |
| 13 | +│ ├── test_config_loader.py # ConfigLoader class tests |
| 14 | +│ ├── test_models.py # Pydantic models tests |
| 15 | +│ └── test_data_validator.py # DataValidator class tests |
| 16 | +├── llm_managers/ # LLM manager tests |
| 17 | +│ └── test_llm_manager.py # LLMManager class tests |
| 18 | +├── metrics/ # Metrics component tests |
| 19 | +│ └── test_custom_metrics.py # Custom metrics tests |
| 20 | +└── output/ # Output component tests |
| 21 | + └── test_utils.py # Output utilities tests |
| 22 | +``` |
| 23 | + |
| 24 | +## Test Categories |
| 25 | + |
| 26 | +The tests are organized into several categories using pytest markers: |
| 27 | + |
| 28 | +- **`unit`**: Unit tests for individual components |
| 29 | +- **`integration`**: Integration tests across components |
| 30 | +- **`config`**: Configuration loading and validation tests |
| 31 | +- **`models`**: Pydantic model validation tests |
| 32 | +- **`validation`**: Data validation tests |
| 33 | +- **`output`**: Output generation and formatting tests |
| 34 | +- **`slow`**: Tests that take longer to run |
| 35 | +- **`llm`**: Tests requiring LLM API calls (may be skipped in CI) |
| 36 | + |
| 37 | +## Running Tests |
| 38 | + |
| 39 | +### Using the Test Runner Script |
| 40 | + |
| 41 | +The easiest way to run tests is using the provided test runner: |
| 42 | + |
| 43 | +```bash |
| 44 | +# Run all tests |
| 45 | +python tests/test_runner.py all |
| 46 | + |
| 47 | +# Run specific test categories |
| 48 | +python tests/test_runner.py unit |
| 49 | +python tests/test_runner.py config |
| 50 | +python tests/test_runner.py models |
| 51 | +python tests/test_runner.py validation |
| 52 | + |
| 53 | +# Run tests with coverage |
| 54 | +python tests/test_runner.py coverage |
| 55 | + |
| 56 | +# Run specific test file |
| 57 | +python tests/test_runner.py file tests/core/test_models.py |
| 58 | + |
| 59 | +# Run fast tests only (exclude slow tests) |
| 60 | +python tests/test_runner.py fast |
| 61 | +``` |
| 62 | + |
| 63 | +### Using pytest directly |
| 64 | + |
| 65 | +You can also run tests directly with pytest: |
| 66 | + |
| 67 | +```bash |
| 68 | +# Run all tests |
| 69 | +pytest tests/ |
| 70 | + |
| 71 | +# Run with verbose output |
| 72 | +pytest -v tests/ |
| 73 | + |
| 74 | +# Run specific test file |
| 75 | +pytest tests/core/test_config_loader.py |
| 76 | + |
| 77 | +# Run tests with specific markers |
| 78 | +pytest -m "config" tests/ |
| 79 | +pytest -m "not slow" tests/ |
| 80 | + |
| 81 | +# Run with coverage |
| 82 | +pytest --cov=lsc_eval --cov-report=html tests/ |
| 83 | +``` |
| 84 | + |
| 85 | +## Test Configuration |
| 86 | + |
| 87 | +### Environment Setup |
| 88 | + |
| 89 | +Tests use fixtures to set up clean environments: |
| 90 | + |
| 91 | +- **`clean_environment`**: Clears environment variables before/after tests |
| 92 | +- **`temp_dir`**: Provides temporary directory for test files |
| 93 | +- **`sample_system_config`**: Provides sample system configuration |
| 94 | +- **`sample_evaluation_data`**: Provides sample evaluation data |
| 95 | + |
| 96 | +### Mock Data |
| 97 | + |
| 98 | +Tests use realistic mock data based on the actual system.yaml configuration: |
| 99 | + |
| 100 | +- **LLM Configuration**: OpenAI, Azure, Anthropic, Gemini, WatsonX, Ollama providers |
| 101 | +- **Metrics**: Ragas, DeepEval, and Custom metrics as defined in system.yaml |
| 102 | +- **Output Formats**: CSV, JSON, TXT formats with visualization options |
| 103 | +- **Evaluation Data**: Multi-turn conversations with contexts and expected responses |
| 104 | + |
| 105 | +## Test Coverage |
| 106 | + |
| 107 | +The test suite covers: |
| 108 | + |
| 109 | +### Core Components |
| 110 | +- **ConfigLoader**: System configuration loading, environment setup, logging configuration |
| 111 | +- **Models**: Pydantic model validation for TurnData, EvaluationData, EvaluationResult |
| 112 | +- **DataValidator**: Evaluation data validation, metric requirements checking |
| 113 | + |
| 114 | +### LLM Managers |
| 115 | +- **LLMManager**: Provider-specific configuration, environment validation, model name construction |
| 116 | + |
| 117 | +### Metrics |
| 118 | +- **CustomMetrics**: LLM-based evaluation, score parsing, prompt generation |
| 119 | + |
| 120 | +### Output Components |
| 121 | +- **Utils**: Statistics calculation, result aggregation, evaluation scoping |
| 122 | + |
| 123 | +## Key Test Scenarios |
| 124 | + |
| 125 | +### Configuration Testing |
| 126 | +- Valid and invalid system configurations |
| 127 | +- Environment variable setup and validation |
| 128 | +- Logging configuration with different levels |
| 129 | +- Metric mapping and validation |
| 130 | + |
| 131 | +### Model Validation Testing |
| 132 | +- Field validation for all Pydantic models |
| 133 | +- Edge cases and boundary conditions |
| 134 | +- Required field validation |
| 135 | +- Data type validation |
| 136 | + |
| 137 | +### Data Validation Testing |
| 138 | +- Evaluation data structure validation |
| 139 | +- Metric requirement checking |
| 140 | +- Context and expected response validation |
| 141 | +- Multi-conversation validation |
| 142 | + |
| 143 | +### LLM Manager Testing |
| 144 | +- Provider-specific environment validation |
| 145 | +- Model name construction for different providers |
| 146 | +- Error handling for missing credentials |
| 147 | +- Configuration parsing |
| 148 | + |
| 149 | +### Metrics Testing |
| 150 | +- Custom metric evaluation |
| 151 | +- LLM response parsing |
| 152 | +- Score normalization |
| 153 | +- Error handling for failed evaluations |
| 154 | + |
| 155 | +### Output Testing |
| 156 | +- Statistics calculation |
| 157 | +- Result aggregation by metric and conversation |
| 158 | +- Score statistics computation |
| 159 | +- Edge cases with empty or error results |
| 160 | + |
| 161 | +## Running Tests in CI/CD |
| 162 | + |
| 163 | +For continuous integration, you can: |
| 164 | + |
| 165 | +```bash |
| 166 | +# Run fast tests only (exclude slow/LLM tests) |
| 167 | +pytest -m "not slow and not llm" tests/ |
| 168 | + |
| 169 | +# Run with XML output for CI systems |
| 170 | +pytest --junitxml=test-results.xml tests/ |
| 171 | + |
| 172 | +# Run with coverage for code quality metrics |
| 173 | +pytest --cov=lsc_eval --cov-report=xml --cov-report=term tests/ |
| 174 | +``` |
| 175 | + |
| 176 | +## Adding New Tests |
| 177 | + |
| 178 | +When adding new functionality: |
| 179 | + |
| 180 | +1. Create test files following the naming convention `test_*.py` |
| 181 | +2. Use appropriate pytest markers to categorize tests |
| 182 | +3. Follow the existing fixture patterns for setup/teardown |
| 183 | +4. Include both positive and negative test cases |
| 184 | +5. Test edge cases and error conditions |
| 185 | +6. Update this README if adding new test categories |
| 186 | + |
| 187 | +## Test Data |
| 188 | + |
| 189 | +Test fixtures provide realistic data based on system.yaml: |
| 190 | + |
| 191 | +- **Metrics**: All metrics defined in system.yaml with proper thresholds |
| 192 | +- **Providers**: All LLM providers with required environment variables |
| 193 | +- **Output Formats**: All output formats and visualization options |
| 194 | +- **Evaluation Scenarios**: Multi-turn conversations with various metric combinations |
| 195 | + |
| 196 | +This ensures tests accurately reflect the actual system configuration and usage patterns. |
| 197 | + |
0 commit comments