PR 3/3 — Unit tests for GEval #98

arin-deloatch · 2025-11-09T23:58:12Z

Part 3 of 3 in the GEval Integration Series

This is the third and final pull request in the series introducing GEval (DeepEval) LLM-as-a-judge support into the LSC Evaluation Framework. This PR focuses on robust unit testing and validation coverage.

Summary

Adds a comprehensive unit test suite for all major GEval functionality

Why? 🤔

Guarantee correctness of the new GEval integration under diverse conditions (missing files, invalid configs, bad metrics).
Validate that configuration-driven GEval behavior (defaults, overrides, thresholds) matches expectations.

Key Changes

tests/unit/core/metrics/test_geval.py

Summary by CodeRabbit

New Features
- Added GEval evaluation metrics support with a predefined registry of metrics covering technical accuracy, command validity, Ansible best practices, security awareness, conversation coherence, task completion, and progressive refinement.
- GEval configuration now available in system settings with customizable metric defaults.
Tests
- Added comprehensive unit tests for GEval evaluation handler.

…nd additions to system.yaml structure

coderabbitai · 2025-11-09T23:58:20Z

Walkthrough

The PR integrates GEval metrics into the evaluation system through a registry-based approach. It introduces configuration models, a new GEvalHandler for metric evaluation, registry loading logic, and routing updates to support both turn-level and conversation-level GEval metrics alongside existing DeepEval metrics.

Changes

Cohort / File(s)	Summary
Configuration Registry `config/registry/geval_metrics.yaml`	New GEval metrics registry defining reusable turn-level and conversation-level evaluation metrics (technical_accuracy, command_validity, ansible_best_practices, security_awareness, conversation_coherence, task_completion, progressive_refinement) with criteria, evaluation parameters, steps, and thresholds. Includes usage examples and runtime override documentation.
System Configuration `config/system.yaml`	Adds GEval configuration block under api with enabled toggle, registry_path, and default metric lists for turn and conversation levels.
Configuration Models `src/lightspeed_evaluation/core/models/system.py`	Introduces new `GEvalConfig` model with enabled, registry_path, default_turn_metrics, and default_conversation_metrics fields; extends `SystemConfig` with geval field.
Model Exports `src/lightspeed_evaluation/core/models/__init__.py`	Adds `GEvalConfig` to public exports.
Metrics Initialization `src/lightspeed_evaluation/core/metrics/__init__.py`	Reformats `__all__` export list from single-line to multi-line format (no functional change).
DeepEval Metrics Integration `src/lightspeed_evaluation/core/metrics/deepeval.py`	Adds registry_path parameter to constructor; integrates GEvalHandler; extends evaluate routing to delegate non-standard metrics to GEval; enables litellm caching.
GEval Handler `src/lightspeed_evaluation/core/metrics/geval.py`	New GEvalHandler class for GEval metric evaluation with registry management, configuration resolution, parameter conversion, turn-level and conversation-level evaluation, and metadata-based runtime overrides.
System Loader `src/lightspeed_evaluation/core/system/loader.py`	Adds GEval metrics loading from registry; updates populate_metric_mappings to load GEval metrics when enabled; extends SystemConfig construction with geval field initialization.
Evaluation Pipeline `src/lightspeed_evaluation/pipeline/evaluation/evaluator.py`	Passes registry_path to DeepEvalMetrics; adds "geval" framework routing to deepeval_metrics instance.
Test Suite `tests/unit/core/metrics/test_geval.py`	Comprehensive unit tests for GEvalHandler covering initialization, registry loading, parameter conversion, configuration retrieval with precedence, evaluation flows (turn and conversation), and error handling.

Sequence Diagram(s)

sequenceDiagram
    actor Evaluator
    participant DeepEvalMetrics
    participant GEvalHandler
    participant Registry
    participant GEval
    
    Evaluator->>DeepEvalMetrics: evaluate(metric_name)
    alt metric in standard metrics
        DeepEvalMetrics->>DeepEvalMetrics: use standard handler
    else metric not found
        DeepEvalMetrics->>GEvalHandler: evaluate(metric_name, conv_data, turn_idx, ...)
        GEvalHandler->>GEvalHandler: _get_geval_config(metric_name)
        alt turn metadata exists
            GEvalHandler->>GEvalHandler: use turn metadata config
        else conversation metadata exists
            GEvalHandler->>GEvalHandler: use conversation metadata config
        else registry available
            GEvalHandler->>Registry: load metric config
        end
        GEvalHandler->>GEvalHandler: _convert_evaluation_params(params)
        alt is_conversation
            GEvalHandler->>GEvalHandler: _evaluate_conversation()
        else turn-level
            GEvalHandler->>GEvalHandler: _evaluate_turn()
        end
        GEvalHandler->>GEval: metric.measure(test_case)
        GEval-->>GEvalHandler: score, reason
        GEvalHandler-->>DeepEvalMetrics: score, reason
    end
    DeepEvalMetrics-->>Evaluator: result

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

GEvalHandler logic: Multiple private methods with distinct concerns (registry loading, parameter conversion, turn vs. conversation evaluation paths) require separate reasoning for each path.
Configuration precedence: Turn metadata → conversation metadata → registry fallback needs validation for correctness and edge cases.
Registry loading robustness: Error handling, fallback behavior, and path resolution logic across multiple functions.
Integration points: Changes span system config, models, loader, evaluator, and metrics framework—interactions between these need verification.
Test coverage: Unit tests are comprehensive but edge cases around empty/invalid registries and failed metric measurements need careful review.

Areas requiring extra attention:

GEvalHandler._load_registry() fallback and error handling logic
GEvalHandler._convert_evaluation_params() enum conversion robustness
Configuration precedence validation in _get_geval_config()
YAML registry schema and completeness against expected metrics
Integration between system loader and registry path resolution

Possibly related PRs

Turn metric override #55: Modifies metric evaluation and configuration using the same per-turn/per-conversation metadata keys (turn_metrics_metadata, conversation_metrics_metadata) and metric-loading flows through loader/evaluator components.

Suggested reviewers

VladimirKadlec
tisnik
asamal4

Poem

🐰 A registry of metrics now blooms,
GEval flows through DeepEval's rooms,
Turn by turn, and talks complete,
Config precedence makes evaluation sweet!
Registry reigns when metadata's gone,
The GEval integration marches on! 🌟

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title 'PR 3/3 — Unit tests for GEval' accurately reflects the main objective of this pull request, which adds comprehensive unit tests for GEval functionality. It is specific, clear, and directly describes the primary change.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 64f8892 and ccb3ac3.

📒 Files selected for processing (10)

config/registry/geval_metrics.yaml (1 hunks)
config/system.yaml (1 hunks)
src/lightspeed_evaluation/core/metrics/__init__.py (1 hunks)
src/lightspeed_evaluation/core/metrics/deepeval.py (3 hunks)
src/lightspeed_evaluation/core/metrics/geval.py (1 hunks)
src/lightspeed_evaluation/core/models/__init__.py (2 hunks)
src/lightspeed_evaluation/core/models/system.py (2 hunks)
src/lightspeed_evaluation/core/system/loader.py (5 hunks)
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1 hunks)
tests/unit/core/metrics/test_geval.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (8)

src/lightspeed_evaluation/**