Skip to content

Conversation

@arin-deloatch
Copy link
Contributor

@arin-deloatch arin-deloatch commented Nov 9, 2025

Introductory Note — Part 1 of 3 in the GEval Integration Series

This is the first pull request in a three-part series introducing GEval (DeepEval) LLM-as-a-judge support into the LSC Evaluation Framework.

The overall goal of the series is to extend the framework with DeepEval-based GEval metrics, configurable system settings and comprehensive unit tests.

PR Series Overview:

PR 1 — Core Integration (this PR): GEvalHandler; DeepEvalMetrics integration

PR 2 — Configuration Integration: adds config and system settings for GEval; Custom GEval metric registry

PR 3 — Unit Tests for GEvalHandler

Each PR builds on the previous to enable incremental review and minimize code churn.

Summary

Adds first-class GEval support via the DeepEval framework to the LSC Evaluation Framework. Introduces a GEval handler core metric wiring, and pipeline extension points so downstream runs can evaluate with LLM-as-a-judge (GEval) alongside existing metrics.

Why? 🤔

  • Empower users of the LSC evaluation framework to bring their own metrics tailored to their domain (e.g., RHEL command correctness, technical accuracy, etc.).
  • Provide a registry-based system for defining and versioning metric suites – including criteria, evaluation steps, and thresholds – to encourage consistency and reproducibility.
  • Leverage DeepEval’s GEval engine as a flexible, model-agnostic evaluation layer.

Out of Scope 🚫

  • Registry definition (PR 2)
  • Configuration integration (PR 2)
  • Unit tests (PR 3)

Key Changes

  • src/lightspeed_evaluation/core/metrics/deepeval.py
    • Initialize GEval handler with shared LLM manager in DeepEvalMetrics class
  • src/lightspeed_evaluation/core/metrics/geval.py
    • GEvalHandler class
  • See docstrings for explanation of functionality

Summary by CodeRabbit

  • New Features
    • GEval metrics integrated alongside existing DeepEval metrics for expanded evaluations
    • Optional registry path to load custom metric definitions
    • Automatic routing between standard and GEval evaluation engines
    • Support for both turn-level and conversation-level evaluations
    • Shared model resources for unified evaluation flows
    • Improved runtime feedback, logging, and caching to speed evaluations and reduce repeated work

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 9, 2025

Walkthrough

Adds GEval support by introducing GEvalHandler, extends DeepEvalMetrics to accept an optional registry_path and route evaluate() calls between built-in DeepEval metrics and GEval metrics, and reformats the module exports list without changing exported symbols.

Changes

Cohort / File(s) Summary
Exports formatting
src/lightspeed_evaluation/core/metrics/__init__.py
Reflowed the __all__ declaration from a single-line to a multi-line, trailing-comma style; no change to exported names.
DeepEval metrics extension
src/lightspeed_evaluation/core/metrics/deepeval.py
DeepEvalMetrics.__init__ signature extended to accept `registry_path: str
GEval integration
src/lightspeed_evaluation/core/metrics/geval.py
New GEvalHandler class: loads/caches a YAML registry (with discovery and fallback), resolves configs from runtime metadata or registry, converts string params to LLMTestCaseParams, supports turn- and conversation-level evaluation, exposes evaluate() and _get_geval_config(), and includes logging and error handling.

Sequence Diagram(s)

sequenceDiagram
    participant Caller
    participant DeepEvalMetrics
    participant GEvalHandler
    participant GEval
    Caller->>DeepEvalMetrics: evaluate(metric_name, conv_data, turn_idx, turn_data)
    alt standard DeepEval metric
        DeepEvalMetrics->>DeepEvalMetrics: run built-in metric handler
        DeepEvalMetrics-->>Caller: score, reason
    else GEval metric
        DeepEvalMetrics->>GEvalHandler: evaluate(metric_name, conv_data, turn_idx, turn_data, is_conversation)
        GEvalHandler->>GEvalHandler: _get_geval_config(runtime metadata → registry)
        alt config found
            GEvalHandler->>GEvalHandler: _convert_evaluation_params
            alt conversation-level
                GEvalHandler->>GEval: evaluate(conversation test case)
            else turn-level
                GEvalHandler->>GEval: evaluate(turn test case)
            end
            GEval-->>GEvalHandler: score, reason
            GEvalHandler-->>DeepEvalMetrics: score, reason
        else config missing
            GEvalHandler-->>DeepEvalMetrics: None, error message
        end
        DeepEvalMetrics-->>Caller: score, reason
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

  • Areas needing attention:
    • GEvalHandler._load_registry() — YAML discovery, caching, and fallback behavior
    • _get_geval_config() — precedence and edge cases for runtime vs registry configs
    • _convert_evaluation_params() — mapping to LLMTestCaseParams and invalid-param handling
    • DeepEvalMetrics.evaluate() — routing correctness and shared LLM manager / litellm cache setup

Poem

🐇 I hopped in code with eager paws,
GEval rules and registry laws,
Turns and threads now pass the test,
Metrics routed east and west,
A little rabbit cheers the cause! 🥕

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the primary change: adding GEval (DeepEval) core integration through new handler and wiring changes.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9fd462d and 96e39ba.

📒 Files selected for processing (2)
  • src/lightspeed_evaluation/core/metrics/deepeval.py (3 hunks)
  • src/lightspeed_evaluation/core/metrics/geval.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
src/lightspeed_evaluation/**

📄 CodeRabbit inference engine (AGENTS.md)

Add all new evaluation features under src/lightspeed_evaluation/ (do not add new features elsewhere)

Files:

  • src/lightspeed_evaluation/core/metrics/deepeval.py
  • src/lightspeed_evaluation/core/metrics/geval.py
src/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

src/**/*.py: Require type hints for all public functions and methods
Use Google-style docstrings for all public APIs
Use custom exceptions from core.system.exceptions for error handling
Use structured logging with appropriate levels

Files:

  • src/lightspeed_evaluation/core/metrics/deepeval.py
  • src/lightspeed_evaluation/core/metrics/geval.py
src/lightspeed_evaluation/core/metrics/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Register new metrics in MetricManager’s supported_metrics dictionary

Files:

  • src/lightspeed_evaluation/core/metrics/deepeval.py
  • src/lightspeed_evaluation/core/metrics/geval.py
🧠 Learnings (8)
📓 Common learnings
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 28
File: lsc_eval/runner.py:99-103
Timestamp: 2025-08-26T11:17:48.640Z
Learning: The lsc_eval generic evaluation tool is intended to become the primary evaluation framework, replacing an existing evaluation tool in the lightspeed-evaluation repository.
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to src/lightspeed_evaluation/** : Add all new evaluation features under src/lightspeed_evaluation/ (do not add new features elsewhere)
📚 Learning: 2025-10-16T11:17:19.324Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to src/lightspeed_evaluation/core/metrics/**/*.py : Register new metrics in MetricManager’s supported_metrics dictionary

Applied to files:

  • src/lightspeed_evaluation/core/metrics/deepeval.py
  • src/lightspeed_evaluation/core/metrics/geval.py
📚 Learning: 2025-09-19T00:37:23.798Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:33-36
Timestamp: 2025-09-19T00:37:23.798Z
Learning: In the lightspeed-evaluation codebase, metric resolution (including applying defaults when turn_metrics is None) happens upstream in ConversationProcessor.process_conversation() using MetricManager.resolve_metrics(), not in the EvaluationErrorHandler. The error handler only marks explicitly defined metrics as ERROR.

Applied to files:

  • src/lightspeed_evaluation/core/metrics/deepeval.py
  • src/lightspeed_evaluation/core/metrics/geval.py
📚 Learning: 2025-09-19T12:32:06.403Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:18-31
Timestamp: 2025-09-19T12:32:06.403Z
Learning: When analyzing method calls, always examine the complete call site including all parameters before suggesting fixes. In the lightspeed-evaluation codebase, mark_all_metrics_as_error in processor.py correctly passes both resolved_turn_metrics and resolved_conversation_metrics parameters.

Applied to files:

  • src/lightspeed_evaluation/core/metrics/deepeval.py
  • src/lightspeed_evaluation/core/metrics/geval.py
📚 Learning: 2025-10-16T11:17:19.324Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to src/lightspeed_evaluation/** : Add all new evaluation features under src/lightspeed_evaluation/ (do not add new features elsewhere)

Applied to files:

  • src/lightspeed_evaluation/core/metrics/geval.py
📚 Learning: 2025-09-18T23:59:37.026Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/core/system/validator.py:146-155
Timestamp: 2025-09-18T23:59:37.026Z
Learning: In the lightspeed-evaluation project, the DataValidator in `src/lightspeed_evaluation/core/system/validator.py` is intentionally designed to validate only explicitly provided user evaluation data, not resolved metrics that include system defaults. When turn_metrics is None, the system falls back to system config defaults, and this validation separation is by design.

Applied to files:

  • src/lightspeed_evaluation/core/metrics/geval.py
📚 Learning: 2025-10-16T11:17:19.324Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to config/evaluation_data.yaml : Keep evaluation data in config/evaluation_data.yaml (conversation groups, turns, overrides, scripts)

Applied to files:

  • src/lightspeed_evaluation/core/metrics/geval.py
📚 Learning: 2025-09-08T11:11:54.516Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: config/system.yaml:78-82
Timestamp: 2025-09-08T11:11:54.516Z
Learning: For the custom:tool_eval metric, when threshold is not specified (None), the system defaults to checking if score > 0, providing less strict evaluation logic compared to exact matching. This allows for more flexible tool call evaluation where partial correctness is acceptable.

Applied to files:

  • src/lightspeed_evaluation/core/metrics/geval.py
🧬 Code graph analysis (2)
src/lightspeed_evaluation/core/metrics/deepeval.py (3)
src/lightspeed_evaluation/core/metrics/geval.py (2)
  • GEvalHandler (20-528)
  • evaluate (133-204)
src/lightspeed_evaluation/core/llm/manager.py (4)
  • LLMManager (10-117)
  • get_config (105-107)
  • get_model_name (91-93)
  • get_llm_params (95-103)
src/lightspeed_evaluation/core/llm/deepeval.py (1)
  • DeepEvalLLMManager (8-42)
src/lightspeed_evaluation/core/metrics/geval.py (2)
src/lightspeed_evaluation/core/llm/deepeval.py (1)
  • DeepEvalLLMManager (8-42)
src/lightspeed_evaluation/core/metrics/deepeval.py (1)
  • evaluate (92-133)
🔇 Additional comments (12)
src/lightspeed_evaluation/core/metrics/deepeval.py (4)

1-27: LGTM!

The module docstring clearly describes the dual integration (standard DeepEval + GEval), and the logging setup follows best practices.


30-44: LGTM!

The class docstring and __init__ signature are well-documented. Type hints use modern Python 3.10+ union syntax appropriately.


45-68: LGTM!

The shared LLM manager approach efficiently supports both standard DeepEval and GEval metrics. The GEvalHandler initialization correctly passes through the registry_path parameter.


92-133: LGTM!

The routing logic cleanly separates standard DeepEval metrics from GEval metrics. The normalization at lines 122-126 correctly handles the geval: prefix, addressing the previous review comment.

src/lightspeed_evaluation/core/metrics/geval.py (8)

1-18: LGTM!

Module docstring clearly describes the GEval integration purpose, and all necessary imports are present. Logging setup follows best practices.


20-50: LGTM!

The class design with class-level registry caching is efficient. Type hints are present, and the docstring follows Google style as required by coding guidelines.


51-132: LGTM!

The registry loading logic is robust with comprehensive error handling. It correctly searches multiple common paths and gracefully falls back to an empty registry with appropriate warnings. The broad exception catching is intentional and properly documented.


133-204: LGTM!

The evaluate method has complete type hints and an excellent docstring. The unused _turn_idx parameter is appropriately prefixed with an underscore for interface compatibility. Configuration validation and delegation logic are clean.


206-250: LGTM!

The parameter conversion logic correctly handles both standard enum values and custom strings. Returning None for custom parameters allows GEval to auto-detect required fields, which is the intended behavior.


252-356: LGTM!

The turn evaluation logic correctly implements the parameter handling that was addressed in the previous review. The conditional logic at lines 299-309 properly preserves custom parameters while providing sensible defaults when needed. Error handling is comprehensive with structured logging.


358-452: LGTM!

The conversation evaluation logic mirrors the turn-level implementation and correctly handles parameter conversion. The aggregation of conversation turns into a single test case is straightforward and well-documented.


454-528: LGTM!

The configuration resolution correctly implements the priority order (turn metadata → conversation metadata → registry). The metric_key construction at line 483 appropriately adds the geval: prefix for metadata lookups, while the registry lookup uses the bare metric name, which is the correct behavior.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 64f8892 and a4370b3.

📒 Files selected for processing (3)
  • src/lightspeed_evaluation/core/metrics/__init__.py (1 hunks)
  • src/lightspeed_evaluation/core/metrics/deepeval.py (3 hunks)
  • src/lightspeed_evaluation/core/metrics/geval.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
src/lightspeed_evaluation/**

📄 CodeRabbit inference engine (AGENTS.md)

Add all new evaluation features under src/lightspeed_evaluation/ (do not add new features elsewhere)

Files:

  • src/lightspeed_evaluation/core/metrics/__init__.py
  • src/lightspeed_evaluation/core/metrics/deepeval.py
  • src/lightspeed_evaluation/core/metrics/geval.py
src/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

src/**/*.py: Require type hints for all public functions and methods
Use Google-style docstrings for all public APIs
Use custom exceptions from core.system.exceptions for error handling
Use structured logging with appropriate levels

Files:

  • src/lightspeed_evaluation/core/metrics/__init__.py
  • src/lightspeed_evaluation/core/metrics/deepeval.py
  • src/lightspeed_evaluation/core/metrics/geval.py
src/lightspeed_evaluation/core/metrics/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Register new metrics in MetricManager’s supported_metrics dictionary

Files:

  • src/lightspeed_evaluation/core/metrics/__init__.py
  • src/lightspeed_evaluation/core/metrics/deepeval.py
  • src/lightspeed_evaluation/core/metrics/geval.py
🧠 Learnings (5)
📓 Common learnings
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to src/lightspeed_evaluation/** : Add all new evaluation features under src/lightspeed_evaluation/ (do not add new features elsewhere)
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 28
File: lsc_eval/runner.py:99-103
Timestamp: 2025-08-26T11:17:48.640Z
Learning: The lsc_eval generic evaluation tool is intended to become the primary evaluation framework, replacing an existing evaluation tool in the lightspeed-evaluation repository.
📚 Learning: 2025-10-16T11:17:19.324Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to src/lightspeed_evaluation/core/metrics/**/*.py : Register new metrics in MetricManager’s supported_metrics dictionary

Applied to files:

  • src/lightspeed_evaluation/core/metrics/__init__.py
  • src/lightspeed_evaluation/core/metrics/deepeval.py
  • src/lightspeed_evaluation/core/metrics/geval.py
📚 Learning: 2025-10-16T11:17:19.324Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to src/lightspeed_evaluation/** : Add all new evaluation features under src/lightspeed_evaluation/ (do not add new features elsewhere)

Applied to files:

  • src/lightspeed_evaluation/core/metrics/deepeval.py
  • src/lightspeed_evaluation/core/metrics/geval.py
📚 Learning: 2025-09-19T00:37:23.798Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:33-36
Timestamp: 2025-09-19T00:37:23.798Z
Learning: In the lightspeed-evaluation codebase, metric resolution (including applying defaults when turn_metrics is None) happens upstream in ConversationProcessor.process_conversation() using MetricManager.resolve_metrics(), not in the EvaluationErrorHandler. The error handler only marks explicitly defined metrics as ERROR.

Applied to files:

  • src/lightspeed_evaluation/core/metrics/geval.py
📚 Learning: 2025-09-18T23:59:37.026Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/core/system/validator.py:146-155
Timestamp: 2025-09-18T23:59:37.026Z
Learning: In the lightspeed-evaluation project, the DataValidator in `src/lightspeed_evaluation/core/system/validator.py` is intentionally designed to validate only explicitly provided user evaluation data, not resolved metrics that include system defaults. When turn_metrics is None, the system falls back to system config defaults, and this validation separation is by design.

Applied to files:

  • src/lightspeed_evaluation/core/metrics/geval.py
🧬 Code graph analysis (2)
src/lightspeed_evaluation/core/metrics/deepeval.py (3)
src/lightspeed_evaluation/core/metrics/geval.py (2)
  • GEvalHandler (20-501)
  • evaluate (122-195)
src/lightspeed_evaluation/core/llm/manager.py (4)
  • LLMManager (10-117)
  • get_config (105-107)
  • get_model_name (91-93)
  • get_llm_params (95-103)
src/lightspeed_evaluation/core/llm/deepeval.py (1)
  • DeepEvalLLMManager (8-42)
src/lightspeed_evaluation/core/metrics/geval.py (2)
src/lightspeed_evaluation/core/llm/deepeval.py (1)
  • DeepEvalLLMManager (8-42)
src/lightspeed_evaluation/core/metrics/deepeval.py (1)
  • evaluate (92-129)

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (2)
src/lightspeed_evaluation/core/metrics/geval.py (2)

387-393: Critical: Same custom evaluation_params issue in conversation-level evaluation.

This method has the identical bug as _evaluate_turn. When converted_params is None (custom parameters), line 388's condition forces enum defaults, preventing custom GEval metrics from using their intended parameters.

Apply the same fix pattern:

 converted_params = self._convert_evaluation_params(evaluation_params)
-        if not converted_params:
-            # If no valid params, use sensible defaults for conversation evaluation
-            converted_params = [
-                LLMTestCaseParams.INPUT,
-                LLMTestCaseParams.ACTUAL_OUTPUT,
-            ]
 
         # Configure the GEval metric for conversation-level evaluation
         metric_kwargs: dict[str, Any] = {
             "name": "GEval Conversation Metric",
             "criteria": criteria,
-            "evaluation_params": converted_params,
             "model": self.deepeval_llm_manager.get_llm(),
             "threshold": threshold,
             "top_logprobs": 5,  # Vertex/Gemini throws an error if over 20.
         }
+
+        if converted_params is None:
+            if not evaluation_params:
+                metric_kwargs["evaluation_params"] = [
+                    LLMTestCaseParams.INPUT,
+                    LLMTestCaseParams.ACTUAL_OUTPUT,
+                ]
+        else:
+            metric_kwargs["evaluation_params"] = converted_params

287-293: Critical: Custom GEval evaluation_params are still being replaced with defaults.

Despite the past review comment being marked as "Addressed in commit 4e37984", the code still has the same issue. When _convert_evaluation_params() returns None (indicating custom string parameters that GEval should auto-detect), line 288's condition if not converted_params: evaluates to True and forces the enum defaults, wiping out the caller's intent.

This means registry-defined or runtime-provided custom parameters can never be used—they always get replaced with [INPUT, ACTUAL_OUTPUT].

Apply the fix from the previous review (or similar logic):

 converted_params = self._convert_evaluation_params(evaluation_params)
-        if not converted_params:
-            # If no valid params, use sensible defaults for turn evaluation
-            converted_params = [
-                LLMTestCaseParams.INPUT,
-                LLMTestCaseParams.ACTUAL_OUTPUT,
-            ]
 
         # Create GEval metric with runtime configuration
         metric_kwargs: dict[str, Any] = {
             "name": "GEval Turn Metric",
             "criteria": criteria,
-            "evaluation_params": converted_params,
             "model": self.deepeval_llm_manager.get_llm(),
             "threshold": threshold,
             "top_logprobs": 5,
         }
+
+        # Only set evaluation_params if we have valid enum conversions
+        # or if no params were provided at all (then use defaults)
+        if converted_params is None:
+            if not evaluation_params:
+                metric_kwargs["evaluation_params"] = [
+                    LLMTestCaseParams.INPUT,
+                    LLMTestCaseParams.ACTUAL_OUTPUT,
+                ]
+            # else: leave unset so GEval can auto-detect from custom strings
+        else:
+            metric_kwargs["evaluation_params"] = converted_params
🧹 Nitpick comments (2)
src/lightspeed_evaluation/core/metrics/geval.py (2)

76-101: Consider simplifying the path resolution logic.

The current logic initializes path and possible_paths defensively (lines 76-78) and then has overlapping conditionals that make the flow harder to follow. When an explicit registry_path is provided but doesn't exist, the code still iterates through possible_paths containing only that single invalid path.

Consider restructuring to separate explicit path handling from auto-discovery:

-        # Ensure variables are always bound for static analysis -
-        path: Optional[Path] = None
-        possible_paths: list[Path] = []
-
-        # Normalize user-specified path vs. auto-discovery
         if registry_path is not None:
             try:
                 path = Path(registry_path)
+                if path.exists():
+                    # Use explicit path directly
+                    self._load_from_path(path)
+                    return
+                else:
+                    logger.warning(
+                        "Explicit registry path does not exist: %s. Falling back to auto-discovery.",
+                        path
+                    )
             except TypeError:
-                # Bad type passed in; treat as no path provided
-                path = None
-            if path is not None:
-                possible_paths = [path]
-        else:
-            package_root = Path(__file__).resolve().parents[3]
-            possible_paths = [
-                Path.cwd() / "config" / "registry" / "geval_metrics.yaml",
-                package_root / "config" / "registry" / "geval_metrics.yaml",
-            ]
-
-        # If no explicit file exists yet, search candidates
-        if path is None or not path.exists():
-            for candidate in possible_paths:
-                if candidate.exists():
-                    path = candidate
-                    break
+                logger.warning("Invalid registry_path type: %s", type(registry_path))
+
+        # Auto-discovery
+        package_root = Path(__file__).resolve().parents[3]
+        possible_paths = [
+            Path.cwd() / "config" / "registry" / "geval_metrics.yaml",
+            package_root / "config" / "registry" / "geval_metrics.yaml",
+        ]
+
+        path = None
+        for candidate in possible_paths:
+            if candidate.exists():
+                path = candidate
+                break

Then extract the loading logic into a helper method to avoid duplication.


450-524: LGTM for configuration retrieval logic.

The priority order (turn metadata → conversation metadata → registry) is well-designed and properly documented. The method correctly handles all three configuration sources with appropriate logging.

The multiple pylint: disable comments (lines 505-508, 510, 514) suggest the type checker struggles with the class-level _registry: dict[str, Any] | None. Consider extracting registry access into a helper property to centralize the None-check:

@property
def _loaded_registry(self) -> dict[str, Any]:
    """Get loaded registry or empty dict."""
    return GEvalHandler._registry or {}

Then use self._loaded_registry throughout to eliminate the disables.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4e37984 and 9fd462d.

📒 Files selected for processing (1)
  • src/lightspeed_evaluation/core/metrics/geval.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
src/lightspeed_evaluation/**

📄 CodeRabbit inference engine (AGENTS.md)

Add all new evaluation features under src/lightspeed_evaluation/ (do not add new features elsewhere)

Files:

  • src/lightspeed_evaluation/core/metrics/geval.py
src/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

src/**/*.py: Require type hints for all public functions and methods
Use Google-style docstrings for all public APIs
Use custom exceptions from core.system.exceptions for error handling
Use structured logging with appropriate levels

Files:

  • src/lightspeed_evaluation/core/metrics/geval.py
src/lightspeed_evaluation/core/metrics/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Register new metrics in MetricManager’s supported_metrics dictionary

Files:

  • src/lightspeed_evaluation/core/metrics/geval.py
🧠 Learnings (8)
📓 Common learnings
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 28
File: lsc_eval/runner.py:99-103
Timestamp: 2025-08-26T11:17:48.640Z
Learning: The lsc_eval generic evaluation tool is intended to become the primary evaluation framework, replacing an existing evaluation tool in the lightspeed-evaluation repository.
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to src/lightspeed_evaluation/** : Add all new evaluation features under src/lightspeed_evaluation/ (do not add new features elsewhere)
📚 Learning: 2025-10-16T11:17:19.324Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to src/lightspeed_evaluation/** : Add all new evaluation features under src/lightspeed_evaluation/ (do not add new features elsewhere)

Applied to files:

  • src/lightspeed_evaluation/core/metrics/geval.py
📚 Learning: 2025-10-16T11:17:19.324Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to src/lightspeed_evaluation/core/metrics/**/*.py : Register new metrics in MetricManager’s supported_metrics dictionary

Applied to files:

  • src/lightspeed_evaluation/core/metrics/geval.py
📚 Learning: 2025-09-19T00:37:23.798Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:33-36
Timestamp: 2025-09-19T00:37:23.798Z
Learning: In the lightspeed-evaluation codebase, metric resolution (including applying defaults when turn_metrics is None) happens upstream in ConversationProcessor.process_conversation() using MetricManager.resolve_metrics(), not in the EvaluationErrorHandler. The error handler only marks explicitly defined metrics as ERROR.

Applied to files:

  • src/lightspeed_evaluation/core/metrics/geval.py
📚 Learning: 2025-09-18T23:59:37.026Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/core/system/validator.py:146-155
Timestamp: 2025-09-18T23:59:37.026Z
Learning: In the lightspeed-evaluation project, the DataValidator in `src/lightspeed_evaluation/core/system/validator.py` is intentionally designed to validate only explicitly provided user evaluation data, not resolved metrics that include system defaults. When turn_metrics is None, the system falls back to system config defaults, and this validation separation is by design.

Applied to files:

  • src/lightspeed_evaluation/core/metrics/geval.py
📚 Learning: 2025-10-16T11:17:19.324Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to config/evaluation_data.yaml : Keep evaluation data in config/evaluation_data.yaml (conversation groups, turns, overrides, scripts)

Applied to files:

  • src/lightspeed_evaluation/core/metrics/geval.py
📚 Learning: 2025-09-19T12:32:06.403Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:18-31
Timestamp: 2025-09-19T12:32:06.403Z
Learning: When analyzing method calls, always examine the complete call site including all parameters before suggesting fixes. In the lightspeed-evaluation codebase, mark_all_metrics_as_error in processor.py correctly passes both resolved_turn_metrics and resolved_conversation_metrics parameters.

Applied to files:

  • src/lightspeed_evaluation/core/metrics/geval.py
📚 Learning: 2025-09-08T11:11:54.516Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: config/system.yaml:78-82
Timestamp: 2025-09-08T11:11:54.516Z
Learning: For the custom:tool_eval metric, when threshold is not specified (None), the system defaults to checking if score > 0, providing less strict evaluation logic compared to exact matching. This allows for more flexible tool call evaluation where partial correctness is acceptable.

Applied to files:

  • src/lightspeed_evaluation/core/metrics/geval.py
🧬 Code graph analysis (1)
src/lightspeed_evaluation/core/metrics/geval.py (2)
src/lightspeed_evaluation/core/llm/deepeval.py (1)
  • DeepEvalLLMManager (8-42)
src/lightspeed_evaluation/core/metrics/deepeval.py (1)
  • evaluate (92-129)
🔇 Additional comments (6)
src/lightspeed_evaluation/core/metrics/geval.py (6)

1-18: LGTM!

Module setup follows best practices with clear documentation, appropriate imports, and proper logger initialization.


20-50: LGTM!

Class structure is well-designed with appropriate use of class-level caching for the registry and clear documentation. Type hints and docstrings follow project guidelines.


133-204: LGTM!

The main evaluation entry point is well-structured with clear delegation logic, comprehensive documentation, and appropriate error handling. The unused _turn_idx parameter is appropriately marked and documented as interface compatibility.


206-250: LGTM!

The parameter conversion logic correctly handles both standard enum-based parameters and custom string parameters. Returning None for custom params allows GEval to auto-detect required fields, which is the intended behavior.


312-356: LGTM for test case construction and evaluation logic.

The test case building, context normalization, and error handling are well-implemented. The code properly handles optional fields and provides detailed logging for debugging. The broad exception catch is acceptable here given the external library call and comprehensive error logging.


395-448: LGTM for aggregation and evaluation logic.

The conversation turn aggregation strategy is sound, creating a multi-turn narrative for GEval to evaluate. Error handling follows the same robust pattern as turn-level evaluation.

@arin-deloatch
Copy link
Contributor Author

Consolidating into one comprehensive PR: #97

@arin-deloatch arin-deloatch mentioned this pull request Nov 13, 2025
@arin-deloatch arin-deloatch deleted the feature/geval-support branch November 13, 2025 21:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants