GEval Integration #97

arin-deloatch · 2025-11-09T23:48:15Z

GEval Integration

This work originally began as a three-part PR series. After review and iteration, we’ve consolidated #96 and #98 into a single comprehensive pull request that delivers full GEval integration and accompanying unit tests within the existing evaluation framework.

Why? 🤔

GEval provides a standardized mechanism for using LLMs as evaluators, including custom grading criteria and rich evaluation parameters.
Teams can define nuanced scoring logic tailored to their domain, product or compliance requirements.
Ensure framework stability and correctness through targeted unit tests included in this consolidation.
GEval provides the foundation for future enhancements such as panels of LLM judges, weighted scoring, or more advanced judge orchestration.

Summary by CodeRabbit

New Features
- Added GEval evaluation metrics: turn-level technical accuracy and conversation-level coherence for improved automated quality checks.
- GEval metrics integrated alongside existing DeepEval metrics for expanded evaluations
- Automatic routing between standard and GEval evaluation engines
- Shared model resources for unified evaluation flows
- Improved runtime feedback, logging, and caching to speed evaluations and reduce repeated work
Tests
- Added comprehensive unit tests covering GEval handler, metric metadata resolution, and evaluator routing for these metrics.

coderabbitai · 2025-11-09T23:48:24Z

Walkthrough

Adds GEval metric support: new GEvalHandler class, routes "geval:" metrics through DeepEvalMetrics, expands MetricManager to return full metric metadata, updates evaluator wiring and configuration with two GEval metrics in system.yaml, and adjusts exports formatting and tests.

Changes

Cohort / File(s)	Summary
Configuration `config/system.yaml`	Added two GEval metric entries: turn-level `geval:technical_accuracy` and conversation-level `geval:conversation_coherence` (criteria, evaluation_params, evaluation_steps, threshold, description). Minor re-spacing of LLM/API provider lines.
GEval integration `src/lightspeed_evaluation/core/metrics/geval.py`	New `GEvalHandler` class with public `evaluate(...)`, config lookup, param conversion, turn-level and conversation-level evaluators, error handling and logging.
DeepEval integration `src/lightspeed_evaluation/core/metrics/deepeval.py`	`DeepEvalMetrics` updated to accept `metric_manager`, initialize GEvalHandler, share LLM manager/cache, and route evaluation to GEval or standard DeepEval metrics.
Metric resolution `src/lightspeed_evaluation/core/metrics/manager.py`	Replaced single-threshold lookup with `get_metric_metadata(...)` returning full metadata dict; added `get_effective_threshold(...)` to extract threshold from metadata.
Evaluator wiring `src/lightspeed_evaluation/pipeline/evaluation/evaluator.py`	Instantiates `DeepEvalMetrics` with `metric_manager`; registers `"geval"` routing key to the deepeval handler.
Exports formatting `src/lightspeed_evaluation/core/metrics/__init__.py`	Reformatted `__all__` into multi-line trailing-comma list; no behavioral change.
Unit tests `tests/unit/core/metrics/test_geval.py`, `tests/unit/core/metrics/test_manager.py`, `tests/unit/pipeline/evaluation/test_evaluator.py`	Added comprehensive tests for GEvalHandler behavior, MetricManager metadata resolution, and updated evaluator handler-count expectation for "geval" routing.

Sequence Diagram(s)

sequenceDiagram
    participant Evaluator
    participant DeepEvalMetrics
    participant GEvalHandler
    participant MetricManager
    participant DeepEvalLib

    Evaluator->>DeepEvalMetrics: evaluate(metric_name, conv_data, turn_idx, turn_data, ...)
    alt metric_name starts with "geval:" or routed as geval
        DeepEvalMetrics->>GEvalHandler: evaluate(metric_name, conv_data, turn_idx, turn_data, is_conversation)
        GEvalHandler->>MetricManager: get_metric_metadata(metric_name, level, conv_data, turn_data)
        MetricManager-->>GEvalHandler: metadata (criteria, params, steps, threshold)
        alt is_conversation
            GEvalHandler->>GEvalHandler: _evaluate_conversation(...)
            GEvalHandler->>DeepEvalLib: GEval.measure(test_case, params...)
        else
            GEvalHandler->>GEvalHandler: _evaluate_turn(...)
            GEvalHandler->>DeepEvalLib: GEval.measure(test_case, params...)
        end
        DeepEvalLib-->>GEvalHandler: score, reason
        GEvalHandler-->>DeepEvalMetrics: (score, reason)
    else standard DeepEval metric
        DeepEvalMetrics->>DeepEvalLib: evaluate via supported_metrics
        DeepEvalLib-->>DeepEvalMetrics: score, reason
    end
    DeepEvalMetrics-->>Evaluator: (score, reason)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

New GEvalHandler (geval.py): review param mapping, metadata lookup, test-case construction, error paths.
DeepEvalMetrics changes (deepeval.py): confirm resource sharing, routing, constructor change impact.
MetricManager refactor (manager.py): validate metadata resolution and fallback logic.
Tests: ensure mocks align with new behavior and edge cases covered.
Config: verify GEval metric schema in system.yaml matches runtime expectations.

Possibly related PRs

Turn metric override #55 — touches metric-management and per-turn metric handling used by GEval integration.
LCORE-748: Addded unit test cases coverage for the evaluation framework #95 — related changes to MetricManager metadata resolution and evaluator tests; similar test coverage.
add common custom llm #70 — changes to DeepEval/LLM initialization paths relevant to shared LLM manager usage.

Suggested reviewers

VladimirKadlec
asamal4
tisnik

Poem

🐰 I hopped into metrics bright and new,
GEval seeds planted, thresholds in view,
DeepEval and handlers now dance as one,
Turns and conversations kissed by the sun,
A rabbit's small hop — tests passing through.

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'GEval Integration' accurately summarizes the main objective of the pull request, which is integrating GEval capabilities into the LSC Evaluation Framework through configuration, registry, and system loader changes.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (2)

src/lightspeed_evaluation/core/system/loader.py (1)
67-80: Consider distinguishing turn-level vs conversation-level metrics in the registry.

The current implementation adds all GEval metrics to both TURN_LEVEL_METRICS and CONVERSATION_LEVEL_METRICS sets (lines 78-79), with validation deferred to evaluation time. While the comment explains this design choice, it may lead to confusing validation errors when users specify a conversation-level metric for turn-level evaluation (or vice versa).

Consider enhancing the registry schema to include a level field (e.g., level: "turn" or level: "conversation"), allowing the loader to route metrics to the appropriate set. This would provide earlier validation and clearer error messages for mismatched metric usage.

Example registry enhancement:
technical_accuracy:
  level: "turn"  # or "conversation" or "both"
  criteria: |
    ...
And in the loader:
for metric_name, metric_config in registry.items():
    metric_identifier = f"geval:{metric_name}"
    level = metric_config.get("level", "both")
    
    if level in ("turn", "both"):
        TURN_LEVEL_METRICS.add(metric_identifier)
    if level in ("conversation", "both"):
        CONVERSATION_LEVEL_METRICS.add(metric_identifier)
src/lightspeed_evaluation/core/metrics/geval.py (1)
323-331: Consider returning None for missing scores instead of defaulting to 0.0.

Line 325 defaults the score to 0.0 when metric.score is None:
score = metric.score if metric.score is not None else 0.0
This could mask evaluation failures where GEval couldn't compute a score. Since the return type is tuple[float | None, str], consider returning None for the score when it's unavailable, allowing the caller to handle this case explicitly (e.g., marking as ERROR status).

Apply this diff:
         try:
             metric.measure(test_case)
-            score = metric.score if metric.score is not None else 0.0
+            score = metric.score  # Can be None if evaluation fails
             reason = (
                 str(metric.reason)
                 if hasattr(metric, "reason") and metric.reason
Note: This same pattern appears in _evaluate_conversation (line 418), so consider updating both methods consistently.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 64f8892 and e7758e3.

📒 Files selected for processing (9)

config/registry/geval_metrics.yaml (1 hunks)
config/system.yaml (1 hunks)
src/lightspeed_evaluation/core/metrics/__init__.py (1 hunks)
src/lightspeed_evaluation/core/metrics/deepeval.py (3 hunks)
src/lightspeed_evaluation/core/metrics/geval.py (1 hunks)
src/lightspeed_evaluation/core/models/__init__.py (2 hunks)
src/lightspeed_evaluation/core/models/system.py (2 hunks)
src/lightspeed_evaluation/core/system/loader.py (5 hunks)
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (5)

config/**/*.{yaml,yml}

📄 CodeRabbit inference engine (AGENTS.md)

Use YAML for configuration files under config/

Files:

config/registry/geval_metrics.yaml
config/system.yaml

config/system.yaml

📄 CodeRabbit inference engine (AGENTS.md)

config/system.yaml: Keep system configuration in config/system.yaml (LLM, API, metrics metadata, output, logging)
Add metric metadata to the metrics_metadata section in config/system.yaml when introducing new metrics

Files:

config/system.yaml

src/lightspeed_evaluation/**

📄 CodeRabbit inference engine (AGENTS.md)

Add all new evaluation features under src/lightspeed_evaluation/ (do not add new features elsewhere)

Files:

src/lightspeed_evaluation/core/models/system.py
src/lightspeed_evaluation/core/metrics/__init__.py
src/lightspeed_evaluation/core/system/loader.py
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
src/lightspeed_evaluation/core/metrics/deepeval.py
src/lightspeed_evaluation/core/metrics/geval.py
src/lightspeed_evaluation/core/models/__init__.py

src/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

src/**/*.py: Require type hints for all public functions and methods
Use Google-style docstrings for all public APIs
Use custom exceptions from core.system.exceptions for error handling
Use structured logging with appropriate levels

Files:

src/lightspeed_evaluation/core/models/system.py
src/lightspeed_evaluation/core/metrics/__init__.py
src/lightspeed_evaluation/core/system/loader.py
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
src/lightspeed_evaluation/core/metrics/deepeval.py
src/lightspeed_evaluation/core/metrics/geval.py
src/lightspeed_evaluation/core/models/__init__.py

src/lightspeed_evaluation/core/metrics/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Register new metrics in MetricManager’s supported_metrics dictionary

Files:

src/lightspeed_evaluation/core/metrics/__init__.py
src/lightspeed_evaluation/core/metrics/deepeval.py
src/lightspeed_evaluation/core/metrics/geval.py

🧠 Learnings (10)

📓 Common learnings

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 28
File: lsc_eval/runner.py:99-103
Timestamp: 2025-08-26T11:17:48.640Z
Learning: The lsc_eval generic evaluation tool is intended to become the primary evaluation framework, replacing an existing evaluation tool in the lightspeed-evaluation repository.

Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to src/lightspeed_evaluation/** : Add all new evaluation features under src/lightspeed_evaluation/ (do not add new features elsewhere)

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/core/system/validator.py:146-155
Timestamp: 2025-09-18T23:59:37.026Z
Learning: In the lightspeed-evaluation project, the DataValidator in `src/lightspeed_evaluation/core/system/validator.py` is intentionally designed to validate only explicitly provided user evaluation data, not resolved metrics that include system defaults. When turn_metrics is None, the system falls back to system config defaults, and this validation separation is by design.

Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to src/lightspeed_evaluation/core/metrics/**/*.py : Register new metrics in MetricManager’s supported_metrics dictionary

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:33-36
Timestamp: 2025-09-19T00:37:23.798Z
Learning: In the lightspeed-evaluation codebase, metric resolution (including applying defaults when turn_metrics is None) happens upstream in ConversationProcessor.process_conversation() using MetricManager.resolve_metrics(), not in the EvaluationErrorHandler. The error handler only marks explicitly defined metrics as ERROR.

📚 Learning: 2025-10-16T11:17:19.324Z

Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to config/system.yaml : Add metric metadata to the metrics_metadata section in config/system.yaml when introducing new metrics

Applied to files:

config/registry/geval_metrics.yaml
config/system.yaml
src/lightspeed_evaluation/core/models/system.py

📚 Learning: 2025-10-16T11:17:19.324Z

Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to config/evaluation_data.yaml : Keep evaluation data in config/evaluation_data.yaml (conversation groups, turns, overrides, scripts)

Applied to files:

config/registry/geval_metrics.yaml
config/system.yaml

📚 Learning: 2025-10-16T11:17:19.324Z

Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: When adding a new feature, also add a new sample evaluation data YAML file

Applied to files:

config/registry/geval_metrics.yaml

📚 Learning: 2025-10-16T11:17:19.324Z

Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to config/system.yaml : Keep system configuration in config/system.yaml (LLM, API, metrics metadata, output, logging)

Applied to files:

config/system.yaml

📚 Learning: 2025-09-10T15:48:14.671Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/output/generator.py:43-49
Timestamp: 2025-09-10T15:48:14.671Z
Learning: In the lightspeed-evaluation framework, system configuration uses Pydantic data models (SystemConfig, OutputConfig, LoggingConfig, etc.) rather than plain dictionaries. Components like OutputHandler receive properly structured Pydantic models, so direct attribute access (e.g., system_config.output.enabled_outputs) is the correct approach.

Applied to files:

src/lightspeed_evaluation/core/models/system.py
src/lightspeed_evaluation/core/system/loader.py
src/lightspeed_evaluation/core/models/__init__.py

📚 Learning: 2025-09-18T23:59:37.026Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/core/system/validator.py:146-155
Timestamp: 2025-09-18T23:59:37.026Z
Learning: In the lightspeed-evaluation project, the DataValidator in `src/lightspeed_evaluation/core/system/validator.py` is intentionally designed to validate only explicitly provided user evaluation data, not resolved metrics that include system defaults. When turn_metrics is None, the system falls back to system config defaults, and this validation separation is by design.

Applied to files:

src/lightspeed_evaluation/core/models/system.py
src/lightspeed_evaluation/core/system/loader.py
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py

📚 Learning: 2025-10-16T11:17:19.324Z

Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to src/lightspeed_evaluation/core/metrics/**/*.py : Register new metrics in MetricManager’s supported_metrics dictionary

Applied to files:

src/lightspeed_evaluation/core/metrics/__init__.py
src/lightspeed_evaluation/core/system/loader.py
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
src/lightspeed_evaluation/core/metrics/deepeval.py
src/lightspeed_evaluation/core/metrics/geval.py

📚 Learning: 2025-10-16T11:17:19.324Z

Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to src/lightspeed_evaluation/** : Add all new evaluation features under src/lightspeed_evaluation/ (do not add new features elsewhere)

Applied to files:

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
src/lightspeed_evaluation/core/metrics/deepeval.py

📚 Learning: 2025-09-19T00:37:23.798Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:33-36
Timestamp: 2025-09-19T00:37:23.798Z
Learning: In the lightspeed-evaluation codebase, metric resolution (including applying defaults when turn_metrics is None) happens upstream in ConversationProcessor.process_conversation() using MetricManager.resolve_metrics(), not in the EvaluationErrorHandler. The error handler only marks explicitly defined metrics as ERROR.

Applied to files:

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py

🧬 Code graph analysis (5)

src/lightspeed_evaluation/core/system/loader.py (1)

src/lightspeed_evaluation/core/models/system.py (1)

GEvalConfig (267-287)

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)

src/lightspeed_evaluation/core/metrics/deepeval.py (1)

DeepEvalMetrics (30-184)

src/lightspeed_evaluation/core/metrics/deepeval.py (3)

src/lightspeed_evaluation/core/metrics/geval.py (2)

GEvalHandler (20-499)

evaluate (120-193)

src/lightspeed_evaluation/core/llm/manager.py (4)

LLMManager (10-117)

get_config (105-107)

get_model_name (91-93)

get_llm_params (95-103)

src/lightspeed_evaluation/core/llm/deepeval.py (1)

DeepEvalLLMManager (8-42)

src/lightspeed_evaluation/core/metrics/geval.py (2)

src/lightspeed_evaluation/core/llm/deepeval.py (1)

DeepEvalLLMManager (8-42)

src/lightspeed_evaluation/core/metrics/deepeval.py (1)

evaluate (92-129)

src/lightspeed_evaluation/core/models/__init__.py (1)

src/lightspeed_evaluation/core/models/system.py (1)

GEvalConfig (267-287)

🔇 Additional comments (16)

src/lightspeed_evaluation/core/metrics/__init__.py (1)

8-13: No issues found. The all exports are complete and correct.

The multi-line formatting with trailing commas is good practice. GEvalHandler is intentionally not exported because it's an internal component of DeepEvalMetrics, not a top-level metrics class like the others. The architecture is clean: GEval metrics are accessed through DeepEvalMetrics.evaluate() which routes metrics by name to either standard DeepEval implementations or to the internal GEvalHandler. All top-level metric classes are properly exported.

config/system.yaml (1)

46-52: Verify GEval metrics metadata design.

The GEval metrics defined in the registry (e.g., geval:technical_accuracy, geval:command_validity) do not have corresponding entries in the metrics_metadata section (lines 54-117). While GEval metrics carry their own metadata structure in the registry file (criteria, evaluation_params, evaluation_steps, threshold), the coding guidelines state: "Add metric metadata to the metrics_metadata section in config/system.yaml when introducing new metrics."

Please confirm whether this deviation is intentional, as GEval uses a separate registry-based metadata approach, or if these metrics should also have entries in the metrics_metadata section for consistency.

Based on coding guidelines.

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (2)

46-49: LGTM!

The integration of GEval registry path into DeepEval metrics initialization is clean and correctly propagates configuration from the system config.

57-57: LGTM!

The routing of GEval metrics through the existing deepeval_metrics handler is an elegant design choice that unifies metric evaluation without requiring a separate handler entry point.

src/lightspeed_evaluation/core/models/__init__.py (1)

24-24: LGTM!

The GEvalConfig export follows the established pattern for exposing configuration models in the public API.

Also applies to: 43-43

config/registry/geval_metrics.yaml (1)

1-136: LGTM with observation.

The GEval metrics registry is well-structured and follows a consistent schema across all metric definitions. The inclusion of usage examples and the caveat about generated metrics (lines 5-6) is helpful for users.

One observation: The evaluation_params use lowercase identifiers like actual_output and expected_output, which need conversion to LLMTestCaseParams enum values. Verified that src/lightspeed_evaluation/core/metrics/geval.py (lines 195-238) handles this conversion with appropriate fallback behavior when params don't match enum values.

src/lightspeed_evaluation/core/models/system.py (1)

314-317: LGTM!

The geval field integration into SystemConfig follows the established pattern for configuration sections and properly uses Field with appropriate defaults and description.

src/lightspeed_evaluation/core/metrics/deepeval.py (2)

38-68: LGTM!

The initialization properly sets up shared resources between standard DeepEval metrics and GEval, with appropriate cache configuration and clear separation of concerns. The use of a shared LLM manager promotes resource efficiency.

92-129: LGTM!

The unified evaluation routing between standard DeepEval metrics and GEval is well-implemented. The routing logic is clear, and the comment clarifying that GEval metric names should not include the "geval:" prefix is helpful for maintainability.

src/lightspeed_evaluation/core/system/loader.py (2)

108-110: LGTM!

The conditional loading of GEval metrics based on the enabled flag is correctly placed in the metric mapping population flow, and properly propagates the registry path from configuration.

183-183: LGTM!

The GEvalConfig construction follows the established pattern for configuration sections, with appropriate defaulting to an empty dictionary when the section is absent.

src/lightspeed_evaluation/core/metrics/geval.py (5)

35-49: LGTM!

The class design with a shared registry cache is efficient and appropriate for this use case. The initialization properly delegates registry loading and stores the shared LLM manager.

120-193: LGTM!

The main evaluation method has a clear, well-documented flow: configuration retrieval, validation, and delegation to the appropriate evaluation level. The error messages are informative and help with debugging.

195-238: LGTM!

The parameter conversion logic with fallback to auto-detection is a smart design choice that provides flexibility for custom evaluation parameters while maintaining type safety for standard parameters.

400-413: LGTM!

The conversation aggregation strategy of concatenating all turns into a single test case is appropriate for GEval's evaluation model. The clear turn numbering in the aggregated input/output helps maintain context.

435-499: LGTM!

The configuration resolution with clear priority order (runtime metadata overrides registry) is well-implemented. The logging at different levels (debug for source selection, warning for missing config) provides good observability.

src/lightspeed_evaluation/core/metrics/geval.py

src/lightspeed_evaluation/core/models/system.py

lpiwowar

I'm just passing by 🚶. Two things crossed my mind while taking a look at this PR.

Thanks for working on this:) 👍

lpiwowar · 2025-11-11T09:05:01Z

config/system.yaml

+
+# GEval Configuration
+# Configurable custom metrics using DeepEval's GEval framework
+geval:


questions(s): Is there a way how to make the geval metrics configuration aling with the already existing metrics so that we can extend the metrics_metadata section? Would something like this work?

metrics_metadata: geval_registry_path: "config/registry/geval_metrics.yaml" turn_metrics: "geval:technical_accuracy": ...

I'm just wondering whether we are going to introduce a separate config option for each evaluation framework we might add in the future.

I understand the rationale behind the implementation: it's driven by the evaluation technique and the non-static nature of the metrics, not the framework.
However, I agree with Lukas about integrating this into the current metric_metadata. This would simplify both the user experience and maintenance. Furthermore, users can add the entire geval metric directly to the existing metadata (avoiding a separate registry). With this we can fully utilize our existing flow for metric requirements validation, override logic, and data field conversion within the overall deepeval frameworks.

Will do; thank you for the suggestions. I will get this updated.

Refactored/Additions

Added field name mapping logic in geval.py to be more aligned with the current input values

Updated all evaluation_params in system.yaml GEval metrics to use
data field names: query, response, expected_response, contexts

Updated docstrings to reflect new naming convention

Standardized on MetricManager for GEval

Added get_metric_metadata to get complete metadata dictionary (simplifies GEval integration)

refactored get_effective_threshold to use the unified metadata getter.

Removed

Removed obsolete GEval registry configuration

Any additional GEval dependencies

GEvalConfig

Registry loading in loader.py

Thank you. This simplifies the logic a lot.

lpiwowar · 2025-11-11T09:05:40Z

config/registry/geval_metrics.yaml

+    - actual_output
+    - expected_output
+  evaluation_steps:
+    - "Verify that the Ansible syntax is valid and follows YAML formatting rules"


suggestion (non-blocking): I would personally keep here in the default file the metrics that are product agnostic and metrics that can theoretically be used by anybody.

Any examples of metrics that are product specific should be in docs showcasing how people can write their own metrics IMO.

asamal4

Thank you for adding this..

asamal4 · 2025-11-11T00:13:25Z

config/registry/geval_metrics.yaml

+  evaluation_params:
+    - input
+    - actual_output
+    - expected_output


These names are as per deepeval. However it will create confusion, as our eval data expects different names.
Can we align these as per the input data and internally rename these.

✅ Fixed this here; Geval is now aligned with the rest of the framework

config/registry/geval_metrics.yaml

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

src/lightspeed_evaluation/core/metrics/deepeval.py (1)
97-138: Clarify docstring regarding metric_name prefix handling.

The docstring states that metric_name "should NOT include 'geval:' prefix" (line 110), but the implementation defensively strips the prefix if present (lines 127-131). While this defensive approach is good for robustness, consider updating the docstring to reflect the actual behavior:
         Args:
-            metric_name: Name of metric (for GEval, this should NOT include "geval:" prefix)
+            metric_name: Name of metric (for GEval, "geval:" prefix will be stripped if present)
             conv_data: Conversation data object
             scope: EvaluationScope containing turn info and conversation flag

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 88ba187 and c83e295.

📒 Files selected for processing (5)

config/system.yaml (4 hunks)
src/lightspeed_evaluation/core/metrics/deepeval.py (3 hunks)
src/lightspeed_evaluation/core/metrics/geval.py (1 hunks)
src/lightspeed_evaluation/core/metrics/manager.py (1 hunks)
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

config/system.yaml

🧰 Additional context used

📓 Path-based instructions (3)

src/lightspeed_evaluation/**

📄 CodeRabbit inference engine (AGENTS.md)

Add all new evaluation features under src/lightspeed_evaluation/ (do not add new features elsewhere)

Files:

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
src/lightspeed_evaluation/core/metrics/manager.py
src/lightspeed_evaluation/core/metrics/deepeval.py
src/lightspeed_evaluation/core/metrics/geval.py

src/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

src/**/*.py: Require type hints for all public functions and methods
Use Google-style docstrings for all public APIs
Use custom exceptions from core.system.exceptions for error handling
Use structured logging with appropriate levels

Files:

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
src/lightspeed_evaluation/core/metrics/manager.py
src/lightspeed_evaluation/core/metrics/deepeval.py
src/lightspeed_evaluation/core/metrics/geval.py

src/lightspeed_evaluation/core/metrics/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Register new metrics in MetricManager’s supported_metrics dictionary

Files:

src/lightspeed_evaluation/core/metrics/manager.py
src/lightspeed_evaluation/core/metrics/deepeval.py
src/lightspeed_evaluation/core/metrics/geval.py

🧠 Learnings (6)

📓 Common learnings

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 28
File: lsc_eval/runner.py:99-103
Timestamp: 2025-08-26T11:17:48.640Z
Learning: The lsc_eval generic evaluation tool is intended to become the primary evaluation framework, replacing an existing evaluation tool in the lightspeed-evaluation repository.

Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to src/lightspeed_evaluation/** : Add all new evaluation features under src/lightspeed_evaluation/ (do not add new features elsewhere)

📚 Learning: 2025-09-19T00:37:23.798Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:33-36
Timestamp: 2025-09-19T00:37:23.798Z
Learning: In the lightspeed-evaluation codebase, metric resolution (including applying defaults when turn_metrics is None) happens upstream in ConversationProcessor.process_conversation() using MetricManager.resolve_metrics(), not in the EvaluationErrorHandler. The error handler only marks explicitly defined metrics as ERROR.

Applied to files:

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
src/lightspeed_evaluation/core/metrics/geval.py

📚 Learning: 2025-10-16T11:17:19.324Z

Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to src/lightspeed_evaluation/core/metrics/**/*.py : Register new metrics in MetricManager’s supported_metrics dictionary

Applied to files:

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
src/lightspeed_evaluation/core/metrics/manager.py
src/lightspeed_evaluation/core/metrics/deepeval.py

📚 Learning: 2025-10-16T11:17:19.324Z

Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to src/lightspeed_evaluation/** : Add all new evaluation features under src/lightspeed_evaluation/ (do not add new features elsewhere)

Applied to files:

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py

📚 Learning: 2025-09-18T23:59:37.026Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/core/system/validator.py:146-155
Timestamp: 2025-09-18T23:59:37.026Z
Learning: In the lightspeed-evaluation project, the DataValidator in `src/lightspeed_evaluation/core/system/validator.py` is intentionally designed to validate only explicitly provided user evaluation data, not resolved metrics that include system defaults. When turn_metrics is None, the system falls back to system config defaults, and this validation separation is by design.

Applied to files:

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
src/lightspeed_evaluation/core/metrics/geval.py

📚 Learning: 2025-07-16T12:07:29.169Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_script_runner.py:0-0
Timestamp: 2025-07-16T12:07:29.169Z
Learning: In the lsc_agent_eval package, the ScriptRunner class was modified to use absolute paths internally rather than documenting path normalization behavior, providing more predictable and consistent path handling.

Applied to files:

src/lightspeed_evaluation/core/metrics/geval.py

🧬 Code graph analysis (4)

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)

src/lightspeed_evaluation/core/metrics/deepeval.py (1)

DeepEvalMetrics (31-193)

src/lightspeed_evaluation/core/metrics/manager.py (1)

src/lightspeed_evaluation/core/models/data.py (2)

EvaluationData (264-311)

TurnData (35-261)

src/lightspeed_evaluation/core/metrics/deepeval.py (4)

src/lightspeed_evaluation/core/metrics/geval.py (2)

GEvalHandler (19-433)

evaluate (45-116)

src/lightspeed_evaluation/core/metrics/manager.py (1)

MetricManager (17-166)

src/lightspeed_evaluation/core/llm/manager.py (4)

LLMManager (10-117)

get_config (105-107)

get_model_name (91-93)

get_llm_params (95-103)

src/lightspeed_evaluation/core/llm/deepeval.py (1)

DeepEvalLLMManager (8-42)

src/lightspeed_evaluation/core/metrics/geval.py (3)

src/lightspeed_evaluation/core/llm/deepeval.py (1)

DeepEvalLLMManager (8-42)

src/lightspeed_evaluation/core/metrics/manager.py (3)

MetricLevel (10-14)

MetricManager (17-166)

get_metric_metadata (50-82)

src/lightspeed_evaluation/core/metrics/deepeval.py (1)

evaluate (97-138)

🔇 Additional comments (12)

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (2)

46-49: LGTM! DeepEvalMetrics initialization updated correctly.

The addition of metric_manager parameter aligns with the GEval integration, enabling access to metric metadata for both standard and custom metrics.

57-58: LGTM! GEval routing configured correctly.

The routing configuration correctly directs GEval metrics through the unified deepeval_metrics handler, and the comment clearly documents this design decision.

src/lightspeed_evaluation/core/metrics/deepeval.py (2)

1-29: LGTM! Module setup and imports are appropriate.

The updated docstring clearly describes the dual support for standard and GEval metrics, and the logger initialization follows best practices.

39-66: LGTM! Constructor properly extended for GEval integration.

The addition of metric_manager parameter and initialization of GEvalHandler with shared LLM resources follows the unified design. Type hints comply with coding guidelines.

src/lightspeed_evaluation/core/metrics/manager.py (2)

50-82: LGTM! Metadata retrieval method is well-designed.

The method properly implements the priority hierarchy (runtime metadata → system defaults) and includes comprehensive documentation. Type hints and docstring comply with coding guidelines.

84-110: LGTM! Threshold extraction properly refactored.

The method cleanly delegates to get_metric_metadata() and safely extracts the threshold value. This refactoring improves code organization and supports the richer metadata requirements for GEval integration.

src/lightspeed_evaluation/core/metrics/geval.py (6)

1-44: LGTM! Class structure and initialization are well-designed.

The module docstring clearly describes the GEval integration, and the constructor properly accepts the required dependencies. Type hints and Google-style docstrings comply with coding guidelines.

45-116: LGTM! Main evaluation method is well-structured.

The method properly validates configuration, extracts parameters with sensible defaults, and routes to the appropriate evaluation level. Documentation is comprehensive and error handling is appropriate.

118-176: Parameter conversion logic is robust.

The method properly maps evaluation data field names to DeepEval enum values with appropriate fallbacks. The decision to return None when encountering any unrecognized parameter (line 173) allows GEval to auto-detect fields, which is a reasonable defensive approach.

178-283: LGTM! Turn-level evaluation is comprehensive.

The method properly handles parameter conversion, builds the GEval metric with appropriate configuration, constructs test cases with optional fields, and includes robust error handling with detailed logging. The conditional logic for evaluation_params (lines 228-236) correctly handles all scenarios.

285-381: LGTM! Conversation-level evaluation properly aggregates turns.

The method correctly aggregates conversation turns into a single test case and evaluates them holistically. The turn aggregation logic (lines 350-352) is clear, and error handling includes helpful diagnostic logging. The comment about top_logprobs limit (line 326) is useful context.

383-433: LGTM! Configuration retrieval properly leverages MetricManager.

The method correctly constructs the metric identifier with the geval: prefix and delegates to MetricManager.get_metric_metadata() to leverage the priority hierarchy. Logging at appropriate levels aids debugging.

asamal4 · 2025-11-13T07:42:12Z

config/system.yaml

-  provider: "openai"          # LLM Provider (openai, watsonx, gemini, hosted_vllm etc..)
-  model: "gpt-4o-mini"        # Model name for the provider
+  provider: vertex         # LLM Provider (openai, watsonx, gemini, hosted_vllm etc..)
+  model: gemini-2.0-flash        # Model name for the provider


Please keep the existing defaults as it is.

This slipped through, sorry about that. I'll get that fixed.

…nd additions to system.yaml structure

…val config

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

src/lightspeed_evaluation/core/metrics/manager.py (1)
76-83: Avoid leaking shared metadata state

We now hand out the same dict object that lives inside turn_metrics_metadata / system defaults. Any caller that tweaks the returned metadata (for example, to add derived fields at runtime) will mutate the underlying config for all subsequent evaluations. Please return a defensive copy before handing it to callers to keep per-request state isolated.

You could apply something like:
+import copy
 ...
-        if metric_identifier in level_metadata:
-            return level_metadata[metric_identifier]
+        if metric_identifier in level_metadata:
+            return copy.deepcopy(level_metadata[metric_identifier])
 ...
-        return system_metadata.get(metric_identifier)
+        metadata = system_metadata.get(metric_identifier)
+        return copy.deepcopy(metadata) if metadata else None
src/lightspeed_evaluation/core/metrics/geval.py (1)
217-243: Consider extracting duplicated evaluation_params logic.

The logic for conditionally setting evaluation_params (lines 228-236) is duplicated in _evaluate_conversation (lines 329-336). While the behavior is correct, extracting this into a small helper would reduce duplication and make the logic easier to maintain.

Consider creating a helper method:
def _prepare_metric_kwargs(
    self,
    criteria: str,
    converted_params: list[LLMTestCaseParams] | None,
    evaluation_params: list[str],
    evaluation_steps: list[str] | None,
    threshold: float,
    name: str,
) -> dict[str, Any]:
    """Prepare kwargs for GEval metric instantiation."""
    metric_kwargs: dict[str, Any] = {
        "name": name,
        "criteria": criteria,
        "model": self.deepeval_llm_manager.get_llm(),
        "threshold": threshold,
        "top_logprobs": 5,
    }
    
    if converted_params is None:
        if not evaluation_params:
            metric_kwargs["evaluation_params"] = [
                LLMTestCaseParams.INPUT,
                LLMTestCaseParams.ACTUAL_OUTPUT,
            ]
    else:
        metric_kwargs["evaluation_params"] = converted_params
    
    if evaluation_steps:
        metric_kwargs["evaluation_steps"] = evaluation_steps
    
    return metric_kwargs
Then call it from both _evaluate_turn and _evaluate_conversation.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c83e295 and 0e330ef.

📒 Files selected for processing (9)

config/system.yaml (4 hunks)
src/lightspeed_evaluation/core/metrics/__init__.py (1 hunks)
src/lightspeed_evaluation/core/metrics/deepeval.py (3 hunks)
src/lightspeed_evaluation/core/metrics/geval.py (1 hunks)
src/lightspeed_evaluation/core/metrics/manager.py (1 hunks)
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1 hunks)
tests/unit/core/metrics/test_geval.py (1 hunks)
tests/unit/core/metrics/test_manager.py (1 hunks)
tests/unit/pipeline/evaluation/test_evaluator.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (3)

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
config/system.yaml
src/lightspeed_evaluation/core/metrics/init.py

🧰 Additional context used

🧠 Learnings (5)

📓 Common learnings

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:33-36
Timestamp: 2025-09-19T00:37:23.798Z
Learning: In the lightspeed-evaluation codebase, metric resolution (including applying defaults when turn_metrics is None) happens upstream in ConversationProcessor.process_conversation() using MetricManager.resolve_metrics(), not in the EvaluationErrorHandler. The error handler only marks explicitly defined metrics as ERROR.

📚 Learning: 2025-09-18T23:59:37.026Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/core/system/validator.py:146-155
Timestamp: 2025-09-18T23:59:37.026Z
Learning: In the lightspeed-evaluation project, the DataValidator in `src/lightspeed_evaluation/core/system/validator.py` is intentionally designed to validate only explicitly provided user evaluation data, not resolved metrics that include system defaults. When turn_metrics is None, the system falls back to system config defaults, and this validation separation is by design.

Applied to files:

tests/unit/core/metrics/test_manager.py
src/lightspeed_evaluation/core/metrics/geval.py

📚 Learning: 2025-07-16T12:07:29.169Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_script_runner.py:0-0
Timestamp: 2025-07-16T12:07:29.169Z
Learning: In the lsc_agent_eval package, the ScriptRunner class was modified to use absolute paths internally rather than documenting path normalization behavior, providing more predictable and consistent path handling.

Applied to files:

src/lightspeed_evaluation/core/metrics/geval.py

📚 Learning: 2025-09-19T00:37:23.798Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:33-36
Timestamp: 2025-09-19T00:37:23.798Z
Learning: In the lightspeed-evaluation codebase, metric resolution (including applying defaults when turn_metrics is None) happens upstream in ConversationProcessor.process_conversation() using MetricManager.resolve_metrics(), not in the EvaluationErrorHandler. The error handler only marks explicitly defined metrics as ERROR.

Applied to files:

src/lightspeed_evaluation/core/metrics/geval.py
tests/unit/core/metrics/test_geval.py

📚 Learning: 2025-09-19T12:32:06.403Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:18-31
Timestamp: 2025-09-19T12:32:06.403Z
Learning: When analyzing method calls, always examine the complete call site including all parameters before suggesting fixes. In the lightspeed-evaluation codebase, mark_all_metrics_as_error in processor.py correctly passes both resolved_turn_metrics and resolved_conversation_metrics parameters.

Applied to files:

tests/unit/core/metrics/test_geval.py

🧬 Code graph analysis (5)

src/lightspeed_evaluation/core/metrics/manager.py (1)

src/lightspeed_evaluation/core/models/data.py (2)

EvaluationData (264-311)

TurnData (35-261)

src/lightspeed_evaluation/core/metrics/deepeval.py (5)

src/lightspeed_evaluation/core/metrics/geval.py (2)

GEvalHandler (19-433)

evaluate (45-116)

src/lightspeed_evaluation/core/metrics/manager.py (1)

MetricManager (17-166)

src/lightspeed_evaluation/core/models/data.py (2)

EvaluationScope (356-367)

TurnData (35-261)

src/lightspeed_evaluation/core/llm/manager.py (4)

LLMManager (10-117)

get_config (105-107)

get_model_name (91-93)

get_llm_params (95-103)

src/lightspeed_evaluation/core/llm/deepeval.py (1)

DeepEvalLLMManager (8-42)

tests/unit/core/metrics/test_manager.py (3)

src/lightspeed_evaluation/core/metrics/manager.py (3)

MetricManager (17-166)

get_metric_metadata (50-82)

MetricLevel (10-14)

src/lightspeed_evaluation/core/models/data.py (2)

TurnData (35-261)

EvaluationData (264-311)

src/lightspeed_evaluation/core/models/system.py (1)

SystemConfig (267-297)

src/lightspeed_evaluation/core/metrics/geval.py (3)

src/lightspeed_evaluation/core/llm/deepeval.py (1)

DeepEvalLLMManager (8-42)

src/lightspeed_evaluation/core/metrics/manager.py (3)

MetricLevel (10-14)

MetricManager (17-166)

get_metric_metadata (50-82)

src/lightspeed_evaluation/core/metrics/deepeval.py (1)

evaluate (97-138)

tests/unit/core/metrics/test_geval.py (2)

src/lightspeed_evaluation/core/metrics/geval.py (3)

GEvalHandler (19-433)

_convert_evaluation_params (118-176)

_get_geval_config (383-433)

src/lightspeed_evaluation/core/metrics/manager.py (2)

MetricLevel (10-14)

get_metric_metadata (50-82)

🔇 Additional comments (5)

src/lightspeed_evaluation/core/metrics/geval.py (5)

1-43: LGTM! Clean initialization and documentation.

The module setup, imports, and constructor are well-structured. The handler correctly accepts both DeepEvalLLMManager and MetricManager dependencies for runtime metric evaluation.

45-116: LGTM! Solid orchestration logic.

The evaluate() method properly retrieves configuration, validates criteria, and delegates to the appropriate evaluator based on the evaluation level.

118-176: LGTM! Robust parameter conversion with fallback.

The helper correctly maps evaluation data field names to LLMTestCaseParams enums and falls back to direct enum lookup for backward compatibility. Returning None for unrecognized params allows GEval's auto-detection to take over.

285-381: LGTM! Conversation aggregation logic is sound.

The method correctly aggregates all turns into a single test case for conversation-level evaluation. The formatting is straightforward and should work well with GEval's processing.

Note: The score defaulting and duplicate evaluation_params logic issues mentioned in previous comments also apply here (line 363 and lines 329-336).

383-433: LGTM! Clean integration with MetricManager.

The method properly delegates configuration retrieval to MetricManager, respecting the priority hierarchy (turn-level → conversation-level → system defaults). The logging is helpful for debugging missing configurations.

coderabbitai · 2025-11-13T18:44:28Z

src/lightspeed_evaluation/core/metrics/geval.py

+            score = metric.score if metric.score is not None else 0.0
+            reason = (
+                str(metric.reason)
+                if hasattr(metric, "reason") and metric.reason
+                else "No reason provided"
+            )
+            return score, reason


⚠️ Potential issue | 🟡 Minor

Defaulting None score to 0.0 might mask evaluation failures.

If metric.score is None, the current implementation returns 0.0 as the score. This could make it difficult to distinguish between a metric that legitimately evaluated to zero versus one that failed to produce a score at all. Consider returning None as the score when metric.score is None, which better signals that evaluation didn't complete successfully.

Apply this diff:

- score = metric.score if metric.score is not None else 0.0 + score = metric.score

The same pattern appears in _evaluate_conversation at line 363 and should be updated consistently.

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents

In src/lightspeed_evaluation/core/metrics/geval.py around lines 264 to 270 (and likewise update the similar pattern at _evaluate_conversation around line 363), the code currently substitutes metric.score None with 0.0 which masks failed evaluations; change the logic to return None when metric.score is None (keep numeric scores unchanged) and preserve reason extraction as-is so callers can detect a missing score vs a real zero; update both places to return (None, reason) when metric.score is None instead of (0.0, reason).

arin-deloatch · 2025-11-13T18:57:23Z

Additions/Modifications in `0e330ef`

system.yaml fixes
GEval Unit tests (test_geval.py)
Added testimg support to test_manager.py for get_metric_metadata()
Fixed pipeline test error in test_evaluator.py
- Fixing length assertion for number of handlers

asamal4

LGTM.
@VladimirKadlec @tisnik PTAL

VladimirKadlec

LGTM, thank you!

VladimirKadlec · 2025-11-19T07:28:31Z

config/system.yaml

 # LLM as a judge configuration
 llm:
-  provider: "openai"          # LLM Provider (openai, watsonx, gemini, hosted_vllm etc..)
+  provider: "openai"      # LLM Provider (openai, watsonx, gemini, hosted_vllm etc..)


nit: only spaces, I'd keep the original indent

VladimirKadlec · 2025-11-19T07:28:39Z

config/system.yaml

 # But can be easily integrated with other APIs with minimal change.
 api:
-  enabled: true                        # Enable API calls instead of using pre-filled data
+  enabled: true                       # Enable API calls instead of using pre-filled data


nit: only spaces, I'd keep the original indent

tisnik

LGTM

coderabbitai bot reviewed Nov 9, 2025

View reviewed changes

src/lightspeed_evaluation/core/metrics/geval.py Outdated Show resolved Hide resolved

src/lightspeed_evaluation/core/models/system.py Outdated Show resolved Hide resolved

arin-deloatch force-pushed the feature/geval-config branch from e7758e3 to 88ba187 Compare November 10, 2025 22:00

lpiwowar reviewed Nov 11, 2025

View reviewed changes

asamal4 reviewed Nov 11, 2025

View reviewed changes

config/registry/geval_metrics.yaml Outdated Show resolved Hide resolved

arin-deloatch force-pushed the feature/geval-config branch from 88ba187 to c83e295 Compare November 12, 2025 21:28

coderabbitai bot reviewed Nov 12, 2025

View reviewed changes

asamal4 reviewed Nov 13, 2025

View reviewed changes

arin-deloatch added 8 commits November 13, 2025 10:11

feat: added GEval handler; integrated component into DeepEvalMetrics

cc0dc60

bug:small fix for undefined registry paths

6dc7b04

fix: resolving pylint and pydocstyle conflicts

0b51087

fix: unbound errors with registry loading function

0a9f693

removing backward compatibility for turn data + coderabbit fixes

e7dc73b

additional configuration needed for Geval support + metric registry a…

13b60b5

…nd additions to system.yaml structure

fix: pylint and pyright conflicts; corrected default directory for GE…

e0a5bcc

…val config

refactor: addressed PR feedback; alignment with pre-existing structure

205c03c

arin-deloatch changed the title ~~PR 2/3 — GEval Configuration, Registry and Settings Integration~~ GEval Integration Nov 13, 2025

This was referenced Nov 13, 2025

PR 1/3 — Add GEval (DeepEval) core integration #96

Closed

PR 3/3 — Unit tests for GEval #98

Closed

geval unit tests; general geval testing support

0e330ef

arin-deloatch force-pushed the feature/geval-config branch from c83e295 to 0e330ef Compare November 13, 2025 18:33

coderabbitai bot reviewed Nov 13, 2025

View reviewed changes

asamal4 approved these changes Nov 17, 2025

View reviewed changes

VladimirKadlec approved these changes Nov 19, 2025

View reviewed changes

tisnik approved these changes Nov 20, 2025

View reviewed changes

tisnik merged commit 39d7119 into lightspeed-core:main Nov 20, 2025
15 checks passed

coderabbitai bot mentioned this pull request Nov 24, 2025

LEADS-8: Lazy imports for eval tool #106

Merged

GEval Integration #97

GEval Integration #97

Uh oh!

Conversation

arin-deloatch commented Nov 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

GEval Integration

Why? 🤔

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Nov 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

lpiwowar left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

asamal4 Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Refactored/Additions

Removed

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

asamal4 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 13, 2025

Choose a reason for hiding this comment

Uh oh!

arin-deloatch commented Nov 13, 2025

Additions/Modifications in 0e330ef

Uh oh!

asamal4 left a comment

Choose a reason for hiding this comment

Uh oh!

VladimirKadlec left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tisnik left a comment

arin-deloatch commented Nov 9, 2025 •

edited

Loading

coderabbitai bot commented Nov 9, 2025 •

edited

Loading

asamal4 Nov 11, 2025 •

edited

Loading

Additions/Modifications in `0e330ef`