Skip to content

Conversation

@arin-deloatch
Copy link
Contributor

@arin-deloatch arin-deloatch commented Nov 9, 2025

GEval Integration

This work originally began as a three-part PR series. After review and iteration, we’ve consolidated #96 and #98 into a single comprehensive pull request that delivers full GEval integration and accompanying unit tests within the existing evaluation framework.

Why? 🤔

  • GEval provides a standardized mechanism for using LLMs as evaluators, including custom grading criteria and rich evaluation parameters.
  • Teams can define nuanced scoring logic tailored to their domain, product or compliance requirements.
  • Ensure framework stability and correctness through targeted unit tests included in this consolidation.
  • GEval provides the foundation for future enhancements such as panels of LLM judges, weighted scoring, or more advanced judge orchestration.

Summary by CodeRabbit

  • New Features

    • Added GEval evaluation metrics: turn-level technical accuracy and conversation-level coherence for improved automated quality checks.
    • GEval metrics integrated alongside existing DeepEval metrics for expanded evaluations
    • Automatic routing between standard and GEval evaluation engines
    • Shared model resources for unified evaluation flows
    • Improved runtime feedback, logging, and caching to speed evaluations and reduce repeated work
  • Tests

    • Added comprehensive unit tests covering GEval handler, metric metadata resolution, and evaluator routing for these metrics.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 9, 2025

Walkthrough

Adds GEval metric support: new GEvalHandler class, routes "geval:" metrics through DeepEvalMetrics, expands MetricManager to return full metric metadata, updates evaluator wiring and configuration with two GEval metrics in system.yaml, and adjusts exports formatting and tests.

Changes

Cohort / File(s) Summary
Configuration
config/system.yaml
Added two GEval metric entries: turn-level geval:technical_accuracy and conversation-level geval:conversation_coherence (criteria, evaluation_params, evaluation_steps, threshold, description). Minor re-spacing of LLM/API provider lines.
GEval integration
src/lightspeed_evaluation/core/metrics/geval.py
New GEvalHandler class with public evaluate(...), config lookup, param conversion, turn-level and conversation-level evaluators, error handling and logging.
DeepEval integration
src/lightspeed_evaluation/core/metrics/deepeval.py
DeepEvalMetrics updated to accept metric_manager, initialize GEvalHandler, share LLM manager/cache, and route evaluation to GEval or standard DeepEval metrics.
Metric resolution
src/lightspeed_evaluation/core/metrics/manager.py
Replaced single-threshold lookup with get_metric_metadata(...) returning full metadata dict; added get_effective_threshold(...) to extract threshold from metadata.
Evaluator wiring
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
Instantiates DeepEvalMetrics with metric_manager; registers "geval" routing key to the deepeval handler.
Exports formatting
src/lightspeed_evaluation/core/metrics/__init__.py
Reformatted __all__ into multi-line trailing-comma list; no behavioral change.
Unit tests
tests/unit/core/metrics/test_geval.py, tests/unit/core/metrics/test_manager.py, tests/unit/pipeline/evaluation/test_evaluator.py
Added comprehensive tests for GEvalHandler behavior, MetricManager metadata resolution, and updated evaluator handler-count expectation for "geval" routing.

Sequence Diagram(s)

sequenceDiagram
    participant Evaluator
    participant DeepEvalMetrics
    participant GEvalHandler
    participant MetricManager
    participant DeepEvalLib

    Evaluator->>DeepEvalMetrics: evaluate(metric_name, conv_data, turn_idx, turn_data, ...)
    alt metric_name starts with "geval:" or routed as geval
        DeepEvalMetrics->>GEvalHandler: evaluate(metric_name, conv_data, turn_idx, turn_data, is_conversation)
        GEvalHandler->>MetricManager: get_metric_metadata(metric_name, level, conv_data, turn_data)
        MetricManager-->>GEvalHandler: metadata (criteria, params, steps, threshold)
        alt is_conversation
            GEvalHandler->>GEvalHandler: _evaluate_conversation(...)
            GEvalHandler->>DeepEvalLib: GEval.measure(test_case, params...)
        else
            GEvalHandler->>GEvalHandler: _evaluate_turn(...)
            GEvalHandler->>DeepEvalLib: GEval.measure(test_case, params...)
        end
        DeepEvalLib-->>GEvalHandler: score, reason
        GEvalHandler-->>DeepEvalMetrics: (score, reason)
    else standard DeepEval metric
        DeepEvalMetrics->>DeepEvalLib: evaluate via supported_metrics
        DeepEvalLib-->>DeepEvalMetrics: score, reason
    end
    DeepEvalMetrics-->>Evaluator: (score, reason)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

  • New GEvalHandler (geval.py): review param mapping, metadata lookup, test-case construction, error paths.
  • DeepEvalMetrics changes (deepeval.py): confirm resource sharing, routing, constructor change impact.
  • MetricManager refactor (manager.py): validate metadata resolution and fallback logic.
  • Tests: ensure mocks align with new behavior and edge cases covered.
  • Config: verify GEval metric schema in system.yaml matches runtime expectations.

Possibly related PRs

Suggested reviewers

  • VladimirKadlec
  • asamal4
  • tisnik

Poem

🐰 I hopped into metrics bright and new,
GEval seeds planted, thresholds in view,
DeepEval and handlers now dance as one,
Turns and conversations kissed by the sun,
A rabbit's small hop — tests passing through.

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title 'GEval Integration' accurately summarizes the main objective of the pull request, which is integrating GEval capabilities into the LSC Evaluation Framework through configuration, registry, and system loader changes.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
src/lightspeed_evaluation/core/system/loader.py (1)

67-80: Consider distinguishing turn-level vs conversation-level metrics in the registry.

The current implementation adds all GEval metrics to both TURN_LEVEL_METRICS and CONVERSATION_LEVEL_METRICS sets (lines 78-79), with validation deferred to evaluation time. While the comment explains this design choice, it may lead to confusing validation errors when users specify a conversation-level metric for turn-level evaluation (or vice versa).

Consider enhancing the registry schema to include a level field (e.g., level: "turn" or level: "conversation"), allowing the loader to route metrics to the appropriate set. This would provide earlier validation and clearer error messages for mismatched metric usage.

Example registry enhancement:

technical_accuracy:
  level: "turn"  # or "conversation" or "both"
  criteria: |
    ...

And in the loader:

for metric_name, metric_config in registry.items():
    metric_identifier = f"geval:{metric_name}"
    level = metric_config.get("level", "both")
    
    if level in ("turn", "both"):
        TURN_LEVEL_METRICS.add(metric_identifier)
    if level in ("conversation", "both"):
        CONVERSATION_LEVEL_METRICS.add(metric_identifier)
src/lightspeed_evaluation/core/metrics/geval.py (1)

323-331: Consider returning None for missing scores instead of defaulting to 0.0.

Line 325 defaults the score to 0.0 when metric.score is None:

score = metric.score if metric.score is not None else 0.0

This could mask evaluation failures where GEval couldn't compute a score. Since the return type is tuple[float | None, str], consider returning None for the score when it's unavailable, allowing the caller to handle this case explicitly (e.g., marking as ERROR status).

Apply this diff:

         try:
             metric.measure(test_case)
-            score = metric.score if metric.score is not None else 0.0
+            score = metric.score  # Can be None if evaluation fails
             reason = (
                 str(metric.reason)
                 if hasattr(metric, "reason") and metric.reason

Note: This same pattern appears in _evaluate_conversation (line 418), so consider updating both methods consistently.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 64f8892 and e7758e3.

📒 Files selected for processing (9)
  • config/registry/geval_metrics.yaml (1 hunks)
  • config/system.yaml (1 hunks)
  • src/lightspeed_evaluation/core/metrics/__init__.py (1 hunks)
  • src/lightspeed_evaluation/core/metrics/deepeval.py (3 hunks)
  • src/lightspeed_evaluation/core/metrics/geval.py (1 hunks)
  • src/lightspeed_evaluation/core/models/__init__.py (2 hunks)
  • src/lightspeed_evaluation/core/models/system.py (2 hunks)
  • src/lightspeed_evaluation/core/system/loader.py (5 hunks)
  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (5)
config/**/*.{yaml,yml}

📄 CodeRabbit inference engine (AGENTS.md)

Use YAML for configuration files under config/

Files:

  • config/registry/geval_metrics.yaml
  • config/system.yaml
config/system.yaml

📄 CodeRabbit inference engine (AGENTS.md)

config/system.yaml: Keep system configuration in config/system.yaml (LLM, API, metrics metadata, output, logging)
Add metric metadata to the metrics_metadata section in config/system.yaml when introducing new metrics

Files:

  • config/system.yaml
src/lightspeed_evaluation/**

📄 CodeRabbit inference engine (AGENTS.md)

Add all new evaluation features under src/lightspeed_evaluation/ (do not add new features elsewhere)

Files:

  • src/lightspeed_evaluation/core/models/system.py
  • src/lightspeed_evaluation/core/metrics/__init__.py
  • src/lightspeed_evaluation/core/system/loader.py
  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
  • src/lightspeed_evaluation/core/metrics/deepeval.py
  • src/lightspeed_evaluation/core/metrics/geval.py
  • src/lightspeed_evaluation/core/models/__init__.py
src/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

src/**/*.py: Require type hints for all public functions and methods
Use Google-style docstrings for all public APIs
Use custom exceptions from core.system.exceptions for error handling
Use structured logging with appropriate levels

Files:

  • src/lightspeed_evaluation/core/models/system.py
  • src/lightspeed_evaluation/core/metrics/__init__.py
  • src/lightspeed_evaluation/core/system/loader.py
  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
  • src/lightspeed_evaluation/core/metrics/deepeval.py
  • src/lightspeed_evaluation/core/metrics/geval.py
  • src/lightspeed_evaluation/core/models/__init__.py
src/lightspeed_evaluation/core/metrics/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Register new metrics in MetricManager’s supported_metrics dictionary

Files:

  • src/lightspeed_evaluation/core/metrics/__init__.py
  • src/lightspeed_evaluation/core/metrics/deepeval.py
  • src/lightspeed_evaluation/core/metrics/geval.py
🧠 Learnings (10)
📓 Common learnings
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 28
File: lsc_eval/runner.py:99-103
Timestamp: 2025-08-26T11:17:48.640Z
Learning: The lsc_eval generic evaluation tool is intended to become the primary evaluation framework, replacing an existing evaluation tool in the lightspeed-evaluation repository.
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to src/lightspeed_evaluation/** : Add all new evaluation features under src/lightspeed_evaluation/ (do not add new features elsewhere)
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/core/system/validator.py:146-155
Timestamp: 2025-09-18T23:59:37.026Z
Learning: In the lightspeed-evaluation project, the DataValidator in `src/lightspeed_evaluation/core/system/validator.py` is intentionally designed to validate only explicitly provided user evaluation data, not resolved metrics that include system defaults. When turn_metrics is None, the system falls back to system config defaults, and this validation separation is by design.
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to src/lightspeed_evaluation/core/metrics/**/*.py : Register new metrics in MetricManager’s supported_metrics dictionary
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:33-36
Timestamp: 2025-09-19T00:37:23.798Z
Learning: In the lightspeed-evaluation codebase, metric resolution (including applying defaults when turn_metrics is None) happens upstream in ConversationProcessor.process_conversation() using MetricManager.resolve_metrics(), not in the EvaluationErrorHandler. The error handler only marks explicitly defined metrics as ERROR.
📚 Learning: 2025-10-16T11:17:19.324Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to config/system.yaml : Add metric metadata to the metrics_metadata section in config/system.yaml when introducing new metrics

Applied to files:

  • config/registry/geval_metrics.yaml
  • config/system.yaml
  • src/lightspeed_evaluation/core/models/system.py
📚 Learning: 2025-10-16T11:17:19.324Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to config/evaluation_data.yaml : Keep evaluation data in config/evaluation_data.yaml (conversation groups, turns, overrides, scripts)

Applied to files:

  • config/registry/geval_metrics.yaml
  • config/system.yaml
📚 Learning: 2025-10-16T11:17:19.324Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: When adding a new feature, also add a new sample evaluation data YAML file

Applied to files:

  • config/registry/geval_metrics.yaml
📚 Learning: 2025-10-16T11:17:19.324Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to config/system.yaml : Keep system configuration in config/system.yaml (LLM, API, metrics metadata, output, logging)

Applied to files:

  • config/system.yaml
📚 Learning: 2025-09-10T15:48:14.671Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/output/generator.py:43-49
Timestamp: 2025-09-10T15:48:14.671Z
Learning: In the lightspeed-evaluation framework, system configuration uses Pydantic data models (SystemConfig, OutputConfig, LoggingConfig, etc.) rather than plain dictionaries. Components like OutputHandler receive properly structured Pydantic models, so direct attribute access (e.g., system_config.output.enabled_outputs) is the correct approach.

Applied to files:

  • src/lightspeed_evaluation/core/models/system.py
  • src/lightspeed_evaluation/core/system/loader.py
  • src/lightspeed_evaluation/core/models/__init__.py
📚 Learning: 2025-09-18T23:59:37.026Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/core/system/validator.py:146-155
Timestamp: 2025-09-18T23:59:37.026Z
Learning: In the lightspeed-evaluation project, the DataValidator in `src/lightspeed_evaluation/core/system/validator.py` is intentionally designed to validate only explicitly provided user evaluation data, not resolved metrics that include system defaults. When turn_metrics is None, the system falls back to system config defaults, and this validation separation is by design.

Applied to files:

  • src/lightspeed_evaluation/core/models/system.py
  • src/lightspeed_evaluation/core/system/loader.py
  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-10-16T11:17:19.324Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to src/lightspeed_evaluation/core/metrics/**/*.py : Register new metrics in MetricManager’s supported_metrics dictionary

Applied to files:

  • src/lightspeed_evaluation/core/metrics/__init__.py
  • src/lightspeed_evaluation/core/system/loader.py
  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
  • src/lightspeed_evaluation/core/metrics/deepeval.py
  • src/lightspeed_evaluation/core/metrics/geval.py
📚 Learning: 2025-10-16T11:17:19.324Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to src/lightspeed_evaluation/** : Add all new evaluation features under src/lightspeed_evaluation/ (do not add new features elsewhere)

Applied to files:

  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
  • src/lightspeed_evaluation/core/metrics/deepeval.py
📚 Learning: 2025-09-19T00:37:23.798Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:33-36
Timestamp: 2025-09-19T00:37:23.798Z
Learning: In the lightspeed-evaluation codebase, metric resolution (including applying defaults when turn_metrics is None) happens upstream in ConversationProcessor.process_conversation() using MetricManager.resolve_metrics(), not in the EvaluationErrorHandler. The error handler only marks explicitly defined metrics as ERROR.

Applied to files:

  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
🧬 Code graph analysis (5)
src/lightspeed_evaluation/core/system/loader.py (1)
src/lightspeed_evaluation/core/models/system.py (1)
  • GEvalConfig (267-287)
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)
src/lightspeed_evaluation/core/metrics/deepeval.py (1)
  • DeepEvalMetrics (30-184)
src/lightspeed_evaluation/core/metrics/deepeval.py (3)
src/lightspeed_evaluation/core/metrics/geval.py (2)
  • GEvalHandler (20-499)
  • evaluate (120-193)
src/lightspeed_evaluation/core/llm/manager.py (4)
  • LLMManager (10-117)
  • get_config (105-107)
  • get_model_name (91-93)
  • get_llm_params (95-103)
src/lightspeed_evaluation/core/llm/deepeval.py (1)
  • DeepEvalLLMManager (8-42)
src/lightspeed_evaluation/core/metrics/geval.py (2)
src/lightspeed_evaluation/core/llm/deepeval.py (1)
  • DeepEvalLLMManager (8-42)
src/lightspeed_evaluation/core/metrics/deepeval.py (1)
  • evaluate (92-129)
src/lightspeed_evaluation/core/models/__init__.py (1)
src/lightspeed_evaluation/core/models/system.py (1)
  • GEvalConfig (267-287)
🔇 Additional comments (16)
src/lightspeed_evaluation/core/metrics/__init__.py (1)

8-13: No issues found. The all exports are complete and correct.

The multi-line formatting with trailing commas is good practice. GEvalHandler is intentionally not exported because it's an internal component of DeepEvalMetrics, not a top-level metrics class like the others. The architecture is clean: GEval metrics are accessed through DeepEvalMetrics.evaluate() which routes metrics by name to either standard DeepEval implementations or to the internal GEvalHandler. All top-level metric classes are properly exported.

config/system.yaml (1)

46-52: Verify GEval metrics metadata design.

The GEval metrics defined in the registry (e.g., geval:technical_accuracy, geval:command_validity) do not have corresponding entries in the metrics_metadata section (lines 54-117). While GEval metrics carry their own metadata structure in the registry file (criteria, evaluation_params, evaluation_steps, threshold), the coding guidelines state: "Add metric metadata to the metrics_metadata section in config/system.yaml when introducing new metrics."

Please confirm whether this deviation is intentional, as GEval uses a separate registry-based metadata approach, or if these metrics should also have entries in the metrics_metadata section for consistency.

Based on coding guidelines.

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (2)

46-49: LGTM!

The integration of GEval registry path into DeepEval metrics initialization is clean and correctly propagates configuration from the system config.


57-57: LGTM!

The routing of GEval metrics through the existing deepeval_metrics handler is an elegant design choice that unifies metric evaluation without requiring a separate handler entry point.

src/lightspeed_evaluation/core/models/__init__.py (1)

24-24: LGTM!

The GEvalConfig export follows the established pattern for exposing configuration models in the public API.

Also applies to: 43-43

config/registry/geval_metrics.yaml (1)

1-136: LGTM with observation.

The GEval metrics registry is well-structured and follows a consistent schema across all metric definitions. The inclusion of usage examples and the caveat about generated metrics (lines 5-6) is helpful for users.

One observation: The evaluation_params use lowercase identifiers like actual_output and expected_output, which need conversion to LLMTestCaseParams enum values. Verified that src/lightspeed_evaluation/core/metrics/geval.py (lines 195-238) handles this conversion with appropriate fallback behavior when params don't match enum values.

src/lightspeed_evaluation/core/models/system.py (1)

314-317: LGTM!

The geval field integration into SystemConfig follows the established pattern for configuration sections and properly uses Field with appropriate defaults and description.

src/lightspeed_evaluation/core/metrics/deepeval.py (2)

38-68: LGTM!

The initialization properly sets up shared resources between standard DeepEval metrics and GEval, with appropriate cache configuration and clear separation of concerns. The use of a shared LLM manager promotes resource efficiency.


92-129: LGTM!

The unified evaluation routing between standard DeepEval metrics and GEval is well-implemented. The routing logic is clear, and the comment clarifying that GEval metric names should not include the "geval:" prefix is helpful for maintainability.

src/lightspeed_evaluation/core/system/loader.py (2)

108-110: LGTM!

The conditional loading of GEval metrics based on the enabled flag is correctly placed in the metric mapping population flow, and properly propagates the registry path from configuration.


183-183: LGTM!

The GEvalConfig construction follows the established pattern for configuration sections, with appropriate defaulting to an empty dictionary when the section is absent.

src/lightspeed_evaluation/core/metrics/geval.py (5)

35-49: LGTM!

The class design with a shared registry cache is efficient and appropriate for this use case. The initialization properly delegates registry loading and stores the shared LLM manager.


120-193: LGTM!

The main evaluation method has a clear, well-documented flow: configuration retrieval, validation, and delegation to the appropriate evaluation level. The error messages are informative and help with debugging.


195-238: LGTM!

The parameter conversion logic with fallback to auto-detection is a smart design choice that provides flexibility for custom evaluation parameters while maintaining type safety for standard parameters.


400-413: LGTM!

The conversation aggregation strategy of concatenating all turns into a single test case is appropriate for GEval's evaluation model. The clear turn numbering in the aggregated input/output helps maintain context.


435-499: LGTM!

The configuration resolution with clear priority order (runtime metadata overrides registry) is well-implemented. The logging at different levels (debug for source selection, warning for missing config) provides good observability.

Copy link
Contributor

@lpiwowar lpiwowar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just passing by 🚶. Two things crossed my mind while taking a look at this PR.

Thanks for working on this:) 👍


# GEval Configuration
# Configurable custom metrics using DeepEval's GEval framework
geval:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

questions(s): Is there a way how to make the geval metrics configuration aling with the already existing metrics so that we can extend the metrics_metadata section? Would something like this work?

metrics_metadata:
  geval_registry_path: "config/registry/geval_metrics.yaml"
  turn_metrics:
    "geval:technical_accuracy":
    ...

I'm just wondering whether we are going to introduce a separate config option for each evaluation framework we might add in the future.

Copy link
Collaborator

@asamal4 asamal4 Nov 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the rationale behind the implementation: it's driven by the evaluation technique and the non-static nature of the metrics, not the framework.
However, I agree with Lukas about integrating this into the current metric_metadata. This would simplify both the user experience and maintenance. Furthermore, users can add the entire geval metric directly to the existing metadata (avoiding a separate registry). With this we can fully utilize our existing flow for metric requirements validation, override logic, and data field conversion within the overall deepeval frameworks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do; thank you for the suggestions. I will get this updated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored/Additions

  • Added field name mapping logic in geval.py to be more aligned with the current input values
    • Updated all evaluation_params in system.yaml GEval metrics to use
      data field names: query, response, expected_response, contexts
  • Updated docstrings to reflect new naming convention
  • Standardized on MetricManager for GEval

Removed

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. This simplifies the logic a lot.

- actual_output
- expected_output
evaluation_steps:
- "Verify that the Ansible syntax is valid and follows YAML formatting rules"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (non-blocking): I would personally keep here in the default file the metrics that are product agnostic and metrics that can theoretically be used by anybody.

Any examples of metrics that are product specific should be in docs showcasing how people can write their own metrics IMO.

Copy link
Collaborator

@asamal4 asamal4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for adding this..

Comment on lines 17 to 20
evaluation_params:
- input
- actual_output
- expected_output
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These names are as per deepeval. However it will create confusion, as our eval data expects different names.
Can we align these as per the input data and internally rename these.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Fixed this here; Geval is now aligned with the rest of the framework

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
src/lightspeed_evaluation/core/metrics/deepeval.py (1)

97-138: Clarify docstring regarding metric_name prefix handling.

The docstring states that metric_name "should NOT include 'geval:' prefix" (line 110), but the implementation defensively strips the prefix if present (lines 127-131). While this defensive approach is good for robustness, consider updating the docstring to reflect the actual behavior:

         Args:
-            metric_name: Name of metric (for GEval, this should NOT include "geval:" prefix)
+            metric_name: Name of metric (for GEval, "geval:" prefix will be stripped if present)
             conv_data: Conversation data object
             scope: EvaluationScope containing turn info and conversation flag
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 88ba187 and c83e295.

📒 Files selected for processing (5)
  • config/system.yaml (4 hunks)
  • src/lightspeed_evaluation/core/metrics/deepeval.py (3 hunks)
  • src/lightspeed_evaluation/core/metrics/geval.py (1 hunks)
  • src/lightspeed_evaluation/core/metrics/manager.py (1 hunks)
  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • config/system.yaml
🧰 Additional context used
📓 Path-based instructions (3)
src/lightspeed_evaluation/**

📄 CodeRabbit inference engine (AGENTS.md)

Add all new evaluation features under src/lightspeed_evaluation/ (do not add new features elsewhere)

Files:

  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
  • src/lightspeed_evaluation/core/metrics/manager.py
  • src/lightspeed_evaluation/core/metrics/deepeval.py
  • src/lightspeed_evaluation/core/metrics/geval.py
src/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

src/**/*.py: Require type hints for all public functions and methods
Use Google-style docstrings for all public APIs
Use custom exceptions from core.system.exceptions for error handling
Use structured logging with appropriate levels

Files:

  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
  • src/lightspeed_evaluation/core/metrics/manager.py
  • src/lightspeed_evaluation/core/metrics/deepeval.py
  • src/lightspeed_evaluation/core/metrics/geval.py
src/lightspeed_evaluation/core/metrics/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Register new metrics in MetricManager’s supported_metrics dictionary

Files:

  • src/lightspeed_evaluation/core/metrics/manager.py
  • src/lightspeed_evaluation/core/metrics/deepeval.py
  • src/lightspeed_evaluation/core/metrics/geval.py
🧠 Learnings (6)
📓 Common learnings
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 28
File: lsc_eval/runner.py:99-103
Timestamp: 2025-08-26T11:17:48.640Z
Learning: The lsc_eval generic evaluation tool is intended to become the primary evaluation framework, replacing an existing evaluation tool in the lightspeed-evaluation repository.
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to src/lightspeed_evaluation/** : Add all new evaluation features under src/lightspeed_evaluation/ (do not add new features elsewhere)
📚 Learning: 2025-09-19T00:37:23.798Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:33-36
Timestamp: 2025-09-19T00:37:23.798Z
Learning: In the lightspeed-evaluation codebase, metric resolution (including applying defaults when turn_metrics is None) happens upstream in ConversationProcessor.process_conversation() using MetricManager.resolve_metrics(), not in the EvaluationErrorHandler. The error handler only marks explicitly defined metrics as ERROR.

Applied to files:

  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
  • src/lightspeed_evaluation/core/metrics/geval.py
📚 Learning: 2025-10-16T11:17:19.324Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to src/lightspeed_evaluation/core/metrics/**/*.py : Register new metrics in MetricManager’s supported_metrics dictionary

Applied to files:

  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
  • src/lightspeed_evaluation/core/metrics/manager.py
  • src/lightspeed_evaluation/core/metrics/deepeval.py
📚 Learning: 2025-10-16T11:17:19.324Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to src/lightspeed_evaluation/** : Add all new evaluation features under src/lightspeed_evaluation/ (do not add new features elsewhere)

Applied to files:

  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-09-18T23:59:37.026Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/core/system/validator.py:146-155
Timestamp: 2025-09-18T23:59:37.026Z
Learning: In the lightspeed-evaluation project, the DataValidator in `src/lightspeed_evaluation/core/system/validator.py` is intentionally designed to validate only explicitly provided user evaluation data, not resolved metrics that include system defaults. When turn_metrics is None, the system falls back to system config defaults, and this validation separation is by design.

Applied to files:

  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
  • src/lightspeed_evaluation/core/metrics/geval.py
📚 Learning: 2025-07-16T12:07:29.169Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_script_runner.py:0-0
Timestamp: 2025-07-16T12:07:29.169Z
Learning: In the lsc_agent_eval package, the ScriptRunner class was modified to use absolute paths internally rather than documenting path normalization behavior, providing more predictable and consistent path handling.

Applied to files:

  • src/lightspeed_evaluation/core/metrics/geval.py
🧬 Code graph analysis (4)
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)
src/lightspeed_evaluation/core/metrics/deepeval.py (1)
  • DeepEvalMetrics (31-193)
src/lightspeed_evaluation/core/metrics/manager.py (1)
src/lightspeed_evaluation/core/models/data.py (2)
  • EvaluationData (264-311)
  • TurnData (35-261)
src/lightspeed_evaluation/core/metrics/deepeval.py (4)
src/lightspeed_evaluation/core/metrics/geval.py (2)
  • GEvalHandler (19-433)
  • evaluate (45-116)
src/lightspeed_evaluation/core/metrics/manager.py (1)
  • MetricManager (17-166)
src/lightspeed_evaluation/core/llm/manager.py (4)
  • LLMManager (10-117)
  • get_config (105-107)
  • get_model_name (91-93)
  • get_llm_params (95-103)
src/lightspeed_evaluation/core/llm/deepeval.py (1)
  • DeepEvalLLMManager (8-42)
src/lightspeed_evaluation/core/metrics/geval.py (3)
src/lightspeed_evaluation/core/llm/deepeval.py (1)
  • DeepEvalLLMManager (8-42)
src/lightspeed_evaluation/core/metrics/manager.py (3)
  • MetricLevel (10-14)
  • MetricManager (17-166)
  • get_metric_metadata (50-82)
src/lightspeed_evaluation/core/metrics/deepeval.py (1)
  • evaluate (97-138)
🔇 Additional comments (12)
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (2)

46-49: LGTM! DeepEvalMetrics initialization updated correctly.

The addition of metric_manager parameter aligns with the GEval integration, enabling access to metric metadata for both standard and custom metrics.


57-58: LGTM! GEval routing configured correctly.

The routing configuration correctly directs GEval metrics through the unified deepeval_metrics handler, and the comment clearly documents this design decision.

src/lightspeed_evaluation/core/metrics/deepeval.py (2)

1-29: LGTM! Module setup and imports are appropriate.

The updated docstring clearly describes the dual support for standard and GEval metrics, and the logger initialization follows best practices.


39-66: LGTM! Constructor properly extended for GEval integration.

The addition of metric_manager parameter and initialization of GEvalHandler with shared LLM resources follows the unified design. Type hints comply with coding guidelines.

src/lightspeed_evaluation/core/metrics/manager.py (2)

50-82: LGTM! Metadata retrieval method is well-designed.

The method properly implements the priority hierarchy (runtime metadata → system defaults) and includes comprehensive documentation. Type hints and docstring comply with coding guidelines.


84-110: LGTM! Threshold extraction properly refactored.

The method cleanly delegates to get_metric_metadata() and safely extracts the threshold value. This refactoring improves code organization and supports the richer metadata requirements for GEval integration.

src/lightspeed_evaluation/core/metrics/geval.py (6)

1-44: LGTM! Class structure and initialization are well-designed.

The module docstring clearly describes the GEval integration, and the constructor properly accepts the required dependencies. Type hints and Google-style docstrings comply with coding guidelines.


45-116: LGTM! Main evaluation method is well-structured.

The method properly validates configuration, extracts parameters with sensible defaults, and routes to the appropriate evaluation level. Documentation is comprehensive and error handling is appropriate.


118-176: Parameter conversion logic is robust.

The method properly maps evaluation data field names to DeepEval enum values with appropriate fallbacks. The decision to return None when encountering any unrecognized parameter (line 173) allows GEval to auto-detect fields, which is a reasonable defensive approach.


178-283: LGTM! Turn-level evaluation is comprehensive.

The method properly handles parameter conversion, builds the GEval metric with appropriate configuration, constructs test cases with optional fields, and includes robust error handling with detailed logging. The conditional logic for evaluation_params (lines 228-236) correctly handles all scenarios.


285-381: LGTM! Conversation-level evaluation properly aggregates turns.

The method correctly aggregates conversation turns into a single test case and evaluates them holistically. The turn aggregation logic (lines 350-352) is clear, and error handling includes helpful diagnostic logging. The comment about top_logprobs limit (line 326) is useful context.


383-433: LGTM! Configuration retrieval properly leverages MetricManager.

The method correctly constructs the metric identifier with the geval: prefix and delegates to MetricManager.get_metric_metadata() to leverage the priority hierarchy. Logging at appropriate levels aids debugging.

provider: "openai" # LLM Provider (openai, watsonx, gemini, hosted_vllm etc..)
model: "gpt-4o-mini" # Model name for the provider
provider: vertex # LLM Provider (openai, watsonx, gemini, hosted_vllm etc..)
model: gemini-2.0-flash # Model name for the provider
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please keep the existing defaults as it is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This slipped through, sorry about that. I'll get that fixed.

@arin-deloatch arin-deloatch changed the title PR 2/3 — GEval Configuration, Registry and Settings Integration GEval Integration Nov 13, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
src/lightspeed_evaluation/core/metrics/manager.py (1)

76-83: Avoid leaking shared metadata state

We now hand out the same dict object that lives inside turn_metrics_metadata / system defaults. Any caller that tweaks the returned metadata (for example, to add derived fields at runtime) will mutate the underlying config for all subsequent evaluations. Please return a defensive copy before handing it to callers to keep per-request state isolated.

You could apply something like:

+import copy
 ...
-        if metric_identifier in level_metadata:
-            return level_metadata[metric_identifier]
+        if metric_identifier in level_metadata:
+            return copy.deepcopy(level_metadata[metric_identifier])
 ...
-        return system_metadata.get(metric_identifier)
+        metadata = system_metadata.get(metric_identifier)
+        return copy.deepcopy(metadata) if metadata else None
src/lightspeed_evaluation/core/metrics/geval.py (1)

217-243: Consider extracting duplicated evaluation_params logic.

The logic for conditionally setting evaluation_params (lines 228-236) is duplicated in _evaluate_conversation (lines 329-336). While the behavior is correct, extracting this into a small helper would reduce duplication and make the logic easier to maintain.

Consider creating a helper method:

def _prepare_metric_kwargs(
    self,
    criteria: str,
    converted_params: list[LLMTestCaseParams] | None,
    evaluation_params: list[str],
    evaluation_steps: list[str] | None,
    threshold: float,
    name: str,
) -> dict[str, Any]:
    """Prepare kwargs for GEval metric instantiation."""
    metric_kwargs: dict[str, Any] = {
        "name": name,
        "criteria": criteria,
        "model": self.deepeval_llm_manager.get_llm(),
        "threshold": threshold,
        "top_logprobs": 5,
    }
    
    if converted_params is None:
        if not evaluation_params:
            metric_kwargs["evaluation_params"] = [
                LLMTestCaseParams.INPUT,
                LLMTestCaseParams.ACTUAL_OUTPUT,
            ]
    else:
        metric_kwargs["evaluation_params"] = converted_params
    
    if evaluation_steps:
        metric_kwargs["evaluation_steps"] = evaluation_steps
    
    return metric_kwargs

Then call it from both _evaluate_turn and _evaluate_conversation.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c83e295 and 0e330ef.

📒 Files selected for processing (9)
  • config/system.yaml (4 hunks)
  • src/lightspeed_evaluation/core/metrics/__init__.py (1 hunks)
  • src/lightspeed_evaluation/core/metrics/deepeval.py (3 hunks)
  • src/lightspeed_evaluation/core/metrics/geval.py (1 hunks)
  • src/lightspeed_evaluation/core/metrics/manager.py (1 hunks)
  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1 hunks)
  • tests/unit/core/metrics/test_geval.py (1 hunks)
  • tests/unit/core/metrics/test_manager.py (1 hunks)
  • tests/unit/pipeline/evaluation/test_evaluator.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
  • config/system.yaml
  • src/lightspeed_evaluation/core/metrics/init.py
🧰 Additional context used
🧠 Learnings (5)
📓 Common learnings
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:33-36
Timestamp: 2025-09-19T00:37:23.798Z
Learning: In the lightspeed-evaluation codebase, metric resolution (including applying defaults when turn_metrics is None) happens upstream in ConversationProcessor.process_conversation() using MetricManager.resolve_metrics(), not in the EvaluationErrorHandler. The error handler only marks explicitly defined metrics as ERROR.
📚 Learning: 2025-09-18T23:59:37.026Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/core/system/validator.py:146-155
Timestamp: 2025-09-18T23:59:37.026Z
Learning: In the lightspeed-evaluation project, the DataValidator in `src/lightspeed_evaluation/core/system/validator.py` is intentionally designed to validate only explicitly provided user evaluation data, not resolved metrics that include system defaults. When turn_metrics is None, the system falls back to system config defaults, and this validation separation is by design.

Applied to files:

  • tests/unit/core/metrics/test_manager.py
  • src/lightspeed_evaluation/core/metrics/geval.py
📚 Learning: 2025-07-16T12:07:29.169Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_script_runner.py:0-0
Timestamp: 2025-07-16T12:07:29.169Z
Learning: In the lsc_agent_eval package, the ScriptRunner class was modified to use absolute paths internally rather than documenting path normalization behavior, providing more predictable and consistent path handling.

Applied to files:

  • src/lightspeed_evaluation/core/metrics/geval.py
📚 Learning: 2025-09-19T00:37:23.798Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:33-36
Timestamp: 2025-09-19T00:37:23.798Z
Learning: In the lightspeed-evaluation codebase, metric resolution (including applying defaults when turn_metrics is None) happens upstream in ConversationProcessor.process_conversation() using MetricManager.resolve_metrics(), not in the EvaluationErrorHandler. The error handler only marks explicitly defined metrics as ERROR.

Applied to files:

  • src/lightspeed_evaluation/core/metrics/geval.py
  • tests/unit/core/metrics/test_geval.py
📚 Learning: 2025-09-19T12:32:06.403Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:18-31
Timestamp: 2025-09-19T12:32:06.403Z
Learning: When analyzing method calls, always examine the complete call site including all parameters before suggesting fixes. In the lightspeed-evaluation codebase, mark_all_metrics_as_error in processor.py correctly passes both resolved_turn_metrics and resolved_conversation_metrics parameters.

Applied to files:

  • tests/unit/core/metrics/test_geval.py
🧬 Code graph analysis (5)
src/lightspeed_evaluation/core/metrics/manager.py (1)
src/lightspeed_evaluation/core/models/data.py (2)
  • EvaluationData (264-311)
  • TurnData (35-261)
src/lightspeed_evaluation/core/metrics/deepeval.py (5)
src/lightspeed_evaluation/core/metrics/geval.py (2)
  • GEvalHandler (19-433)
  • evaluate (45-116)
src/lightspeed_evaluation/core/metrics/manager.py (1)
  • MetricManager (17-166)
src/lightspeed_evaluation/core/models/data.py (2)
  • EvaluationScope (356-367)
  • TurnData (35-261)
src/lightspeed_evaluation/core/llm/manager.py (4)
  • LLMManager (10-117)
  • get_config (105-107)
  • get_model_name (91-93)
  • get_llm_params (95-103)
src/lightspeed_evaluation/core/llm/deepeval.py (1)
  • DeepEvalLLMManager (8-42)
tests/unit/core/metrics/test_manager.py (3)
src/lightspeed_evaluation/core/metrics/manager.py (3)
  • MetricManager (17-166)
  • get_metric_metadata (50-82)
  • MetricLevel (10-14)
src/lightspeed_evaluation/core/models/data.py (2)
  • TurnData (35-261)
  • EvaluationData (264-311)
src/lightspeed_evaluation/core/models/system.py (1)
  • SystemConfig (267-297)
src/lightspeed_evaluation/core/metrics/geval.py (3)
src/lightspeed_evaluation/core/llm/deepeval.py (1)
  • DeepEvalLLMManager (8-42)
src/lightspeed_evaluation/core/metrics/manager.py (3)
  • MetricLevel (10-14)
  • MetricManager (17-166)
  • get_metric_metadata (50-82)
src/lightspeed_evaluation/core/metrics/deepeval.py (1)
  • evaluate (97-138)
tests/unit/core/metrics/test_geval.py (2)
src/lightspeed_evaluation/core/metrics/geval.py (3)
  • GEvalHandler (19-433)
  • _convert_evaluation_params (118-176)
  • _get_geval_config (383-433)
src/lightspeed_evaluation/core/metrics/manager.py (2)
  • MetricLevel (10-14)
  • get_metric_metadata (50-82)
🔇 Additional comments (5)
src/lightspeed_evaluation/core/metrics/geval.py (5)

1-43: LGTM! Clean initialization and documentation.

The module setup, imports, and constructor are well-structured. The handler correctly accepts both DeepEvalLLMManager and MetricManager dependencies for runtime metric evaluation.


45-116: LGTM! Solid orchestration logic.

The evaluate() method properly retrieves configuration, validates criteria, and delegates to the appropriate evaluator based on the evaluation level.


118-176: LGTM! Robust parameter conversion with fallback.

The helper correctly maps evaluation data field names to LLMTestCaseParams enums and falls back to direct enum lookup for backward compatibility. Returning None for unrecognized params allows GEval's auto-detection to take over.


285-381: LGTM! Conversation aggregation logic is sound.

The method correctly aggregates all turns into a single test case for conversation-level evaluation. The formatting is straightforward and should work well with GEval's processing.

Note: The score defaulting and duplicate evaluation_params logic issues mentioned in previous comments also apply here (line 363 and lines 329-336).


383-433: LGTM! Clean integration with MetricManager.

The method properly delegates configuration retrieval to MetricManager, respecting the priority hierarchy (turn-level → conversation-level → system defaults). The logging is helpful for debugging missing configurations.

Comment on lines +264 to +270
score = metric.score if metric.score is not None else 0.0
reason = (
str(metric.reason)
if hasattr(metric, "reason") and metric.reason
else "No reason provided"
)
return score, reason
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Defaulting None score to 0.0 might mask evaluation failures.

If metric.score is None, the current implementation returns 0.0 as the score. This could make it difficult to distinguish between a metric that legitimately evaluated to zero versus one that failed to produce a score at all. Consider returning None as the score when metric.score is None, which better signals that evaluation didn't complete successfully.

Apply this diff:

-            score = metric.score if metric.score is not None else 0.0
+            score = metric.score

The same pattern appears in _evaluate_conversation at line 363 and should be updated consistently.

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In src/lightspeed_evaluation/core/metrics/geval.py around lines 264 to 270 (and
likewise update the similar pattern at _evaluate_conversation around line 363),
the code currently substitutes metric.score None with 0.0 which masks failed
evaluations; change the logic to return None when metric.score is None (keep
numeric scores unchanged) and preserve reason extraction as-is so callers can
detect a missing score vs a real zero; update both places to return (None,
reason) when metric.score is None instead of (0.0, reason).

@arin-deloatch
Copy link
Contributor Author

Additions/Modifications in 0e330ef

Copy link
Collaborator

@asamal4 asamal4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@VladimirKadlec VladimirKadlec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you!

# LLM as a judge configuration
llm:
provider: "openai" # LLM Provider (openai, watsonx, gemini, hosted_vllm etc..)
provider: "openai" # LLM Provider (openai, watsonx, gemini, hosted_vllm etc..)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: only spaces, I'd keep the original indent

# But can be easily integrated with other APIs with minimal change.
api:
enabled: true # Enable API calls instead of using pre-filled data
enabled: true # Enable API calls instead of using pre-filled data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: only spaces, I'd keep the original indent

Copy link
Contributor

@tisnik tisnik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tisnik tisnik merged commit 39d7119 into lightspeed-core:main Nov 20, 2025
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants