-
Notifications
You must be signed in to change notification settings - Fork 21
GEval Integration #97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GEval Integration #97
Conversation
WalkthroughAdds GEval metric support: new GEvalHandler class, routes "geval:" metrics through DeepEvalMetrics, expands MetricManager to return full metric metadata, updates evaluator wiring and configuration with two GEval metrics in system.yaml, and adjusts exports formatting and tests. Changes
Sequence Diagram(s)sequenceDiagram
participant Evaluator
participant DeepEvalMetrics
participant GEvalHandler
participant MetricManager
participant DeepEvalLib
Evaluator->>DeepEvalMetrics: evaluate(metric_name, conv_data, turn_idx, turn_data, ...)
alt metric_name starts with "geval:" or routed as geval
DeepEvalMetrics->>GEvalHandler: evaluate(metric_name, conv_data, turn_idx, turn_data, is_conversation)
GEvalHandler->>MetricManager: get_metric_metadata(metric_name, level, conv_data, turn_data)
MetricManager-->>GEvalHandler: metadata (criteria, params, steps, threshold)
alt is_conversation
GEvalHandler->>GEvalHandler: _evaluate_conversation(...)
GEvalHandler->>DeepEvalLib: GEval.measure(test_case, params...)
else
GEvalHandler->>GEvalHandler: _evaluate_turn(...)
GEvalHandler->>DeepEvalLib: GEval.measure(test_case, params...)
end
DeepEvalLib-->>GEvalHandler: score, reason
GEvalHandler-->>DeepEvalMetrics: (score, reason)
else standard DeepEval metric
DeepEvalMetrics->>DeepEvalLib: evaluate via supported_metrics
DeepEvalLib-->>DeepEvalMetrics: score, reason
end
DeepEvalMetrics-->>Evaluator: (score, reason)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes
Possibly related PRs
Suggested reviewers
Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (2)
src/lightspeed_evaluation/core/system/loader.py (1)
67-80: Consider distinguishing turn-level vs conversation-level metrics in the registry.The current implementation adds all GEval metrics to both
TURN_LEVEL_METRICSandCONVERSATION_LEVEL_METRICSsets (lines 78-79), with validation deferred to evaluation time. While the comment explains this design choice, it may lead to confusing validation errors when users specify a conversation-level metric for turn-level evaluation (or vice versa).Consider enhancing the registry schema to include a
levelfield (e.g.,level: "turn"orlevel: "conversation"), allowing the loader to route metrics to the appropriate set. This would provide earlier validation and clearer error messages for mismatched metric usage.Example registry enhancement:
technical_accuracy: level: "turn" # or "conversation" or "both" criteria: | ...And in the loader:
for metric_name, metric_config in registry.items(): metric_identifier = f"geval:{metric_name}" level = metric_config.get("level", "both") if level in ("turn", "both"): TURN_LEVEL_METRICS.add(metric_identifier) if level in ("conversation", "both"): CONVERSATION_LEVEL_METRICS.add(metric_identifier)src/lightspeed_evaluation/core/metrics/geval.py (1)
323-331: Consider returning None for missing scores instead of defaulting to 0.0.Line 325 defaults the score to
0.0whenmetric.score is None:score = metric.score if metric.score is not None else 0.0This could mask evaluation failures where GEval couldn't compute a score. Since the return type is
tuple[float | None, str], consider returningNonefor the score when it's unavailable, allowing the caller to handle this case explicitly (e.g., marking as ERROR status).Apply this diff:
try: metric.measure(test_case) - score = metric.score if metric.score is not None else 0.0 + score = metric.score # Can be None if evaluation fails reason = ( str(metric.reason) if hasattr(metric, "reason") and metric.reasonNote: This same pattern appears in
_evaluate_conversation(line 418), so consider updating both methods consistently.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (9)
config/registry/geval_metrics.yaml(1 hunks)config/system.yaml(1 hunks)src/lightspeed_evaluation/core/metrics/__init__.py(1 hunks)src/lightspeed_evaluation/core/metrics/deepeval.py(3 hunks)src/lightspeed_evaluation/core/metrics/geval.py(1 hunks)src/lightspeed_evaluation/core/models/__init__.py(2 hunks)src/lightspeed_evaluation/core/models/system.py(2 hunks)src/lightspeed_evaluation/core/system/loader.py(5 hunks)src/lightspeed_evaluation/pipeline/evaluation/evaluator.py(1 hunks)
🧰 Additional context used
📓 Path-based instructions (5)
config/**/*.{yaml,yml}
📄 CodeRabbit inference engine (AGENTS.md)
Use YAML for configuration files under config/
Files:
config/registry/geval_metrics.yamlconfig/system.yaml
config/system.yaml
📄 CodeRabbit inference engine (AGENTS.md)
config/system.yaml: Keep system configuration in config/system.yaml (LLM, API, metrics metadata, output, logging)
Add metric metadata to the metrics_metadata section in config/system.yaml when introducing new metrics
Files:
config/system.yaml
src/lightspeed_evaluation/**
📄 CodeRabbit inference engine (AGENTS.md)
Add all new evaluation features under src/lightspeed_evaluation/ (do not add new features elsewhere)
Files:
src/lightspeed_evaluation/core/models/system.pysrc/lightspeed_evaluation/core/metrics/__init__.pysrc/lightspeed_evaluation/core/system/loader.pysrc/lightspeed_evaluation/pipeline/evaluation/evaluator.pysrc/lightspeed_evaluation/core/metrics/deepeval.pysrc/lightspeed_evaluation/core/metrics/geval.pysrc/lightspeed_evaluation/core/models/__init__.py
src/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
src/**/*.py: Require type hints for all public functions and methods
Use Google-style docstrings for all public APIs
Use custom exceptions from core.system.exceptions for error handling
Use structured logging with appropriate levels
Files:
src/lightspeed_evaluation/core/models/system.pysrc/lightspeed_evaluation/core/metrics/__init__.pysrc/lightspeed_evaluation/core/system/loader.pysrc/lightspeed_evaluation/pipeline/evaluation/evaluator.pysrc/lightspeed_evaluation/core/metrics/deepeval.pysrc/lightspeed_evaluation/core/metrics/geval.pysrc/lightspeed_evaluation/core/models/__init__.py
src/lightspeed_evaluation/core/metrics/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
Register new metrics in MetricManager’s supported_metrics dictionary
Files:
src/lightspeed_evaluation/core/metrics/__init__.pysrc/lightspeed_evaluation/core/metrics/deepeval.pysrc/lightspeed_evaluation/core/metrics/geval.py
🧠 Learnings (10)
📓 Common learnings
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 28
File: lsc_eval/runner.py:99-103
Timestamp: 2025-08-26T11:17:48.640Z
Learning: The lsc_eval generic evaluation tool is intended to become the primary evaluation framework, replacing an existing evaluation tool in the lightspeed-evaluation repository.
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to src/lightspeed_evaluation/** : Add all new evaluation features under src/lightspeed_evaluation/ (do not add new features elsewhere)
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/core/system/validator.py:146-155
Timestamp: 2025-09-18T23:59:37.026Z
Learning: In the lightspeed-evaluation project, the DataValidator in `src/lightspeed_evaluation/core/system/validator.py` is intentionally designed to validate only explicitly provided user evaluation data, not resolved metrics that include system defaults. When turn_metrics is None, the system falls back to system config defaults, and this validation separation is by design.
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to src/lightspeed_evaluation/core/metrics/**/*.py : Register new metrics in MetricManager’s supported_metrics dictionary
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:33-36
Timestamp: 2025-09-19T00:37:23.798Z
Learning: In the lightspeed-evaluation codebase, metric resolution (including applying defaults when turn_metrics is None) happens upstream in ConversationProcessor.process_conversation() using MetricManager.resolve_metrics(), not in the EvaluationErrorHandler. The error handler only marks explicitly defined metrics as ERROR.
📚 Learning: 2025-10-16T11:17:19.324Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to config/system.yaml : Add metric metadata to the metrics_metadata section in config/system.yaml when introducing new metrics
Applied to files:
config/registry/geval_metrics.yamlconfig/system.yamlsrc/lightspeed_evaluation/core/models/system.py
📚 Learning: 2025-10-16T11:17:19.324Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to config/evaluation_data.yaml : Keep evaluation data in config/evaluation_data.yaml (conversation groups, turns, overrides, scripts)
Applied to files:
config/registry/geval_metrics.yamlconfig/system.yaml
📚 Learning: 2025-10-16T11:17:19.324Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: When adding a new feature, also add a new sample evaluation data YAML file
Applied to files:
config/registry/geval_metrics.yaml
📚 Learning: 2025-10-16T11:17:19.324Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to config/system.yaml : Keep system configuration in config/system.yaml (LLM, API, metrics metadata, output, logging)
Applied to files:
config/system.yaml
📚 Learning: 2025-09-10T15:48:14.671Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/output/generator.py:43-49
Timestamp: 2025-09-10T15:48:14.671Z
Learning: In the lightspeed-evaluation framework, system configuration uses Pydantic data models (SystemConfig, OutputConfig, LoggingConfig, etc.) rather than plain dictionaries. Components like OutputHandler receive properly structured Pydantic models, so direct attribute access (e.g., system_config.output.enabled_outputs) is the correct approach.
Applied to files:
src/lightspeed_evaluation/core/models/system.pysrc/lightspeed_evaluation/core/system/loader.pysrc/lightspeed_evaluation/core/models/__init__.py
📚 Learning: 2025-09-18T23:59:37.026Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/core/system/validator.py:146-155
Timestamp: 2025-09-18T23:59:37.026Z
Learning: In the lightspeed-evaluation project, the DataValidator in `src/lightspeed_evaluation/core/system/validator.py` is intentionally designed to validate only explicitly provided user evaluation data, not resolved metrics that include system defaults. When turn_metrics is None, the system falls back to system config defaults, and this validation separation is by design.
Applied to files:
src/lightspeed_evaluation/core/models/system.pysrc/lightspeed_evaluation/core/system/loader.pysrc/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-10-16T11:17:19.324Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to src/lightspeed_evaluation/core/metrics/**/*.py : Register new metrics in MetricManager’s supported_metrics dictionary
Applied to files:
src/lightspeed_evaluation/core/metrics/__init__.pysrc/lightspeed_evaluation/core/system/loader.pysrc/lightspeed_evaluation/pipeline/evaluation/evaluator.pysrc/lightspeed_evaluation/core/metrics/deepeval.pysrc/lightspeed_evaluation/core/metrics/geval.py
📚 Learning: 2025-10-16T11:17:19.324Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to src/lightspeed_evaluation/** : Add all new evaluation features under src/lightspeed_evaluation/ (do not add new features elsewhere)
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.pysrc/lightspeed_evaluation/core/metrics/deepeval.py
📚 Learning: 2025-09-19T00:37:23.798Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:33-36
Timestamp: 2025-09-19T00:37:23.798Z
Learning: In the lightspeed-evaluation codebase, metric resolution (including applying defaults when turn_metrics is None) happens upstream in ConversationProcessor.process_conversation() using MetricManager.resolve_metrics(), not in the EvaluationErrorHandler. The error handler only marks explicitly defined metrics as ERROR.
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
🧬 Code graph analysis (5)
src/lightspeed_evaluation/core/system/loader.py (1)
src/lightspeed_evaluation/core/models/system.py (1)
GEvalConfig(267-287)
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)
src/lightspeed_evaluation/core/metrics/deepeval.py (1)
DeepEvalMetrics(30-184)
src/lightspeed_evaluation/core/metrics/deepeval.py (3)
src/lightspeed_evaluation/core/metrics/geval.py (2)
GEvalHandler(20-499)evaluate(120-193)src/lightspeed_evaluation/core/llm/manager.py (4)
LLMManager(10-117)get_config(105-107)get_model_name(91-93)get_llm_params(95-103)src/lightspeed_evaluation/core/llm/deepeval.py (1)
DeepEvalLLMManager(8-42)
src/lightspeed_evaluation/core/metrics/geval.py (2)
src/lightspeed_evaluation/core/llm/deepeval.py (1)
DeepEvalLLMManager(8-42)src/lightspeed_evaluation/core/metrics/deepeval.py (1)
evaluate(92-129)
src/lightspeed_evaluation/core/models/__init__.py (1)
src/lightspeed_evaluation/core/models/system.py (1)
GEvalConfig(267-287)
🔇 Additional comments (16)
src/lightspeed_evaluation/core/metrics/__init__.py (1)
8-13: No issues found. The all exports are complete and correct.The multi-line formatting with trailing commas is good practice. GEvalHandler is intentionally not exported because it's an internal component of DeepEvalMetrics, not a top-level metrics class like the others. The architecture is clean: GEval metrics are accessed through DeepEvalMetrics.evaluate() which routes metrics by name to either standard DeepEval implementations or to the internal GEvalHandler. All top-level metric classes are properly exported.
config/system.yaml (1)
46-52: Verify GEval metrics metadata design.The GEval metrics defined in the registry (e.g.,
geval:technical_accuracy,geval:command_validity) do not have corresponding entries in themetrics_metadatasection (lines 54-117). While GEval metrics carry their own metadata structure in the registry file (criteria, evaluation_params, evaluation_steps, threshold), the coding guidelines state: "Add metric metadata to the metrics_metadata section in config/system.yaml when introducing new metrics."Please confirm whether this deviation is intentional, as GEval uses a separate registry-based metadata approach, or if these metrics should also have entries in the
metrics_metadatasection for consistency.Based on coding guidelines.
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (2)
46-49: LGTM!The integration of GEval registry path into DeepEval metrics initialization is clean and correctly propagates configuration from the system config.
57-57: LGTM!The routing of GEval metrics through the existing
deepeval_metricshandler is an elegant design choice that unifies metric evaluation without requiring a separate handler entry point.src/lightspeed_evaluation/core/models/__init__.py (1)
24-24: LGTM!The GEvalConfig export follows the established pattern for exposing configuration models in the public API.
Also applies to: 43-43
config/registry/geval_metrics.yaml (1)
1-136: LGTM with observation.The GEval metrics registry is well-structured and follows a consistent schema across all metric definitions. The inclusion of usage examples and the caveat about generated metrics (lines 5-6) is helpful for users.
One observation: The
evaluation_paramsuse lowercase identifiers likeactual_outputandexpected_output, which need conversion toLLMTestCaseParamsenum values. Verified thatsrc/lightspeed_evaluation/core/metrics/geval.py(lines 195-238) handles this conversion with appropriate fallback behavior when params don't match enum values.src/lightspeed_evaluation/core/models/system.py (1)
314-317: LGTM!The
gevalfield integration intoSystemConfigfollows the established pattern for configuration sections and properly usesFieldwith appropriate defaults and description.src/lightspeed_evaluation/core/metrics/deepeval.py (2)
38-68: LGTM!The initialization properly sets up shared resources between standard DeepEval metrics and GEval, with appropriate cache configuration and clear separation of concerns. The use of a shared LLM manager promotes resource efficiency.
92-129: LGTM!The unified evaluation routing between standard DeepEval metrics and GEval is well-implemented. The routing logic is clear, and the comment clarifying that GEval metric names should not include the "geval:" prefix is helpful for maintainability.
src/lightspeed_evaluation/core/system/loader.py (2)
108-110: LGTM!The conditional loading of GEval metrics based on the
enabledflag is correctly placed in the metric mapping population flow, and properly propagates the registry path from configuration.
183-183: LGTM!The GEvalConfig construction follows the established pattern for configuration sections, with appropriate defaulting to an empty dictionary when the section is absent.
src/lightspeed_evaluation/core/metrics/geval.py (5)
35-49: LGTM!The class design with a shared registry cache is efficient and appropriate for this use case. The initialization properly delegates registry loading and stores the shared LLM manager.
120-193: LGTM!The main evaluation method has a clear, well-documented flow: configuration retrieval, validation, and delegation to the appropriate evaluation level. The error messages are informative and help with debugging.
195-238: LGTM!The parameter conversion logic with fallback to auto-detection is a smart design choice that provides flexibility for custom evaluation parameters while maintaining type safety for standard parameters.
400-413: LGTM!The conversation aggregation strategy of concatenating all turns into a single test case is appropriate for GEval's evaluation model. The clear turn numbering in the aggregated input/output helps maintain context.
435-499: LGTM!The configuration resolution with clear priority order (runtime metadata overrides registry) is well-implemented. The logging at different levels (debug for source selection, warning for missing config) provides good observability.
e7758e3 to
88ba187
Compare
lpiwowar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm just passing by 🚶. Two things crossed my mind while taking a look at this PR.
Thanks for working on this:) 👍
config/system.yaml
Outdated
|
|
||
| # GEval Configuration | ||
| # Configurable custom metrics using DeepEval's GEval framework | ||
| geval: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
questions(s): Is there a way how to make the geval metrics configuration aling with the already existing metrics so that we can extend the metrics_metadata section? Would something like this work?
metrics_metadata:
geval_registry_path: "config/registry/geval_metrics.yaml"
turn_metrics:
"geval:technical_accuracy":
...I'm just wondering whether we are going to introduce a separate config option for each evaluation framework we might add in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand the rationale behind the implementation: it's driven by the evaluation technique and the non-static nature of the metrics, not the framework.
However, I agree with Lukas about integrating this into the current metric_metadata. This would simplify both the user experience and maintenance. Furthermore, users can add the entire geval metric directly to the existing metadata (avoiding a separate registry). With this we can fully utilize our existing flow for metric requirements validation, override logic, and data field conversion within the overall deepeval frameworks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do; thank you for the suggestions. I will get this updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refactored/Additions
- Added field name mapping logic in
geval.pyto be more aligned with the current input values- Updated all evaluation_params in
system.yamlGEval metrics to use
data field names: query, response, expected_response, contexts
- Updated all evaluation_params in
- Updated docstrings to reflect new naming convention
- Standardized on MetricManager for GEval
- Added
get_metric_metadatato get complete metadata dictionary (simplifies GEval integration) - refactored
get_effective_thresholdto use the unified metadata getter.
- Added
Removed
- Removed obsolete GEval registry configuration
- Any additional GEval dependencies
GEvalConfig- Registry loading in
loader.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you. This simplifies the logic a lot.
config/registry/geval_metrics.yaml
Outdated
| - actual_output | ||
| - expected_output | ||
| evaluation_steps: | ||
| - "Verify that the Ansible syntax is valid and follows YAML formatting rules" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (non-blocking): I would personally keep here in the default file the metrics that are product agnostic and metrics that can theoretically be used by anybody.
Any examples of metrics that are product specific should be in docs showcasing how people can write their own metrics IMO.
asamal4
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for adding this..
config/registry/geval_metrics.yaml
Outdated
| evaluation_params: | ||
| - input | ||
| - actual_output | ||
| - expected_output |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These names are as per deepeval. However it will create confusion, as our eval data expects different names.
Can we align these as per the input data and internally rename these.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
✅ Fixed this here; Geval is now aligned with the rest of the framework
88ba187 to
c83e295
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
src/lightspeed_evaluation/core/metrics/deepeval.py (1)
97-138: Clarify docstring regarding metric_name prefix handling.The docstring states that
metric_name"should NOT include 'geval:' prefix" (line 110), but the implementation defensively strips the prefix if present (lines 127-131). While this defensive approach is good for robustness, consider updating the docstring to reflect the actual behavior:Args: - metric_name: Name of metric (for GEval, this should NOT include "geval:" prefix) + metric_name: Name of metric (for GEval, "geval:" prefix will be stripped if present) conv_data: Conversation data object scope: EvaluationScope containing turn info and conversation flag
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
config/system.yaml(4 hunks)src/lightspeed_evaluation/core/metrics/deepeval.py(3 hunks)src/lightspeed_evaluation/core/metrics/geval.py(1 hunks)src/lightspeed_evaluation/core/metrics/manager.py(1 hunks)src/lightspeed_evaluation/pipeline/evaluation/evaluator.py(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- config/system.yaml
🧰 Additional context used
📓 Path-based instructions (3)
src/lightspeed_evaluation/**
📄 CodeRabbit inference engine (AGENTS.md)
Add all new evaluation features under src/lightspeed_evaluation/ (do not add new features elsewhere)
Files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.pysrc/lightspeed_evaluation/core/metrics/manager.pysrc/lightspeed_evaluation/core/metrics/deepeval.pysrc/lightspeed_evaluation/core/metrics/geval.py
src/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
src/**/*.py: Require type hints for all public functions and methods
Use Google-style docstrings for all public APIs
Use custom exceptions from core.system.exceptions for error handling
Use structured logging with appropriate levels
Files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.pysrc/lightspeed_evaluation/core/metrics/manager.pysrc/lightspeed_evaluation/core/metrics/deepeval.pysrc/lightspeed_evaluation/core/metrics/geval.py
src/lightspeed_evaluation/core/metrics/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
Register new metrics in MetricManager’s supported_metrics dictionary
Files:
src/lightspeed_evaluation/core/metrics/manager.pysrc/lightspeed_evaluation/core/metrics/deepeval.pysrc/lightspeed_evaluation/core/metrics/geval.py
🧠 Learnings (6)
📓 Common learnings
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 28
File: lsc_eval/runner.py:99-103
Timestamp: 2025-08-26T11:17:48.640Z
Learning: The lsc_eval generic evaluation tool is intended to become the primary evaluation framework, replacing an existing evaluation tool in the lightspeed-evaluation repository.
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to src/lightspeed_evaluation/** : Add all new evaluation features under src/lightspeed_evaluation/ (do not add new features elsewhere)
📚 Learning: 2025-09-19T00:37:23.798Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:33-36
Timestamp: 2025-09-19T00:37:23.798Z
Learning: In the lightspeed-evaluation codebase, metric resolution (including applying defaults when turn_metrics is None) happens upstream in ConversationProcessor.process_conversation() using MetricManager.resolve_metrics(), not in the EvaluationErrorHandler. The error handler only marks explicitly defined metrics as ERROR.
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.pysrc/lightspeed_evaluation/core/metrics/geval.py
📚 Learning: 2025-10-16T11:17:19.324Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to src/lightspeed_evaluation/core/metrics/**/*.py : Register new metrics in MetricManager’s supported_metrics dictionary
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.pysrc/lightspeed_evaluation/core/metrics/manager.pysrc/lightspeed_evaluation/core/metrics/deepeval.py
📚 Learning: 2025-10-16T11:17:19.324Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to src/lightspeed_evaluation/** : Add all new evaluation features under src/lightspeed_evaluation/ (do not add new features elsewhere)
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
📚 Learning: 2025-09-18T23:59:37.026Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/core/system/validator.py:146-155
Timestamp: 2025-09-18T23:59:37.026Z
Learning: In the lightspeed-evaluation project, the DataValidator in `src/lightspeed_evaluation/core/system/validator.py` is intentionally designed to validate only explicitly provided user evaluation data, not resolved metrics that include system defaults. When turn_metrics is None, the system falls back to system config defaults, and this validation separation is by design.
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.pysrc/lightspeed_evaluation/core/metrics/geval.py
📚 Learning: 2025-07-16T12:07:29.169Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_script_runner.py:0-0
Timestamp: 2025-07-16T12:07:29.169Z
Learning: In the lsc_agent_eval package, the ScriptRunner class was modified to use absolute paths internally rather than documenting path normalization behavior, providing more predictable and consistent path handling.
Applied to files:
src/lightspeed_evaluation/core/metrics/geval.py
🧬 Code graph analysis (4)
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)
src/lightspeed_evaluation/core/metrics/deepeval.py (1)
DeepEvalMetrics(31-193)
src/lightspeed_evaluation/core/metrics/manager.py (1)
src/lightspeed_evaluation/core/models/data.py (2)
EvaluationData(264-311)TurnData(35-261)
src/lightspeed_evaluation/core/metrics/deepeval.py (4)
src/lightspeed_evaluation/core/metrics/geval.py (2)
GEvalHandler(19-433)evaluate(45-116)src/lightspeed_evaluation/core/metrics/manager.py (1)
MetricManager(17-166)src/lightspeed_evaluation/core/llm/manager.py (4)
LLMManager(10-117)get_config(105-107)get_model_name(91-93)get_llm_params(95-103)src/lightspeed_evaluation/core/llm/deepeval.py (1)
DeepEvalLLMManager(8-42)
src/lightspeed_evaluation/core/metrics/geval.py (3)
src/lightspeed_evaluation/core/llm/deepeval.py (1)
DeepEvalLLMManager(8-42)src/lightspeed_evaluation/core/metrics/manager.py (3)
MetricLevel(10-14)MetricManager(17-166)get_metric_metadata(50-82)src/lightspeed_evaluation/core/metrics/deepeval.py (1)
evaluate(97-138)
🔇 Additional comments (12)
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (2)
46-49: LGTM! DeepEvalMetrics initialization updated correctly.The addition of
metric_managerparameter aligns with the GEval integration, enabling access to metric metadata for both standard and custom metrics.
57-58: LGTM! GEval routing configured correctly.The routing configuration correctly directs GEval metrics through the unified
deepeval_metricshandler, and the comment clearly documents this design decision.src/lightspeed_evaluation/core/metrics/deepeval.py (2)
1-29: LGTM! Module setup and imports are appropriate.The updated docstring clearly describes the dual support for standard and GEval metrics, and the logger initialization follows best practices.
39-66: LGTM! Constructor properly extended for GEval integration.The addition of
metric_managerparameter and initialization ofGEvalHandlerwith shared LLM resources follows the unified design. Type hints comply with coding guidelines.src/lightspeed_evaluation/core/metrics/manager.py (2)
50-82: LGTM! Metadata retrieval method is well-designed.The method properly implements the priority hierarchy (runtime metadata → system defaults) and includes comprehensive documentation. Type hints and docstring comply with coding guidelines.
84-110: LGTM! Threshold extraction properly refactored.The method cleanly delegates to
get_metric_metadata()and safely extracts the threshold value. This refactoring improves code organization and supports the richer metadata requirements for GEval integration.src/lightspeed_evaluation/core/metrics/geval.py (6)
1-44: LGTM! Class structure and initialization are well-designed.The module docstring clearly describes the GEval integration, and the constructor properly accepts the required dependencies. Type hints and Google-style docstrings comply with coding guidelines.
45-116: LGTM! Main evaluation method is well-structured.The method properly validates configuration, extracts parameters with sensible defaults, and routes to the appropriate evaluation level. Documentation is comprehensive and error handling is appropriate.
118-176: Parameter conversion logic is robust.The method properly maps evaluation data field names to DeepEval enum values with appropriate fallbacks. The decision to return
Nonewhen encountering any unrecognized parameter (line 173) allows GEval to auto-detect fields, which is a reasonable defensive approach.
178-283: LGTM! Turn-level evaluation is comprehensive.The method properly handles parameter conversion, builds the GEval metric with appropriate configuration, constructs test cases with optional fields, and includes robust error handling with detailed logging. The conditional logic for
evaluation_params(lines 228-236) correctly handles all scenarios.
285-381: LGTM! Conversation-level evaluation properly aggregates turns.The method correctly aggregates conversation turns into a single test case and evaluates them holistically. The turn aggregation logic (lines 350-352) is clear, and error handling includes helpful diagnostic logging. The comment about
top_logprobslimit (line 326) is useful context.
383-433: LGTM! Configuration retrieval properly leverages MetricManager.The method correctly constructs the metric identifier with the
geval:prefix and delegates toMetricManager.get_metric_metadata()to leverage the priority hierarchy. Logging at appropriate levels aids debugging.
config/system.yaml
Outdated
| provider: "openai" # LLM Provider (openai, watsonx, gemini, hosted_vllm etc..) | ||
| model: "gpt-4o-mini" # Model name for the provider | ||
| provider: vertex # LLM Provider (openai, watsonx, gemini, hosted_vllm etc..) | ||
| model: gemini-2.0-flash # Model name for the provider |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please keep the existing defaults as it is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This slipped through, sorry about that. I'll get that fixed.
…nd additions to system.yaml structure
c83e295 to
0e330ef
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (2)
src/lightspeed_evaluation/core/metrics/manager.py (1)
76-83: Avoid leaking shared metadata stateWe now hand out the same dict object that lives inside
turn_metrics_metadata/ system defaults. Any caller that tweaks the returned metadata (for example, to add derived fields at runtime) will mutate the underlying config for all subsequent evaluations. Please return a defensive copy before handing it to callers to keep per-request state isolated.You could apply something like:
+import copy ... - if metric_identifier in level_metadata: - return level_metadata[metric_identifier] + if metric_identifier in level_metadata: + return copy.deepcopy(level_metadata[metric_identifier]) ... - return system_metadata.get(metric_identifier) + metadata = system_metadata.get(metric_identifier) + return copy.deepcopy(metadata) if metadata else Nonesrc/lightspeed_evaluation/core/metrics/geval.py (1)
217-243: Consider extracting duplicated evaluation_params logic.The logic for conditionally setting
evaluation_params(lines 228-236) is duplicated in_evaluate_conversation(lines 329-336). While the behavior is correct, extracting this into a small helper would reduce duplication and make the logic easier to maintain.Consider creating a helper method:
def _prepare_metric_kwargs( self, criteria: str, converted_params: list[LLMTestCaseParams] | None, evaluation_params: list[str], evaluation_steps: list[str] | None, threshold: float, name: str, ) -> dict[str, Any]: """Prepare kwargs for GEval metric instantiation.""" metric_kwargs: dict[str, Any] = { "name": name, "criteria": criteria, "model": self.deepeval_llm_manager.get_llm(), "threshold": threshold, "top_logprobs": 5, } if converted_params is None: if not evaluation_params: metric_kwargs["evaluation_params"] = [ LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT, ] else: metric_kwargs["evaluation_params"] = converted_params if evaluation_steps: metric_kwargs["evaluation_steps"] = evaluation_steps return metric_kwargsThen call it from both
_evaluate_turnand_evaluate_conversation.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (9)
config/system.yaml(4 hunks)src/lightspeed_evaluation/core/metrics/__init__.py(1 hunks)src/lightspeed_evaluation/core/metrics/deepeval.py(3 hunks)src/lightspeed_evaluation/core/metrics/geval.py(1 hunks)src/lightspeed_evaluation/core/metrics/manager.py(1 hunks)src/lightspeed_evaluation/pipeline/evaluation/evaluator.py(1 hunks)tests/unit/core/metrics/test_geval.py(1 hunks)tests/unit/core/metrics/test_manager.py(1 hunks)tests/unit/pipeline/evaluation/test_evaluator.py(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
- src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
- config/system.yaml
- src/lightspeed_evaluation/core/metrics/init.py
🧰 Additional context used
🧠 Learnings (5)
📓 Common learnings
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:33-36
Timestamp: 2025-09-19T00:37:23.798Z
Learning: In the lightspeed-evaluation codebase, metric resolution (including applying defaults when turn_metrics is None) happens upstream in ConversationProcessor.process_conversation() using MetricManager.resolve_metrics(), not in the EvaluationErrorHandler. The error handler only marks explicitly defined metrics as ERROR.
📚 Learning: 2025-09-18T23:59:37.026Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/core/system/validator.py:146-155
Timestamp: 2025-09-18T23:59:37.026Z
Learning: In the lightspeed-evaluation project, the DataValidator in `src/lightspeed_evaluation/core/system/validator.py` is intentionally designed to validate only explicitly provided user evaluation data, not resolved metrics that include system defaults. When turn_metrics is None, the system falls back to system config defaults, and this validation separation is by design.
Applied to files:
tests/unit/core/metrics/test_manager.pysrc/lightspeed_evaluation/core/metrics/geval.py
📚 Learning: 2025-07-16T12:07:29.169Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_script_runner.py:0-0
Timestamp: 2025-07-16T12:07:29.169Z
Learning: In the lsc_agent_eval package, the ScriptRunner class was modified to use absolute paths internally rather than documenting path normalization behavior, providing more predictable and consistent path handling.
Applied to files:
src/lightspeed_evaluation/core/metrics/geval.py
📚 Learning: 2025-09-19T00:37:23.798Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:33-36
Timestamp: 2025-09-19T00:37:23.798Z
Learning: In the lightspeed-evaluation codebase, metric resolution (including applying defaults when turn_metrics is None) happens upstream in ConversationProcessor.process_conversation() using MetricManager.resolve_metrics(), not in the EvaluationErrorHandler. The error handler only marks explicitly defined metrics as ERROR.
Applied to files:
src/lightspeed_evaluation/core/metrics/geval.pytests/unit/core/metrics/test_geval.py
📚 Learning: 2025-09-19T12:32:06.403Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:18-31
Timestamp: 2025-09-19T12:32:06.403Z
Learning: When analyzing method calls, always examine the complete call site including all parameters before suggesting fixes. In the lightspeed-evaluation codebase, mark_all_metrics_as_error in processor.py correctly passes both resolved_turn_metrics and resolved_conversation_metrics parameters.
Applied to files:
tests/unit/core/metrics/test_geval.py
🧬 Code graph analysis (5)
src/lightspeed_evaluation/core/metrics/manager.py (1)
src/lightspeed_evaluation/core/models/data.py (2)
EvaluationData(264-311)TurnData(35-261)
src/lightspeed_evaluation/core/metrics/deepeval.py (5)
src/lightspeed_evaluation/core/metrics/geval.py (2)
GEvalHandler(19-433)evaluate(45-116)src/lightspeed_evaluation/core/metrics/manager.py (1)
MetricManager(17-166)src/lightspeed_evaluation/core/models/data.py (2)
EvaluationScope(356-367)TurnData(35-261)src/lightspeed_evaluation/core/llm/manager.py (4)
LLMManager(10-117)get_config(105-107)get_model_name(91-93)get_llm_params(95-103)src/lightspeed_evaluation/core/llm/deepeval.py (1)
DeepEvalLLMManager(8-42)
tests/unit/core/metrics/test_manager.py (3)
src/lightspeed_evaluation/core/metrics/manager.py (3)
MetricManager(17-166)get_metric_metadata(50-82)MetricLevel(10-14)src/lightspeed_evaluation/core/models/data.py (2)
TurnData(35-261)EvaluationData(264-311)src/lightspeed_evaluation/core/models/system.py (1)
SystemConfig(267-297)
src/lightspeed_evaluation/core/metrics/geval.py (3)
src/lightspeed_evaluation/core/llm/deepeval.py (1)
DeepEvalLLMManager(8-42)src/lightspeed_evaluation/core/metrics/manager.py (3)
MetricLevel(10-14)MetricManager(17-166)get_metric_metadata(50-82)src/lightspeed_evaluation/core/metrics/deepeval.py (1)
evaluate(97-138)
tests/unit/core/metrics/test_geval.py (2)
src/lightspeed_evaluation/core/metrics/geval.py (3)
GEvalHandler(19-433)_convert_evaluation_params(118-176)_get_geval_config(383-433)src/lightspeed_evaluation/core/metrics/manager.py (2)
MetricLevel(10-14)get_metric_metadata(50-82)
🔇 Additional comments (5)
src/lightspeed_evaluation/core/metrics/geval.py (5)
1-43: LGTM! Clean initialization and documentation.The module setup, imports, and constructor are well-structured. The handler correctly accepts both
DeepEvalLLMManagerandMetricManagerdependencies for runtime metric evaluation.
45-116: LGTM! Solid orchestration logic.The
evaluate()method properly retrieves configuration, validates criteria, and delegates to the appropriate evaluator based on the evaluation level.
118-176: LGTM! Robust parameter conversion with fallback.The helper correctly maps evaluation data field names to
LLMTestCaseParamsenums and falls back to direct enum lookup for backward compatibility. ReturningNonefor unrecognized params allows GEval's auto-detection to take over.
285-381: LGTM! Conversation aggregation logic is sound.The method correctly aggregates all turns into a single test case for conversation-level evaluation. The formatting is straightforward and should work well with GEval's processing.
Note: The score defaulting and duplicate
evaluation_paramslogic issues mentioned in previous comments also apply here (line 363 and lines 329-336).
383-433: LGTM! Clean integration with MetricManager.The method properly delegates configuration retrieval to
MetricManager, respecting the priority hierarchy (turn-level → conversation-level → system defaults). The logging is helpful for debugging missing configurations.
| score = metric.score if metric.score is not None else 0.0 | ||
| reason = ( | ||
| str(metric.reason) | ||
| if hasattr(metric, "reason") and metric.reason | ||
| else "No reason provided" | ||
| ) | ||
| return score, reason |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Defaulting None score to 0.0 might mask evaluation failures.
If metric.score is None, the current implementation returns 0.0 as the score. This could make it difficult to distinguish between a metric that legitimately evaluated to zero versus one that failed to produce a score at all. Consider returning None as the score when metric.score is None, which better signals that evaluation didn't complete successfully.
Apply this diff:
- score = metric.score if metric.score is not None else 0.0
+ score = metric.scoreThe same pattern appears in _evaluate_conversation at line 363 and should be updated consistently.
Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In src/lightspeed_evaluation/core/metrics/geval.py around lines 264 to 270 (and
likewise update the similar pattern at _evaluate_conversation around line 363),
the code currently substitutes metric.score None with 0.0 which masks failed
evaluations; change the logic to return None when metric.score is None (keep
numeric scores unchanged) and preserve reason extraction as-is so callers can
detect a missing score vs a real zero; update both places to return (None,
reason) when metric.score is None instead of (0.0, reason).
Additions/Modifications in 0e330ef
|
asamal4
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
@VladimirKadlec @tisnik PTAL
VladimirKadlec
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thank you!
| # LLM as a judge configuration | ||
| llm: | ||
| provider: "openai" # LLM Provider (openai, watsonx, gemini, hosted_vllm etc..) | ||
| provider: "openai" # LLM Provider (openai, watsonx, gemini, hosted_vllm etc..) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: only spaces, I'd keep the original indent
| # But can be easily integrated with other APIs with minimal change. | ||
| api: | ||
| enabled: true # Enable API calls instead of using pre-filled data | ||
| enabled: true # Enable API calls instead of using pre-filled data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: only spaces, I'd keep the original indent
tisnik
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
GEval Integration
This work originally began as a three-part PR series. After review and iteration, we’ve consolidated #96 and #98 into a single comprehensive pull request that delivers full GEval integration and accompanying unit tests within the existing evaluation framework.
Why? 🤔
Summary by CodeRabbit
New Features
Tests